ftprang/papers/tproject.pdfunsup ervised data mining in nominally-supp orted databases thomas prang...

Unsupervised Data Mining in

Nominally-Supported Databases

Thomas Prang

Los Alamos National Laboratory

04-30-98

Abstract

The meaning of structure or pattern for data which ful�lls only

nominal requirements will be investigated. Basic Information- and

Uncertainty-measures will be discussed and a theoretical framework

of four basic techniques for structure-�nding introduced. I will con-

clude with an overview of a set of both, well known and less known

methods, which will be related to each other, to the structure-�nding

problem and also to the introduced basic techniques. To contrast the

nominal-data, unsupervised approach some fundamental methods of

the much more investigated continuous, supervised domain will also

be presented.

Contents

1 Problem 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Epistemological background . . . . . . . . . . . . . . . . . . . 6

2 Structure 92.1 What is structure . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Information and Uncertainty Measures . . . . . . . . . . . . . 12

2.2.1 Hartley Information . . . . . . . . . . . . . . . . . . . 132.2.2 Shannon's Entropy . . . . . . . . . . . . . . . . . . . . 152.2.3 Transmission . . . . . . . . . . . . . . . . . . . . . . . 18

1

2.2.4 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . 182.3 Finding structure . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Projection and Extension . . . . . . . . . . . . . . . . 192.3.2 Subset and Superset . . . . . . . . . . . . . . . . . . . 212.3.3 Coarsening and Re�ning . . . . . . . . . . . . . . . . . 212.3.4 New dimensions - Meta-dimensions . . . . . . . . . . . 22

3 Methods 233.1 Supervised, Scalar methods . . . . . . . . . . . . . . . . . . . 24

3.1.1 Fisher / Perceptron (NN) - linear classi�ers . . . . . . 263.1.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . 283.1.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . 30

3.2 Unsupervised, Ordered Methods . . . . . . . . . . . . . . . . . 333.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 Mask Analysis . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Unsupervised, Nominal methods . . . . . . . . . . . . . . . . . 383.3.1 Analysis of Variance (ANOVA) . . . . . . . . . . . . . 393.3.2 Reconstructability Analysis (RA) . . . . . . . . . . . . 423.3.3 DEEP . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.4 Log-linear models . . . . . . . . . . . . . . . . . . . . 543.3.5 Rule inference . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Supervised, Nominal methods . . . . . . . . . . . . . . . . . . 633.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . 63

4 Dangers in Data-mining 664.1 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Parts and Wholes . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4 Summary of Dangers . . . . . . . . . . . . . . . . . . . . . . . 68

1 Problem

In my project I want to deal with unsupervised approaches for �nding struc-tures, patterns, or relationships in large databases. Many methods have beendeveloped for structure �nding in continuous and ordered data, but nominalaspects of this problem are often ignored.

2

An important characteristic of databases is how entries (rows) within atable are related to each other. In a customer table the entries may beunrelated; in a table of daily temperature measurements the individual en-tries may be time related. Database tables usually contain a variable(s) foruniquely identifying and distinguishing each entry (functional relationship).This variable(s) which often doesn't have any other purpose is referred to asthe \support" [26]. For example there might be a customer id for uniquelyidentifying customers and a time stamp for identifying the temperature mea-surements. We can now express the kind of relationship among data entriesby the support type. In the case of time or space coordinates the support isordered. The consequence for structure �nding is that we can look for behav-ior patterns over this support (time, space, etc.); for example the in uencea previous date entry has on a current one, what in uence has yesterdaystemperature (and perhaps some other climate variables) on today's temper-ature ? In the case of customer IDs, record numbers, and Social SecurityNumbers, the support is nominal. Here the structure �nding needs to con-centrate on general patterns within data entries. In this paper I focus onnominal support.

Similar to nominal support, I also want to concentrate on nominal data,or data �elds whose \real world" counterparts are neither continuous norordered themselves. Examples for this are 'Area code' (which could be sortedas numbers but it would not make sense), 'Name of friend', 'product bought','which kind of payment used'. Other examples include normalized stringvariables like normalized addresses. Note that we can still have a distancemeasurement (not necessarily a metric in the mathematical sense) on thesekinds of �elds so that clustering is possible. For example, there is no intrinsicordering of persons as friends, but we can have a distance measurement offriendship; products have no intrinsic ordering (though they could be orderedin price as persons could be ordered by their IQ, which would be an implicitrelabeling of the �elds) but they can have a similarity measure.

The distinction between \supervised", and \unsupervised" for Data Min-ing methods comes from the classi�cation problem. If methods use a trainingdata set with correct classi�cations for learning speci�c predictive patternsthey are called supervised. Many neural networks as well as logistic regres-sion, Fisher-analysis, etc. work this way. If we just use the data itself to�nd internal structure the method is called unsupervised. Summarizing, onecould state that \supervised" denotes structure �nding directed to one clas-si�cation variable, while \unsupervised" means general structure �nding.

3

In general three data types are distinguished: nominal, ordinal, andscalar. Nominal data values have no ordered relationship to each other,ordinal values can be ordered but the ordering has no associated distance.We can't say that \one value is double as good as another one". Finallyscalar values, also refered as continuous values, have a quantitative relation-ship associated with the ordering.

Although I will concentrate my discussion of basic techniques, methodsand implementations on the more general unsupervised, nominal domain,for completeness, I will present some supervised methods and methods oncontinuous data. Relating them to the discussed background will help to\see" the complete picture and better understand the unsupervised, nominalproblem.

1.1 Introduction

\Database Mining can be de�ned as the process of mining for implicit, previ-ously unknown, and potentially useful information from very large databasesby e�cient knowledge discovery techniques." (workshop program 1995 ACMComputer Science Conference) [31].

\Implicit" expresses that the information is inductive, discovered fromdata, instead of deductive, derived from laws.

\Previously unknown" has two meanings and therefore can be some-what misleading. The �rst one is the obvious, that we are only interestedin knowledge that is not already known. This is true for all science. Butthe other important meaning is that \data mining is distinguished by thefact that it is aimed at the discovery of information, without a previouslyformulated hypothesis" [6, pg. 12]. In this sense it is di�erent from mostsciences where we usually �rst state hypothese and then test them. Datamining is aimed at deriving hypothese from the data in �rst place. Thisis then also the main distinction between data mining and data warehous-ing (data management). Data warehousing allows queries for \validating" ahypothesis within the data, while data mining searches for general patternto \explain" something. Because we don't start out with hypothese and weare mostly only dealing with archival data (section 1.2) data-mining resultsneed to be interpreted with care (section 4). Depending on the method andmeasure used they may not re ect any correlation among variables, let alonecausality relationships. Therefore data-mining results should be treated aswhat they are - hypothesis.

4

\potentially useful" is obvious, we only care for information which insome sense is useful.

\very large databases" (VLDB) expresses the problem of dealing withenormous amounts of data. \Computerization of daily life has caused dataabout individual behavior to be collected and stored by banks, credit cardcompanies, reservation systems, and electronic points of sale" [6, pg. 9].Also satellite photographs, climate measurements, video safety recordings,etc. should be noted in the increasing data. \It has been estimated thatthe amount of information in the world doubles every 20 months. The sizeand number of databases probably increase even faster." [42, pg. 1]. Thisshould also show an important, sometimes misunderstood, point. Data min-ing methods are not developed to replace humans (because of being betteror something), instead they are developed to guide and support humans inthe process of discovering information. The amounts of data are so enormousthat it is totally impossible for humans to examine them. If we don't usecomputerized support in data mining, most of the collected data would stayunobserved.

\by e�cient knowledge discovery techniques" emphasizes again thepoint that we need fast algorithms for dealing with the data in a useful way.Especially the curse of dimensionality should be noted: The complexity ofmethods tend to increase exponentially with the number of dimensions asdoes the amount of data needed to meaningful \cover" that dimensionality.This is the main impetus for the \variable selection", \�eld selection", and\variable reduction" methods.

There are several other de�nitions of data-mining, but they usually di�eronly slightly depending on what the author wants to emphasize, or what kindof readership he addresses. For example Cabena et. al. addresses business-managers: \Data Mining is the process of extracting previously unknown,valid, and actionable information from very large databases and then usingthe information to make crucial business decisions." [6, pp. 12-13]. A goodintroduction to data mining can be found in Adriaans and Zantinge [1, pp1-10].

In this paper I will present some fundamental structure-�nding ideas (sec-tion 2) and overview a representative selection of data-mining methods (sec-tion 3). As my title expresses, the emphasis will be in unsupervised methods(meaning not directed to any particular classi�cation variable), and on nomi-nal data. Therefore, structure and knowledge connected with ordered or evenquantitative relationships of variables, e.g. Newton's law, is disregarded in

5

this discussion and postponed for a later paper. Some methods for continuousdata are also introduced to contrast with the nominal approach.

1.2 Epistemological background

In this section I will discuss the similarities and signi�cant di�erences be-tween databases and source systems in the system science view. It will bedemonstrated that these di�erences are not only speci�c to the source sys-tem perspective but to the data-mining problem in general, focusing on thequestion of what is the important data for our investigation, how can it becleaned, preprocessed, transformed.

As we will see, databases are basically collections of data entries overspeci�ed sets of variables with speci�ed domains (state sets, value sets).This view �ts pretty closely with the idea of source and data systems [26].De�nition of variables, their connection to the \real-world" and assigned setsof \allowed" values constitute a source system. It's important to note thatthis de�nition already contains a lot of constraints and information. A source

system already speci�es what aspects of the \real world" are important,where \important" must always be seen in the problem context of our system.Also speci�ed are how to map these aspects into our problem space. Thisis normally a homomorphic mapping - the values are simpli�ed and less innumber, but the \relevant" structure is preserved.

Relational Databases are seen as a collection of relations [39]. Join-ing all these relations we can obtain one overall relation which combines therelational information. The de�nition of the overall relation, called the re-lational schema, speci�es a set of variables, corresponding domains, andrespective \real-world-connections". The constraints and knowledge alreadygiven by the variables and corresponding sets of values are the same as in asource system. The relational schema of the total join is therefore an equiv-alent (isomorphic) description to a source system of the whole database.

Measurements add data to the source system. We obtain a data system.In the database view a relation or table is obtained from the relational

schema. The data represents structure and relationships between our vari-ables, and, therefore, information about speci�c \real world" connections.

Actually the set of (unjoined) relations in a database contains alreadydeeper constraints and structure among the variables as some variables aredisconnected in di�erent relations. The unjoined relations with their addi-tional structure actually correspond to a GSPS structure system. More

6

details on structure systems will be given and related in section 3.3.2. Inrelational databases this structure within the relational schema is often in-duced by so called \normal forms" for the manageability of the databese.The following discussion will therefore concentrate on the overall relation asone data system, its relational schema isomorphic to a source system.

Even though source and data systems are theoretically connected withdatabases, there is still a big and practical di�erence between them. In thesystem science view, the building of a source-system is preceded by severalpremethodological considerations [26, chapter 1]: The Purpose of Inves-tigation expresses our idealized intention, what do we want to achieve withthe system, what is the reason for our modeling. The Constraints of In-vestigation restrict our idealized intention and reapply our purpose to thereal-world with all its constraints. For example, parts of our intention maynot be realizable - some information may be unavailable, technologies maynot be advanced enough, etc. This leads us to our �nal Object of In-vestigation from which we abstract variables and corresponding state-sets,also support and support sets for di�erentiating our data. Building a sourcesystem in the system science approach means carefully selecting variablesand domains for the kind of things we want to investigate. For the domainswe restrict our attention to as few values as possible as this simpli�es themodeling process.

In data mining the situation looks entirely di�erent. Neither variablesnor state sets or supports are selected by ourselves, rather they are given tous with the direction to �nd some speci�c structures, patterns, and depen-dencies. Often these databases are large, incomplete and contain \noisy",uncertain, redundant, useless, NULL or missing values. Questions are howto deal with the overabundance of information, how to �nd that information,and even if there is adequate information for interesting discoveries.

The situation can be compared with the three levels of data one canuse for scienti�c discovery (see also [42, pp. 33]). The �rst level is theexperimental level. One can actively select appropriate input variables andan environment to create the needed data. This normally leaves us with anabundance of data as many parameters can be varied. Also new data canbe created, measurements can be improved and tests repeated (to validateresults).

Observational Data constitutes the second level. In this situation param-eters of an investigated object cannot be changed and therefore data cannotbe created just as needed. But for an observer it is still possible to choose

7

what kinds of data he wants to select for his investigation. He can measureany observable properties of any available object and re�ne his methods ofmeasurement.

The lowest level consists of historical or archival data. This is data whichis already recorded and we have no way to change, improve or add otherdimensions. In most situations this is the information we need to deal within databases, whereas in the system science approach we often start withexperimental or observational data, though constraints can require also theuse of archival data.

There are two approaches to deal with these di�erences between databasesand source-/data-system.

The �rst solution is the obvious one. For our investigation we de�ne thegiven database as the universe of discourse. This means that constraintsrestrict the available data to that in the database. From the database wethen select our variables and corresponding domains (which could be di�erentfrom variables and domains in the database) according to our purpose.

The second solution is similar in that it also selects variables, but wedon't choose static domains for these variables. When exploring databaseswe often want to be able to see the details in the data as well as simpli�ed andabstracted values for �nding general patterns. Therefore we use the detailed\precision" of the database for our source-system but we also incorporate theconcept of \simpli�cation" (coarsening and re�nement) as hierarchies. Thismeans that we introduce hierarchies of values for each variable in the source-system. This makes it possible not only to search for structure at each levelof the hierarchy but also to use di�erent levels for di�erent variables. Forexample for income we might use only three di�erent states fhigh, medium,lowg, for address a clustering into small towns and city-districts, and fornumber of children the whole precision of the database. Later in this processwe might want to see how the medium wages spread out over �ve morere�ned wage ranges. Hierarchies of values are often used in OLAP (On-LineAnalytical Processing) [2, 3, 1] and in the next chapter I will discuss howthey can be used for structure �nding. I also want to mention that a \3-4-5 rule" for building hierarchies is known in the database community. Itsuggests dividing each node of the hierarchy into 3,4 or 5 subnodes [18].

One might ask why I concentrated so much on the relationship of data-bases with source systems in this chapter. First of all, source systems are verysimilar to relational databases and therefore an \interesting" viewpoint. Sec-ond the mentioned di�erences between databases and source systems express

8

the importance and necessity to prepare data for data-mining. In this sense\building a source system" is just a labeling example (I think a very appro-priate one) for a step which in the literature is also called \Data Cleaning",\Data Preprocessing", \Data Transformation" and \Variable Selection" [1,pp. 37-47]. This process includes knowing the purpose for investigating thedatabase; dealing with useless, noisy, false, uncertain and redundant data;transforming it and selecting useful dimensions and state sets (or hierarchiesof state sets). Basically it summarizes all the preprocessing which needs tobe done before the \real" investigation. For more details on source or datasystems see [26, chapter 1,2].

2 Structure

The structures or patterns we are looking for are constraints, relations anddependencies within the data which are not obvious from the wide database-structure.

Additional structures can help us understand some connections in thedatabase; in the extreme case it can lead to the elimination of redundant data-entries. The most important reason for �nding structure is predictability: forexample, if some of the data is known over some �elds, then what are theprobability or possibility-measures for other �elds; an owner of a businessmay want to predict the buying behavior of his customers; some companiesmay want to predict fraudulent behavior of customers; a sociologist tries topredict group-dynamic behavior in speci�c social groups and to �nd out whatrole-behavior has the most in uence to it; etc.

2.1 What is structure

In this section I want to investigate what we mean by structure in data.\Dimension" , \Variable", \Aspect" are used simultaneously to describe acertain dimension of our data. For each dimension, data-entries (also calledrecords, entities, etc.) can take any values out of a speci�ed domain. Fromthe actual occurrence of these values in our data we can derive a count-table,indicating how often a speci�c value or set of values occurs. This count-table is a basis to derive probability-distributions. In the following examplesand the rest of this paper I will mostly refer to a probability-distribution asrelative frequencies, although other measures may be used also, see [26, pp.

9

103-105].As an formal example assume a table R, here also called relation, of

cars:R � dom(S)� dom(X1)� dom(X2)

where X1; dom(X1) = f\Ford00; \Dodge00; \Chevy00g denotes the variable cartype, X2; dom(X2) = f\black00; \white00; \blue00g the car's color, and S; dom(S) =f1; 2; 3; 4; 5; 6; 7g the uniquely identifying (nominal) support. The support isjust a car-number, and is ignored for the count table. Note that the dis-tinction between support and variables describes the following functionalrelationship:

d : dom(S) 7! �ni=1dom(Xi)

Let R be de�ned by the tuples in the following table:

s (No.) x1 (car-type) x2 (color)1 Ford black2 Dodge red3 Ford white4 Ford black5 Chevy blue6 Dodge red7 Dodge red

The relation R can be also described by its characteristic function �over support S and variables X1; X2:

� : dom(S)� dom(X1)� dom(X2) 7! f0; 1g

�(s; x1; x2) :=

(1; (s; x1; x2) 2 R0; (s; x1; x2) 62 R

(1)

A count function c over the variable domain is introduced by aggregatingabout the uniquely identifying support:

c : dom(X1)� dom(X2) 7! f0; 1; 2; : : :g

c(x1; x2) :=X

s2dom(S)

�(s; x1; x2) (2)

The relative frequencies f which will be seen as induced probabilitiesare obtained by dividing by the the total number of data-entries jRj:

f : dom(X1)� dom(X2) 7! [0; 1]

10

f(x1; x2) :=c(x1; x2)

jRj(3)

Note that not all possible tuples of type and color (x1; x2) 2 dom(X1) �dom(X2) need to occur in the relation R. Therefore it may be c(x1; x2) =f(x1; x2) = 0 for some tuples (x1; x2). Similar to the table representation ofR these tuples are ignored in the count table:

x1 (type) x2 (color) c(x1; x2)(count) f(x1; x2)(probability)Ford black 2 2/7Dodge red 3 3/7Ford white 1 1/7Chevy blue 1 1/7total 7 1

Can we have structure in a single dimension? De�nitively not if the valuesin that dimension are randomly distributed. That random distribution wouldtell us something about this dimension (i.e. that there is no structure) butwould leave us with an unstructured mess of values. Thus we associate somestructure with a variable if its value-distribution allows some predictabil-ity; that is if the value-distribution is di�erent from a random distribution.As an example, imagine a distribution where 50% of the cars are red and50% are black. If red and black are the only values for car colors then thisdimension is randomly distributed and doesn't give us any information forprediction. If 90% of the cars are red and only 10% are black the situation isentirely di�erent. We can �nd the \structure" that red cars are much morelikely to appear than black ones. \Structure" seems to be connected withthe distribution of the variable-values. You can compare this with �tting adistribution (Normal, Exponential, Gamma,..) in the continuous case.

Going to two or more dimensions, the relationships between variables areinvolved. We think of high structure if speci�c values of one variable mostlyappear together with a speci�c value of another variable. In statistical terms,we say the variables are \correlated"; however, statistical correlation doesnot work for nominal variables (neither mean, variance, nor covariance arede�ned on a nominal probability space). Looking to the joint-distributionof the variables we see that this \appearing together" again just means a\structured" distribution of values instead of a random distribution. Theprobability for some value-tuples is pretty high (for those values which mostlyappear together) while other probabilities remain small.

11

This becomes even more clear if we reduce the distribution again to a one-dimensional case by looking at the conditional distribution f(Y jX = x). We�x one value in dimension X and look how the values of Y distribute in thiscase (see Section 2.3.2 ). If the resulting distribution is random, then ourchosen value in X seems not related to the dimension Y . But if the value xmostly occurs with one value in Y the conditional distribution will be highlystructured and the predictive uncertainty low.

When we look for some kind of pattern we often have some entities whichhave something in common (on which we conditionalize). We want to �gureout what else they have in common (what structure there might be in theconditional distribution). Consider this example from programming. In somecases a program returns an error (in the space of program-runs this is the �rstthing the have in common). The programmer then wants to know what elsethese runs have in common so he can �nd out what could have triggered theerror. If all these runs show a speci�c and distinct pattern in the input-valuesthen the problem might be connected with these inputs.

In this sense the structure within a dataset can be measured by the ran-domness or uncertainty of its value-distribution (or conditional distribution,etc.)

2.2 Information and Uncertainty Measures

In this section some classical information and uncertainty measures basedon probability-distributions are introduced. As discussed in the last sectionstructure is connected with size and randomness of the underlying distribu-tion. The more random a distribution the higher our uncertainty and thelower the amount of structure. Also the more di�erent values occur in adistribution the higher the uncertainty. The �rst introduced measure, theHartley Information, deals with the general properties of uncertainty justdepending on the domain size. The properties of uncertainty for di�erentprobability assignments are discussed in connection with Shannons famousEntropy measure. Joint uncertainties and conditional uncertainties are shownas important structure and association indicators. Finally some \distinctive-ness" measures of two distributions, e.g. usefull for outlayer detection andrule �nding (3.3.5), are introduced. Alltogether these measures built an sig-ni�cant repertoire for evaluating structure and pattern within probabilitydistributions.

12

2.2.1 Hartley Information

In 1928 Hartley introduced [15] a simple measure of information. When onemessage is chosen from a �nite set of equally likely choices then the number ofpossible choices or any monotonic function of this number can be regarded asa measure of information. Hartley pointed out that the logarithmic functionis the most \natural" measure. It is practical more useful because time,bandwidth, etc. tend to vary linearly with the logarithm of the number ofpossibilities: adding one relay to a group doubles the number of possiblestates of the relays. In this sense the logarithm also feels more intuitive asa proper measure: two identical channels should have twice the capacity fortransmitting information than one.

Today usually the logarithm with base 2 is chosen and the resultinginformation-units are called binary digits, or bits. Therefore one relay or ip- op which can be in any of two stable positions holds 1 bit of informa-tion. N such devices can store N bits, since the total number of possiblestates is 2N and I = log2(2

N) = N (adapted from [40]):

Sn = fs1; : : : ; sng; jSnj = n;S = fSnjn = 1; 2; : : : g

I : S 7! [0;1)

I(Sn) := log2(jSnj) = log2(n) bits (4)

where S is the set of all �nite sets of equally likely states.To evaluate the information content of knowing a current state s 2 Sn we

compare the a priori information (set Sn) with the a posteriori information(set fsg). In general our a posteriori information can be in any subsetU � Sn. Our knowledge about a particular subset U or state fsg is expressedby the conditional Hartley information I(a posteriori setja priori set):

I(U jSn) := I(Sn)� I(U)) = log2

jSnj

jU j

!bits (5)

If U = fsg then jU j = 1 and I(U jSn) = I(Sn). Therefore we usually identifythe information of knowing a particular state s 2 Sn with the informationcontent of the set Sn.

Note that the information we have (in bits) if we know the state can beinterpreted as the uncertainty (in bits) if we don't know the state. In theabove example our uncertainty is N bits if we don't know the states of N(binary) relays, and thus we are uncertain about N bits of information.

13

Sometimes the set of states is partitioned into clusters. Let Ck be apartion of Sn:

Ck = fc1; c2; : : : ; ckg; ci � Sn; ci \ cj = ;;k[

j=1

cj = S

Now two information measures are of interest: I(fsgjSn) = I(Sn), the infor-mation content of the state s 2 Sn, and I(fcgjCk) = I(Ck), the informationcontent of the cluster c 2 Ck. A relative Hartley information of thepartition over the overall state set is de�ned as follows:

Irelative(Ck; Sn) :=log2(jCkj)

log2(jSnj)=

log2(k)

log2(n)=I(Ck)

I(Sn)(6)

It describes the relative information contained in the clustering compared toknowing the states.

The relative information is also used to evaluate the (statistical) signi�-cance of distributions. Here the single data entries are the states and theirrespective variable values are the corresponding clusters. Example, if thereare 400 data entries D = fd1; : : : ; d400g, 200 with value v1 2 V , and also200 with value v1inV then I(V jD) = log(2)=log(400) = 0:1157. If thereare only two data entries D = fd1; d2g, one with value v1 and one with v2then I(V jD) = log(2)=log(2) = 1, which is the highest amount of possiblerelative information for a partition. We know that estimated probabilitiesof ~f(x1) = ~f(x2) = :5 are much more signi�cant for the �rst case then forthe second. Therefore a lower relative information among the values indicatehigher signi�cance for a probability approximation.

Probabilistic Interpretation of the Hartley Measure: Instead oftalking about known or unknown states selected out of a �nite set Sn wecan also see the situation from the viewpoint of a random variable X whichinduces a probability measure F and a probability distribution f on a Sn,P(Sn) denotes the powerset of Sn:

X : A 7! Sn

F : P(Sn) 7! [0; 1]; F (U) = FA(X�1(U)); U � Sn ) X�1(U) 2 A

f : Sn 7! [0; 1]; f(s) = F (fsg); s 2 Sn

14

where X is a measurable function from the probability space (A;A; FA)to Sn, and FA the probability measure de�ned on sigma algebra A overset A. Further mathematical details will be ignored, the interested readermight refer to any book on probability-theory. In the following discussionvariables will be seen as equivalent to random variables, and a state-set Snwill be denoted as the domain of its corresponding variable X, dom(X) := Sn,x 2 dom(X) denotes a state or value of variable X.

The important point is that we can see the Hartley measure as de�nedon a special case of random variables which assigns equal probabilities to allvalues:

f(x) =1

jdom(X)j; x 2 dom(X);

jdom(X)j = number of values in the domain of X

We can now de�ne Hartley information again on the space of probabilitydistributions with equal probability (denoted P 0):

P 0 :=nf jf : Sn 7! [0; 1]; f(s) =

1

jdom(Sn)j=

1

n; s 2 Sn; n = 1; 2; : : :

oI : P 0 7! [0;1)

I(X) = I(f(s)js 2 S) := log2(jSj) bits (7)

Now Hartley information means the information we have when we know thestate of the random variable X, or the uncertainty if we don't know it.

This di�erent viewpoint seems not to make sense as we are still onlyinterested in the size of the domain. I introduced this interpretation to showthe context of a more general information-measure introduced in 1948 byShannon. Note, though, that this is only one of many interpretations; theHartley measure can be seen in this probability-context, but does by nomeans imply it. Hartley's measure has also other useful interpretations, e.g.as an unspeci�ty-measure [28].

2.2.2 Shannon's Entropy

In 1948 Shannon introduced a general uncertainty-measure on random vari-ables which takes di�erent probabilities among states into account [40, pp.392-396]. Today this measure is well known as \Shannon's Entropy" [26, pp.

15

112-116], [28, pp. 153-167], etc. Let X be a random-variable and P the spaceof all �nite probability-distributions:

P := ff jf : dom(X) 7! [0; 1]; x 2 dom(X) = fs1; : : : ; sng; n = 1; 2; : : :g

H : P 7! [0;1)

H(X) = H(f(x)jx 2 dom(X)) := �X

x2dom(X)

f(x) log2 f(x) bits (8)

where dom(X) is the value-set of variable X, x 2 X a speci�c value and fthe probability-distribution of X.

The conditional entropy of a variable Y knowing variable X is de�ned asthe average of the entropies of Y for each value x 2 X, weighted accordingto the probability that x occurs:

H(Y jX) :=X

x2dom(X)

f(x) �

0@� Xy2dom(Y )

f(yjx) log2 f(yjx)

1A (9)

where f(yjx) denotes the conditional probability of y 2 Y when variable Xis in state x. The conditional entropy expresses how uncertain we are of Ythe average when we know X (which could be any of the values x 2 X).

Shannon's entropy is an important measure for evaluating structures andpatterns in our data. The lower the entropy (uncertainty) the more structureis already given in the relation. The usability becomes more obvious bylooking at some properties of Shannon's Entropy:

1. H = 0 if and only if f(s) = 1 for one s 2 dom(X) and f(x) = 0 forall other x 2 dom(X). This means the entropy H is only 0 if we arecertain about the outcome.

2. For any number of states N = jdom(X)j the entropy H is maximaland equal to log2(N) if all states x 2 dom(X) have equal probability(Hartley-information). This is the situation where we have no structurein our distribution and are most uncertain.

3. Any change toward equalization of the probabilities increases H. Themore the states are equally likely to occur the less structure we haveand the higher the uncertainty.

16

4. The uncertainty of two independent variables (X,Y) is the sum of theirrespective uncertainties. This conforms with the initial comments wegave about the Hartley information-measure. Knowing X gives us noinformation about Y; therefore the conditional entropy Y knowing Xequals the entropy of Y:

H(X; Y ) = H(X) +H(Y ); X,Y independent

HX(Y ) = H(Y )

5. The uncertainty of two dependent variables (X,Y) is less than thesum of the individual uncertainties. This is caused by the informa-tion (structure) which is given in the correlation of the two variables.Because of the structure relating Y and X, the conditional entropy Yknowing X is smaller than the `a priori' entropy of Y:

H(X; Y ) < H(X) +H(Y ); X,Y dependent

HX(Y ) < H(Y )

6. The uncertainties of two variables (X,Y) is the sum of the uncertaintyof one variable X added to the conditional uncertainty of the othervariable Y knowing X. This also shows that the uncertainty of a variableY is never increased by knowledge of X:

H(X; Y ) = H(X) +HX(Y )

HX(Y ) � H(Y )

In connection with Shannon's entropy there are several similar measuresde�ned.

A relative uncertainty of a variable, also called normalized uncertainty, isobtained by dividing by the maximum uncertainty log2(jdom(X)j):

Hrelative(X) :=H(X)

log2(jdom(X)j)=H(X)

I(X)(10)

17

2.2.3 Transmission

The strength of the relationship between two variables (in bits) can be mea-sured by the following measure which is known as \information transmission"[28, pg. 164]:

T (X; Y ) = H(X) +H(Y )�H(X; Y ) (11)

= H(X)�H(XjY )

= H(Y )�H(Y jX)

From the discussion about Shannon's entropy we know that this measureequals 0 if X and Y are independent. It increases with stronger relationship.In this sense transmission is a kind of nominal measure for correlation.

2.2.4 Cross-Entropy

This measure, also known as directed divergence, measures how good a dis-tribution approximates another distribution. It is used in reconstructabilityanalysis (section 3.3.2) as a distance measure between reconstructed hypoth-esis compared to the original distribution.

H(f; fh) =X

x2dom(X)

f(x) � log2

f(x)

fh(x)

!(12)

There are several more interesting measures for measuring structure andinformation which I will refer to in section 3.

2.3 Finding structure

So far �nding structures appears to be easy. One just needs to compute theamount of uncertainty in the value-distribution and use it as a measure ofthe amount of structure. In wide databases this situation is more di�cultbecause �nding structures means �rst �nding sets of variables and corre-sponding values which are highly related and thus have low uncertainty intheir distributions. In a huge personal database, for example, there could bea relationship between high income, no kids, and red cars, which wouldn'thave been obvious from the entire data set.

For �nding these kinds of structures in huge, wide databases there arebasically four techniques which can be applied in certain combinations. In

18

the following sections I will discuss these techniques. Note that the �rstthree techniques require only nominal data, though they are also appliedin the continuous case, they are unable to �nd quantitative and orderedrelationships, e.g., if X increases in value, then Y increases in value, orquantisized X = 2:3 �Z. Also note that these techniques are presented moreas a theoretical framework than for direct practical implementation. Thepractical use of these techniques will be discussed in the methods-overviewin chapter 3.

2.3.1 Projection and Extension

To project a dataset means to aggregate it according to a subset of dimensions(formally de�ned in section 3.3.2). This means that the data is projectedonly through a given set of dimensions and viewed independently of all otherdimensions. Projection is used to �nd direct relationships in a smaller subsetof dimensions. This relationship may not be obvious from the whole datasetbecause of other `noisy' dimensions. A simple example is a personal databaseof mood, quality of food, amount of work, and weather conditions with 61data-records. This data is given in form of the following counts table:

Mood Food Work Weather Count Probabilitybad bad few rainy 9 9/61good good much sunny 10 10/61good good few sunny 12 12/61medium good medium cloudy 13 13/61good medium medium sunny 9 9/61medium bad few cloudy 2 2/61bad medium much rainy 6 6/61total 61 1

A 2-dimensional projection (Mood and Weather) of this data shows adirect relation between these two variables: `rainy' comes always with `bad'mood, `cloudy' with `medium' and `sunny' with `good' mood.

Mood Weather Count Probabilitybad rainy 15 15/61good sunny 31 31/61medium cloudy 15 15/61total 61 1

19

This direct relation between mood and weather is easiliy indicated by un-certainty measures: The relative Entropy Hrelative(Mood;Weather) = 0:4706is already low as only 3 states from the 9 possible mood, weather combina-tion occur in the projection. The direct connection between the values be-comes obvious by calculating the conditional Entropy: H(MoodjWeather) =H(WeatherjMood) = 0; if we know one of the variables the uncertaintyabout the other one is 0 in our given table. This shows how Entropy mea-sures can be used on projections to indicate inner structures.

So far we only dealt with counts and probability distributions as mea-

sures for our table (relation). They are important as we use them to measurethe amount of structure in our given relationship (table). But we can alsohave other measures which can be used for further investigation, for exam-ple summary statistics like percentiles, averages, minimum, maximum, andstandard-deviation. These aggregational statistics are de�ned over some di-mensions which due to a projection are not shown directly. Every dimensioncan be folded into a measure. It is just another visualization and viewpointof the same aspect. Instead of showing all possible combinations for the vari-able \food" we could aggregate this information into a measure \Percentile ofgood Food". This process of interchanging measures with dimensions is de-scribed in more detail in [2]. With this new viewpoint relationships betweenmeasures or between dimensions and measures can be investigated.

From our point of view \folding" a dimension into a measure is a mix-ture of projection and then creating a new dimension with the aggregationalstatistics. First we get rid of the \food"- dimension by projection and ag-gregation over its values, then we add the aggregational measure \Percentileof good Food" to this table. Note that these aggregational measures usuallycombine several of the original rows (due to the projection) into one value.Therefore a \folded" dimension consists usually of less rows.

As described we can always see our data as a table of dimensions withcounts attached, but for understanding di�erent viewpoints of data the dis-tinction between dimensions and measures makes sense. More on new di-mensions in Section 2.3.4.

In statistics projections are known as \Marginalization". By project-ing our relation (table) to fewer dimensions we marginalize the probability-distribution.

In OLAP-Terminology this is called \pivoting" and is equivalent to the\SELECTx1; x2; ::; COUNT (�)::GROUPBY x1; x2; ::" statement in SQL.

20

2.3.2 Subset and Superset

Subsetting of the dataset is another important operation in which the avail-able data is restricted according to speci�ed conditions. This means thatspeci�ed values are eliminated from given variables. This allows focusing on\interesting" values. Especially with nominal variables there might be nogeneral linkage between variables but some between speci�ed values.

In a health database of patients we could create a projection of just thevariables illness and food to �nd a linkage between them. But the distributionof this projection might be random. In general there may be no relationshipbetween illness and the kind of food the patient ate prior his illness. Afterfocusing on one speci�c illness by subsetting the data the result can lookentirely di�erent. Perhaps all patients with stomach pain ate `Hamburger'before being hospitalized.

The opposite of subsetting is supersetting. It allows going back and seeingthe relation between the unconditioned values. There might be some generalconnection, but the chosen values reveal only random behavior. In this casethe structure lies in some other values, and we want to go back to see thewhole picture.

Subsetting corresponds to \Conditionalizing" in statistics. We restrictthe values of some variables and look at the \conditional probabilities". Insubsetting variables to di�erent sets of values we get several conditional prob-ability distributions which can then be compared. Di�erences in distributionscan be testes by several statistical tests (�2-test, H-test, U-test) [34].

In OLAP terminology this technique is called \slicing and dicing" andcorresponds to the \WHERE" - clause in SQL, in GSPS terminology this isknown as \simpli�cation".

2.3.3 Coarsening and Re�ning

Coarsening and Re�ning work with the simpli�cation of values and is closelyconnected with the hierarchies of values mentioned in Section 1.2 This tech-nique is important as we often need to generalize the values to see structure.An example would be the observation that most trees have green leaves (insummer). In reality all these trees show many di�erent kinds of green. Ifyou would distinguish 1000 di�erent shades of green you probably would notrecognize that these trees have something in common. Only by \generaliz-ing" these 1000 di�erent color-shades to one \green" this pattern becomes

21

obvious.Another example is family income. If you distinguish single dollar-amounts

you are unlikely to �nd patterns. But by looking at the coarsened valuesfhigh, medium, lowg income you probably �nd a relationship (i.e. low En-tropy, di�erence in conditional distributions) to the kind of car they drive.

The opposite of coarsening is re�ning, which specializes one value into itssubcategories and gives rise to inner structure. If I have data of the wave-length of visible light it seems to be randomly distributed over the intervalbetween 4 � 10�7m and 7 � 10�7m (actually sunlight has a peak at about5 � 10�7). By \drilling down" visible light into its di�erent colors, an innerrelationship becomes apparent. Blue light is always around 6 � 10�7m, whilered light is around 4 � 10�7m.

Coarsening and re�ning on values is related to Projection / Extension ondimensions. While coarsening aggregates several values into one, projectionaggregates over whole dimensions. The result of coarsening all values intoone is the same as eliminating this dimension by projection.

In OLAP-terminology this technique is known as \rolling up" and \drillingdown".

2.3.4 New dimensions - Meta-dimensions

This technique enables us to add new useful dimensions to our table. Thiscan be \measures" as discussed in section 2.3.1, extracted features, labelsof identi�ed clusters, or any kind of function of other variables. This isespecially useful for getting other viewpoints of the data. It helps connectingthe results of other algorithms (e.g. clustering) with the structure-�ndingtechniques discussed previously. For continuous variables linear combination,logarithms, multiplication of variables, etc. may be useful. For exampleregression and neural networks use these transformations for classi�cation(see sections 3.1.2 and 3.1.3).

It should be also mentioned that new dimensions are almost always usedfor dimensionality reduction. In linear regression n dimensions are replacedwith n � 1, n � k or 1 dimension(s), expressed e.g. as linear combinationsof the original distribution. Several dimensions are replaced with fewer newdimensions. This is especially important in context of the dramatic com-plexity increase with the number of dimensions (section 1). Furthermore, aprimary function is that a new dimension represents by itself dependenciesand relations among the old dimensions.

22

3 Methods

In this Section I will present several methods which are used for data-mining.Some of them are traditional methods like Neural Networks (NN's)for classi-�cation, Clustering for similarity-hierarchies, Regression and other statisticalmethods for modeling. Others are based on GSPS (General Systems Prob-lem Solver) as an overall problem-solving framework for inductive modelingof databases [26]. In particular, reconstructability analysis for �nding sim-pler overall models of the database, mask analysis for investigating behaviorof a system on an ordered support (time, etc.) and DEEP for determininghigh local structure are GSPS methods. Methods derived from di�erent �eldsinclude decision trees and rule inference.

Although I am concentrating in this paper on nominal data and unsu-pervised methods, I will �rst present some of the classical approaches whichare supervised and use continuous, ordered data; because they are histori-cal methods and because it is interesting to see their connection and ideasfor data-mining. But moreover, the unsupervised, nominal structure-�ndingproblem is more understandable if it is contrasted with methods using adi�erent approach.

Note that \Genetic Algorithms" (GA's) are also often mentioned in thiscontext of structure-�nding. But GA's are not directly Data-mining methodsthough they are often used to search and optimize huge search-spaces (i.e.space of models). In combination with other methods this can be a veryuseful approach. See [12, 35] for more information.

One other process applicable to all domains is \Goodness of �t". Usinga simple Chi-Square (�2) statistic, a K-S test, etc., the quality of a modelcompared to test-data can be evaluated; the expected values of the modele.g. eijk for the input variables i,j and k, are compared with the actuallyoccurring values, e.g. oijk [34, pp. 294-300,216-217].

In the following sections I will �rst introduce, discuss and relate the prob-lem context of the di�erent approaches: supervised and unsupervised withscalar, ordinal or nominal data. Then after a short overview of each method, Iwill show how they relate to the four basic techniques I introduced in Section2.3.

Note that all these methods use slightly di�erent notation, though I triedto standardize them, and traditionally they use di�erent labels for similarconcepts. In NN terminology the investigated database is referred to as parttraining set and part cross-validation set. In other situations I might write

23

about a relation or the counts table. All these labels denote the same dataset, although sometimes from di�erent viewpoints. In the case of a training-set we are neither interested in counts nor in induced probabilities of ourrelation (section 2.1); the data with the correct classi�cation matters as NNusually ignore the signi�cance (count) of speci�c value combinations. In othermethods I refer to counts and probabilities and the reader will understand theappropriate meaning in that context. Single occurring value-combinations inthe relation will be referred to as entities, rows, data-records, etc.

The introduced methods can be categorized as follows:

Type/ Data Scalar Ordinal Nominal

supervised Fisher, linear classi�ers Logit-modelLogistic RegressionNeural Networks

Decision Trees

unsupervised Clustering Mask-AnalysisAnova

Reconstructability AnalysisDEEP

Log-linear modelsRule InductionGoodness of �t

3.1 Supervised, Scalar methods

In this section we will see that supervised methods on continuous data areonly somewhat connected to our four basic techniques, mostly just by cre-ating new dimensions. Here I want to discuss what makes these methodsdi�erent so the reader might concentrate on these di�erences in the followingmethod introductions.

Supervised methods have the purpose of classi�cation and prediction:known correct classi�ed data is used to derive or train a model which is thentested on di�erent data for classi�cation. Though we also try to �nd predic-tive patterns with other methods we don't just look for a model connectingsome input variable with an classi�cation-output and use a training-set for\supervising" the model. Some problems that can arise with supervisedmethods will be discussed in section 4. On the other hand supervised meth-

24

ods have the advantage that you normally know more precisely what youare looking for. You have data and classi�cations and the aim is to �nd anadequate mapping which supports the structure of the problem but avoidsover�tting the training-set.

Another big di�erence is that all the presented methods make use of theordering of the variables. Several variables are aggregated by the proposedmodel which transforms the input-variables into an output. This model isoften a simple (continuous) function (i.e. linear), or consists of an iterativeconnecting of inputs (i.e. Neural Networks). The output (the new dimension)is either direct a classi�cation, a value used for classi�cation (i.e. Fisher) orsome other \useful" information (i.e. log-odds-ratio in logistic Regression).

By linking the variables in a continuous function with its result used forclassi�cation we create a \decision surface" in the input variable space. Thepurpose of the described methods is to adjust the proposed model parameterssuch that this surface re ects the actual classes as accurately as possible. Inthe simplest case (presented in the �rst section with Fisher's method and thePerceptron) this surface is just a hyperplane which separates the two possibleclassi�cations.

Note that all these methods use an implicit assumption for the decision-surface to work: entities \close" to each other according to the ordering ofthe dimensions are assumed to be also \close" in their classi�cation. In a\chaotic" region (i.e. the classi�cation is very sensitive to small variations inthe input-variables) these methods (but probably also all other methods) areunable to �nd correct classi�cations. This is one of the reasons for weatherforecasts to be so uncertain in some speci�c (but surely not all) situations.

This whole concept of a \decision-surface" does not make sense if wehave nominal (i.e. not ordered) data. A \surface" always assumes that inany dimension data-points on one side are \larger" in value and on the otherside \smaller" in value. It is important to understand that this is not thecase in nominal data. Weighted combination of variables are only possible inthe sense of logic functions. So in the case of nominal data we only deal withdecision (hyper-) points, or planes in the sense of projected data in whichsome variables are ignored as irrelevant.

This should be enough for a brief discussion before presenting some basicsupervised methods on ordered data. The issues of nominal data, �ndingdecision hyper-points and how nominal variables can be combined in logic-functions will be extended in the following sections.

25

3.1.1 Fisher / Perceptron (NN) - linear classi�ers

Fisher's method [10], [20, pp. 470] is a supervised method for the classi�cation-problem into two classes by using the knowledge of some continuous input-variables. It corresponds to the historic Neural Network \Perceptron" [37](also known as \Linear Machine") but uses a di�erent approach for \learn-ing" the model.

Let ~X = [X1; X2; : : : ; Xp] be the known continuous p input-variables, andY the classi�cation variable with two states. Fisher's method assumes thatthe conditional Probability distributions are normally distributed with thesame (invertible) covariance matrix:

f( ~XjY = 0) � N( ~�0;�); E( ~XjY = 0) = ~�0

f( ~XjY = 1) � N( ~�1;�); E( ~XjY = 1) = ~�1

The normal distribution assumption guaranties optimal �t of the model,but is not necessary, whereas the assumption of same covariance matricescan be critical for Fisher's model calculation.

Now the aim of the method is to �nd a linear combination of the variablesXi which is best able to discriminate the two classes of Y . Formally, we areare looking for a vector l:

W = lT � ~X

such that:(E(W jY = 1)� E((W jY = 0))2

V AR(W )

is maximal.This vector l de�nes the best linear combination of the Xi variables for

the scalar (univariate) random variable W to discriminate between the twoclasses (the di�erence between the conditional expected values are maximalwith respect to its variance). The expected values and variance are:

�w0 := E(W jY = 0) = lT ~�0

�w1 := E(W jY = 1) = lT ~�1

V AR(W ) = lT�l

For classifying a new observation ~x� 2 dom( ~X) we just multiply it withl to calculate its linear combination (the value of the random variable W).

26

Then we classify according to this value. If it's closer to the mean �w0 thenwe classify it as class 0, otherwise as class 1. Let m := �w0+ �w1

2, and assume

�w1 > �w0, then the classi�ciation for x� is:

classify(x�) :=

(1; lTx� > m0; lTx� � m

(13)

The remaining question is how to train the model, how to obtain thevector l. Here Fisher's Method and the Perceptron di�er in their approaches.

If both populations f( ~XjY = 0) and f( ~XjY = 1) have the same covariance-matrix �, then:

lT = ( ~�1 � ~�0)0��1 (14)

maximizes the di�erences of the expected values relative to the variance ofW . From the generalized Cauchy-Schwarz inequality:

(lT �)2 � (lT�l)(�T��)

it follows (letting � := ~�1 � ~�0 ):

(E(W jY = 1)� E((W jY = 0))2

V AR(W )=

(lT ~�1 � lT ~�0)2

lT�l=

(lT �)2

lT�l� �T��1�

Substituting lT = ( ~�1 � ~�0)T��1 we reach this maximum:

(lT �)2

lT�l=

(( ~�1 � ~�0)T��1 � �)2

( ~�1 � ~�0)T��1��T �1( ~�1 � ~�0)

=(�T��1�)2

�T��1�

= �T��1�

Fisher's Method uses this result to derive the statistically \optimal"model. The Covariance-matrix � and the means ~�0 and ~�1 are obtainedvia unbiased estimators [20, pp. 474]. In this case the random variable W isalso known as \Fisher's Sample Linear Discrimination Function".

While Fisher's Method is grounded on statistical inference the Perceptrondoesn't make any assumptions about the distributions. The Perceptron withinitial random weights (the vector l) is sequentially given training data forclassi�cation. Whenever the classi�cation is wrong l is adjusted by addingor subtracting (depending on the right classi�cation) a learning-parameter

27

� times the input vector. It is known that the \weight-vector" l of thePerceptron will converge [37].

How do these methods relate to the presented basic structure-�ndingtechniques? First of all they require continuous variables and don't workwith nominal data. But as they are famous historical methods I wanted topresent them for completeness and for understanding of the structure-�ndingproblem. They also state good examples for using the technique of creatingnew dimensions.

In this case the new dimension is computed by a linear combination of theother dimensions. The aim is to select this dimension according to the modelsuch that the conditional distributions f( ~XjY = 0) and f( ~XjY = 1) are asdi�erent as possible. Using these linear methods means that we are lookingfor an n� 1 dimensional hyperplane in the n-dimensional variable-space forseparating our 2 classes; we are looking for one new dimension which thencan be used projected and conditionalized for classifying new unknown dataentities.

3.1.2 Logistic Regression

\Logistic regression", like Fisher's Method and the Perceptron (Section 3.1.1)is a supervised method for the two class classi�cation problem [16]. Though adi�erent model is used, it can be shown that logistic discrimination and Fisherdiscrimination are the same when sampling from multivariate distributionswith common covariance matrices [17].

Logistic regression tries to model the (logarithmic) odds-ratio for theclassi�cation (variable Y ) as a linear function of the p \input" variables~X = fX1; X2; : : : ; Xpg; ~� is the (p+ 1) dimensional coe�cient vector:

log

24f(Y = 1j ~X)

f(Y = 0j ~X)

35 = �0 +X1�1 + : : :+Xp�p = �0 + ~X T ~� (15)

The odds-ratio is the factor of how many times the event (Y = 1) is morelikely to happen than event (Y = 0) given the knowledge of X. By takingthe logarithm we map the values (0;1) to (�1;1). As:

f(Y = 1j ~X) > f(Y = 0j ~X) ,f(Y = 1j ~X)

f(Y = 0j ~X)> 1 , log

24f(Y = 1j ~X)

f(Y = 0j ~X)

35 > 0

28

we can see the similarity to Fisher's method and the Perceptron in classifying.In Logistic Discrimination the log-odds-ratio of the conditional classi�cationand therefore indirectly the conditional probabilities f(Y = 1j ~X) and f(Y =

0j ~X) are modeled. For classi�cation purposes we just need to know which ofthe probabilities is the higher one. This means our decision surface reducesto:

w := �0 +X1�1 + : : :+Xp�p

(> 0 =) classify 1� 0 =) classify 0

which is the same (n-1) dimensional hyperplane as used by the linearclassi�ers. Actually we can use all di�erent kinds of functions to model thelogarithmic odds ratio. We could also weight our classi�cation in such away that we only classify something as \1" if the probability for this eventis higher then some given probability p. This just means changing 0 to adi�erent value in the above formula.

In standard logistic regression the model-parameters �i are obtained viamaximum likelihood estimators. By transforming the model 15 for the log-odds-ratio we get ( f(Y = 1j ~X) = 1� f(Y = 0j ~X) ):

�( ~X) := f(Y = 1j ~X) =exp(�0 +X1�1 + : : :+Xp�p)

1 + exp(�0 +X1�1 + : : :+Xp�p)(16)

Assuming that all data entities are independent, then the joint probabilitydistribution P of our n training-entities is the product of the individualdistributions. Let (~xi; yi); 1 � i � n be a training-tupel from the data-set,~xi 2 dom(X) are the values of the p input variables and yi 2 dom(Y ) = f0; 1gis the corresponding correct classi�cation:

nYi=1

� (~xi)yi � (1� � (~xi))

(1�yi) =: P�(~xi; yi)i=1;:::;n ; ~�

�

This joint distribution P ( (~xi; yi)i=1;:::;n ; ~� ), is dependent on the model-

parameters ~� = (�0; : : : ; �p). and our training-set (~xi; yi)i=1;:::;n; ~xi =(xi1; : : : ; xip) 2 dom(X). Given our training data we want to adjust the

model-parameters ~� such that the joint probability is maximized (maximumlikelihood for our training data to occur). This is done by basic mathemat-ics for �nding maxima of a function (di�erentiating, etc., or numerically formore di�cult model-functions).

29

3.1.3 Neural Networks

Although there are many di�erent neural networks for classi�cation, cluster-ing, and modeling, the most popular one is by far the Multi-Layer-Perceptron(MLP) for classi�cation using the \backpropagation" learning algorithm. Inthis short introduction I will concentrate my attention on history and ideasof this method.

In general the concept of \Neural Networks" just means using the metaphorof interacting neurons. Each neuron is a relatively simple structure, comput-ing some kind of function from inputs and delivering the result as an output.In a network several neurons are connected, each one using outputs of otherneurons for its inputs. Normally there are also some general inputs from out-side of the NN-system and outputs by some neurons are also used as generaloutputs. In the example of classi�cation the observed data may be the inputto the system and the classi�cation-category the output.

The metaphor and idea about Neural Networks was introduced 1943 bythe neurophysiologist Warren McCulloch and the logician Walter Pitts inthe connection with brain research [33]. The brain is viewed as consisting ofbillions of interacting neurons. General inputs to the brain are delivered bythe senses: seeing, hearing, feeling, etc. Actions and Decisions then can beseen as outputs of the neural network \brain".

All the interactions between neurons can make a neural network a fairlycomplex system. The advantage of this is that NN are known to be \uni-versal classi�ers". This means that in theory they can approximate everyclassi�cation function as closely as required. But this complexity has also asigni�cant disadvantage. Using NN for classi�cation it is often not compre-hensible how the network came up with its decision, or what the importantevaluated classi�cation criteria are. For most business decisions this is unac-ceptable. Often the only way to get a \feeling" for the classi�cation processlies in doing a sensitivity analysis.

For practical computing purposes the connectivity of neurons is usuallyrestricted. For avoiding \circles of dependencies" (feedbacks) where the out-put from one neuron goes through several other neurons but then ends upagain as an input for itself, neurons are organized in \layers". The outputsof neurons in one layer are only used as inputs for the following layer (alsocalled feedforward).

30

From Perceptrons to Multi-layer-perceptrons (MLP): A very goodand mathematical founded introduction into the following historical methodsis given in [37].

In Section 3.1.1 we already saw the simplest and original case of an arti-�cial NN, the perceptron, which is viewed as one single neuron. The percep-tron receives a vector of inputs, multiplies it with a weight-vector and usesthis linear combination to give an classi�cation output of either 0 or 1. In themetaphor of NN the weights just correspond to the strength of each inputconnection. During \training" of the network the strength of the connectionsare adjusted according to how useful and how \right" they are in doing thecorrect classi�cation.

As we know from section 3.1.1, perceptrons are limited in changing theirdecision-surface: all \adaptations" of the model still result in a hyperplane.They are also limited in the number of classes, in that only two classescan be distinguished. The second problem was overcome just be using oneperceptron for each class with the weighted sums as output. A single neuronin a second layer then made the classi�cation according to the perceptronwith the highest output. More exible decision-surfaces could be reached bynot only giving the original variables (in NN often called features) as inputsto the �rst layer, but by also using the products Xi �Xj; i; j = 1; : : : ; n of

the n input variables ~X = (X1; : : : ; Xn) allowing parabolic decision surfaces.Creating all these new features can also be seen as another preceding

layer of neurons. This NN is then called a \Quadratic Machine". We couldproceed in this way, creating more and more features out of our variablesand reaching the ability to create more and more exible decision surfaces.The only serious problem is that we get an enormous number of inputs andneurons. This alone is a computational and a storage problem. But the realproblem is also that the training requirements increase drastically. All thenew connections need to be trained. One heuristic for a good sized trainingset is that it takes 10 times the number of connections as the number oftraining entities.

/paragraphBackpropagation MLP: A di�erent approach would be notto precalculate all the di�erent variable-combinations, but instead let thenetwork \learn" what connections are important. We would do this by intro-ducing more layers (one is actually enough), each extracting decision makingfeatures from the previous layer. Here the problem is how to train the �rstlayers of a multilayer network. The corrent classi�cation for the training

31

data is only available for the last layer, the output classi�cation layer. Onesolution out of this dilemma is the famous \backpropagation" algorithm.

The \backpropagation" algorithm is based on continuous and di�eren-tiable transformation functions for each neuron in a network with severallayers (MLP). Each neuron takes the weighted sum of its inputs (which iscontinuous and di�erentiable)

Si =Xj

wijoutj (17)

where outj is the output of neuron j in the previous layer and wij is theweight from neuron j to neuron i, or the strength of this connection. Thissum applies a transfer function to get an output in the interval [0; 1] or [�1; 1].The following transfer-functions are popular:

Sigmoid-function:

outi = f(Si) =1

1 + e�Si

T

(18)

where T is called the \Temperature" of the neuron. The higher T thesmoother is the function between 0 and 1. As T approaches 0 thetransfer function approaches a step-function and therefore the classicalcase of a perceptron.

Hyperbolic tangent:

outi = tanh(Si) =1� e�Si

1 + e�Si(19)

which is an antisymmetric function.

Because the outi are di�erentiable we are able to \backpropagate" theerror. In the reverse order of classi�cation each neuron receives an errorfeedback from the neurons of the following layer about the partial \fault" ithas on a misclassi�cation. According to this information each neuron cantrain and adjust its weights. More detailed information on the backgroundand mathematics of the backpropagation algorithm can be found in [43, pp.87-95], [48, pp. 122-133], or nearly any other book on NN.

32

Discussion: Relating the Multi layer Perceptron (MLP) back to the ba-sic techniques, we see that MLP's iterate the process of creating functionaldimensions of the given variables. The hope is to end up with some �naldimensions which are strongly correlated with our desired classi�cation. Orin NN-words: the focus is on �nal dimensions which transform the input di-mensions su�ciently enough to approximate the unknown decision surface.

Other NN methods include Radial Basis Functions (RBF) for similaritybased classi�cation, Adaptive Resonance Theory (ART) for clustering, Ko-honen Networks (self organizing maps), Hop�eld-Networks and much more.They are all based on the idea that neurons create functional new dimensionswhose output is then used by other neurons. See [43, 48, 49] for more detailsand references about NN.

There are also other techniques for improving the use of neural networksand increasing their adaptability plus keeping their complexity (number ofneurons) low. One of them is called \Boosting". This Meta learning tech-nique weights the training data. In the beginning we use equal weightsand obtain one hypothesis (NN-structure) by training. Now the weights oftraining samples which a classi�ed wrong is increased and the ones of rightclassi�ed samples decreased. A second NN is trained a linear combination oftheir results (voting) is used for classi�cation. The weights of the trainingset are adjusted and we continue creating more NN hyptothesis.

3.2 Unsupervised, Ordered Methods

In this section two methods for unsupervised structure-�nding are intro-duced. The �rst, Clustering, requires some kind of distance measure betweenthe variables for �nding similar entities which are then grouped into \clus-ters". The second, mask analysis, needs an ordered support (time, 1-dimspace) over which it aims to detect behavior-patterns in the other variables.

3.2.1 Clustering

Clustering describes a collection of unsupervised methods whose aim is topartition an overall data set into a signi�cantly smaller number of \clusters".These methods in general require some kind of distance measure among thedata entities in order to group them together and identify each data entitywith one cluster.

33

Similarity Clustering: Most clustering algorithms partition the databased on how similar individual records are; the more similar, the more likelythat they belong to the same cluster. Their main purpose is to identify clus-ters which maximize the inter-cluster distance and minimize the intra-clusterdistance, so that we obtain clearly distinct groups of similar entities. Thisgrouping introduces a \natural" unsupervised classi�cation schema based onsimilarities according to the given distance measure.

Creating unsupervised classi�cation schemas is also an important partof the human recognition process. Humans always group new things andassign these groups with natural language labels; the group of trees, houses,cars, clouds, etc. These labels are abstractions which identify speci�c sets ofentities that are similar in some aspects while other aspects are unimportant.The shape and function are important characteristics of a house or car, whilethe color is pretty much unimportant. In the case of trees color has a higherimportance. This also concludes that di�erent distance measures are neededfor di�erent \classi�cation schemas".

The information which is created as a new \natural" classi�cation schemais important knowledge which can be added to our relation as a new dimen-sion. The new dimension contains knowledge based on similarities in thechosen distance measure. The \right" choice of the distance measure is veryimportant and has the implicit assumption that the induced similarities aremeaningful for classi�cations.

If we also deal with some externally provided classi�cations of our data,then the overlap between the new \natural" classi�cation and the given clas-si�cation is of interest. Projections into these dimensions and comparison oftheir respective distributions can be used to investigate common properties.Iteratively trying to create clusters that are close to the given classi�cationschema is also known as \supervised clustering".

As clusters are identi�ed as distinct groups, the di�erent structural prop-erties among the clusters can be investigated in general. This means thatall di�erent supervised and unsupervised methods can be applied separatelyon each cluster. For the supervised classi�cation problem this may help inmore accurate models and predictions. For unsupervised structure �ndingsigni�cant di�erences between identi�ed patterns can be investigated. Inboth cases clustering can give insights in the structural relationships among\natural" classi�cations.

34

Algorithms: The most famous algorithm from this group is probably theK-means algorithm, where K is a prede�ned number of clusters [18]. Thealgorithm starts out with randomly associating data-entities to one of theK clusters. Then the algorithm loops through the following steps until itconverges:

1. Calculate the center data-point (mean) for each cluster.

2. For each data-record compare the distance-measure to each of the Kcluster-centers. Associate the record to the closest cluster.

Note that the K-means algorithm does not work for nominal data as `mean'is not de�ned on nominal or even ordinal data. A variation of the K-meansalgorithm is the \K-modes" algorithm. Instead of calculating in step 1 themean for each cluster, the mode of the cluster-distribution is used (the modeis de�ned as the most occurring value). The problem is that for a reasonablemode many data entities distributed over few values are needed.

For clustering on both nominal and continuous data the \K-prototypes"algorithm is proposed which uses modes for nominal and ordinal variables andmeans for continuous variables. Note that for de�ning the prototype (cen-ter) we only need to consider variables that are used by our given distance-measure. If the distance-measure is de�ned purely on nominal data then theK-modes algorithm is su�cient, the K-mean algorithm is su�cient if onlycontinuous variables are evaluated for the distance.

Other methods for similarity clustering include hierachical clustering, linkclustering, nearest-beighbor clustering, fuzzy clustering, and Kohonen's self-organizing maps.

Other clustering: The \K-center" -algorithm focuses on a radically dif-ferent approach. Instead of dividing the data into similarity groups it triesto separate a group of K data-points from the rest of the data (2 clusters).These K records are chosen as the most representative records within thedata, which means that the distance from any record to its closest \rep-resentative" is minimized. The resulting K centers then can be used forshowing the diversity of the data. Also representative neurons for RadialBasis Functions can be created using this algorithm.

Discussion: Similarity based clustering has useful applications in provid-ing \natural" and unsupervised classi�cation schemas for our data. Other

35

algorithms help us identifying representative points within the data, i.e. theK-center algorithm. In general, clustering methods are mainly based onthe distance measure among the variables. Mathematical algorithms usethese distances to group data entities into similarity or representative clus-ters. Therefore there is no direct linkage to most basic techniques, but manysimilarities can be observed: the �eld selection for input to the metric isprojection; calculating the distances is creating a new meta-dimension; eachcluster is a subset; the induced partition is coarsening ; and the derived la-bels on the clusters can act as a new dimension. For \supervised clustering"projection and subsetting is used to relate the obtained clusters to the super-vised classi�cation. Furthermore many other methods can be applied anduse this new dimension.

3.2.2 Mask Analysis

The aim of this method is the investigation of predictive behavior patternsover an ordered (sometimes also partially ordered) support, most often time.For example, with knowledge of the weather yesterday and today can wepredict the weather tomorrow? Mask analysis essentially tries to model thebehavior of discrete variables and even nominal variables in a similar wayas di�erential equations model space and time behavior of continuous vari-ables [5]. Examples of famous (partial) di�erential equations are the Wave-equation (hyperbolic PDE) and the Heat-equation (parabolic PDE) whichare used to model many time, space, and functional relationships.

Whereas di�erential equations relate the derivatives over some supportdimensions (time, or space derivatives) to the current function output (vari-able values), mask analysis looks at \earlier" support-instances of its vari-ables (previous data entries). Comparing numerical methods for di�erentialequations to mask analysis is even more striking. These numerical methodsalso use the previous time instance for �rst order di�erential equations, theprevious two instances for second order equations, etc.

Focusing on predictive patterns among the data entries, it is clear thatthis method requires an ordered support: the data entries need to have somekind of relationship to each other. The most common ordered support istime: data entry V = ~v1 occurred before V = ~v2; V = ~v2 in turn happenedbefore V = ~v3, etc..

Other than an ordered support, mask analysis has no requirements. Allvariables can be nominal, ordinal or discretized continuous.

36

Because of the ordered relationship between data records, we don't repre-sent the data in a counts table or probability distribution as this would ignorethe ordering. For mask analysis we start out with the whole data-table in itssequential ordering. Then instead of looking to each data instance (entity)individually we put a mask over several connected support (time) instancesof data. So for some variables we also look at \previous" instances in thedata. By doing this we actually create new dimensions, called samplingvariables which represent the state of variables at a previous support (time)instance. One of the main questions in mask analysis is how far \back",called mask depth, and which variables need to be included as new dimen-sions. The �nal purpose is predictability of the current data instances, calledgenerated variables, conditionally on the created new dimensions, calledgenerating variables.

For an example imagine an original data-table with three variables fx1; x2; x3g.A mask can be represented by a new set of dimensions, e.g.

M := fx(t�2)1 ; x

(t�2)3 ; x

(t�1)1 ; x

(t�1)2 ; x

(t�1)3 ; x

(t)1 ; x

(t)2 ; x

(t)3 g

where (t) refers to data in the t-th (=current) support instance, (t�1) refersto the previous one, etc. In this new model space of variables we induce acounts table and a probability distribution (section 2.1).

From this probability distribution a conditional probability distributionof the generated given the generating variables and conditional entropy isderived.

The mask M together with the conditional probability distribution isour model of behavioral structure in the data, called \behavior system", theconditional entropy measures the quality of the model, or how uncertain weare about our prediction. As in reconstructability analysis (section 3.3.2) weneed to consider the tradeo� between quality (accuracy) of the model andcomplexity, re ected by the number of new variables.

A \behavior system" can be also seen as describing support-invariantbehavior in the data. Independent from the support we have a conditionalprobability function for predicting the next variable-states. For more detailsand especially a more formal de�nition of mask analysis refer to [26, pp.83-174].

Compared to the basic techniques, mask-analysis mainly uses new di-mensions to add the knowledge of previous data to our current record. Usingdi�erent masks we add di�erent variables to compare how well they are able

37

to predict. The quality of of each behavior system is then obtained by pro-jection and standard entropy-measures.

3.3 Unsupervised, Nominal methods

Finally I want to discuss unsupervised methods which only require nominaldata without a distance measure. Often we are trapped into thinking aboutthe continuous, ordered cases where we can visualize relationships in 2 or 3-dimensional cartesian graphs. The previous methods should have illustratedthis assumption.

With nominal data the idea of a decision surface doesn't make sense (sec-tion 3.1). Though aggregation of values is possible via hierarchies (perhapsinduced by a fuzzy similarity measure) di�erent values are otherwise unre-lated and can therefore only be handled as single points or sets of values.

Di�erent variables can only be aggregated by logic operations. If variablex1 has a value out of a given subset Si � dom(xi) and variable xj a value inSj � dom(xj) then a third variable xk has with some probability p a valuein Sk � dom(xk). This process is known as rule inference and is discussed insection 3.3.5.

In the nominal case we have only count tables which can be visualized inhistograms (though even this could be misleading as continuous probabilitydistributions are often represented in this way). These count tables thencan be used for deriving probability or other evidence distributions. Asaggregation of values and variables in the sense of adding and multiplying isvery restricted, the use of the �rst 3 basic techniques, mentioned in section2.3, on an induced probability distribution are the main structure-�ndingapproaches. With the following method introductions I also hope to makethe point that the introduced techniques are from their theoretical viewpointa su�cient description for investigating nominal data.

Before I start with the method descriptions I want to present the MarketBasket problem as a good example for the nominal, unsupervised problemdomain (though it has a little di�erent data structure than discussed). InMarket Basket research we want to investigate patterns in the shopping be-havior of customers. A set of nominal items is given. Subsets of these itemsare bought by customers in so-called \transactions". The purpose is to iden-tify rules of the type \A customer purchasing items A,B, and C often alsopurchases item D" (section 3.3.5). For speci�c questions like e�ects of ad-vertising we also might want to specify some of the items A, or D a priori.

38

For example, if a customer buys milk what else is he likely to buy?The usefulness of hierarchies also becomes apparent in this example. If

all di�erent items in a shop are distinguished (milk from di�erent producers,skim, 1%,2%, and whole milk) then hardly any general rules can be found. Ahierarchy of values (just any type of milk, any kind of bread, etc.) can helpinferring general as well as more specialized patterns. Clearly rule-inferenceis also an unsupervised method as we don't use any data for supervisedclassi�cation training.

3.3.1 Analysis of Variance (ANOVA)

Analysis of Variance is a statistical method for modeling the e�ect of severalnominal input variables and their interactions on an continuous, orderedoutput variable. The F-test is used for controlling if the in uence of someinput variables, alone or interacting, is signi�cant on the output variable.Least Squares estimates are used to estimate the strength of the e�ect. [8,pp. 108-145], [19, 34, 21, 22]

A good example is the growth of plants. We want to investigate if thereis an in uence on the growth of plants by growing them on di�erent typesof soil, using di�erent fertilizer, etc. and if there are interacting in uencesamong these variables. Here the kind of soil, fertilizer, etc. are the nominalinput variables, and the height of the growing plants might be the outputvariable.

We can also use Analysis of Variance for our purely nominal domain. Inthis case the modeled continuous output variable corresponds to a probabilitydistribution over the nominal relation. We model the likelihood that speci�cvalue combinations appear together.

Analysis of Variance is used to investigate the probability distributionby searching for patterns of values that occur together. The focus is on�nding out which values and variables appear together in a random manner(no structure) and what speci�c values have a strong in uence either incombination or on its own on the likelihood to appear. In other words aresome dimensions independent from each other or is there a \correlation"e�ect between the variables. In an example we might want to investigate thein uences of a patient's blood type, of a new medication versus its placebo,of the occurrence or non-occurrence of some Genes in his DNA, of the climatehe lives in, etc., on the patient's probability of healing or not healing.

39

Let me formally introduce the model for three variables x1; x2; x3:

fijk = �+ Ai +Bj + Ck + ABij + ACik +BCjk + ABCijk + �ijk

where fijk denotes the output-variable (probability) for the i-th value of vari-able x1, the j-th value of the second variable x2 and the k-th value of x3. Let� be the overall mean of the output-variable, which would be the random-probability (1=jdom(x1�x2�x3)j) in our case. Ai is de�ned as the in uenceof the i-th value of the �rst variable, Bj as the in uence of the j-th value ofthe second variable, and Ck as the in uence of the k-th value of the third vari-able. ABij; ACik; BCjk and ABCijk are the combined e�ects due to some ofthe values occurring together. For example, a speci�c soil-type (x1 = 1) anda speci�c fertilizer (x2 = 1) may have only a very small or even negative in u-ence on the plant-growth compared to the average growth (A1 < 0; B1 < 0),but in combination they have a big positive e�ect (AB11 � 0).

�ijk consists of all the e�ects of other variables which we ignored in ourmodel. It is referred to as a \residual" and it can be interpreted as therandom error of the model. In the Analysis of Variance the random errors �ijkare assumed to be independent, normally distributed random variables withmean 0 and the same variance (�ijk � N(0; �2)); for all i; j; k. That means weassume only a linear in uence (Ai; Bj; Ck; ABij; etc.) on the system-functionfijk and no change in variation. For example, a speci�c medication mighthave a positive e�ect on the health of patients, while placebos might havea smaller e�ect. Nevertheless, the variability of health over patients takingmedication or placebos is assumed to be the same. If this is not the case thenwe are perhaps missing another variable (belief in placebo?) which explainsand separates an increased variation.

Because of the de�nition of � as the mean-value and �ijk as random vari-ations with mean 0, the following constraints also hold for our model:X

i

Ai =Xj

Bj =Xk

Ck = 0

Xij

ABij =Xik

ACik =Xjk

BCjk = 0

Xijk

ABCijk = 0

With these assumptions we can compute the following least-square estimatesof the variable e�ects. Let fijk be the actual output values (probabilities) for

40

the i-th, j-th, and k-th value of the respective input-variables, a \:" denotesa subscript over which the average is taken, \^" denotes an estimate:

� = f:::

Ai = fi:: � f:::

Bj = f:j: � f:::

Ck = f::k � f:::dABij = fij: � fi:: � f:j: + f:::dACik = fi:k � fi:: � f::k + f:::dABjk = f:jk � f:j: � f::k + f:::dABCijk = fijk � fij: � fi:k � f:jk + fi:: + f:j: + f::k � f:::

These estimates can be generalized for more variables.The F-ratio test is used to test if interactions between variables like (ABij;

for all i; j) are signi�cant and if we need to include them in our model. AboveI presented a complete model with all possible interactions of three variables,but in reality we often want to represent only the signi�cant aspects. Formore details on the F-ratio test see [19, 34, 8].

A random distribution with no structure is equivalent to all variables nothaving any direct in uence on the distribution and only the global mean ofthe model being needed to \explain" everything. If the distribution is notrandom then the structure can be due to the in uence of some single variablesand/or to some more complex interactions. These kinds of interactions arethe structures we are especially interested in and which we try to capture bydoing analysis of variance.

Comparing Analysis of Variance to the basic techniques, we recognize thatmarginal distributions of variables (for example f::k) are used to estimatetheir in uence on the overall probability-distribution. First we project onthe variables of interest then we take the averaged probability of the valuesof interest. For example to get the in uence of the i-th value of variable x1we project on variable x1, take the probability for the i-th value, which isthe sum of all probabilities where the i-th value of x1 occurs, and divide itby jdom(x2 � x3)j to obtain the average probability. Then we subtract theglobal mean � and end up with the in uence Ai.

41

Some of these issues will be discussed in more detail in the followingsection. Reconstructability Analysis is basically used for the same purposebut is more precise than ANOVA, as no model assumptions are made abouthow the data in uences the output variable [21, 22]. This is also the rea-son for this rather short introduction into ANOVA while the discussion ofReconstructability Analysis will be much more detailed.

3.3.2 Reconstructability Analysis (RA)

Reconstructability Analysis is an approach for inducting modeling relation-ships and correlations between variables. The aim is to identify stronglyrelated subsets of variables and to represent this knowledge in a simpli�edmodel which eliminates the connection between all other \almost" unrelatedsubsets of variables. Reconstruction Analysis grew out of the GSPS (Gen-eral Systems Problem Solver) Framework and is well documented in [26, pp.227-281],[14], [28, pp. 270-279], [27], [7].

Informal Introduction: In the original overall relation every possibleconnection between variable- subsets and their e�ects on the counts is ac-counted for; every \correlated" behavior between variables can be representedas all variables are \connected" with each other in this relation. This may beeither unclear or self-evident at this point, it just means that every possiblevalue combination over the overall cartesian product space can be assignedan individual count and induced probability measure.

On the other hand, if all the variables would be independent from eachother we could simplify the overall probability distribution dramatically byjust representing each dimension separately with its own distribution; inother words, the whole system could be represented by the projections throughthe individual dimensions xi. All subsets of variables which are not \corre-lated" do not need to appear together in the same relation. So by \de-composing" the overall relation into a set of subrelations we can obtain asimpli�ed model which allows to see the connections between variables moredirectly. The less the variables are connected the simpler our model. We alsomight wish to ignore slight connections among variables in order to reducethe complexity of the system and to emphasis the the strong relations.

Imagine an example where 70% of the cars are red, and 30% blue. In-dependently to the car color, 20% of the cars are standard , the other 80%

42

automatic. We could represent this information in an overall relation withprobabilities attached:

Color Shift Probabilityred standard 14%red automatic 56%blue standard 6%blue automatic 24%total 100%

From this table the independence between the variables is not obvious.But if we use a model of two subrelations fColorg and fShiftingg then theindependence becomes clear as this \simpler" representation is able to sup-port the same knowledge. We have the same accuracy as in the above tableand the same information is represented. By the way I already made use ofthis simpler representation in my verbal description. The following are theprojections of our \perfect" model:

Color Probabilityred 70%blue 30%total 100%

Shift Probabilitystandard 20%automatic 80%total 100%

In reality we work with much larger sets of variables, but the same princi-ple holds. Only the relationships can be much more complex, like there couldbe some connection among variables x1; x2; x3, a connection among variablesx2; x3; x4 and also a connection among variables x1; x2; x4 but still no com-bined e�ect of all the variables x1; x2; x3; x4. In this case we would exchangethe later relation, denoted fx1; x2; x3; x4g, by the three less complex relationsfx1; x2; x3g, fx2; x3; x4g and fx1; x2; x4g.

Normal forms in relational databases follow a similar idea of decompos-ing the overall data into simpler (reconstructable) subrelations. Known in-dependencies among variables are used to simplify the data representationand to get rid of redundancies. Manageability of updating procedures andthe database is the main reason for doing this in Data Warehousing. Thedi�erence is that in creating normal forms we create the relational form ac-cording to our knowledge of independencies. We recognize that a sale ofsome clothes has something to do with the salesperson, but we assume thathome-addresses of salespersons are not individually connected with each sale,

43

only with other salesperson speci�c data.In reconstructability analysis we don't know the independencies in ad-

vance. Given our data we try to infer simpler models which are an optimaltradeo� between ignoring slight connections in the data and a higher com-plexity of the model. In the further discussion of this method I will ignoreexisting normal or relational forms of the database and concentrate on decom-posing one overall relation. This could be just one of the existing relationaltables or an \arti�cial" overall relation constructed by joining the existingrelations.

We start out with the whole relation in which all variables are connectedby the data entries. We then want to simplify this whole relation by a set ofsubrelations which connect only a smaller number of variables and thereforereduces the complexity. The de�nition of the set of subrelations which carriesour assumptions about the independencies among the variables is called a\model" of the overall relation, or in this particular case the \reconstructionhypothesis".

In Reconstruction Analysis many di�erent hypotheses, following somespeci�c search paths, are undertaken to �nd optimal models. To comparethe correctness of hypothesis assumptions, we project the data using thesubrelations of this model and then rebuilt a relation over the whole domainvia unbiased (maximum entropy) reconstruction. Unbiased reconstructionmeans we obtain again an overall relation by exactly using the informationfrom our projected model without adding external knowledge.

Taking a distance measure of the reconstructed (projected) distribution tothe original distribution we are able to tell the quality of our reconstructabil-ity hypothesis. The closer the distance, the more accurate the model and theless information is lost by using that model. Choosing a model is a tradeo� ofcomplexity and accuracy of the model. The original relation is most accuratebut so complex that it doesn't give us insights about the internal structureand the connectivity of the variables. The more we ignore weak correlation-ships, the simpler and more understandable the model gets, but we also loseaccuracy. Using a separate subrelation for each variable we obtain the sim-plest model but in most cases also a very inaccurate and therefore uselessone. Because of the two complementary measures, simplicity and accuracy,the model solution set of reconstructability analysis consists of all hypotheseswhich are not inferior to any other model in a combined (simplicity, accuracy)partial ordering. Models in the solution set are less accurate with increasingsimplicity but \optimal" for their respective simplicity level.

44

To summarize, Reconstructability Analysis is \the problem of breakingdown a given overall system into (simpler) subsystems that preserve enoughinformation about the overall system. The principle motivation behind thereconstruction problem is the reduction of complexity in the system in-volved." [28, pg. 278].

Formal De�nition: Let me now formally de�ne Reconstructability Anal-ysis formally. De�ne V = fx1; : : : ; xng to be the set of all dimensions.The domain of V is the cartesian product of the domains of its dimen-sions: X := dom(V ) = dom(x1) � dom(x2) � : : : � dom(xn). On thisdomain the database induces a count-table and a probability-distributionf(~x); ~x 2 dom(V ) (Section 2.1) which represents the information and rela-tionship among the variables, P denotes the set of all probability distributionsde�ned on dom(V ):

f : P(dom(V )) 7! [0; 1]

P = ff jf : P(dom(V )) 7! [0; 1]g

A projection as described in Section 2.3.1 of f onto fx1; x2g is denoted�fx1;x2g(f) , ~x � ~x0 means vector-inclusion:

�Vi : P 7! P 0; Vi � V; P 0 = ff jf : P(dom(Vi)) 7! [0; 1]g

�Vi : f 7�! f 0; f 0(~x0) =X~x�~x0

f(~x); ~x0 2 dom(Vi); ~x 2 dom(V )

A model of V , the overall variable-set, is de�ned as a set of subsets of V

M = fV1; : : : ; Vmg; Vi � V; i = 1; : : : ; m

such that:

1) [mi=1Vi = V (covering condition)2) i 6= j =) Vi 6� Vj (irredundancy condition)

The projection of f onto a model M is called a \structure system" and consistof the set of distributions:

�M(f) = f�V1(f); : : : ; �Vm(f)g

As a structure system, �M(f) is a simpli�ed representation of the overall sys-tem f , it corresponds to and could be projected from a set of possible overall

45

systems. The set of overall probability distributions which are compatiblewith a given structure-system �M (f) is called its \reconstruction family":

E(�M(f)) = ff 0 2 P j �M (f 0) = �M(f)g

The maximum entropy reconstruction of a structure-system �M (f) is theunique overall distribution which can be rebuilt without adding any extraknowledge (unbiased) and is de�ned as J(�M(f)) 2 E(�M(f)) such that

H(J(�M(f))) = maxf 02E(�M (f))

H(f 0); H(f) = �X

s2dom(V )

f(s) log2 f(s)

This maximum entropy reconstruction can be obtained by a series of rela-tional join operations (i = 1; : : : ; (m� 1) ) in which we sequentially add theknowledge of the subrelations to form the overall system.

Join-procedure: Let f(i)join be the prior distribution from (i� 1) previous

joins, V(i)join the variables set f

(i)join is de�ned on (letting f

(1)join := �V1 ; V

(1)join :=

V1). Denote the projection (subrelation, subsystem) whose information is to

be added to the join as f(i+1)proj := �V(i+1)

, with V(i)proj := Vi as its variable set.

The new resulting distribution from this join is f(i+1)join and de�ned on the

variable set V(i+1)join = V

(i+1)proj [ V (i)

join. We de�ne three sets of variables:

A(i) := fx 2 V j x 2 V(i+1)proj ^ x =2 V

(i)joing

B(i) := fx 2 V j x 2 V(i+1)proj ^ x 2 V

(i)joing

C(i) := fx 2 V j x =2 V (i+1)proj ^ x 2 V (i)

joing

For the join we need to combine the two distributions f(i+1)proj and f

(i)join over

their common variable-set B(i). If B(i) = ;, then the distributions are as-sumed to be independent (by the model) and we obtain f

(i+1)join by multiplying

f (i+1)proj � f (i)join. Let ~a 2 dom(A(i));~b 2 dom(B(i));~c 2 dom(C(i)) :

f(i+1)join (~a;~c) = f

(i+1)proj (~a) � f (i)join(~c)

Otherwise we transform f(i)join into a conditional distribution f

(i)join(C

(i)jB(i)).

For each value~b 2 dom(B(i)) we obtain a conditional distribution on dom(C(i))

46

(as described in section 2.3.2) by dividing:

f(i)join(~cj~b) :=

f(i)join(~c;~b)

f (i)join(~b)

Then we obtain f(i+1)join by:

f(i+1)join (~a;~b;~c) = f

(i+1)proj (~a;~b) � f (i)join(~cj~b)

By the way, C 6= ; because of the irredundancy condition, A = ; can be thecase in some joins but this does not in uence the described procedure.

Performing relational joins for i = 1; : : : ; m � 1 we combine the infor-mation of all projections together and in most cases end up with the maxi-mum entropy restruction J(�M(f)) = f (m)

join of our simpli�ed structure-system�M(f).

In some cases, when we have a so called \loop-structure" which meansloop-dependencies in our model-structure, then f

(m)join is only an approximation

of J(�M(f)). As �M ( J(�M (f) ) = �M(f), we can recognize such a situation

by simply projecting f(m)join into our model. If there are loop-dependencies we

need to continue joining the projections f(i+1)proj := �V((i+1)modulo m)

; i = m; : : :

to our already obtained join-distribution f(m)join until we reach a close enough

approximation of J(�M (f)). This can be done following the above discussedprocedures. Note that A(i) = ;; i = m; : : : .

�i := maxj=(i�m+1);:::;i

jf (j)join � f(j�m)join j ; i � 2 �m

is a measure for the closeness of the approximation [26, pp. 225-227]:

f(i)join ��i � J(�M(f)) � f

(i)join +�i

For examples on the described join procedures see also [26, pp. 223-227].

Model Evaluation: If f = J(�M(f)), then f is called \reconstructable"from our model M. We can represent the data without information loss byusing model M and the assumptions made about independencies seem tobe correct. A distribution f is regarded as \approximately reconstructable"from a model M if the maximum entropy reconstruction J(�M (f) =: fh issu�ciently `close' to f according to some distance-measure.

47

A well known class of distance-measures is the Minkowski-class of dis-tances (parameterized by p 2 f1; 2; 3; : : :g), also known as L-norms:

Lp(f; fh) =

24 Xs2dom(V )

jf(s)� fh(s)jp

351=p (20)

which contain the Hamming-distance (L1), the Euclidean-distance (L2), andthe Max-distance (L1).

For measuring the accuracy of the model we are more interested in theinformation loss of the maximum entropy reconstruction compared to theoriginal distribution. The following two non-symmetric measures are derivedfrom information-theory: Shannon's cross-entropy, also known as directeddivergence [28, pg. 279], [38, pg. 12]:

H(f; fh) =X

s2dom(V )

f(s) � log2

f(s)

fh(s)

!(21)

and the relative information loss by normalizing Shannon's cross-entropy overthe possible information content (Hartley information) [14, pg. 169],[26, pp.228-229]:

D(f; fh) =H(f; fh)

log2(jdom(V )j)(22)

0 � D(f; fh) � 1 characterizes the percentage of information lost for fh

representing distribution f .

Search through Model-Space: Now we know how to evaluate onemodel. How do we iterate through the model space? \To make the searchthrough the set of reconstruction hypotheses of a given overall system or-derly and e�cient, it is essential that the set be ordered by a relation ofre�nement." [14, pp. 167-168]. All possible models can be partially orderedby a complexity-re�nement lattice, where re�nement means simpli�cation,and coarsening \complexi�cation". The \top" of this lattice (the most com-plex) is the overall relation, whose structure could be denoted by the modelM = fV g. The \bottom" (the most simple) is the model in which all vari-ables occur in separate subrelations, M = ffx1g; fx2g; : : : ; fxngg . In thislattice every model M can be simpli�ed in two ways:

In a \C-re�nement" the connection of a variable-pair is broken. The twovariables xi and xj are assumed to be independent and separated in each

48

subrelation. As an example assume the following model of the variable-spaceV = fx1; x2; x3; x4; x5; x6g:

M = f fx1; x2; x4; x6g; fx2; x3; x4; x5g; fx1; x2; x4; x5g; fx2; x4; x5; x6g g

The following C-re�nements are possible:

(x1; x2) (x2; x3) (x3; x4) (x4; x5) (x5; x6)(x1; x4) (x2; x4) (x3; x5) (x4; x6)(x1; x5) (x2; x5)(x1; x6) (x2; x6)

Let's choose to break the connection between the variable-pair (x4; x5). Ev-ery subsystem which doesn't contain this pair is automatically part of ourre�nement (re�ned model), in this example, only fx1; x2; x4; x6g. All re-maining subsystems are replaced by their complete cover of sub-subsystemsof complexity one less:

fx2; x3; x4; x5g �! fx2; x3; x4g; fx2; x3; x5g; fx2; x4; x5g; fx3; x4; x5g



Then all sub-subsystems which contain the chosen variable-pair are removed.Furthermore all sub-subsystems which are redundant or part of any otherelement in the re�nement are removed. In our case only the underlined sub-subsystems remain: fx1; x2; x4g and fx2; x4; x6g are removed because theyare already contained in fx1; x2; x4; x6g, and the others because they containthe pair (x4; x5).

The �nal re�ned model after breaking the variable-pair (x4; x5) is:

M 0 = f fx1; x2; x4; x6g; fx2; x3; x4g; fx2; x3; x5g; fx1; x2; x5g; fx2; x5; x6g g

The following is the complete C-re�nement-lattice for an overall relationwith only 3 variables:

f fx1; x2; x3g g

f fx1; x2g; fx3g g f fx1; x3g; fx2g g f fx1g; fx2; x3g g

f fx1g; fx2g; fx3g g

49

The second form of simpli�cation is called \G-Re�nement". In this moregeneral re�nement (it includes C-re�nements indirectly by the use of severalG-re�nements) no direct links between variable pairs are broken. Insteadone subsystem is replaced with all its sub-subsystems of complexity one lessafter removing redundant relations. In this way the combined e�ects of allvariables in one relation are eliminated.

If, for example, only a speci�c combination of several variables triggersa medical condition, then this e�ect can only be represented in a relationcontaining all these variables together. After a G-re�nement all variables stillstay connected but not in the same relation and the model cannot representthe complex interaction of all variables.

As an example let us G-re�ne fx1; x2; x4; x6g in our model:

M 0 = f fx1; x2; x4; x6g; fx2; x3; x4g; fx2; x3; x5g; fx1; x2; x5g; fx2; x5; x6g g

First we replace fx1; x2; x4; x6g by its sub-systems of complexity one less:


All sub-subsystems contain non redundant inforamation, therefore they areall added to our re�ned model M 00:

M 00 = f fx2; x3; x4g; fx2; x3; x5g; fx1; x2; x5g; fx2; x5; x6g; fx1; x2; x4g;

fx1; x2; x6g; fx1; x4; x6g; fx2; x4; x6g g

In theory we now have an algorithm (G-re�nements) to enumerate allmodels and test them sequentially. In practise this is only possible for reallysmall relations, as the number of models \explodes" with increasing numberof variables [14, pg. 168]. For 3 variables there are only 9 possible models,but for 6 variables the number of models is estimated to be about 7 million.

This is the reason for using two di�erent kinds of re�nements. The morerestrictive C-re�nements can be used as "global search", while we look intothe general G-re�nements for local adjustments.

The procedure of global and local search is supported by several resultsthat \observed ... that reconstruction hypotheses have a tendency to nat-urally cluster into good and bad ones, i.e. into hypothesis with small andlarge distances" [14, pg. 171]. Because of this \natural clustering" it is pos-sible to search the enormous model space more e�ectively. We only need tolook for \interesting" models in the neighborhood of already good models.

50

One used strategy is to start with the original most complex model, re�ne itand test the resulting reconstruction hypothesis. From these models chooseeither the best, the best k, or the best p percent of the models for furtherre�nement. Iterating this strategy down to the most simple model we get apretty good idea about \interesting" models. If we only used C-re�nementsfor the iteration we can use G-re�nements to improve our \local re�nementsearch" on these models.

Also mathematical optimization methods and Genetic Algorithms maybe used for searching the model space. For these methods we either de�nea quality measure as a combination of complexity and accuracy or we onlysearch a speci�c complexity level at a time and use the accuracy as qualitymeasure (or �tness function). For Genetic Algorithms it also needs to beinvestigated how to encode models in character strings.

Discussion: Reconstructability Analysis is similar to Analysis of Variance(section 3.3.1) in its approach of �nding structure within a wide relation.Both methods try to identify the strength of relationships and correlationsamong variables. While ANOVA assumes a model of interaction e�ects whichare then estimated via statistical inference, RA does not assume an interac-tion model; instead it starts with an interaction hypothesis about whichvariables and not how they are connected.

To estimate the interaction-e�ects via RA, e.g. ABCijk, one reconstruc-tion hypothesis connecting all these values, e.g. ffx1; x2; x3gg and one whichonly contains proper subsets of them, e.g. f fx1; x2g; fx1; x3g; fx2; x3g g, iscomputed. Then comparing the probabilities in both unbiased reconstruc-tions we are able to estimate the interaction e�ect, e.g. ABCijk, as the di�er-ence in those probabilities. Jones suggests that K-systems should be preferedto ANOVA in many instances. For a more detailed discussion of the com-parison between ANOVA and RA see [21, 22]. \We conclude that there aresigni�cant di�erences between statistical and K-systems (RA) interactions,and that these di�erences are due to the erroneous model and simplifyingassumptions of statistical interaction [in ANOVA]." [22, pg. 169].

Comparing RA with the basic techniques we recognize that basically pro-jection and extension are used. The data is projected into our models andthen extended via unbiased join procedures. By searching through modelspace we look for good models which contain the identi�ed additional struc-ture in their re�ned subrelations; some variables are linked more directly

51

than in the overall relation, other variables are disconnected altogether.Another use of reconstructability analysis has been suggested by Klir in

1981 [25]. The \reconstruction principle of inductive inference" expressesthat the reconstruction system derived from sampled data is usually a betterestimate to the true distribution than the sampled data itself. This hypoth-esis is supported by several experiments [25, 14, 38] and can be explainedas follows: The relative sample-size over the domain is larger for any of theprojections �Vi than for the whole domain:

samplesize

jdom(Vi)j>

samplesize

jdom(V1 � V2 � : : :� Vm)j

Therefore the probability estimates are more signi�cant for the projectionsthan for the overall relation.

If the overall system is approximately reconstructable from our model,then we can obtain a better approximation of the true distribution by usingthis model. However, as there are already many other methods for improvingprobability estimates (contingency table analysis, etc.) Pitarelli concludes\(Reconstructability analysis) is neither the only nor necessarily the besttechnique for improving an initial relative frequency estimate of a probabilitydistribution de�ned over a �nite product space." [38, pg. 20].

3.3.3 DEEP

Whereas the aim in Reconstructability Analysis is an overall model of thestructure among variables, DEEP (Data Exploration through Extension andProjection [24] is a user-guided approach for investigating high local struc-ture.

DEEP is a new method within the GSPS framework to explore more \lo-cal" structures without reconstructing the whole relation. The user speci�esan initial set of \interesting" variables V1. The overall relation is projected

through this variable set, �V1 , and it can be observed how the data distributesover the corresponding domain dom(V1). Now the user speci�es a second setof variables V2 on which our �rst projection �V1 can be spread out as condi-tional distributions, we extend the �rst projection to include the second setof variables. This means for every value ~v1 2 dom(V1) we create a distribu-tion over the variable set V2, the count attached with each ~v1 is partitionedby the values of V2. We can visualize this process in a 2-dimensional table.The count-table of �V1 is one dimension, the partitioning of these counts by

52

V2 manifests the second dimension. As an example imagine the following�ctitious example with V1 = fincomeg and V2 = fcarg:

income count Ford Dodge Toyota GMC H K Ghigh 10 7 3 0 0 0.8813 0.30 0.88medium 35 10 10 10 5 1.9502 0.39 0.97low 20 0 2 5 13 1.2362 0.37 0.78total 65 17 15 15 18 1.9954 0.33 0.998

The entropy H for all these conditional distributions (H(V2jV1 = ~v1) ) iscalculated and compared. The conditional distributions with low entropiescontain the most structure.

K and G are two other important measures for this method. K is de�nedas a relative Hartley (equation 6, page 14) based on the partitioning of thecounts by values of the second variable set V2. In our example the set of10 data entities with \high income" partition into two groups, one groupdriving \Ford", the other \Dodge", therefore K = log2(2)= log2(10) = 0:30.The 35 \medium wages" data entries are partitioned into 4 groups, K =log2(4)= log2(35) = 0:39, while the 20 \low wages" fall into 3 partitions,K = log2(3)= log2(20) = :37.

G is de�ned as a relative entropy (equation 10, page 17) over the partition,which means the entropy relative not to the whole domain (of V2), but to thesize of the partition. The entropy of the \high wages" is connected with twoclusters, therefore G = H= log2(2) = 0:8813, the entropy of \medium wages"origins from 4 clusters, therefore G = H= log2(4) = 0:97.

According to these three measures the user subsets the data into speci�cvalues v

(1)1 ; : : : ; v

(k1)1 called condition C1 = (V1 2 fv

(1)1 ; : : : ; v

(k1)1 g). We end

up with a data set conditionalized on our �rst variable-set. Now we usethe projection �V2 and calculate entropies for the conditional distributionson a third variable-set V3. We continue subsetting the data with a secondcondition C2 = (V2 2 fv

(1)2 ; : : : ; v

(k2)2 g). This approach is iterated and we end

up with a small subset of data which is highly correlated over the chosenvariable-sets. The user guided process can be summarized in an ordered set(vector) of (variable-set, condition)-tuples:

((V1; C1); (V2; C2); : : : ; (Vn; Cn))

As this process is absolutely user-guided it is very exible in �nding re-quired structures. It is also possible to prede�ne some of the subsetting and

53

conditionalizing criteria and run DEEP in an automated fashion.DEEP is based on an iterative procedure of projecting, conditionalizing,

subsetting and again extending into other dimensions.

3.3.4 Log-linear models

Hierarchical Log-linear models are also quite similar to the approach of re-constructability analysis (also to mask analysis [50]). They investigate whichcombined e�ects of variables are required for a good approximation of theoverall relation, or in short what structure among the variables is necessaryto describe the data su�ciently.

Log-linear models in general start out with count tables (contingency ta-bles). As in RA, the data is then projected to the hypothesized subrelations,still represented by counts compared to some other measure (mostly prob-ability) in RA. Then the overall relation is rebuilt by maximum likelihoodestimates (MLE) via the iterative proportional �tting algorithm (Deming-Stephan algorithm), rather than the unbiased join procedure in RA. Alsosome other reconstruction algorithms are known, e.g. the Newton-Raphsonalgorithm [30, pg. 22].

While RA describes its models just by subsets of variables, this methodputs the main emphasis on describing the result as a \log-linear model". Thecell-counts Fij of two variables 1; 2 are expressed in the following way:

Fij = � � � (1)i � � (2)j � � (1;2)ij (23)

where � is the geometric mean, �(1)i the e�ect of the i-th value of the �rst

(1) variable, and so forth. The reason for the name log-linear comes from asimple transformation which is often done of the above formula:

log(Fij) = log(�) + log��(1)i

�+ log

��(2)j

�+ log

��(1;2)ij

�(24)

These models are called \saturated" as they represent the whole relation-ship between two variables. The � and the � 's can be calculated accordingto the counts. For details on this see [30]. A \simpli�ed" model is obtained

by ignoring some of the � interaction terms and assuming, e.g. �(1;2)ij = 1.

This \unsaturated" model is used for representing the projected and recon-structed data. Note that hierarchical models require that models containinghigh order tau's (e.g. �

(1;2)ij ) also contain all its lower order tau's (e.g. �

(1)i and

�(2)j ). High order in this context re ects the number of variables interacting.

54

Though RA and hierarchical log-linear models are quite similar in whatthey actually do, their main di�erences are in their approaches. RA em-phasis on all the possible models and the search through model space for ahuge number of variables. Log-linear models concentrate more on statisticalaspects and the interactions between a small number of variables. Because ofthe similarities it is at least interesting to follow the development and historyof both methods.

3.3.5 Rule inference

A rule can be de�ned as a statement \if event X occurs, then event Y is likelyto occur", where the events are propositions of the form of variables takingparticular values from their state sets (adapted from [41]). In other wordsevents X and Y can be described as sets of (variable, value)-pairs; the purposeis to �nd these sets X and Y such that X \implies" (with high probability)Y, denoted as (X ! Y). We need to be careful with this notation as a ruledoes not necessarily imply causality (Section 4).

We already know a kind of rule induction from the classi�cation problem.There we �x Y as the classi�cation variable with one of its values and wetry to �nd sets of X which are good predictors for the right classi�cation.As we \supervise" this process with speci�c classi�cation data it is calleda supervised method. In `general rule �nding' we look for regions of highstructure anywhere in the relation to obtain a better understanding of thedomain (expert knowledge). This is well summarized in Smyth and Goodman[41, pg. 302, 313]: \Classi�cation only derives rules relating to a single `class'attribute, whereas generalized rule induction derives rules relating any or allof the attributes". \The rules produced (...) can be used either as a humanaid to understanding the inherent model embodied in data, or as a tentativeinput set of rules to an expert system."

General rule induction is therefore de�ned as an unsupervised method,even though this distinction seems somewhat `fuzzy'. Also in general ruleinduction we sometimes like to specify parts of the event-sets X or Y apriori, though not directly for classi�cation purposes but for approaching theproblem in a more user guided manner For simplicity we only deal with Ycontaining one (variable, value)-pair. Rules with more \implications" canalways be divided in several rules with one implication.

55

Formal De�nition: Let me introduce some more notation:

X := f (X1; x1); (X2; x2); : : : ; (Xn; xn) g = ( ~X; ~x)

Y := (Y; y)

where X1; : : : ; Xn and Y are variables, x1; : : : ; xn and y are speci�c values (or

value-sets) from their respective domains. ~X and Y can be seen as discreterandom variables. We de�ne a rule as:

If ~X = ~x; then Y = y with transition-probability c (25)

Measures: To search for `interesting' rules we need a preference measureto rank the rules and an algorithm which uses the preference measure to �ndthe `best' rules. In general, the conditional probability, also called \transitionprobability" or \con�dence" is a belief parameter associated with every rule:

c = c(X ;Y) := f(Y = yj ~X = ~x) (26)

It expresses the percentage for which the rule-implication is actually true.Another measure, mostly used for association rules (see below), is \support".It expresses the signi�cance of a rule by measuring the probability that thetrue implication of the rule occurs in the data:

s = s(X ;Y) := f(Y = y; ~X = ~x) (27)

An interesting information theory based measure for general rule induc-tion was introduced in 1988 by Goodman and Smyth [13]. The \J-measure"is a mixture of the probability of X and a special case of Shannon's cross-entropy. As a refresher, cross-entropy or directed divergence, is de�ned as(section 3.3.2, [28, pg. 279], [38, pg. 12]):

H(f; fh) =X

s2dom(V )

f(s) � log2

f(s)

fh(s)

!

In rule-inference we are interested in the distribution of the the \implication"variable Y , and especially in its two events y and complement �y. We want tomeasure the di�erence between the a priori distribution f(Y ), i.e. f(Y = y)

and f(Y 6= y), and the a posteriori distribution f(Y j ~X), i.e. f(Y = yj ~X = ~x)

and f(Y 6= yj ~X = ~x). The \j-measure" (small j) is de�ned as \the average

56

mutual information between the events (y and �y) with the expectation takenwith respect to the a posteriori probability distribution of (Y )." [41, pg.304]. Denote f(�y) = 1� f(y) and f(�yj~x) := 1� f(yj~x):

j(Y j ~X = ~x) := f(yj~x) � log2

f(yj~x)

f(y)

!

+f(�yj~x) � log2

f(�yj~x)

f(�y)

!

= f(yj~x) � log2

f(yj~x)

f(y)

!

+(1 � f(yj~x)) � log2

(1 � f(yj~x))

(1 � f(y))

!(28)

This measure is maximized when the \transition"-probability f(Y = yj ~X =~x) equals 1 (or 0), and minimized (=0) when the transition-probability equalsthe a priori probability f(Y = y). \In this sense the j-measure is a well-de�ned measure of how dissimilar our a priori and a posteriori beliefs areabout (Y ) | useful rules imply a high degree of dissimilarity." [41, pg. 305].

Summarizing, the j-measure includes two important features. The �rst isthe \goodness of �t" between the rule hypothesis and the data, expressed byhaving maximal values for transition probabilities f(Y = yj ~X = ~x) close to 1(or 0 for a negative rule). Second is the amount of \dissimilarity" comparedwith the unconditionalized distribution. A rule with similar con�dence as theoverall conclusion probability, f(Y = yj ~X = ~x) � f(Y = y), wouldn't makemuch sense, even if that probability is close to 100%. As an example imagine90% of all customers buy milk, then a rule \buying bread �! buying milkwith c=91%" wouldn't be very useful. The implication of buying milk is notgiven by buying bread, it is just a general pattern.

A third feature is \simplicity" which is combined with the j-measure toform the J-measure. Simplicity is a measure for the complexity of a rulesprecondition. The more likely the truth of the precondition, the simplerand more useful the rule. But the likelihood of the precondition is just theprobability f( ~X = ~x). Therefore the average information content of a rulecan be de�ned as:

J(Y ; ~X = ~x) := f(~x) � j(Y ; ~X = ~x) (29)

57

ITRULE Algorithm: The ITRULE-algorithm [41, pp. 306-308] takes anoverall database relation (with discrete dimensions) as input and generatesa set of the K most informative rules according to the J-measure. K is auser-de�ned parameter.

The algorithm starts out with rules that have �rst order conditions, i.e.X consists only of one (variable, value)-pair ( X = f(X1; x1)g ). It �ndsK rules, calculates their J-measures, and places these rules in an orderedlist. The smallest J-measure de�nes the \running minimum Jmin". Thenthe J-measure of new solution-set candidates is compared with Jmin. Betterrules are inserted and Jmin is updated. Before continuing the evaluation of�rst-order rules, it is decided for each rule whether further specialization,meaning adding (variable, value)-pairs to the precondition, is worthwhile.

In particular, we can calculate an upper bound for the information contentof a specialization (see [41] for derivation) :

Js = max

(f(~x) � f(yj~x) � log2

1

f(y)

!; f(~x) � 1� f(yj~x)) � log2

1

1� f(y)

!)If the Js bound is less than Jmin, then specialization cannot possibly �nd acandidate for the solution-set and we back-up the search from this rule. Alsoif the transition probability f(yj~x) = 1 (or = 0) then we cannot increasethe information content by specializing, as we need to o�set the decrease insimplicity by an increase of goodness-of-�t (which is already maximal). In allother cases we continue to specialize in order to �nd a better more specializedrule.

The problem of this algorithm is its computational complexity. Letd1; d2; : : : ; dn be the dimensions of the overall relation. Denote ki := jdom(di)j;i = 1; : : : ; n as the number of values in the domain of variable di. Thenk = k1 + k2 + : : : + kn is the number of rule conclusions, k2 is the approxi-mate number of �rst-order rules (2k2 if we separate transition probabilitiesclose to 1 and 0). For all these rules rules J-measures and bounds need to becomputed, which can be many if the database is wide and therefore k large.Furthermore a much greater number of specializations of these rules need tobe evaluated. We presume that this together is in the order of O(kn).

The use of value hierarchies (Coarsening, Re�nement) is an importantfeature and needs to be considered in further implementations of this algo-rithm, so far hierarchies are not supported.

A number of other variations are considerable for this algorithm. Insteadof specifying the number K the user could de�ne a minimum information

58

content Jmin. Other constraints on variables and values could be user-de�ned.See also the paragraph `Discussion' for using rule inference in a user-guidedmanner.

Association Rules: The similar problem of �nding \association rules"was introduced by Agrawal, et.al., 1993 [4] in the context of market-basketresearch (section 3.3). In this �eld the buying behavior of customers is themain interest; we want to investigate what subsets of items are usually boughttogether in one \transaction". Therefore association rules usually operateon sets of items, but by seeing one (variable, value)-pair as one item andone database-record as a transaction we can transfer our problem to thismethodology. Note that by doing so some restrictions are imposed on thetransactions: each transaction has the same number of items, that is exactlyone for each variable; items with di�erent values for the same variable areexcluded from being in the same transaction.

An association rule is de�ned as an expression:

X ! Y with con�dence c and support s; (30)

whereI := fi1; i2; : : : ; img

is the set of all items and

X := fix1; ix2 ; : : : ; ixrg � I; 1 � r � (m� 1)

Y := fiyg � I

are subsets of items.The interpretation of an association rule is that \transactions in the

database which contain the items in X also tend to contain the items inY" [47]. The data D to be investigated consists of a set of Transactions T ,which are themselves subsets of items, i.e. all the items a customer boughtat the same time:

D := fT1; T2; : : : ; Tng; Ti � I

As already described there are two basic measures for evaluating the strengthof an association rule. The \con�dence" c) measures the percentage of trans-actions in which the items Y were bought whenever the items in X werebought (f represents the induced frequency distribution from the data D):

c = c(X ;Y) := f(Y � T jX � T ) (31)

59

The \support" (s) expresses the signi�cance of the rule as the percentage ofall transactions which contain both the items in X and Y:

s = s(X ;Y) = s(X [ Y) := f((X [ Y) � T ) (32)

The problem of mining association rules is to �nd all rules that satisfya user-speci�ed minimum support and con�dence. It is decomposed into 3parts [47]:

1. Find all sets of items I � I (\frequent itemsets") which are subset ofmore transaction Ti than the user-de�ned minimum. In other words�nd all subsets of I � I which satisfy a speci�ed minimum support.This can be done in a simple cumulative algorithm because of the nest-edness of the subset property: If an itemset I full�lls the requirement,so do all subsets of I. By starting with single items and sequentiallyjoining itemsets while checking for minimum support we are able toe�ciently determine the frequent itemsets. In this procedure we alsoautomatically obtain the support for all frequent itemsets includingtheir subsets (which are also frequent itemsets). This is used in thenext step.

2. Use the frequent itemsets to generate desired rules. The general ideais that if fi1; i2; i3; i4g is a frequent itemset, then we can determine thecon�dence for a rule fi1; i2g ! fi3; i4g by computing c(fi1; i2g; fi3; i4g) =s(fi1; i2; i3; i4g)=s(fi3; i4g).

3. Prune all uninteresting rules from this set. The criteria for \interesting"must be de�ned according to the Purpose of Investigation.

Using this concept several fast algorithms have been developed. For moredetails and references see one of the many papers on association rules [4, 47,36] etc.

As mentioned earlier, hierarchies of values, or in this case hierarchiesde�ned on the itemsets, are important for �nding general pattern. This isespecially true for this case as we preselect rules according to their support.The name for using hierarchies (taxonomies) in market-basket-research is\Generalized Association Rules"; appropriate algorithms have been devel-oped [47].

The con�dence and support measures are fairly elementary and not nec-essarily su�cient in describing interesting rules. Other measures based onstatistics, i.e. chi-square test, have been proposed [36].

60

Discussion: Most attention in the rule-induction literature is spent onassociation rules. Though only fairly basic information measures are used,the mining of association rules is easily implemented and o�ers fast computa-tional algorithms due to the minimum support criteria. The already designedusability of hierarchies (taxonomies) in the framework of generalized associ-ation rules is another advantage.

The disadvantage is that measuring rules by their support and con�denceis not always su�cient. The important distinctiveness of the transition prob-ability compared to the a priori probability is not taken into account (thoughsome experiments via the chi-square test have been done [47, 36]). Further-more it may be di�cult to �nd a lower limit for the support. Choosing it toosmall slows down the algorithm drastically, choosing it too large we mightoverlook many interesting patterns. While in general market research wemay only be interested in signi�cant overall rules, in other areas like frauddetection we are also interested in small, seldom occurring pattern which arevery distinct from the overall behavior.

The J-measure states a more interesting approach. It combines simplic-ity (which is similar to support s) with distinctiveness and con�dence in aninformation theoretic based way. The big problem of this measure is the highcomputational complexity of its algorithm. Though some of the `specializa-tions' are ignored by the computation of upper bounds, the evaluation andbounding of all �rst-order rules alone can be exhausting for wide database-relations. Also the use of hierarchies is not yet considered. Recalling thecomplexity of the algorithm we probably want to use, either way, as fewvalues as possible for each dimension.

In general, when talking about measures we need to consider that one-dimensional evaluations are always tradeo�s between many criteria. We can'texpect one measure to be useful for all our problems. Finding big overallpatterns and �nding structures for classi�cation purposes are already twoquite distinctive aims. I only discussed two dominant approaches for rule-�nding, but it should be clear that the \Purpose of Investigation" (section1.2) should point out the direction of our measures. It might be that severaldi�erent complementary measures should be used.

The need of hierarchies for �nding general as well as specialized rules isalready pointed out several times. In general the use of them needs to beconsidered in combination with the measures and the purpose of investiga-tion.

Guided search of rules is another fashion of rule �nding which not only

61

cuts down computational time but can be also very useful to the investigator.In this approach the user interactively selects a (variable,value)-pair(s) forthe precondition or the conclusion of the rule. The algorithm then onlyevaluates rules which contain the given pair.

Similar to the \Meta-query" used in the product KnowledgeMiner [46],we could also de�ne \Meta-rules" in specifying some or all of the variables,while the algorithm needs to search for appropriate values. For example wemight specify (X1 = income) and (Y = car) when we want to know if thereare any rules which connect speci�c income groups to some car fabricate.

One danger in using rule-induction is that the results may be misleading.As will be described in section 4, rules do not necessarily show any causality.Also depending on the measure we will �nd good rules which do not evenhave any correlation between the precondition and the conclusion. I alreadymentioned this in the context of distinctiveness. For example if 90% of allcustomers buy milk and independently 90% buy bread then the rule `bread! milk' would have a natural support of .81 and con�dence of .90. The sameis true for the opposite rule `milk ! bread'.

Relationships: Finally I want to relate rule �nding to the basic tech-niques and to other methods. We saw that we are interested in the marginaldistribution of one variable (Y ) which is just a projection into this vari-able. The focus is to �nd a condition, meaning a subsetting to valuesX1 = x1; : : : ; Xn = xn, such that the distinctiveness between the subsetedprojection (the a posteriori distribution) and just the projection (the a pri-ori distribution) is high and the entropy of the a posteriori distribution low.As we remember from section 2.2, low entropy means that the uncertaintyabout the conditionalized variable Y is very low which in turn re ects a hightransition probability into one value y 2 dom(Y ).

To summarize, \mining rules" means subsetting and projecting, subset-ting the precondition and projecting into the conclusion. Probabilities andentropy measures are then used to evaluate the corresponding rule.

In comparison with reconstructability analysis, rule induction (similar toDEEP) looks more into speci�c relationships between values while recon-structability analysis concentrates on general structure between variables.

If we preset the conclusion variable Y then a connection to logistic re-gression and general linear models becomes apparent. All these methodsfocus in �nding other variables which are good in discriminating one value

62

y 2 dom(Y ) from the other values. The only di�erence is (as already dis-cussed in section 3.3) that we can use `ordered' aggregation functions likemultiplication, adding and other continuous functions for logistic regression,etc., while in this case we are `stuck' with logic functions: X1 = x1 ^ X2 =x2 ^ : : : ^Xn = xn .

3.4 Supervised, Nominal methods

A traditionale supervised method will be introduced as this last method forthe nominal domain. First of all the successful use of Shannon's entropy(though not often as such recognized) is demonstrated and the potential useof supervised methods for unsupervised structure-�nding is shown.

Other supervised, nominal methods can be derived from the supervised,continuous domain. One example is the \Logit"-model which can be de-scribed as Logistic regression for nominal variables. Instead of using a linearcombination of the variables it just uses e�ect variables for each nominalvalue (similar to ANOVA and log-linear models).

3.4.1 Decision Trees

As the name points out, decision trees are used for decision making, whichsimply means �nding the right classi�cation for a particular situation (en-tity). They work by coupling simple rules in a tree-like structure. For eachnew entity the conditions of the single top node are compared �rst. Forexample, is variable age < 20, 20 � age < 40, or 40 � age? According tothe answer the entity is referred to one of several subnodes of the tree, eachtesting specialized further conditions. This process of testing conditions andreferring to subnodes is iterated until we reach a leaf of the tree. Because treeconstruction is aimed to have leafs (almost) only reached by representativesof one class each leaf can be associated with one of the classi�cation classes.This knowledge is then used for classi�cation.

Decision tree building: The problem is how to construct a decision tree,how to infer structure and classi�cation rules from a data set. Most of theconstruction algorithms are referred to as Top-Down Induction of DecisionTrees (TDIDT) [45]. Induction, because the knowledge is acquired induc-tively from the data-set; top-down because a candidate rule is chosen �rstfor the single top decision node and then each of its subsets is recursively

63

partitioned. The splitting is terminated if all members of a subset belong tothe same class or no further decision criteria are left. Some newer algorithmsalso stop splitting, referred to as \pre pruning" the tree, if the classi�ca-tion improvement seems insigni�cant. Other algorithms replace insigni�cantsubtrees by leafs in a post process, known as \post pruning".

Rule-Selection: How are rules selected for each node? The most famousapproach used in ID3 [45] is called \information gain", and is based on Shan-non's entropy. \Variants of the information gain (...) have become the defacto standard metrics for TDIDT split selection" [32, pg. 263]. Let C be theclassi�cation variable, X = fX1; : : : ; Xrg a set of variables with nominal ordiscretized values. The variable set X is to be evaluated as a node conditionin separating the classi�cation classes c 2 C. The information gain GC(X)is then de�ned as:

GC(X) := H(C)�H(CjX) (33)

From section 2.2 we know that the information gain is 0 if variables C andX are independent (H(C) = H(CjX)). The more they are \correlated" thelower is the predictive uncertainty H(CjX) and the higher the informationgain. Several variation of this measure are known [32]. One of them is called\gain ratio" and tries to compensate the known bias of gain towards variablesets with larger domains (more values):

GRC(X) :=H(C)�H(CjX)

H(X)=GC(X)

H(X)(34)

Other measures include the Chi-square statistics, Fisher's exact test, cat-egory utility measure, etc. [32].

The important point in selecting rules for decision nodes is that we want tomaximize class homogeneity within partitions and maximize classndistancebetween partitions.

Algorithm and Decision Tree Measures: Similar to reconstructabil-ity analysis (section 3.3.2), in decision tree methods we need to deal withtradeo�s between accuracy and complexity. A decision tree can be verycomplex but with its accuracy also over�tting the training data. On theother hand a decision tree can be very simple, but inaccurate. Several mea-sures for comparing di�erent algorithms and resulting decision trees havebeen introduced:

64

1. \accuracy" := the percentage of correct classi�cations on a cross-validationdata set.

2. \complexity" := total number of leaves on the tree.

3. \e�ciency" := average depth of the tree (top node to leaf). This de-scribes the average cost for using the decision tree for classi�cation.

4. \practicality" := time spend for tree building, pruning and cross-validation.

Another important issue to be considered is the size of the training dataat each decision node. \A fundamental principle of inference is that thedegree of con�dence with which one is able to choose is directly related tothe number of examples" [32, pg. 258]. Therefore inferences made near theleaves of a TDIDT decision tree tend to be (statistically) less signi�cant andreliable than those made near the root. This problem is closely connectedwith over�tting the training data.

The interpretation of decision trees is also di�cult. Though understand-ing the classi�cation reasons for a decision tree is not as di�cult as for MLPneural networks, the captured structures are often not obvious to humanusers.

Structure �nding: The purpose in supervised classi�cation is to �ndvariables tightly connected to the classi�cation variable. For each node in thedecision tree it needs to be decided which variables and values (candidates)are most appropriate for this class-separation.

In this sense classi�cation can be seen as a directed structure-�ndingproblem related to one variable, the classi�cation variable. But if we can�nd structure relative to one variable we can use this for �nding structuresinvolving other variables. In a user-guided approach these algorithms couldfurnish the user with knowledge about the closest classi�cation-oriented vari-ables for any selected variable. Queries like \what variable is most correlatedwith this variable ?" could be answered with at decision trees. In this waydecision trees could be also used as unsupervised methods.

Comparing TDIDT with the basic techniques it is obvious that projectingand subseting are used extensively. For �nding rules and conditions to sep-arate the classes we use projections and calculate entropies and conditionalentropies on them. Iterating this process with subnodes we continuouslysubset the data and specialize the search.

65

4 Dangers in Data-mining

Data-mining methods are aimed to derive hypothese from the data (section1.1). Therefore data-mining results are also only hypothese and need to betaken with care as there are several dangers associated with \misinterpreting"a hypothese.

The general problem of causality, the relationship of wholes and parts,and the limiting factor of a training-set are three of these dangers that willbe discussed in the following sections.

4.1 Causality

In the discussion of rule inference (section 3.3.5), we already noted one prob-lem with data mining results. Rules, denoted in the form X ! Y , oftengive humans the feeling of some causal relationship, X \implies" Y. But in-stead everything is just based on some data observations. By using only thesupport and con�dence measure we saw that some \good" results can betriggered by totally independent variables (Milk ! Bread with s = :81 andc = :90). This rule expresses that Milk and Bread have in common that theyappear in most transactions (independently), which can be still quite usefulinformation after the rule's hypothesis is carefully evaluated.

Even if we measure correlation between variables then this relationshipdoes not necessarily describe any causality. I want to state an even strongerhypothesis: from observational and historical data it is generally impossibleto infer any causal relationships. The only scienti�c approach for concludingcausality is the experimental one. This means �rst stating a hypothesis,then carrying out experiments with known input variables and observing theresponding variables relative to the actively changed inputs.

To support my hypothesis visualize the following example: when we ob-serve in some kind of database that all vegetarians live longer and are lesssick we could think that changing ourselves to vegetarians would improve alsoour health and life expectancy. But that may not necessarily be the case.The structure that we found in the database, the rule that all the vegetarianslive longer and healthier, doesn't imply any causality, it's just a correlationfound in the observations. It may be that all these people are vegetariansbecause of a totally di�erent attitude towards life. Perhaps they take morecare in general about their health, their food, etc. Which then is the reasonfor longer and healthier life? Perhaps being vegetarian is only one probable

66

e�ect of this attitude and by no means a cause. On the other hand, if webelieve in a causality conclusion from data observations, we might as wellconclude that longer life and better health causes (with some probability)being a vegetarian.

We still try to infer causalities from recorded data but it should be notedthat all this is done in the context of huge background knowledge and alreadyknown laws about the world. For example most people would not concludethat better health causes vegetarian eating habits just because this con ictswith their a priori knowledge (or belief). In other cases we are able to explainsome relationships (even if previously unknown) by some external knowledge.This then might justify a causal conclusion.

The whole issue of causality has a much wider context in the philosophyof science, in particular the problem of induction.

4.2 Parts and Wholes

We also need to be careful in making judgments from marginalized and ag-gregated data. Imagine a \fair" university which accepts the same rate pfieldof female applicants and male applicants for each �eld, for example the best10% of female applicants and the best 10% of male applicants for education.Let's also say that the total number of female applicants equals the numberof male applicants. Assume the following distribution of applicants (withsimpli�ed numbers):

Field rate f. applicants f. accepted m. applicants m. acceptedEducation 10 1000 100 100 10Social Sciences 20 500 100 300 60Engineering 30 200 60 1300 390total 1700 260 1700 460

If we look at the aggregated total distribution the selection of studentsseems biased. Many more male students (460) are accepted than female(260). The reason is that female students applied more for �elds with loweracceptance rate (higher competition). But this information is not shownanymore in the total (projected) distribution, parts (marginals, projections)do not in general determine wholes (overall distributions). A similar examplewith patients in a hospital can be found in Glymour et. al. [11, pg. 20].

67

4.3 Training Data

Another precaution needs to be noted in the context of fraud-detection withsupervised methods. In most cases only a small percentage of fraud is iden-ti�ed and used for supervising a method. The result is that the methodwill only identify fraud which is similar to the already known fraud. Toinvestigate other fraud-schemas we need to use unsupervised methods for�nding general patterns which then can be tested for fraudulent behaviorusing external knowledge and/ or investigations.

4.4 Summary of Dangers

Let's summarize some of the dangers in data-mining (adapted from [11]):

1. Associations in databases may be due in whole or part to unrecordedcommon causes and therefore may not indicate any direct causality.

2. Variable values may be the result of feedback mechanisms which areneither shown in the data nor represented by non-recursive models.

3. There might be an (unknown) preselection criteria for an entity beingin the examined database. For example questionnaires are seldom �lledout by a random population sample.

4. It needs to be carefully evaluated what a data-mining result reallyexpresses (Compare examples milk ! bread-rule and acceptance ofstudents).

Other precautions and examples can be found in [6, pp. 37-38], [11, pp.20-22].

References

[1] Adriaans, P. and Zantinge, D. Data Mining. Addison Wesley Longman,Harlow, England, 1996

[2] Agrawal, R. et al. Modeling Multidimensional Databases. Research Re-port, IBM Almaden Research Center, San Jose, CA, 1997

[3] Agrawal, S. et al. On the Computation of Multidimensional Aggregates.Proceedings of the 22nd VLDB Conference, Bombay, India, 1996

68

[4] Agrawal, R. et al. Mining Association Rules between Sets of Items in

Large Databases. Proceedings of the 1993 ACM SIGMOD Conference,Washington, DC, May 1993

[5] Boyce, W.E. and DiPrima, R.C. Elementary di�erential equations- 5th

edition. John Wiley & Sons, Inc., New York, NY, 1986

[6] Cabena, P. et. al. Discovering data mining: from concept to implemen-

tation. Prentice Hall, Inc., Upper Saddle River, NJ, 1997

[7] Cavallo, R.E. and Klir, G.J. Reconstructability analysis of multi-

dimensional relations: a theoretical basis for computer-aided determi-

nation of acceptable system models. International Journal of Generalsystems, 5 (1979), 143-171

[8] Christensen, R. Analysis of Variance, Design and Regression. Chap-man& Hall, London, UK, 1996

[9] Fayyad, U.M. Editorial. Data Mining and Knowledge Discovery, 1(1997), 5-10

[10] Fisher, R.A. The Statistical Utilization of Multiple Measurements. An-nals of Eugenics, 8 (1938), 376-386

[11] Glymour, C et al. Statistical Themes and Lessons for Data Mining. DataMining and Knowledge Discovery, 1 (1997), 11-28

[12] Goldberg, D.E. Genetic Algorithm in search, optimization and machine

Learning. Addison-Wesley, Inc., 1989

[13] Goodman, R.M. and Smyth, P. An information-theoretic model for

rule-based expert systems. 1988 Int. Symposium in Information The-ory, Kobe, Japan, 1988

[14] Hai, A. and Klir, G.J. An empirical investigation of reconstructability

analysis: probabilistic systems. International Journal of Man-MachineStudies, 22 (Feb 1985), 163-192

[15] Hartley, R.V.L., Transmission of Information. Bell Systems TechnicalJournal, 7 (July 1928), pp. 535

69

[16] Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression. Wileyseries in probability and mathematical statistics, Wiley & Sons, NewYork, NY, 1989

[17] Efron, B. The E�ciency of Logistic Regression Compared to Normal

Discriminant Analysis. Journal of the American Statistical Association,70 (1975), 892-898

[18] Han, J. Data Mining Techniques and Applications. UCLA short class,2-5. Feb. 1998

[19] Iverson, G.R. and Norpoth H. Analysis of Variance. Sage Publications,Inc., Beverly Hills, CA, 1976

[20] Johnson, R.A. and Wichern, D.W. Applied Multivariate Statistical

Analysis. Prentice Hall, Inc., Englewood Cli�s, NJ, 1988

[21] Jones, B. K-systems versus classical multivariate systems. InternationalJournal of General Systems, 12 (1986), 1-6

[22] Jones, B. and Gouw, D. The Interaction Concept of K-Systems Theory.International Journal of General Systems, 24 (1996), 163-169

[23] Joslyn, C. Towards General Information Theoretical Representations ofDatabase Problems. Proccedings of 1997 Conference of the IEEE Societyfor Systems, Man, and Cybernetics.

[24] Joslyn, C. Data Exploration through Extension and Projection. yet un-published technical report, 1998

[25] Klir, G.J. On sustems methodology and inductive reasoning: the issue

of parts and wholes. General Systems Yearbook, 26, 29-38

[26] Klir, G.J. Architecture of system problem solving. Plenum Press, NewYork, NY, 1985

[27] Klir, G.J. and Parviz, B. General reconstruction characteristics of prob-abilistic and possibilistic systems. International Journal of Man-MachineStudies, 25 (Oct. 1986), 367-397

[28] Klir, G.J. and Folger, T. Fuzzy Sets, Uncertainty, and Information.Prentice Hall, Englewood Cli�s, NJ, 1988

70

[29] Klir, G.J. and Yuan, B. Fuzzy sets and fuzzy logic: theory and applica-

tions. Prentice Hall, Inc., Upper Saddle River, NJ, 1995

[30] Knoke, D. and Burke, P.J. Log-Linear Models. Sage University Paperseries on Quantitative Applications in the Social Sciences, series no.07-020, Sage Publications, Beverly Hills and London, 1980

[31] Lin, T.Y. and Cercone, N. Rough sets and data mining. Kluver, Norwell,MA, 1997

[32] Martin, J.K. An exact probability metric for decision tree splitting and

stopping. Machine Learning, 28 (1997), 257-291

[33] McCulloch, W.S. and Pitts, W. A logical calculus of the ideas immanent

in neurons activity. Bulletin of Mathematical Biophysics, 5 (1943), 115-133

[34] Miller, I., Freund, J.E., and Johnson,R.A. Probability and statistics for

engineers - 4th edition. Prentice Hall, Englewood Cli�s, NJ, 1990

[35] Mitchell, M. An Introduction to Genetic Algorithms. MIT-Press, 1996

[36] Motwani, R., Brin, S., and Silverstein, C. Beyond Market Baskets: Gen-

eralizing Association Rules to Correlation 1997 ACM SIGMOD Confer-ence on Management of Data, 1997, pp. 265-276

[37] Nilsson, N.J. Learning Machines: Foundations of Trainable Pattern-

Classifying Systems. McGraw-Hill, New York, NY, 1965

[38] Pittarelli, M. A Note on Probability Estimation using Reconstructability

Analysis. International Journal of General Systems, 18 (1990), 11-21

[39] Pittarelli, M. An Algebra for Probabilistic Databases. IEEE Transac-tions on Knowledge and data Engineering, 6 (April 1994), 293-303

[40] Shannon, C.E. A Mathematical Theory of Communiction. The Bell Sys-tems Technical Journal, 27 (1948), 379-423

[41] Smyth, P. and Goodman, R.M. An Information Theoretic Approach to

Rule Induction from Databases. IEEE Transactions on Knowledge andData Engineering, 4 (Aug. 1992), 301-316

71

[42] Piatetsky-Shapiro, G. and Frawley, W.J. (eds.). Knowledge Discovery

in Databases. AAAI Press/ MIT Press, Menlo Park, CA, 1991

[43] Pandya, A.S. and Macy, R.B. Pattern Recognition with neural networksin C++. CRC-Press, Inc., Boca Raton, Fl, 1996

[44] Popper, K.R. The logic of scienti�c discovery New York, NY 1959

[45] Quinlan, J.R. Induction of decision trees. Machine Learning, 1, 81-106

[46] Shen, W-M. et al. An Overview of Database Mining Techniques.http://www.isi.edu/ shen/Tsur.ps

[47] Srikant, R. and Agrawal, R.Mining Generalized Association Rules. Pro-ceedings of the 21st VLDB Conference, Zurich, Swizerland, 1995

[48] Veelentwurf, L.P.J. Analysis and applications of arti�cial neural net-

works. Prentice Hall, Inc., Hertfordshire, UK, 1995

[49] Wasserman, P.D. Advanced Methods in Neural Computing. Van Nos-trand Reinhold, New York, NY 1993

[50] Zwick, M., Shu, H., and Koch, R. Information-Theoretic Mask Analysis

of Rainfall Time Series Data. Advances in System Science and Applica-tion, 1995, Special Issue I

72

ftprang/papers/tproject.pdfunsup ervised data mining in nominally-supp orted databases thomas prang...

Documents