imm - dtu orbitorbit.dtu.dk/files/5257828/imm2821.pdf · this thesis was prepared at the...

Data mining in medicaldatabases

Anna Szymkowiak Have

LYNGBY 2003

IMM-PHD-2003-121

IMM

TECHNICAL UNIVERSITY OF DENMARKInformatics and Mathematical ModellingRichard Petersens Plads, Building 321,DK-2800 Kongens Lyngby, Denmark

Ph.D. ThesisIn partial Fulfillment of the Requirements for the Degree of Doctor of Philoso-phy

Title: Data mining in medical databases

Author: Anna Szymkowiak Have

TECHNICAL UNIVERSITY OF DENMARKInformatics and Mathematical ModellingRichard Petersens Plads, Building 321,DK-2800 Kongens Lyngby, DenmarkIMM-PHD-2003-121

Lyngby 2003-07-31Copyright c© 2003 by Anna Szymkowiak HavePrinted by IMM, Technical University of Denmark

ISSN 0909–3192

Abstract

This Ph.D. thesis focuses on clustering techniques for Knowledge Discoveryin Databases. Various data mining tasks relevant for medical applications aredescribed and discussed. A general framework which combines data projec-tion and data mining and interpretation is presented. An overview of variousdata projection techniques is offered with the main stress on applied PrincipalComponent Analysis. For clustering purposes, various Generalized GaussianMixture models are presented. Further the aggregated Markov model, whichprovides the cluster structure via the probabilistic decomposition of the Grammatrix, is proposed. Other data mining tasks, described in this thesis are out-lier detection and the imputation of the missing data. The thesis presents twooutlier detection methods based on the cumulative distribution and a specialdesignated outlier cluster in connection with the Generalized Gaussian Mixturemodel. Two models for imputation of the missing data, namely the K-nearestneighbor and a Gaussian model are suggested. With the purpose of interpretinga cluster structure two techniques are developed. If cluster labels are availablethen the cluster understanding via the confusion matrix is available. If datais unlabeled, then it is possible to generate keywords (in case of textual data)or key-patterns, as an informative representation of the obtained clusters. Themethods are applied on simple artificial data sets, as well as collections of tex-tual and medical data.

Resume

Denne ph.d.-afhandling fokuserer pa klyngeanalyseteknikker til ekstraktion afviden fra databaser. Afhandling præsenterer og diskuterer forskellige datamin-ing problemstillinger med relevans for medicinske applikationer. Specielt præ-senteres en generel struktur der kombinerer data-projektion, datamining og au-tomatisk fortolkning. Indenfor data-projektion gennemgas en række teknikkermed speciel vægt pa anvendt Principal Komponent Analyse. En række gener-aliserede Gaussisk miksturmodeller foreslas til klyngeanalyse. Desuden foreslasen aggregatet Markov model, som estimerer klyngestrukturen via dekomposi-tion af en sandsynlighedsbaseret Gram-matrix. Herudover beskriver afhandlin-gen to andre datamining problemstillinger nemlig “outlier” detektion og im-putering af manglende data. Afhandlinger præsenterer “outlier” detektionsme-toder. Dels baseret pa kumulerede fordelinger, dels baseret pa introduktion af enspeciel “outlier” klynge i forbindelse med den generaliserede Gaussisk mikstur-model. Med hensyn til imputation af manglende data præsenteres to metoderbaseret pa K-nærmeste-nabo eller en Gaussisk model antagelse. Der er udvikletto metoder til automatisk fortolkning af klyngestrukturen. Nar klynge anno-teringer (“labels”) er tilgængelige vil konfusionsmatricen danne grundlaget forfortolkningen. Hvis sadanne annoteringer ikke er tilgængelige, er det muligtat generere nøgleord (i tilfælde af tekst data) eller generelt nøgle-mønstre, somsaledes bibringer til fortolkning af klyngerne. De foreslaede metoder er testetpa simple kunstige datasæt sa vel som kollektioner af tekst og medicinske data.

Preface

This thesis was prepared at the Informatics and Mathematical Modelling (IMM),Technical University of Denmark (DTU), in partial fulfillment of the require-ments for acquiring the Ph.D. degree in engineering. The work was fundedby the Danish Research Councils through SITE, signal and image processingfor telemedicine. The project commenced in July 2000 and was completed inAugust 2003. Throughout the period the project was supervised by professorJan Larsen and professor Lars Kai Hansen from IMM, DTU. The thesis reflectsthe studies being done during the Ph.D. project which relates to clustering dataoriginating from different sources located in e.g. various databases.

The thesis is printed by IMM, Technical University of Denmark and availableas softcopy from http://www.imm.dtu.dk

Publication Note

Parts of the work presented in this thesis have previously been published jour-nals and at conference proceedings. Following are the paper contributions inthe context of this thesis.

A. Szymkowiak, P.A. Philipsen, J. Larsen, L.K. Hansen, E. Thieden and H.C.Wulf, Imputating missing values in diary records of sun-exposure study, in Pro-ceedings of IEEE Workshop on Neural Networks for Signal Processing XI, ed-

vi

itor: D. Miller, T. Adali, J. Larsen, M. Van Hulle, S. Douglas, pages: 489–498.Own contribution estimated to 50%. Included in appendix B.

A. Szymkowiak, J. Larsen, L.K. Hansen, Hierarchical Clustering for Datamin-ing, In Proceedings of KES-2001 Fifth International Conference on Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies, pages:261–265. Own contribution approximated to 80%. Included in appendix C.

J. Larsen, A. Szymkowiak-Have, L.K. Hansen, Probabilistic Hierarchical Clus-tering with Labeled and Unlabeled Data, in International Journal of Knowledge-Based Intelligent Engineering Systems, volume: 6(1), pages: 56-62. Contribu-tion approximated to 30%. Included in appendix D.

A. Szymkowiak-Have, J. Larsen, L.K. Hansen, P.A. Philipsen, E. Thieden andH.C. Wulf, Clustering of Sun Exposure Measurements, In Proceedings of IEEEWorkshop on Neural Networks for Signal Processing XII, editor: D. Miller andT. Adali and J. Larsen and M. Van Hulle and S. Douglas, pages: 727–735.Contribution approximated to 50%. Included in appendix E.

J. Larsen, L.K. Hansen, A. Szymkowiak-Have, T. Christiansen and T. Kolenda,Webmining: Learning from the World Wide Web, in special issue of Computa-tional Statistics and Data Analysis, volume: 38, pages: 517–532. Contributionapproximated to 10%. Included in appendix F.

Anna Szymkowiak-Have and Mark Girolami and Jan Larsen, Clustering viaKernel Decomposition submitted to IEEE Transactions in Neural Networks, yetunpublished. Contribution approximated to 45%.

Nomenclature

Standard symbols and operators are used consistently throughout the thesis.Symbols and operators are introduced as they are needed. In general matri-ces are presented in uppercase bold letters e.g. X, while vectors are shown inlowercase bold letters, e.g. x. Letters not bolded are scalars, e.g. x.

vii

Acknowledgement

I would like to thank Jan Larsen and Lars Kai Hansen for their supervision,support and answers to all my questions. Also, I would like to thank the rest ofthe people at the Intelligent Signal Processing group, especially the departmentsecretary Ulla Nørhave for looking after me. Furthermore, I would like to thankthe people I got to know during my stay at Paisley University, especially MarkGirolami, Jackie Royal and Leif Azzopardi, for all the discussions and fun bothduring work time and at the pubs.

Finally, I want to thank my husband Kim for his support and patience he hasgiven me during my studies, and my family and my friends for their supportand encouragement and for the carrot in front of my donkey.

Contents

1 Introduction 3

2 Dimensionality reduction and feature selection 9

2.1 Dimensionality reduction methods . . . . . . . . . . . . . . . 10

2.1.1 Random Projection . . . . . . . . . . . . . . . . . . . 10

2.1.2 Principal Component Analysis . . . . . . . . . . . . . 13

2.1.3 Non-negative Matrix Factorization . . . . . . . . . . . 14

2.1.4 Evaluation of dimensionality reduction methods . . . 15

2.2 High dimensional data . . . . . . . . . . . . . . . . . . . . . 18

2.3 Selection of number of principal components . . . . . . . . . 19

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Generalizable Gaussian Mixture models 25

3.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Expectation-Maximization algorithm . . . . . . . . . . . . . . 27

3.3 Unsupervised Generalizable Gaussian Mixturemodel . . . . . 29

3.4 Supervised Generalizable Gaussian Mixture model . . . . . . 33

3.5 Unsupervised/supervised Generalizable Gaussian Mixture model 34

3.6 Outlier & Novelty detection . . . . . . . . . . . . . . . . . . 37

x CONTENTS

3.6.1 Cumulative distribution . . . . . . . . . . . . . . . . . 38

3.6.2 Outlier cluster . . . . . . . . . . . . . . . . . . . . . . 39

3.6.3 Evaluation of outlier detection methods . . . . . . . . 41

3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Cluster visualization & interpretation 43

4.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . 43

4.1.1 Similarity measures . . . . . . . . . . . . . . . . . . . 45

4.1.2 Evaluation of similarity measures . . . . . . . . . . . 48

4.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Keywords and prototypes . . . . . . . . . . . . . . . . . . . . 51

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Imputating missing values 55

5.1 K-Nearest Neighbor model . . . . . . . . . . . . . . . . . . . 60

5.2 Gaussian model . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Probabilistic approach to Kernel Principal Component Analysis 63

6.1 Density Estimation and Decomposition of the Gram Matrix . . 64

6.1.1 Kernel Decomposition . . . . . . . . . . . . . . . . . 65

6.1.2 Examples of kernels . . . . . . . . . . . . . . . . . . 67

6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Segmentation of textual and medical databases 71

7.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.2 Segmentation of textual databases . . . . . . . . . . . . . . . 74

7.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 74

7.2.2 Projection to the latent space . . . . . . . . . . . . . . 76

7.2.3 Clustering and clusters interpretation . . . . . . . . . 76

7.2.4 Clustering of Email collection with USGGM model . . 94

CONTENTS xi

7.3 Segmentation of medical database . . . . . . . . . . . . . . . 97

7.3.1 Segmentation of the data from sun-exposure study . . 97

7.3.2 Imputation missing values in Sun-Exposure study . . . 107

7.4 Aggregated Markov model for clustering and classification . . 114

7.4.1 Clustering of the toy data sets . . . . . . . . . . . . . 115

7.4.2 Clustering of the textual data collections . . . . . . . . 119

7.4.3 Classification of the medical data collections . . . . . 122

8 Conclusion 125

A Equations 129

B Contribution: NNSP 2001 133

C Contribution: KES 2001 145

D Contribution: International KES Journal 2002 151

E Contribution: NNSP 2002 161

F Contribution: special issue CSDA 2002 173

Notation and Symbols

XD×N observation data matrix, where N is denoting number of exam-ples, n = 1, 2, . . . , N , and D is a dimensionality of the featurespace, d = 1, 2, . . . , D

xdn the d’th, n’th element of the data matrix XN×D

xn n’th data vectorXK×N projected matrix, where K, k = 1, . . . , K is dimensionality of

the projected feature spacezn n’th D-dimensional test data vector, where n = 1, 2, . . . , Nz

| · | determinant

|| · ||2 L2 norm, vector length, ||xn||2 =√∑

d x2nd

./ matrix element-wise division1 matrix with all elements 1p(x) Probability Density Function of the vector x

p(k|x) Class Posterior Probability of the class k conditioned on vector x

p(x|k) Likelihood of vector x

µ mean vector µ = 1N

∑n xn

Σ covariance matrix Σ = 1N

∑n(xn − µ)(xn − µ)T

θ vector collecting the parameters of the modelEM Expectation-Maximization algorithmFA Factor AnalysisGGM Generalizable Gaussian Mixture modelI Identity matrixICA Independent Component Analysisi.i.d. Independent and Identically Distributed

2 CONTENTS

KDD Knowledge Discovery in DatabasesKL Kullback-Leibler divergenceKNN K-Nearest NeighborKPCA Kernel Principal Component AnalysisLSA Latent Semantic AnalysisMAR data missing at randomMCAR data missing completely at randomML Maximum Likelihood methodsNMAR data not missing at randomNMF Non-negative Matrix FactorizationPCA Principal Component Analysisp.d.f. Probability Density FunctionPLSA Probabilistic Latent Semantic AnalysisRP Random ProjectionSVD Singular Value DecompositionUGGM unsupervised Generalizable Gaussian Mixture modelUSGGM unsupervised/supervised Generalizable Gaussian Mixture model

C H A P T E R 1

Introduction

Recent progress in data storage and acquisition has resulted in a growing num-ber of enormous databases. The information contained in these databases can beextremely interesting and useful, however the amount is too large for humans toprocess manually. These databases store information covering every aspect ofhuman activity. In the business world, the data bases are created for marketing,investment, manufacturing, customers, products or transactions, and such infor-mation can be stored both for accounting and analysis purposes. The domain ofscientific research is another area which makes heavy use of databases to storeinformation about, for example patients, patient histories, surveys, medical in-vestigations. It is within this domain that the research for this thesis has beenperformed. Medical databases contain data in a variety of formats: images inthe form of X-rays or scans, textual information to describe details of diseases,medical histories, psychology reports, medical articles or various signals likeEKG, ECG, EEG, etc. This data does not need to be located on the same sys-tem. It may be distributed amongst various disparate computers depending onthe source of the data. Such heterogeneous data make the process of informa-tion retrieval an even more complex process.

There is a substantial need for signal/image processing and data mining tools inmedical information processing systems and in telemedicine that are flexible,

4 Introduction

effective and used friendly. Telemedicine is defined as: The investigation, mon-itoring and management of patients and education of patients and staff usingsystems which allow ready access to expert advice and patient information nomatter of where the patient or relevant information is located. Such systemsserve a dual role: they can assist medical professionals in medical researchand in the clinic medical diagnosis and secondly, they can be used as infor-mation sources for patients. Telemedicine faces a number of basic problemsconcerning document retrieval, data mining in the databases, and visualizationof high-dimensional data.

This project focuses on the fact finding in the distributed databases by use ofadvanced signal processing and machine learning methods which can be ap-plied to extract the important for the medical scientists information. Such datamining can also be used by patients to search for analogies or similar diseasesor any relevant information.

Data mining is a new discipline within the field of information retrieval that ex-tracts interesting and useful material from large data sets. In the broader sense,data mining is defined as part of knowledge discovery in databases (KDD)[25, 26, 36] and draws on the fields of statistics, machine learning, patternrecognition and database management. The general approach of KDD processis presented on figure 1.1

The KDD process involves the following steps:

1. Selecting the target data (focusing on the part of the data on which tomake discoveries). The data may consist from different type of informa-tion.

2. Data cleaning and preprocessing includes removal of noise, outliers,missing data (if they are not an object of the discovery). See Section 3.6.

3. Transforming data includes data reduction and projection in order to findand investigate useful features and reduce the data feature space dimen-sionality. See Chapter 2.

4. Performing the data mining task, e.g., classification, regression, cluster-ing and as well the outlier detection and imputation of missing data, ifthey were not removed in data cleaning process. Data mining is per-formed in order to extract patterns and relationships in the data. SeeChapter 3, Chapter 5 and Chapter 6.

5

Figure 1.1 The steps of the KDD process. In the first step the data is extractedfrom the source location, then the preprocessing is performed and the data istransformed to the form suitable for further processing. In the next step thedata mining is applied, for example clustering, classification, regression etc.The final part of KDD process is a meaningful interpretation is the obtainedresults.

5. Interpretation and visualization includes interpretation of mined patterns,visualization of models, patterns and data. See Chapter 4.

6 Introduction

Typically, data mining algorithms assume that the data is already stored inmain memory, therefore no database extraction methods are investigated in thiswork. However, in order to facilitate the extraction of the required informationfrom the databases, the issue of large data sets and the preprocessing and re-quired transformation algorithms is addressed in Chapter 2. This is necessaryas the data we are dealing with is observational and has usually been collectedfor a purpose other than data mining. Some of the preprocessing steps areuniquely connected with the data and as such, they are described with the ex-periments in Chapter 7. The relationships found in data are presented usingmodels for clustering, classification and outlier detection and imputation, inChapters 3, 6 and 5, respectively. Finally, techniques for the useful summa-rization, visualization and interpretation of the discovered structures are givenin Chapter 4.

The contents of this thesis are organized in the following way:

Chapter 2 gives an overview of the several projection methods implemented:random projection, principal component analysis and non-negative ma-trix factorization. A literature overview of the reduction techniques forhuge data sets is also included, as is a discussion of the various waysto select the optimum number of components in the projected data. Thechapter concludes with a evaluation of the presented projection methodsapplied to the artificial data sets.

Chapter 3 is dedicated to the various Gaussian Mixture models that were usedfor classification and clustering the data focusing on the possible medi-cal databases. The algorithms of unsupervised, supervised and unsuper-vised/supervised Generalizable Gaussian Mixture models are described.Two techniques are presented for outlier and novelty detection; the outlierdetection methods are compared through a simple example.

Chapter 4 describes the different similarity measures and methods that can beused with agglomerative hierarchical clustering. In order to interpret theresultant clusters, a method for obtaining keywords and prototypes is alsoexplained.

Chapter 5 explains two methods used for the imputation of missing values inthe study, namely the Gaussian and K-Nearest Neighbor models.

Chapter 6 defines a different approach for unsupervised clustering by meansof Gram matrix decomposition with the aggregated Markov model.

7

Chapter 7 collates the experimental results on the observational data sets. Thefirst section describes the data sets that have been used. The collectionof emails and newsgroups is used as an example of textual data. Thenext data set originates from a survey undertaken by the Department ofDermatology, Bispebjerg Hospital, University of Copenhagen, Denmark.This study examined the connection between the level of sun exposure tothe skin and the risk of developing cancer disease. The third data set isa dermatological collection of six diagnosed erythemato-squamous dis-eases. Preprocessing is carried out separately for each of the data sets.Data projection is then performed, except for the aggregated Markovmodel, where the features space reduction is not required. Since, datamining is the next step of the KDD process, the clustering of the text datasets and the sun-exposure study is implemented using the various mod-els. For visualization and interpretation, hierarchical clustering is appliedand keywords are generated. With respect to the data form sun-exposurestudy, an investigation of the performance of the imputation models isconducted.

Appendix A contains the definitions of the Jensen’s and Minkowski’s inequal-ities and the derivation of the Kullback-Leibler similarity measure forGaussian density functions that is used in agglomerative hierarchical clus-tering.

Appendix B–F contain reprints of the papers authored or co-authored duringthe Ph.D. study.

C H A P T E R 2

Dimensionality reductionand feature selection

Data transformation is a first discussed here step of the KDD process which isshown on figure 1.1.

When processing large databases, one faces two major obstacles: numeroussamples and high dimensionality of the feature space. For example, the docu-ments are represented by several thousands of words, images are composed ofmillions of pixels, where each word or pixel is here understood as a feature.Currently, processing abilities are often not able to handle such high dimen-sional data, mostly due to numerical difficulties in processing, requirementsin storage and transmission within a reasonable time. To reduce the computa-tional time, it is common practice to project the data onto a smaller, latent space.Moreover, such a space is often beneficial for further investigation due to noisereduction and desired feature extraction properties. Smaller dimensions are alsoadvantageous when visualizing and analyzing the data. Thus, in order to extractdesirable information, dimensionality reduction methods are often applied.

This task is non-trivial. The goal is to determine the coordinate system wherethe mapping will create low-dimensional compact representation of the datawhilst maximizing the information contained within.

10 Dimensionality reduction and feature selection

There are many solutions to this problem. Several techniques for dimensionalityreduction have been developed which use both linear and non-linear mappings.Among them are, for example, low-dimensional projections of the data [10, 45,40, 49], neural networks [3, 12, 74], self-organizing maps [48]. One can ap-ply second order methods which use the covariance structure in determiningdirections. To this family belongs the popular Principal Component Analysis[10, 23, 44, 45] that restricts directions to those that are orthogonal; FactorAnalysis [40, 44, 45] which additionally allows the noise level to differ alongthe directions and Independent Component Analysis [49] for which the direc-tions are independent but not necessarily orthogonal. Description of some ofthe afore-mentioned methods can be found in [7, 10, 14, 40, 42, 49, 55, 66, 84].

Since the projection is not the main topic of this thesis, but the preprocessingstep in KDD process, only a few techniques, that were investigated alongsidethe performed research, are presented in detail. Three algorithms are describedbelow, namely Random Projection [4, 9, 19], Principal Component Analysis[10, 14, 23, 44, 45, 66] and Non-negative Matrix Factorization [55]. The mod-ified version of Non-negative Matrix Factorization, is applied in the decompo-sition of the Gram matrix in the aggregated Markov model, described in Chap-ter 6.

2.1 Dimensionality reduction methods

2.1.1 Random Projection

A lot of research studies have focused on an investigation of the Random Pro-jection method. Many of the details and experiments not included in this work,can be found in [4, 9, 19].

The method is simple and it does not require the presence of the data whilstcreating the projection matrix. Many find this a significant advantage. Randomprojection is based on the set of vectors which are orthogonalized from the nor-mal distributed random matrix. The orthogonal projection vectors1 (R · RT =I) preserves the distances among the data points. In this way it is possible to

1Two vectors are said to be orthogonal if the cosine of its angle is equal 0, i.e. u · vT = 0,where v · vT = 1 and u · uT = 1.

2.1 Dimensionality reduction methods 11

project the data to the smaller dimensional space while preserving fair sepa-ration of the clusters. Note however, that R is orthogonal only column-wise,i.e. the identity outcome of the dot product of the projection vectors holds onlyfor quadratic R (projecting onto the space of the same dimension as the origi-nal). In the case of projecting onto smaller dimensions, the distance is no longerpreserved in an ideal way.

Let the D-dimensional data matrix be defined as XD×N , where N is the num-ber of samples. Data is projected with the help of the projection matrix R toK dimensions (K ≤ D) and the projected data matrix is denoted as XK×N .Figure 2.1 presents the algorithm of the Random Projection method.

The Random Projection Algorithm1. Generate random matrix RK×D = {rkd, k = 1 . . . K, d = 1 . . .D}

drawn form normal distribution of zero mean and spherical covariancestructure rkd ∈ N (0, 1). K ≤ D.

2. Orthogonalize each of the projection vectors (use for example Gram-Schmidt algorithm (see [56] or most linear algebra books for reference)).

3. Project data on the new set of directions XK×N = RK×D · XD×N

Figure 2.1 The Random Projection algorithm.

In order to check the success of the random projection method, the separabilityis investigated with the following toy example. In the example, 200 data pointsare generated from 2 Gaussian distributions with spherical covariance structureΣ = I and different mode locations, where distance between the data means||µ1 − µ2||2 approximately equals 5. Only one dimension is enough to sepa-rate clusters. Originally, data spans 500 dimensions. The clusters are almostlinearly separable, i.e. the cluster overlap is negligible. Also by using Princi-pal Component Analysis, described later in section 2.1.2 and presented here asreference, the distance between the clusters is fully preserved.

Figure 2.2 presents the results of the experiment. Left plot shows the distancebetween the cluster centers. As was suspected, the distance decays, and in thesmall dimensional space it can be expected that the data is poorly separated.This artifact is more significant for large dimensional and more complex datasets, such as sparse vector space representation of textual information. The


0 100 200 300 400 5000

1

2

3

4

5

6

Dimension in the reduced space

dist

ance

||µ 1 −

µ2|| 2

0 100 200 300 400 500

0

20

40

60

80

100


Suc

cess

rat

e in

[%]

0 100 200 300 400 5000

5

10

15

20


Tim

e [s

]

Figure 2.2 Illustration of the Random Projection method with example of2 Gaussian distributed clusters. Originally the measured distance betweenthe clusters is close to 6. The experiments are repeated 20 times and theaverage performance is shown. Left plot describes the distance ||µ

1− µ

2||2

between the cluster centers as a function of dimensionality. It decays withthe projected dimension. Error-bars show maximum and minimum measureddistance in the trials. They increase with the dimension. Middle plot showssuccess rate of the projection (the trials for which the distance is larger than3, for smaller numbers clusters are assumed not to be separable). Also thesuccess rate decays with the reduced dimension. That means that for thelarge reductions there is a considerable chance that the projected data willlose separability. Note, that one dimension is enough to linearly separate thedata. The time needed to produce the projection matrix grows exponentiallywith the size of the matrix R (right figure).

smaller the original dimensionality, the longer the separation is kept, when theprojection matrix is increased. The success rate2 (middle plot) is decaying sig-nificantly with the dimension. The projection is not in all cases satisfactory,the separability in some trials3 is lower due to the random origin of the pro-jection vectors. The number of “unlucky” projections grows significantly withreduced dimension. The time needed to produce the projection matrix growssignificantly with the size of the projection matrix R (right plot).

When the projected dimension is large enough, the distortion of the distanceis not significant but then the time spent for creating and diagonalizing projec-tion matrix is considerable. To summarize, for large dimensional data sets it isnot necessarily efficient and satisfying enough4 to use random projection with-out significant separability loss, especially when the dimensionality the data is

2Success rate is the percentage of the number of projection runs in which the distance ispreserved (in this case the clusters are assumed to lose separability when the distance betweenthem is smaller than 3 (originally the distance is equal to 5)). The cluster distance is calculatedfrom projected known cluster centers.

3One particular run of the experiment is understood to be a trial.4The benefits of the random projection may vary accordingly to the application.


projected onto is high.

Random Projection is not applied in further experiments due to the fact that thedimensionality of the data used in the experiments allowed us to use PrincipalComponent Analysis (section 2.1.2). However, when facing larger databases itcould be advisable to use RP as the preprocessing step before applying PCA. Insuch a case, creating the random projection matrix is less expensive than PCAand with the projected space large enough, the distances in the data is not bedistorted significantly.

2.1.2 Principal Component Analysis

Principal Component Analysis (PCA) is probably the most widely used tech-nique for dimensionality reduction. It is similar to random projection throughperforming a linear mapping that uses the orthogonal projection vectors, butin case of PCA the projections are found by diagonalizing the data covariancestructure, i.e. the variance along new directions is maximized. Thus, the pro-jection of vector x on the new latent space appointed by orthogonal projectionvectors u can be written as xk = uT

k x, where k = 1 . . . K and K ≤ D.

It is proofed, that the minimum projection error (in LS sense) is achieved whenthe basis vectors satisfy the eigen-equations Σ · uk = λkuk [10]. Therefore,the singular value decomposition (SVD) is often performed to find the orthog-onal projections. The algorithm for PCA is shown in figure 2.3. Such a methodfor determining the latent space is applied by Latent Semantic Analysis (LSA),described in section 7.2.2 and introduced in [20]. For further and more compre-hensive analysis for PCA refer to [10, 23, 44, 45].

The most significant disadvantages of the PCA technique are complexity andhigh memory usage. However, a great number of alternative algorithms havebeen developed that reduce this problem, for example using networks to esti-mate the few first eigenvalues and corresponding principal directions [3, 12].


Principal Component Analysis Algorithm1. Create the data matrix X

2. Determine covariance Σ = 1N

∑n(xn − µ)(xn − µ)T

3. Determine eigenvalues λk and eigenvectors uk of the covariance struc-ture. Since Σ is positive and symmetric, λ is positive and real satisfyingΣ · uk = λkuk. λ is found in the optimization process of the character-istic equation |Σ − λI| = 0.

4. Sort eigenvalues and corresponding eigenvectors in the descending order.5. Select K < D and project the data on selected directions:

XK×N = UTD×K · XD×N

Figure 2.3 The Principal Component Analysis algorithm.

2.1.3 Non-negative Matrix Factorization

Some medical data such as images or texts contain only positive values. Thisinformation can be utilized by adding the positivity constraint in the optimiza-tion process for finding the projection vectors. One possible choice is the Non-Negative Matrix Factorization (NMF). This approach is proposed by Lee & Se-ung in [55]. The technique is closely related with proposed by Hofmann in [42]Probabilistic Latent semantic Analysis (PLSA) and by Saul and Peveira in [76]aggregated Markov model. NMF is based on the decomposition of the datamatrix X into two matrix factors W and H so that XD×N ≈ WD×KHK×N .Both W and H are constrained to be positive, i.e. W ≥ 0 and H ≥ 0. Theproposed optimization is done in an iterative way minimizing one of the twosuggested objective functions: Euclidean distance [34]:

||X − WH||22 =∑

d

∑

n

(xdn −∑

k

wdkhkn)2 (2.1)

or Kullback-Leibler (KL) divergence [69]:

D(X||WH) =∑

d

∑

n

(xdn · logxdn∑

k wdkhkn− xdn +

∑

k

wdkhkn) (2.2)

between X and WH. The update rules in case of the Euclidean distance areshown in matrix notation by the equations 2.3.

Hnew = Hold WTX

WTWHWnew = Wold XHT

WHHT(2.3)


Regarding KL divergence, the update rules are taking the form:

Hnew = HoldWT · X./WH

1N×K · WTWnew = WoldX./WH · HT

1D×N · HT. (2.4)

The algorithm is proofed to converge, see [55] for further details. The modifiedNMF updates rules are used in segmentation by the aggregated Markov modeldescribed in detail in Chapter 6.

2.1.4 Evaluation of dimensionality reduction methods

Several methods have been described in this chapter. Visual comparison of thetechniques is usually useful for understanding differences between them. Here,the comparison of those techniques is presented for artificially created toy data.Three data sets are generated, namely:

3 Gaussians - Linear structure in the form of three Gaussian ideally separatedclusters with elliptical covariance structures, see figure 2.4. 1200 datapoints are generated, 400 points for each cluster. The signal space has 2dimensions, but the noisy data exists in 20 dimensions.

Rings - Non-linear manifold structure in the form of three uniformly distributedcircles centered at the origin of the coordinate system. The problem isseparable but a nonlinear decision boundary is needed (figure 2.5). 1200data points are generated, 400 points for each cluster. The signal spacehas 2 dimensions, but the data with the noise exists in 20 dimensions.

Text - The discrete data artificially created that resemble the discrete text inthe form of preprocessed and normalized term-document matrix5. Thedata is modeled in latent space found in LSI framework (Latent SemanticIndexing [20]. The data is multi-dimensional (60 features) and forms twoperfectly separable clusters. Features are selected so that a part is sharedand some of the features are unique for each of the clusters, see figure 2.6.Data vectors are normalized to the L2 norm unity length.

Figure 2.7 shows the results of the projection of the 3 Gaussian clusters. The5For explanation of text processing method refer to section 7.2.1


−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x1

x2

Figure 2.4 Scatter plotof the 3 Gaussian clus-ters. 2 first dimensions(out of 20) are shown.The clusters are linearlyseparable.

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x1x2

Figure 2.5 Scatter plotof the 3 rings. 2 first di-mensions are presented.The clusters can be sep-arated by the non-lineardecision surface.

Figure 2.6 Feature plotof the 2 text clusters.On the y-axis are the1000 examples (500 ineach class) and on the x-axis are the 60 features.

−6 −4 −2 0 2 4 6

−1.5

−1

−0.5

0

0.5

1

1.5

Principal Component 1

Pri

ncip

al C

ompo

nent

2

PCA

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

Component 1

Com

pone

nt 2

NMF

−1.5 −1 −0.5 0 0.5 1 1.5 2−4

−3

−2

−1

0

1

2

3

4

Component 1

Com

pone

nt 2

RP

Figure 2.7 Results of the projection for the 3 Gaussian clusters (shown onfigure 2.4). Three techniques for creating the projection vectors are comparedhere: Principal component Analysis (PCA), Non-Negative Matrix Factoriza-tion (NMF) and Random Projection (RP). In all the cases the separation ispreserved.

problem is simple and linearly separable. Three different techniques for cre-ating the projection vectors are compared here, Principal Component Analy-sis (PCA), Non-Negative Matrix Factorization (NMF) and Random Projection(RP). In this case, all the presented methods preserve the linear separability inthe data.

Figure 2.8 shows the results of the projection of the rings structure of 3 clusters.As mentioned previously, the PCA, NMF and RP are compared. Only the Prin-cipal Component Analysis preserves the distances amongst the data points well,while the other methods have lost the separability. RP preserves the separation


−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6


Pri

ncip

al C

ompo

nent

2

PCA

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

Component 1C

ompo

nent

2

NMF

−4 −2 0 2 4

−8

−6

−4

−2

0

2

4

6

8

Component 1

Com

pone

nt 2

RP

Figure 2.8 Results of the projection of the structure of 3 clusters in the shapeof rings (shown on figure 2.5). Three techniques for creating the projec-tion vectors are compared here: Principal component Analysis, Non-NegativeMatrix Factorization and Random Projection. The original data is separablebut a non-linear separation function is needed. PCA does not change the datashape since the data variance in all signal directions is identical. NMF returnsa highly skewed result, difficult to separate. A similar result is achieved withRP. However, some of the other trials returned both better and worse resultswith respect to the possibility of separation.

in some of the trials while in others the separation is lost. This happens due tothe random origin of the projection matrix.

In figure 2.9, the results are presented of the projection of the toy example: thevector space representation of two text clusters. The data is linearly separa-

−4 −2 0 2 4 6−6

−4

−2

0

2

4


Pri

ncip

al C

ompo

nent

2

PCA

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

1.2

Component 1

Com

pone

nt 2

NMF

−3 −2 −1 0 1 2 3

−5

−4

−3

−2

−1

0

1

Component 1

Com

pone

nt 2

RP

Figure 2.9 Results of the projection of the 2 cluster artificial text example(shown on figure 2.6). Three techniques for creating the projection vectors arecompared here, Principal Component Analysis, Non-Negative Matrix Factor-ization and Random Projection. In this case RP is unable to separate the data,while the other methods perform reasonably well.

ble and this characteristic is preserved by using any of the presented methods


with the exception of Random Projection, which completely blends these twoclasses. In this case Random Projection confirms its inability to extract theseparating dimensions, when processing large dimensional sparse data.

Out of three presented methods only Principal Component Analysis have shownrobustness for all the applied data sets. In conclusion, PCA is selected fordimensionality reduction in later experiments due to being the simplest methodand the best performer out of all the researched techniques.

2.2 High dimensional data

The methods described earlier can operate on medium sized databases. How-ever, when the dimensionality of the data is much larger other techniques haveto be implemented. In this thesis, only a short review of the projection tech-niques for huge dimensional databases is presented, since no such databaseswere available for the experiments presented in Chapter 7.

The simplest, most intuitive way to determine the low-dimensional projectionsis to reduce the number of points in the data set. Random sampling withoutreplacement creates a subset expected to be a good representation of the datadistribution. PCA or any other classical method can then be computed fromthis subset [64, 88]. This technique is obviously introducing an error so theprojection vectors will not be optimal. However, one can hope that given thesufficient number of data points the error introduced by this reduction will benegligible.

When the class structure is available a priori some sort of simple clusteringtechnique can be performed on high dimensional data (e.g., K-nearest neigh-bor [36]). Another approach can be used, namely local PCA [29, 46]. Withthis technique only the eigenvectors are extracted that correspond to the largesteigenvalues for each cluster. This set of projections defines a new space for thedata.

The other way to reduce the complexity is to use recursive or iterative meth-ods. In the case of dimensionality reduction that means computing the mostimportant components incrementally [22, 41, 65, 66].

2.3 Selection of number of principal components 19

It is also possible to create a hybrid structure by performing a two step projec-tion as is suggested in [19]. In order to find the high dimensional basis vectorsRandom Projection could be used. The goal is to project the data to lower di-mensional space without causing a significant change in the distance. For thesecond step, classic methods like Principal Component Analysis, Factor Anal-ysis [40, 44, 45] or Independent Component Analysis [49] can be used so theimportant directions are extracted.

2.3 Selection of number of principal components

All the projection techniques presented in this chapter estimate the set of pro-jection directions but also leave an open question as to how the optimum di-mension of the space will be chosen. It is extremely important to notice that toosmall a dimension may result in serious information loss, whilst one that is toolarge will often introduce noise to the projected data. In both cases the conse-quences are significant in the form of large errors in the information extraction,clustering, density estimation, prediction etc.

Thus, several methods have been developed for selecting the optimum dimen-sion. An overview is presented below of some of the techniques for selecting thenumber of principal components in the case of PCA (described in section 2.1.2).

The simplest technique to find the effective dimension K is to calculate theeigenvalue curve and base the decision on its shape. If the effective dimension-ality of the data is smaller than the actual data size D, the eigenvalue curvewill have the shape of the letter “L”. Various, mainly ad hoc rules, have beenproposed for estimating the turning point of that curve6, which will minimizethe projection error [10, 45]. One way is to use the cumulative percentage ofthe total variation. Here, it is desired that the selected components contributesignificantly, say 80%−90% of the total variation. A similar way is to select thethreshold value for which the components with smaller variance/eigenvalues arediscarded. Thus, only the largest/dominant components contribute in the finalresult. In this approach, typically the eigenvalues are rejected that are smallerthan 75 − 90% of the maximum.

6The turning point that determines the optimal K is usually ambiguous, due to the fact thatthe L-curve often has a smooth characteristic.


Another approach is proposed by Girolami in [32]. It suggests using both theordering information included in the eigenvalues and in the variation of theeigenvectors. Thus, instead of looking at the eigenvalue curve, we now inves-tigate the transformed curve of the form λk(

1N

∑j u2

kj). Both the thresholdingand the turning point search can be applied. This transformation usually makesthe slope of the L-curve steeper which facilitates the decision.

Cattell in [15] suggests finding the largest eigenvalue beyond the point wherethe curve becomes a straight line as close to horizontal as possible. A simi-lar technique can be found in [18, 45], but here, the results are based on thelog-eigenvalue curve. Craddock and Flood in [18] claim that the eigenvaluescorresponding to the noise value should decay in geometric progression andthat will produce a straight horizontal curve in the logarithmic space.

The success of the methods mentioned above is subjective and dependent on thechoice of cut-off level. Nevertheless, they are often used due to their simplicityand fairly reliable outcomes.

Additionally, objective rules have been developed for choosing the number ofprincipal components [37, 45]. Usually in such cases the decision is based onthe cross-validation error.

For example the cross-validated choice is proposed by [24, 45, 90]. For an in-creasing number of retained components, the Least Square error is estimatedbetween the original data and the leave-one-out estimate of the data in the re-duced space. Thus, the so-called Prediction Sum of Squares is calculated basedon the following equation

PRESS(K) =D∑

d=1

N∑

n=1

(xdn − xdn)2, (2.5)

where xdn =∑K

k=1 udk · sk · vTnk is the set diminished with the xdn sample.

The next approach is presented in line with Hansen [37] and Minka [61]. TheD-dimensional observation vectors x are assumed here to be a composition ofthe signal and the white noise: x = s + v. If we further assume, that both thesignal and the noise are normally distributed, then the observed N point datasample x will have as well normal characteristics of the form N (µs,Σs +Σv).Notice that the noise has zero mean: µv = 0.

2.3 Selection of number of principal components 21

Thus, the density function of the data can be expressed by:

p(x|µs,Σs,Σv) =1√

|2πΣs + Σv|exp(−

1

2(x−µs)

T (Σs +Σv)−1(x−µs)).

(2.6)

The Σs is further assumed to be singular, i.e. the rank K ≤ D and Σv = σ2ID,where ID is D × D dimensional identity matrix. Then, Σ = Σs + σ2

vID and itcan be estimated from the data set:

µs =1

N

N∑

n=1

xn, Σ =1

N

N∑

n=1

(xn − µs)(xn − µs)T . (2.7)

Further, Σ can be decomposed by singular value decomposition Σ = SΛST ,where S = {sd, d = 1 . . .D} collects the eigenvectors and Λ is a diagonalmatrix with eigenvalues λk ordered in descending order. Since the variation ofthe noise is assumed to be significantly lower than the signal, the first eigenval-ues will represent the signal components while the last ones are responsible forthe noise. Thus, by truncation in the eigenvalue space, the noise level can beestimated:

ΣK = S · diag([λ1, λ2, · · · , λK , 0, · · · , 0]) · ST , (2.8)

σ2 =1

D − KTrace(Σ − ΣK , ) (2.9)

Σv = σ2ID, (2.10)

and then the signal covariance is described by:

Σs = S · diag([λ1 − σ2, λ2 − σ2, · · · , λK − σ2, 0, · · · , 0]) · ST (2.11)

If, in addition to the sample x, the Nz points are observed zn, n = {1, 2, . . . Nz}forming the test set then the generalization error7 can be estimated in cross-

7The generalization error is defined as the expected loss on the future independent sample.


validation scheme as the negative log-likelihood over this test set8

Gz = −1

Nz

Nz∑

i=1

log(p(zi|µs,Σs,Σv))

= −1

Nz

Nz∑

i=1

log(exp(−1

2(zi − µs)T (Σs + Σv)

−1(zi − µs))√|2π(Σs + Σv)|

)

=1

2log |2π(Σs + Σv)| +

1

2Trace[(Σs + Σv)

−1Σz], (2.12)

where Σs and Σv are estimated covariance structures of the signal and thenoise, respectively, for the K largest eigenvalues. And the covariance of thetest set Σz = 1

Nz

∑Nz

n=1(zn − µs)T (zn − µs).

By minimizing the approximated generalization error the optimum signal spaceis found.

Some of the presented approaches are illustrated with an example. For thispurpose a Gaussian cluster is generated, where the effective dimension is equalto 3, but the signal, due to the added noise, existed in 10-dimensional space. Inorder to determine the effective dimension of the data, the approximated gen-eralization error is calculated according to equation 2.12. Results are presentedon figure 2.10 (lower right plot). The optimum dimension that gives a minimumgeneralization error is equal to 3. Additionally, the Prediction Sum of Squares(equation 2.5) is computed. Here, the curve flattens at the third componentsuggesting that dimension for the signal space. For comparison the eigenvaluecurves are also shown. Both the eigenvalues of the data and the scaled version,proposed in [32] indicate 3 components.

2.4 Discussion

Choosing the most suitable procedure for the projection should take into consid-eration the performance of the particular method as well as the dimensionalityof the data and the application. For example, on-line learning can require a

8It is also possible to use for example the leave-one-out estimate or predicted generalizationerror base on the sum of training error and penalized term. Penalization is typically dependenton the model complexity eg. AIC or BIC criterion see [1, 69, 79].

2.4 Discussion 23

1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

λi

1 3 2 5 7 8 6 4 10 90

0.5

1

1.5

2

2.5

3

3.5

λi (Σ

j=1N u

j)2/N

1 2 3 4 5 6 7 8 9 100

20

40

60

80

Number of principal components used

PR

ES

S /

NM

1 2 3 4 5 6 7 8 9

101

102

103

Number of principal components used

Gen

eral

izat

ion

erro

r

Figure 2.10 Illustration selecting the number of components. For the exam-ple, one Gaussian cluster is generated where the signal space had 3 compo-nents but the noisy data existed in 10 dimensions. Upper left plot present theeigenvalues λk. Here, the first 3 eigenvalues are dominant. A similar resultis obtained for a scaled eigenvalue curve [32] (upper right plot). Lower panelpresents the Predicted Sum of Squares (Eq. 2.5)) and the generalization er-ror is calculated based on Eq. 2.12. In this simple case all methods find thecorrect space of the signal.

fast dimensionality reduction method whilst batch learning can afford a slowerbut more accurate technique. Principal Component Analysis is both simple andwell suited to the applications presented in this work, where mostly medium-sized databases are available and batch learning is performed. Thus, in furtherinvestigations, PCA is used.

C H A P T E R 3

Generalizable GaussianMixture models

Following the steps of the KDD process presented on figure 1.1, in the first stepthe transformation task is performed. Therefore in the preceding Chapter 2 thedimensionality reduction and feature selection techniques are presented. Onsuch transformed data the essential part of KDD process may be applied, i.e.data mining. In the next Chapter 4 the methods for data mining results interpre-tation are introduced. The results of the full process based on the realistic dataare presented in the last Chapter 7.

Data mining general task is to discover relationships among the data samples.Since, this work focuses mostly on segmentation one way to investigate theseconnections is to determine data underlying distribution. The mixture modelsare supplying with that cluster structure through the determined multi-modaldensity.

A mixture model is a particular, very flexible form of the density function. Itcan be written as a linear combination of the component densities p(x|j) as:

p(x) =

K∑

k=1

p(x|k)P (k). (3.1)

26 Generalizable Gaussian Mixture models

Coefficients P (k) are called mixing parameters or mixing proportions and theyare in fact the prior probability of the data being generated by the component k.P (k) satisfies the following conditions:

∑Kk=1 P (k) = 1 and, 0 ≤ P (k) ≤ 1.

Component densities are, naturally, normalized so that they satisfy the condi-tions of the probability density function, i.e.

∫p(x|k)dx = 1. By Bayes opti-

mal decision rule (assuming 0/1 loss function [69]) a new point z is assigned tocluster k if

k = argmaxk

p(k|z) = argmaxk

p(z|k)P (k)∑k p(z|k)P (k)

(3.2)

where∑K

k=1 p(k|z) = 1.

Mixture models are widely used, mostly due to their flexibility in modelingthe unknown distribution shapes. In the data mining area, they are useful inestimation of density and what follows in cluster analysis, i.e. the investigationof group structure in the data (which is the main thread of this work). Theycan be used also in many other applications like for example: neural networks,soft weight sharing or mixtures of experts, description of which can be foundin [10].

Provided with sufficient number of components and correct selected parame-ters the accuracy of the mixture models is very high. The estimate is accurateeven when the data was generated from a distribution which was different thanthe chosen component densities. In such case, typically, only larger numberof components are needed in the approximation. As the component densitiesp(x|k) all valid density functions can be used. However, exponential densityfamily [70] and especially the Gaussian p.d.f. is a good choice due to simpleanalytical expressions.

The learning method for the mixture model is based on the maximum likeli-hood (section 3.1) which leads to the Expectation-Maximization (EM) algo-rithm (section 3.2).

3.1 Maximum likelihood

When modeling the data distribution with the parametric form of the densityfunction (e.g., mixture model) the set of optimum parameters has to be deter-mined. The maximum likelihood (ML) estimation is a solid tool for learning

3.2 Expectation-Maximization algorithm 27

those parameters. In that technique, the optimum values of the parameters arefound by maximizing the likelihood derived from the training data set [10, 69].

In the parametric form of the density function p(x|θ) the density depends onthe set of parameters θ.

The training data set consist of N samples D = {xn, n = 1, 2, . . . N}. Thedata likelihood L, assuming i.i.d. data points distribution, can be written in thefollowing way:

L(θ) = p(D|θ) =N∏

n=1

p(xn|θ)). (3.3)

Since we are looking for the set of parameters θ that maximize the likelihoodL(θ), the solution can be found from the differential equations of the form∂L(θ)

∂θ = 0. Then, the optimum θ∗ is found that maximizes the likelihood L,

θ∗ = argmaxθ∗

L(θ). (3.4)

It is often advantageous (for the computational reasons) to use the log-likelihoodinstead of the regular likelihood (equation 3.3) or, what is equivalent, to mini-mize the negative log-likelihood.

3.2 Expectation-Maximization algorithm

Even though, the maximum likelihood approach supplies with the solution tothe optimization problem it does not provide the method for calculating theparameters. Sets of differential equations are highly nonlinear and thereforedifficult to solve. However, the ML approach gives the opportunity to derivethe iterative algorithm that will allow to find the approximation to the correctsolution and at the same time will make the optimization process much simpler.In order to find the maximum likelihood estimate of the parametric models theExpectation-Maximization algorithm (EM) can be applied, which was intro-duced in [21]. The other alternative technique, however not addressed here,for determining parameters of the probability density function is a VariationalBayes approach [2, 5, 16].

The EM term was first introduced and formalized by Dempster, Laird and Ru-bin in 1977 [21]. It is an iterative schema which involves the non-linear function


optimization and re-estimation of the parameters what leads to the maximiza-tion of the data likelihood.

Let us write the log-likelihood as a function of some visible v and hidden hvariables

L(θ) = log(p(v|θ)) = log

∫p(h, v|θ)dh (3.5)

Now, by introducing the additional function – the distribution over the hiddenstates q(h) and by using Jensen’s inequality (appendix A), the likelihood givenin equation 3.5 can be rewritten in the following way:

L(θ) = log

∫q(h)

p(h, v|θ)

q(h)dh ≥

∫q(h) log

p(h, v|θ)

q(h)dh = F(q(h)θ),

(3.6)where

F(q(h), θ) =

∫q(h) log(p(h, v|θ))dh −

∫q(h) log(q(h))dh (3.7)

The last term∫

q(h) log(q(h))dh is, in fact, an entropy of q. The lower boundon the likelihood F(q(h), θ) is introduced in equation 3.7. It can be proofed [21]that the difference between the likelihood L and the new objective function Fis the nonnegative Kullback-Leibler (KL) divergence between the distributionover hidden variables and the probability of the hidden states. Note also, thatmaximizing F maximizes L, what is our objective.

Finally, the EM optimization algorithm can be introduced.

• E step:In the E step the distribution over the hidden variables F(q(h), θ) isestimated given the data and the current parameters i.e.

qk(h) = argmaxq(h)

F(q(h), θk−1) (3.8)

• M step:The M step is responsible for modifying the parameters in order tomaximize the joint distribution over the data and hidden variables i.e.

θk = argmaxθ

F(qk(h), θ) (3.9)

Figure 3.1 Expectation-Maximization algorithm

3.3 Unsupervised Generalizable Gaussian Mixturemodel 29

When the exact EM steps are applied1 the algorithm is maximizing the datalikelihood at each step. Even though, usually, only approximate EM steps canbe applied, the algorithm converges and still maximizes the data likelihood.The convergence proof is provided by Dempster et al. [21], and it can as wellbe found in [10, 92].

3.3 Unsupervised Generalizable Gaussian Mixturemodel

Gaussian probability density function is the most often choice for the densitycomponents. This assures the ability of deriving analytically the expressionsfor parameter updates in the EM algorithm. It also gives a flexible and accurateestimate of the true density, which can be used in discovering the hidden clusterstructure in the data. The Gaussian mixture model is also applied for examplein [39, 52, 53, 80].

Let define the density of the D-dimensional data vector x under the modelassumptions by the following equations:

p(x|θ) =K∑

k=1

p(x|θk) · P (k), (3.10)

where

p(x|θk) = (2π)−D2 |Σk|

− 1

2 exp(−1

2(x − µk)

TΣ−1k (x − µk)) (3.11)

The Gaussian components p(x|θk) in equation 3.10 are mixed with proportionsP (k), which satisfy earlier mentioned conditions. θk represents the collectionof parameters for component k so that θk ≡ {µk,Σk, P (k)}, where K is thetotal number of components. The parameters are estimated from the set of ob-servations D = {xn, n = 1, 2, . . . N} by the Expectation-Maximization algo-rithm described in section 3.2. The objective of the iterative EM is to minimize

1The exact form of the hidden states distribution is known and the parameters minimize thejoint distribution.


the cost function – the negative log-likelihood of the form:

L(θ|D) = − log(p(D|θ))

= − log(N∏

n

p(xn|θ))

= −N∑

n=1

log(p(xn|θ)))

= −N∑

n=1

log( K∑

k=1

p(xn|θk) · P (k)). (3.12)

In the E-step the cluster posterior P (k|xn) is estimated for the fixed set ofthe Gaussian parameters θk. The optimum parameters that minimize the costfunction are found in the M-step.

Naturally, the cost function is monotonically decreasing, what is ensured byEM therefore in order to find the optimum model complexity, a generalizationerror must be determined. By the complexity of the model the model order isunderstood, i.e. in the case of the discussed GGM model the number of clustersor what is strictly connected - number of parameters in the mixture. The modelhas optimal complexity if its generalization error is minimal. By definition thegeneralization error is a measure of how well a model can respond to new dataon which it has not been trained but which are related in some way to the train-ing patterns. An ability to generalize is crucial to the decision making ability ofthe model. For example, the generalization error can be simply calculated fromthe validation set or approximated from the training set. Then the rule is to addto the cost function (equation 3.12) the penalty value that is proportional to thetotal number of parameters2 (eg. Akaike Information Criterion [1], BayesianInformation Criterion [79]). The minimum point of this composition (the costfunction plus the number of parameters in the system) defines the optimummodel complexity. Figure 3.2 illustrates this basic idea.

Two techniques are presented in the implementation of the EM algorithm forGaussian mixture model. The EM algorithm for Gaussian Mixture model canbe implemented in two different ways. So called hard assignment algorithmbases on the 0/1 decision function, i.e. the points are uniquely assigned to one

2The simple models that generalize well are preferred.

3.3 Unsupervised Generalizable Gaussian Mixturemodel 31

Cost functionPenaltyEstimated Generalization Error

Optimum Model

Figure 3.2 The illustration of the selection of model complexity. The opti-mum model is minimum in the estimated generalization error, which is deter-mined as a sum of the cost function and the penalty value, defines to optimummodel.

cluster with maximum cluster posterior p(k|x). Then the cluster parametersare estimated from the subsets of the data assigned to this cluster. The othermethod is to use in the computations the precise cluster posterior probability(soft assignment). Such a method certainly offers better density estimation.This technique, however, returns often high number of components with fewclusters without assigned members form the actual data. Therefore, for cluster-ing purposes, hard assignment algorithm is often more useful.

Figure 3.3 presents the algorithm for unsupervised Generalizable Gaussian Mix-ture model with soft assignment. The parameters are initialized with the datavariance and randomly selected data points are used as cluster centers.


Calculate mean and the covariance of the data:µ0 = N−1

∑n xn,

Σ0 = N−1∑

n(xn − µ0)(xn − µ0)T

Initialization1. Initialize µk as the random point drown from the data set.2. Initialize Σk = Σ0

3. Initialize P (k) = 1K

Optimization• E-step:

1. Calculate the likelihood: p(x|θ)2. Calculate the cost function: L(θ)

3. Compute posterior: p(k|xn) = p(xn|k)P (k)p(xn)

Split the data set D into two parts:−Dµ (for mean estimation)−DΣ (for covariance estimation).

• M-step:

1. Estimate µk =

∑n∈Dµ

p(k|xn)·xn∑n∈Dµ

p(k|xn)

2. Estimate Σk =

∑n∈DΣ

(xn−µk)(xn−µk)T ·p(k|xn)∑n∈DΣ

p(k|xn)

3. Estimate P (k) =∑

n p(k|xn)

Figure 3.3 The unsupervised Gaussian Mixture algorithm.

To obtain hard assignement algoritm the cluster parameters µk,Σk and P (k)are estimated in the M-step only based on the points assigned to the particularcluster, the cluster posterior p(k|x) is quantized to binary (0/1) representation.

It is well known that the classic EM algorithm for Gaussian mixture modeloverfits easily, i.e., the optimum parameters maximizing the likelihood tend to-wards the Gaussian densities of zero covariance placed over each of the datapoints3. In order to avoid this artifact the generalization is introduced by divid-

3It can be easily shown that for µi = xi and σ → 0 the negative log-likelihood L → −∞.

3.4 Supervised Generalizable Gaussian Mixture model 33

ing the data set into two parts, and estimating mean and the covariance fromthis disjoint sets. The alternative possibility is the Variational Bayes [2] meth-ods, however they are not addressed in this work. The EM update steps areperformed until the algorithm converge what in practice means that the earlystopping criteria is used. The algorithm has converged when no changes in thecluster assignments are observed.

In order to achieve good estimation of the parameters, the number of data sam-ples must outnumber the number of parameters in the model. In case of unsu-pervised Gaussian Mixture model the mean, covariance and the mixing propor-tions are estimated, what means that for each cluster there are D + D·(D+1)

2 +1parameters to estimate, where D is the data dimension. For example, for 2-dimensional data there are 2 + 2·(2+1)

2 + 1 = 6 parameters to estimates percluster while in the case of 10 dimensions this number grows to 66 parametersper cluster. In result, if the data consist from, e.g. 10 clusters, only 60 pointsare needed in estimation of 2-dimensional case but for 10-dimensional data noless than 660 points are required to ensure correct estimate of the parameters.So, the number of data points needed in estimation grows quadratically with thefeature space dimensionality.

3.4 Supervised Generalizable Gaussian Mixture model

Supervised Gaussian Mixture model is used in the classification schema whennot only data points are provided, but as well the corresponding labels, i.e.the data set consist of couples: D = {{xn, yn}, n = 1, 2, . . . N}, where xn is aD-dimensional data sample and yn is the corresponding class label, yn ∈ {c =1 . . . c = C}, where C is a total number of observed classes.

The idea is to adapt the separate Gaussian mixture for each of the observedclasses. Thus, the unsupervised mixture algorithm presented on figure 3.3 isused for each of the class-separated subsets of the data.

The density of the data point x belonging to the class c is then expressed byjoint probability:

p(x, c|θ) =

K∑

k=1

p(x|θk) · P (k|c) · P (c). (3.13)


Similarly to unsupervised Gaussian mixture the p(x|θk) states for Gaussiandensity components (equation 3.10), P (c) is the probability of the class c,∑C

c=1 P (c) = 1 and it is calculated from the data simply by counting mem-bers of each class. P (k|c) gives the proportions of the clusters in the class c.

3.5 Unsupervised/supervised Generalizable GaussianMixture model

For classification purposes both unlabeled and labeled data may be useful.Castelli & Cover in [17] proofs that labeled samples have exponential contribu-tion in reducing the probability of error, since they carry important informationabout decision boundaries. Thus, unlabeled data alone, are insufficient in clas-sification task due to the lack of this information. In real world applications,however, labels are usually difficult to obtain. For example, in medical diag-noses or the categories of the web-pages it is often not only expensive but alsotime consuming and in some applications unrealistic (e.g., labeling of the con-tinuously growing world wide web). Therefore, an excellent idea is to use avail-able labeled examples with collected unlabeled data as, for example, results ofmedical tests or web-pages. Thus, the unsupervised/supervised GeneralizableGaussian Mixture (USGGM) is proposed.

In USGGM the joint input/output density is modeled, as previously, as theGaussian Mixture [53, 60, 63] in the following manner:

p(y,x|θ) =K∑

k=1

P (y|θk)p(x|θk)P (k). (3.14)

p(x|θk) are the Gaussian density components defined earlier by equation 3.10,which are mixed with the non-negative proportions P (k). An additional set ofparameters, in the case of this model, the class cluster posteriors P (y|k) are alsonon-negative and

∑Cy=1 P (y|k) = 1. θ collects the parameters of the model

what in the case of USGGM model for the k-th component is described asθk ≡ {P (y|k), µk,Σk, P (k)}. It is assumed that the joint input/output densityfactorize so p(y,x) = P (y|k)p(x|k) and the data points are assigned to the

3.5 Unsupervised/supervised Generalizable Gaussian Mixturemodel 35

class when y = argmaxy

P (y|x), where

p(y|x) =p(y,x)

p(x)=

K∑

k=1

P (y|k)p(θk|x) =

K∑

k=1

P (y|k)p(x|θk)P (k)

p(x). (3.15)

The entire data set consists of the unlabeled Du = {xn; n = 1 . . . Nu} andlabeled examples Dl = {xn, yn; n = 1 . . . Nl}. The objective is to estimatejoint density parameters θ based on the data set D = Du ∪ Dl that will ensuregeneralizability. The cost function (the negative log-likelihood) for this modelis given by:

L = − log(p(D|θ)) (3.16)

= −∑

n∈Dl

logK∑

k=1

P (yn|k)p(xn|k)P (k) − λ∑

n∈Du

logK∑

k=1

p(xn|k)P (k),

where 0 ≤ λ ≤ 1 is called a discount factor. The discount factor is introducedto the cost function in order to control the influence of the unlabeled samples,and that affects also the EM updates. Its value is estimated in a cross-validationscheme. For a small number of labeled examples, EM will converge almost tothe unsupervised algorithm where the labeled points will effect only the initial-ization part and the identification of the components with the class labels. Withgrowing size of Dl the influence of the unlabeled data set starts to decreasesince the labeled data set is providing enough information both for classifica-tion and for parameter estimation. Therefore, it is generally expected that λ isclose to one with only few labeled data points and decreases towards zero withthe increasing size of labeled samples.

The unsupervised/supevised Gaussian Mixture algorithm is presented in fig-ure 3.4. Similarly to previously described algorithms, the parameters for USGGMare optimized by the EM algorithm. To ensure generalization the means and co-variances are estimated from a disjoint data set and P (y|k) and P (k) from thewhole set. As previously, the cluster covariances Σk are initialized with the datacovariance, means µk are chosen randomly from the samples, component pro-portions P (k) take uniform values and the class posterior probabilities P (y|k)are forming the table computed from the available labels.


Initialization1. Choose values for K and 0 ≤ λ ≤ 1.2. Let i be K different randomly selected indices from {1, 2, · · · , N}, and set µk =

xik.

3. Let Σ0 = N−1∑

n∈D(xn − µ0)(xn − µ0)>,where µ0 = N−1

∑n∈D xn, and

set ∀k : Σk = Σ0.4. Set ∀k : P (k) = 1/K.5. Compute class prior probabilities: P (y) = N−1

l

∑n∈Dl

δ(yn − y), whereδ(z) = 1 if z = 0, and zero otherwise. Set ∀k : P (y|k) = P (y).

6. Select a split ratio 0 < γ < 1. Split the unlabeled data set into disjoint sets asDu = Du,1 ∪Du,2, with |Du,1| = [γNu] and |Du,2| = Nu − |Du,1|. Do similarsplitting for the labeled data set Dl = Dl,1 ∪ Dl,2.

Repeat until convergence1. Compute posterior component probabilities:

p(k|xn) = p(xn|k)P (k)∑k

p(xn|k)P (k) , for all n ∈ Du,

and for all n ∈ Dl, p(k|yn,xn) = P (yn|k)p(xn|k)P (k)∑k

P (yn|k)p(xn|k)P (k) .

2. For all k, update means

µk =

∑

n∈Dl,1

xnP (k|yn,xn) + λ∑

n∈Du,1

xnP (k|xn)

∑

n∈Dl,1

P (k|yn,xn) + λ∑

n∈Du,1

P (k|xn)

3. For all k, update covariance matrices

Σk =

∑

n∈Dl,2

SknP (k|yn,xn) + λ∑

n∈Du,2

SknP (k|xn)

∑

n∈Dl,2

P (k|yn,xn) + λ∑

n∈Du,2

P (k|xn)

where Skn = (xn − µk)(xn − µk)>. Perform a regularization of Σk.4. For all k, update cluster priors

P (k) =

∑

n∈Dl

P (k|yn,xn) + λ∑

n∈Du

P (k|xn)

Nl + λNu

5. For all k, update class cluster posteriors

P (y|k) =

∑

n∈Dl

δ(y − yn)P (k|yn,xn)

∑

n∈Dl

P (k|yn,xn)

Figure 3.4 The USGGM algorithm.

3.6 Outlier & Novelty detection 37

3.6 Outlier & Novelty detection

The outlier detection can be a part of the data cleaning process (see figure 1.1)as well as data mining. In the first case the outlier problem is addressed in orderto enhance the data model. In the second case the outlier detection is itselfconsidered as a data mining task where outliers are the final outcomes of theprocessing. The following outlier detection techniques base on the probabilisticoutcome of the GGM model, and therefore they are introduced in this chapter.However, this work do address the outlier detection only as a data cleaningproblem performed prior to the data mining task.

Even though, methods presented later in this chapter are developed for the out-lier detection they can be also used in novelty detection.

There are various origins of outliers. When the outlier sample is a result of anerror in the system, it may greatly influence the learning and estimation process.Therefore, it is usually advised to exclude it from the data set, before estima-tion, unless the applied model is designed to deal with outliers. On the otherhand, when the outlier is in fact an unusual example, it may itself be a subjectto the study. Thus, for example in medical systems, the outlier detection tech-niques can be used to detect, e.g. incorrect diagnoses, abnormal changes in theperformed tests, scans or other signals. A novel sample is closely related to anoutlier. The difference is that novelty comes in fact from the new data genera-tion mechanism appearing in the observed data after the model was trained andthe parameters were obtained. The novel points have low likelihood values inthe current model.

Thus, the ability to spot the outlier and novel samples is an important task inprocessing data and in cluster analysis. Many definitions of an outlier can beformulated. Here, an outlier is referred as an observation that deviates so muchfrom other observations as to arouse suspicions that it was generated by a dif-ferent mechanism [47]. Based on that definition it is assumed that the outlierobservations are not produced by the model, or the survey is possibly harmedby an error. Such an unlikely sample should have low probability value in thatmodel. In the case of Gaussian mixture it means that the probability of the out-lier data point for any cluster is low. Figure 3.5 illustrates this problem. Theoutlier point is placed far away from the rest of the data samples. Therefore,it is unlikely for it to be generated by any of the presented densities. Such asituation may be indication of abnormal behavior, error or novel (unobserved


0

0.1

0.2

0.3

0.4

0.5

0.6

Pro

babi

lity

Cluster 1 densityCluster 2 densityData points from cluster 1Data points from cluster 2Outlier sample

Figure 3.5 The illustration of the outlier sample. The outlier observationpoint (green star) can not be unambiguously assigned to any of the presenteddensities since the probability for each of them is very low.

before) data point appearing in the data set. Any of those scenarios is importantin data analysis.

If the existence of the outliers is not taken into consideration during optimiza-tion process, the final result may be affected greatly. For example, figure 3.6illustrate such an influence in case of simple linear model. Including the outlierpoint in the parameter estimation (blue dashed line) causes substantial deviationfrom the correct result (red solid line). Removing this unlikely data point fromthe training set will produce significantly better estimate (green dashed line).

Similar techniques are used in the novelty detection. While in case of the out-liers, the unlikely data points were removed from the data set to improve gen-eralization, the novelty samples should be included and the model parametersre-estimated.

3.6.1 Cumulative distribution

A simple outlier detection technique is proposed in line with [39]. The methodbases on the estimate of the input density p(x) which is available through eg.mixture model.


0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

x1

x2

data pointsoutlier pointoriginal functionpredicted functionpredicted func. with outlier

Figure 3.6 The influence of the one outlier sample on the estimation of thefunction parameters from the data. A simple linear function is used as anexample. The linear function calculated from the data set with one outliersample produces strongly skewed incorrect result.

The cumulative distribution Q(t) is defined as the probability of the data thathas likelihood below some threshold t, calculated for all the thresholds, i.e.Q(t) = P (x ∈ R) where R = {xn : p(xn) < t}. Now, the threshold in set onthe distribution Q(t), which allows to reject the low probable events. Typically,low value is chosen, for example 5%, what means that the rejected data hasthe likelihood lower than 5% of the maximum likelihood value observed in thedata.

Again, the same technique can be used in detection of the novel events in thenew observed data.

3.6.2 Outlier cluster

The other method [54] is working as an extension to the earlier presented Gen-eralizable Gaussian Mixture model. However similar techniques can be devel-oped for arbitrary mixture model. It simultaneously detects the outliers andestimates the data density in the mixture model framework.

The method involves the additional, especially created, wide, outlier cluster,


which is placed in the center of the data. Figure 3.7 illustrates this idea. Due

Data densityOutlier cluster

Figure 3.7 The outlier cluster (dashed line) has much larger covariance thanthe data (solid line). Therefore the likelihood shown on the figure in the centeris dominated by data and on tails by outlier density.

to the high variance, the density of the outlier cluster takes much lower valuesthan the data density in the center of the distribution but higher on the tails.Therefore, the unlikely samples fall into the outlier cluster rather than contributeto the process of estimating the parameters of the data density.

The method uses modified unsupervised GGM in estimation. The model is alinear combination of K + 1 component densities and the proportions:

p(x|θ) =K+1∑

k=1

p(x|θk)P (k). (3.17)

In the formula the Gaussians (given by equation 3.11) are used as componentdensities p(x|θk). The parameters are estimated, as in earlier presented mod-els, via EM algorithm that minimizes the negative log-likelihood given in equa-tion 3.12.

This modification of the UGGM algorithm, presented in figure 3.3, uses oneadditional Gaussian cluster with fixed parameters: µK+1 = µ0 = N−1

∑n xn

and ΣK+1 = cΣ0 = cN−1∑

n(xn −µ0)(xn −µ0)T , where c is a multiplying

constant which decides about the wideness of the outlier cluster and typically


takes large values4. The EM updates for µk and Σk concern only the K clustersdescribing the data density while the parameters of the outlier cluster K + 1remain the same. Mixing proportions P (k) that are indication of the clustermembership are updated for all components.

Even though, the presented technique is developed for unsupervised GGM itcan easily be extended to other discussed models like SGGM and USGGM.

3.6.3 Evaluation of outlier detection methods

A 2-dimensional toy data set was generated for illustration and comparison ofthe outlier detection techniques presented in sections 3.6.1 and 3.6.2. Resultsare shown on figure 3.8.

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

Data poinsOutliers

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

Data pointsOutliers method 1

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

Data pointsOutliers method 2

Figure 3.8 Two Gaussian clusters with the outliers are used as the data set(left panel). The outliers are marked in red. Method 1 (middle plot) is corre-sponding to the outcome of the cumulative distribution. The samples classi-fied as outliers are marked with the green triangle. Here, more samples thanexpected is detected as outliers. Right panel is presenting the results of theclustering with the outlier cluster (method 2) . The examples that were classi-fied as outliers are marked with the cyan diamond. Precisely the same pointsthat were originally outliers are predicted as outliers.

Two Gaussian distributed spherical clusters were generated. Then, additional 9outlier samples formed in regular grid were included in the data set. Left plot offigure 3.8 shows the created data. Blue points are generated by Gaussians andred circles are the outliers. In the middle plot the result of the cumulative dis-tribution is presented and in right plot the outcome of outlier cluster techniqueis shown. The cumulative distribution was used with the threshold of 0.5% but

4Value of c = 10 is used in the experiments


still many low probable samples from the Gaussian clusters were classified asoutliers. In case of outlier cluster method the result is accurate.

3.7 Discussion

Summarizing, the mixture models are universal and very flexible approxima-tions for the data densities. The significant drawback of that model is the com-putational complexity. The number of parameters in the model grows quadrat-ically with the data feature dimension what next requires a large number ofdata points needed for good parameter estimation. For high dimensional datait is moreover a time consuming process, since the inversion of the D × D co-variance matrix is needed in each training iteration. Therefore, dimensionalityreduction techniques are often used in the preprocessing step.

In many applications, finding outliers, i.e. rare events, is more interesting thanfinding the common cases. They indicate either some unusual behavior or con-ditions or give information about new introduced cases or simply about errorsexisting in the system. Therefore, ability of detecting outliers is an useful fea-ture for the model.

The outlier cluster was found superior over the cumulative distribution. It doesnot detect any outlier samples in the data where in fact there is no outliers. If thedensity is estimated accurately then the outcome of this method is also precise.The outlier detection technique is however, based on the mixture model and assuch inherit all its drawbacks.

C H A P T E R 4

Cluster visualization &interpretation

Visualization and the result interpretation, the next step in the KDD process(see figure 1.1), follows the transformation part that is described in Chapter 2and the modeling part presented in Chapter 3. The following chapter discoversthe visualization and interpretation possibilities based on the outcome of theGeneralizable Gaussian Mixture model. The modeling results of the realisticdata sets together with theirs interpretation are visualized in the Chapter 7.

4.1 Hierarchical clustering

Hierarchical methods for unsupervised and supervised data mining provide amultilevel description of data. It is relevant for many applications related to in-formation extraction, retrieval, navigation and organization, for further readingin this topic refer to, e.g. [13, 28]. Many different approaches have been sug-gested to hierarchical analysis from divisive to agglomerative clustering andrecent developments include [11, 27, 58, 69, 87, 89]. In this section, the ag-glomerative probabilistic clustering method is investigated [52, 53, 80] as an

44 Cluster visualization & interpretation

additional level based on the outcome of the Generalizable Gaussian Mixturemodel.

The agglomerative hierarchical clustering algorithms start with each object as aseparate element. These elements are successively combined based on a simi-larity measure until only one group remains.

Let us start from the outcome of the unsupervised GGM model (described insection 3.3), the probability density components p(x|k) (equation 3.11), wherek is an index of the cluster with parameters collected in θk, and treat it asfirst level j = 1 of the new developed hierarchy. On this level, K clusters areobserved that are described by means µk, covariances Σk and the proportionvalues P (k), k = 1, 2, . . . , K. The goal is to group together the clusters thatare similar to each other. This similarity can be understood, for example, as thedistance between the clusters in the Euclidean space, or the similar character-istics in the probability space. Therefore, various similarity measures can beapplied what often leads to different clustering results. A few of the possiblechoices for distance measures between the clusters are presented further in thischapter.

The simplest technique for combining the clusters is to merge only two clustersat each level of the hierarchy. Two clusters are merged, when the distancebetween them is the smallest. The procedure is repeated until one cluster at thetop level is reached containing all the elements. That is, at level j = 1 there areK clusters and there is one cluster at the final level, j = 2K − 1.

Let pj(x|k) be the density for the k’th cluster at level j and Pj(k) the corre-sponding mixing proportion. Then the density model at level j is defined by:

p(x) =

K−j+1∑

k=1

Pj(k)pj(x|k). (4.1)

If cluster r and m at level j are merged into ` at level j + 1 then

pj+1(x|`) =pj(x|r) · Pj(r) + pj(x|m) · Pj(m)

Pj(r) + Pj(m), (4.2)

and

Pj+1(`) = Pj(r) + Pj(m). (4.3)

4.1 Hierarchical clustering 45

The class posterior

Pj(k|xn) =pj(x|k)Pj(k)

p(x)(4.4)

propagates also in the hierarchy so that

Pj+1(`|xn) = Pj(r|xn) + Pj(m|xn). (4.5)

Once, the hierarchy is created it is possible to determine the class/level mem-bership for new samples. Thus, xn is assigned to cluster k at level j if

Pj(k|xn) =p(xn|k)P (k)

p(xn)> ρ, (4.6)

where ρ is a threshold value, typically close to 1. That ensures high confidencein the assignment. The data sample “climbing up” the tree structure gains con-fidence with each additional level. That means, that if the sample at level j cannot be assigned unambiguously to any of the clusters it will surely be possibleat one of the higher levels of the hierarchy.

4.1.1 Similarity measures

4.1.1.1 Kullback-Leibler similarity measure

The natural similarity measure between the cluster densities is the symmetricKullback-Leibler (KL) divergence [10, 69], since it reflects dissimilarity be-tween the densities in the probabilistic space. Symmetric KL for agglomerativehierarchical clustering was introduced and investigated in [52]. It is defined as

D(r, m) =1

2

∫p(x|r) log

p(x|r)

p(x|m)dx +

1

2

∫p(x|m) log

p(x|m)

p(x|r)dx. (4.7)

On level j = 1, KL divergence for the Gaussian clusters can be expressed bythe simplified form:

D1(r, m) = −D

2+

1

4(Trace[Σ−1

r Σm] + Trace[Σ−1m Σr]) (4.8)

+1

4(µr − µm)T (Σ−1

r + Σ−1m )(µr − µm)

The derivation of this equation is provided in appendix A.


Unfortunately, a significant drawback of KL is that an exact analytical expres-sion can be obtained only for the first level of the hierarchy, while distances forthe next levels have to be approximated [52, 53, 80]. For that approximationsimple combination rule can be used in which, the distances to a new clusterare calculated from the original distances weighted by the mixing proportionsP (k). Thus, the distance between the merged cluster `{r, m} and the othercluster k is defined by the following recursive formula:

Dj+1(k, `) =Pj(r)Dj(r, k) + Pj(m)Dj(m, k)

Pj(r) + Pj(m)(4.9)

4.1.1.2 L2 similarity measure

The L2 distance was presented in [53] as a similarity measure in connectionwith agglomerative hierarchical clustering .

The L2 distance in case of density functions is defined as

Dj+1(r, m) =

∫(pj(x|r) − pj(x|m))2 dx (4.10)

where r and m index two different clusters. Due to Minkowski’s inequality(appendix A), D(r, m) is a distance measure.

Let define the set of cluster indices I = {1, 2, · · · , K} and Iα and Iβ aredisjoint subsets of I such that Iα ∩ Iβ = ∅, Iα ⊂ I and Iβ ⊂ I. Iα, Iβ

contain the indices of clusters, such that they include clusters r and m at levelj, respectively. Then, the density of cluster r is given by:

pj(x|r) =∑

i∈Iα

αip(x|i), αi =P (i)∑

i∈IαP (i)

(4.11)

for i ∈ Iα , and zero otherwise. The density of cluster m with βi is defined ina similar way:

pj(x|m) =∑

i∈Iβ

βip(x|i), βi =P (i)∑

i∈IβP (i)

. (4.12)

The Gaussian integral is given by [91]:∫

p(x|a)p(x|b) dx = N (µa − µb,Σa + Σb), (4.13)


where N (µ,Σ) = (2π)−D/2 · |Σ|1/2 · exp(−12µ>Σ−1µ).

If we further define the K–dimensional vectors α = {αi}, β = {βi} and theK × K symmetric matrix G = {Gab}, where Gab = N (µa − µb,Σa + Σb),then the distance D can be rewritten in the simplified notation as:

D(r, m) = (α − β)>G(α − β). (4.14)

It is also important to include the prior of the component in the distance mea-sure. Thus, the modified L2 is then given by:

Dj+1(r, m)=

∫(pj(x|r)Pj(r) − pj(x|m)Pj(m))2 dx (4.15)

which easily can be expressed by a modified matrix Gab = P (a)P (b)Gab.

4.1.1.3 Cluster Confusion similarity measure

Another possibility is to use as a similarity measure the confusion betweenthe clusters [53]. When merging two clusters, the similarity E is defined asa probability of misassignment when drawing the samples from two clustersseparately.

E(r, m) =

∫

Rm

p(x|r)P (r)dx +

∫

Rr

p(x|m)P (m)dx, (4.16)

where Rm is a set that contains vectors x classified as belonging to cluster mi.e. Rm = {x : m = argmax

jp(j|x)} and Rr = {x : r = argmax

jp(j|x)}.

In general, E(r, m) can not be computed analytically, but it can be approxi-mated with arbitrarily accuracy by using an auxiliary set of data samples drawnfrom the estimated model. The cluster confusion similarity measure can becomputed using the following algorithm:

1. Randomly select a cluster i with probability P (i)

2. Draw a sample xn from p(xn|i)

3. Determine the estimated cluster j = argmaxk

p(k|xn)

4. Estimate E(r, m) as the fraction of samples where (i = r ∧ j = m) or(j = r ∧ i = m)

Figure 4.1 The Cluster Confusion similarity measure algorithm.


4.1.1.4 Sample Dependent similarity measure

Instead of constructing a fixed hierarchy, a sample dependent hierarchy can beobtained by merging a number of clusters relevant for a new data sample xn.This similarity measure was proposed in [53].

Let p(k|xn), k = 1, . . . , K be the computed posteriors that are ranked in de-scending order and the accumulated posterior is stored in A(i) =

∑ik=1 p(k|xn).

The sample dependent cluster is then formed by merging the fundamental, forthis sample, components k = 1, 2, · · · , M , where M = argmin

iA(i) > ρ,

with for example, ρ = 0.9.

4.1.2 Evaluation of similarity measures in hierarchical clustering

To compare the presented similarity measures a 6 cluster toy example was cre-ated. All clusters are drawn from normal distributions with spherical covariancestructures. In four clusters (1, 2, 3, 4) the covariance is set to identity matrixΣ = I , and for two clusters (5, 6), Σ = 0.1 · I . The scatter plot of the data isshown on figure 4.2.

The graphs visualizing hierarchical clustering are called dendrograms. Den-drograms for different similarity measures are shown on figure 4.3. All the dis-cussed measures recognize the dissimilarity between the narrow and the widevariance clusters. As mentioned earlier, the hierarchy produced by differentsimilarity measures vary.

For the selected test samples shown also on figure 4.2 ordered class posteriorvalues are shown in table 4.1. Five test points were especially selected for goodillustration of the Sample Dependent similarity measure. The sample coordi-nates are shown on top of the table 4.1 and refer to figure 4.2. The thresholdvalue ρ is assumed to equal ρ = 0.9. For example, sample x1 gains full confi-dence after second hierarchy level, thus the fundamental clusters for that exam-ple ware clusters number 1 and 4. A similar situation is in the case of samplex5. Regarding point x2, the single cluster no. 6 provides enough of the confi-dence. Sample x3 builds two level hierarchy {1, 3} and sample x4 three level{{6, 3}, 1}. In that way the hierarchy is independently created for each of thesamples.


−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

Class 1Class 2Class 3Class 4Class 5Class 6Test samples

Figure 4.2 The scatter plot of the artificial data generated for comparisonbetween the similarity measures presented in Chapter 4. The dendrogramfor the presented data build with KL, L2, modified L2 and Cluster Confusionsimilarity measures are presented on figure 4.3. 6 test samples that are markedwith red stars were selected for illustration of Sample Dependent similaritymeasure that is presented in table 4.1.

x1 = (2,−2)p(k|xi) k

0.5 10.5 40.0 20.0 30.0 50.0 6

x2 = (−0.6, 0.5)p(k|xi) k0.967 60.015 20.011 30.007 50.000 40.000 1

x3 = (4, 3.5)p(k|xi) k

0.82 10.18 30.00 40.00 20.00 50.00 6

x4 = (0.9, 1)p(k|xi) k0.388 60.351 30.259 10.001 20.001 40.000 5

x5 = (0.4, 0)p(k|xi) k0.495 50.495 60.006 10.002 30.002 40.000 2

Table 4.1 The Sample Dependent similarity measure. The test samples,which coordinates and are referring to figure 4.2. The ordered class posteriorand the corresponding cluster numbers are shown in columns. The assumedthreshold value is ρ = 0.9. For example, sample x1 gains full confidencealready after first hierarchy level, thus the fundamental clusters for that ex-ample are a composition of clusters number 1 and 4. For point x2, the singlecluster no. 6 provides enough of the confidence. In case of sample x4 3 levelhierarchy is proposed namely, {{6, 3}, 1}.


5 6 1 3 2 410

0

101

102

KL Similarity Measure

1 3 4 2 5 6

100

L2 Similarity Measure

1 3 4 2 5 610

−3

10−2

10−1

L2 modified Similarity Measure

5 6 3 1 4 2

10−1.5

10−1.4

10−1.3

Cluster Confusion Similarity Measure

Figure 4.3 The presented dendrograms are corresponding to hierarchicalclustering created with different similarity measures for a 6 cluster toy exam-ple shown on figure 4.2. The type of the measure is shown above each figure.By referring to figure 4.2 the differences, among the methods, in the orderingof clusters can be seen.

4.2 Confusion matrix

The important final step of the KDD process (figure 1.1) is an understandablepresentation of the obtained results. In case of supervised learning the confu-sion matrix is often used as a cluster interpretation. The confusion matrix, bydefinition, contains information about actual and predicted classifications doneby a classification system. An illustration of that idea is presented in table 4.2.a represents a vector generated by class A, and b by class B. The estimatedcorresponding clusters are marked by A and B, respectively. P (a|A) is a prob-ability of the assignment of the vector that was generated from class A to the

4.3 Keywords and prototypes 51

Class A Class B

Cluster A P (a|A) P (b|A)

Cluster B P (a|B) P (b|B)

Table 4.2 A simple illustration of the confusion matrix idea. Vector a is gen-erated by class A, while vector b by class B. The estimated correspondingclusters are marked by A and B, respectively. P (a|A) is the probability of theassignment of the vector that was generated from class A to the correspond-ing cluster A. P (a|B) denotes probability of misassignment of the vector a.When confusion matrix equals identity matrix ,there is no cluster confusionobserved.

corresponding cluster A. P (a|B) denotes probability of misassignment of thevector a. When confusion matrix equals identity matrix, there is no clusterconfusion observed.

The confusion matrix can be also useful in the interpretation when unsupervisedlearning is used but all the labels are available. In such case not a class but acluster structure of the data is the matter of the interest. In such case, numberof clusters is typically larger than number of class labels. Thus, the confusionmatrix provides the “supervised” labeling for the clusters.

4.3 Keywords and prototypes

As another method of interpretation, keywords and prototypes are proposed.Here, no class labels are required. The term Keywords originate from textualdata1 but it can be easily extended to other data types. For example, as keyworda typical image can be understood or typical values of the observed features.As prototype the most representative cluster example, is considered, which isselected from the original documents, what in GGM model corresponds to theexample with the highest density value.

The simplest approach for keywords generation is to use the location param-eter (mean) as a representative for each cluster. A more general and accuratetechnique requires generating a set of the most probable vectors yi from eachof the clusters found by, e.g., Monte Carlo sampling of the probability density

1Detailed description of the text data type can be found in section 7.1.


function p(x). Vectors yi are called typical features. Since, in most of the casesthe computations are not hold in the original space2, the clusters are describedin the latent space, where processing was also performed. Therefore, the typicalfeatures need to be back-projected to the original space, where an interpretationis possible, i.e. yi = U · yi, where U is a projection matrix. Then, keywordscorrespond to the most fundamental features of the back-projected vectors maybe generated:

Keywordsk = Feature(ydi > ρ) (4.17)

Since the density values for particular clusters may vary greatly it is unpracticalto use directly equation 4.17. Instead, the back-projected vectors are normalizedso that maximum value is equal 1 or they sum to 1. Then, ρ is typically selectedas a value close to 1.

In some cases it is more useful to represent clusters by one common example.Thus, the prototype of a given cluster can be found by selecting the example,that has the highest density value p(xn|k). For example, for text data the pro-totypical documents can be found. It is also possible to generate keywordsdirectly from such prototype.

The interpretation with keywords and prototypes is possible on each level ofthe hierarchy. The typical features are drawn from mixture distribution of themerged clusters and back-projected to the original term space, where the com-mon keywords representation is found. The prototypes must be found, sepa-rately, for each of the component clusters.

4.4 Discussion

The hierarchical clustering supplies with the multilevel description of the data.In that way, the data points, which can not be unambiguously assigned to anyof the clusters found by GGM model can gain the confidence on higher levelsin the hierarchy. Several similarity measures may be used, which result withdifferent structures. However, no method significantly outperformed the other ,despite their drawbacks. Therefore, the results from all the similarity measuresare presented later in the experimental Chapter 7. It should be noted that for-

2One step of the KDD process is projection of the data, which is carried out whenever theoriginal data dimensionality is significantly large. The Gaussian Mixture models considered inChapter 3 require small (from numerical reasons) dimensionality of the data.

4.4 Discussion 53

mula for Kullback-Leibler similarity measure is the approximate and ClusterConfusion similarity measure is computationally expensive.

Interpretation of the clusters on all hierarchy levels is provided by keywordsand prototypes. Additionally, if the data labels are available it is possible tocompute the confusion matrix between the obtained and original labeling.

C H A P T E R 5

Imputating missing values

The missing data imputation task can be both a subject of data cleaning processas well as the data mining itself, see figure 1.1. In this work the problem isaddressed in connection to the particular data set that was collected during thesurvey and therefore the data records are partially missing. One of the tasks inconnection with this data set was to create models that will allow for accurateprediction of the lacking values.

The sun-exposure data was used in connection with missing values imputation.The experiments can be found in [81]. Since originally, the diary records arecategorical, both nominal and ordinal, coding technique is proposed that con-verts the data to binary vectors. 1-out-of-c coding was used for this purpose.It represent c level categorical variable with a binary c bits vector. For exam-ple, if the variable takes one of the following three states {1, 2, 3}, (c = 3)then the states are coded as shown on the figure 5.1 Missing data appears in

various applications both in the statistical and in the real life problems. Datamay be missing from various reasons. For example, subjects to studies, medi-cal patients, often drop out from different reasons before the study is finished.

56 Imputating missing values

state binary representation1 1002 0103 001

Table 5.1 An example of 3 state categorical variable coded into binary rep-resentation using 1-out-of-c coding.

The questionnaires are not filled out properly because they were forgotten, thequestions were skipped accidentally or they were not understood or simply thesubject did not know the answers. The error can occur in the data storage, thesamples may get contaminated, etc. The cause may also be the weather condi-tions, which did not allow the samples to be collected or the investigation to beperformed. Missing data can be detected in the original space for example byfinding the outlier samples by use of methods described in Section 3.6.

A lot of research effort has been spend in this subject. One could refer, forexample, to the books written by Little & Rubin [57] or by Schafer [77], sometheory can as well be found in the articles by Ghahramani [30, 31] or Rubin[73].

In order to define a model for missing data, lets, in line with Rubin [72] de-compose the data matrix X = {xn}N

n=1 to the observed and the missing part asfollows: X = [Xo,Xm]. One could now introduce the matrix R = {rn}

Nn=1

which is an indicator of the missingness:

R =

{1, xdn observed0, xdn missing

The joint probability distribution of the data generation process and the miss-ingness mechanism can be then decomposed:

p(R,X|ξ, θ) = p(R|X, ξ)p(X|θ), (5.1)

where θ and ξ are the parameters for the data generation process and missingdata mechanism, respectively.

57

The mechanism for generation of missing data is divided into the followingthree categories [30, 72]:

MAR (data Missing At Random). The probability of generating the missingvalue may depend on observation but not on missing value itself, i.e.p(R|Xo,Xm, ξ) = p(R|Xo, ξ), see figure 5.1 for illustration.

MCAR (data Missing Completely At Random) is a special case of MAR. Bythat definition the missing data is generated independently from Xo andXm, i.e. p(R|Xo,Xm, ξ) = p(R|ξ), see figure 5.2.

NMAR (data Not Missing at Random). The missing values generation mecha-nism is depending both on observed and missing part, i.e. p(R|Xo,Xm, ξ).Then the data is said to be censored.

0 2 4 6 8 10

0

0.5

1

1.5

2

2.5

3

3.5

x1

x2

MAR

Figure 5.1 An example for datamissing at random (MAR) generat-ing mechanism.

0 2 4 6 8 10

0

0.5

1

1.5

2

2.5

3

3.5

x1

x2

MCAR

Figure 5.2 An example for datamissing completely at random(MCAR) generating mechanism.

Majority of the research covers the cases, where missing data is of the MCARor the MAR type. There is no general approaches for NMAR type.

Let us imagine, a following example: the sensor that is unreadable outside a cer-tain range. In such case, learning the data distribution where the missing valueswere omitted will cause severe error (NMAR case). But if the same sensor failsonly occasionally (MAR case), due to other reasons than measured tempera-ture, the data, however harmed, often supplies enough information to create theimputation system and achieve a good estimate of the data distribution.


Most of the issues in statistical literature [10, 30, 57, 77] concerns two tasks,namely the imputation1 of the missing values and the estimation of the modelparameters. In the estimation, it is a common practice to use the completecase analysis (CCA) or available-case analysis (ACA). Case deletion is oftenused in case of both tasks in order to force the incomplete data to the completedata format. Omitting the patterns with unknown features (CCA) can be jus-tified, whenever the large quantity of the data is available2 and the variablesare missing completely at random or just at random as shown in figures 5.1and figure 5.2. Then the accuracy of the estimation of the model parameterswill not suffer severely due to the deletion process. However, the estimate maybe biased if the missingness process depends on some latent variable. The ad-vantage of this technique is the possibility of using the standard methods forstatistical analysis developed for fully observed data. The significant drawbackis the loss of information that occurs in the deletion process. When there is notenough data samples, the recommended technique is to use all the available forthe modeling data (ACA). It is well-founded since the missing data vectors alsocarry information which, may turn out crucial for the modeling. Thus, all thecases, where the variable of interest occurs are included in the estimation.

Once the parameters are estimated, the imputation can be performed. Natu-rally with the proper model selected, the missing values are re-inserted with aminimum error.

There has been suggested many different methods for completing the missingvalues (e.g., [57]). One way for the multivariate data is replacing the missingvalue with a predicted plausible value. The following list presents the mostoften used techniques in the literature:

Unconditional Mean Imputation replaces the missing value with the meanof all observed cases. It assumes MCAR generation mechanism and boththe mean and the variance of the completed data is underestimated.

Cold deck uses values from a previous study to replace missing values. Itassumes MCAR case. Variance is underestimated.

1The imputation is a general term for reconstruction of the missing data by plausible values.2Only c.a. 5% of the missing data is an acceptable amount to delete from the data set.

59

Hot deck replaces the missing value with the value from a similar, fully ob-served case. It assumes MAR case and it gives better variance estimatethan the mean and cold deck imputation.

Regression replaces the missing value with a value predicted from the regres-sion of observed variables. If the regression is linear the MAR assump-tion is done. The variance however, is underestimated, but less than inthe mean estimation.

Substitution replaces the missing study with another fully observed study notincluded earlier in the data set.

For example, one could use the conditional Mean Imputation. Here, both themean and the variance are calculated based on the complete cases and the miss-ing values are filled in by means of linear regression conditioned on the ob-served variables. An example of this technique as a method of estimating miss-ing values in multivariate data can be found in [57].

There are also alternative methods that maintain the full sample size and resultin unbiased estimates of parameters, namely multiple imputation [73, 71] andmaximum likelihood estimation based approaches [30, 85, 86]. In the multi-ple imputation approach, as the name indicates unlike earlier mentioned singleimputation techniques, the value of the missing variable is re-inserted severaltimes. That technique returns together with the plausible value also the uncer-tainty over the missing variable. In the maximum likelihood based approach themissing values are treated as hidden variables and they are estimated via EM(algorithm 3.1).

Below, two models for estimating missing data are presented. The first methodis a simple non-parametric K-Nearest Neighbor (KNN). In the second modelthe imputation is based on the assumption that the data vectors are coming fromthe Gaussian distribution. Both KNN and the Gaussian imputation analysis arebased on the article [81] enclosed in appendix B. The experimental results arefound in section 7.3.2.


5.1 K-Nearest Neighbor model

In order to determine nearest neighbors of the investigated data point, a dis-tance measure must be proposed. In case of binary data the Hamming distancemeasure is used, which is given by the following formula:

D(xi,xj) =D∑

d=1

|xdi − xdj |, (5.2)

where xi and xj are two binary vectors, d is a bit index and D is a numberof bits in the vector. Thus, for example, the distance between two vectors:x1 = [100] and x2 = [001] is calculated in the following way: D(x1,x2) =∑3

d=1 |xd1 − xd2| = |1 − 0| + |0 − 0| + |0 − 1| = 2.

When the data has real values, for example Ln-norm can be applied as a dis-tance measure. L2-norm between D-dimensional vectors xi and xj is definedas

D(xi,xj) =

√√√√D∑

d=1

(xdi − xdj)2. (5.3)

Naturally, any other valid3 distance measure may be used, that is suitable forthe application, data type, or domain the calculations are hold. For example,with the discrete data often the linear cosine inner-product is used to measurevector dissimilarities or in the probability domain the KL divergence [69] maybe applied.

As the optimum number of nearest neighbors K, for the investigated data set,the number is selected, for which the minimum imputation error is observed, inthe experiment performed on fully observed data vectors.

The algorithm for the non-parametric K-Nearest Neighbor Model is presentedon figure 5.3.

3The distance metric must satisfy the positivity, symmetry and triangle inequality conditions.

5.2 Gaussian model 61

KNN Algorithm:1. Divide the data set D into two parts. Let the first set contain data vectors

in which at least one of the features is missing, Dm. The remaining partwhere all the vectors are complete is called Do.

2. For each vector x ∈ Dm:

• Divide the vector into observed and missing parts as x = [xo,xm].

• Calculate the distance (Eq. 5.2 or Eq. 5.3 ) between the xo and allthe vectors from the set Do. Use only those features, which areobserved in x.

• Use the K closest vectors (K-nearest neighbors) and perform a ma-jority voting (in the discrete case) or mean value (for real data) es-timate of the missing values.

Figure 5.3 The algorithm for KNN model for imputation.

5.2 Gaussian model

Let now make an assumption that x comes form Gaussian distribution withmean µ and covariance Σ. The feature vector x, is divided into an observedand missing part: x = [xo,xm], as it was done in previous sections. Underthe Gaussian model assumption, the optimal inference of the missing part isgiven as the expected value of the missing part given the observed part. That isgiven by the least-squares linear regression between xm and xo predicted by aGaussian (for reference see [30]) i.e.,,

E(xm|xo) = µm + ΣmoΣ−1oo · (xo − µo) (5.4)

where

µ = [µo, µm] and Σ =

[Σoo Σom

Σ>om Σmm

](5.5)

The Gaussian imputation algorithm is presented in figure 5.4.


GM Algorithm:1. Divide the data set D into two parts. Let the first set contain data vectors

in which at least one of the features is missing, call it Dm. Then theremaining part, where all the vectors are complete is called Do.

2. Estimate mean µ and the covariance matrix Σ from Do, i.e.

µ =1

No

∑

n∈Do

xn (5.6)

Σ =1

No − 1

∑

n∈Do

(xn − µ) (xn − µ)> (5.7)

where No = |Do| is the number of complete vectors.3. For each vector x ∈ Dm

• Divide the vector into two parts x = [xo,xm], where xo is theobserved vector features and xm the missing.

• Estimate the missing part of the vector using Eq. 5.4

xm = E(xm|xo) = µm + ΣmoΣ−1oo · (xo − µo) (5.8)

In case of binary vectors, sign of that estimate is used i.e., xm =sign [E(xm|xo)]. For discrete numbers the estimate may be ade-quately quantized.

Figure 5.4 The algorithm for Gaussian model for imputation.

5.3 Discussion

The missing data problem occurs in many medical related databases. One of thedata mining task concerns the imputation of such lost data. In this chapter twomethods for imputation are presented. The experimental results and analysis ofthe performance for both methods are presented in section 7.3.2. It is shownin this section that the Gaussian imputation, for the applied data set, performsslightly better than KNN model.

C H A P T E R 6

Probabilistic approach toKernel Principal Component

Analysis

Spectral clustering methods, one of which is presented in the following chapter,are the another techniques used in data mining task, see figure 1.1. In this casethe presented in Chapter 2 feature dimensionality reduction methods are notnecessary even when the data dimensionality is large. It can be, however, usedprior to the performed clustering.

The kernel principal component analysis (KPCA) [78] in decomposition of aGram matrix has been shown to be a particularly elegant method for extractingnonlinear features from multivariate data. It has been shown to be a discreteanalogue of the Nystrom approximation to obtaining the eigenfunctions of aprocess from a finite sample [88]. This relationship between KPCA and non-parametric orthogonal series density estimation was highlighted in [33], andthe relation with spectral clustering has recently been investigated in [8]. Thebasis functions obtained from KPCA can be viewed as the finite sample esti-mates of the truncated orthogonal series [33]. However, a problem commonto orthogonal series density estimation is that the strict non-negativity required

64 Probabilistic approach to Kernel Principal Component Analysis

from a probability density is not guaranteed when employing these finite ordersequences to make point estimates [43]. This is also the case of the KPCAdecomposition [33].

The following chapter considers the non-parametric estimation of a probabilitydensity from a finite sample [43] and relates this to the identification of classstructure. This approach is presented in yet unpublished article [83].

6.1 Density Estimation and Decomposition of the GramMatrix

Lets consider the estimation of an unknown probability density function p(x)from a finite sample of N points {x1, · · · ,xN} ∼ p(x) where the feature spaceof x is D-dimensional. Such sample can be employed to estimate the density ina non-parametric form by using, for example, a Parzen window estimator (referto [43] for a review). Then the obtained density estimate is given by

p(x) = N−1N∑

n=1

Kh(x,xn) (6.1)

where Kh(xi,xj) denotes the window (or kernel function) of width h, betweenthe points xi and xj , which itself satisfies the requirements of a density function[43]. It is important to note that the pairwise kernel function values Kh(xi,xj)or Gram matrix1 provides the necessary information regarding the sample esti-mate of the underlying probability density function p(x).

For applications of unsupervised kernel methods such as KPCA the selectionof the kernel parameter, in case of the Gaussian kernel h, is often problematic.However, since the kernel matrix can be viewed as the sample density estimate,methods such as leave-one-out cross-validation can be employed in obtainingan appropriate value of the kernel width parameter.

1N × N Gram matrix G = {gij} of the data matrix X = {xdn} is defined as the matrix ofthe dot products gij = x

Ti xj or equivalently: G = X

TX.

6.1 Density Estimation and Decomposition of the Gram Matrix 65

6.1.1 Kernel Decomposition

The density estimate can be decomposed in the following probabilistic manner:

p(x) =N∑

n=1

p(x,xn) (6.2)

=

N∑

n=1

p(x|xn)P (xn) (6.3)

= N−1N∑

n=1

p(x|xn) (6.4)

It is assumed that each sample point is equally probable a priori, i.e. P (xn) =N−1, which is an outcome of the independent and identically distributed (i.i.d.)assumption. The kernel operation can then be seen (compare equations 6.1and 6.4) as the conditional density p(x|xn) = Kh(x,xn).

By Bayes’ theorem [10], a discrete posterior probability can be defined for anypoint x given each of the N sample points

p(xn|x) =p(x|xn)P (xn)

∑Nn′=1 p(x|xn′)P (xn′)

=K(x,xn)

∑Nn′=1 K(x,xn′)

≡ K(x,xn) (6.5)

such that p(xn|x) ≥ 0 ∀ n. It is easy to note that∑N

n=1 p(xn|x) = 1.

If there is the underlying class structure in the density then the sample posteriorprobability can be decomposed by introducing a discrete class variable

p(xn|x) =

C∑

c=1

p(xn, c|x) =

C∑

c=1

p(xn|c,x)p(c|x). (6.6)

If the points have been drawn i.i.d. from the respective C classes forming thedistribution such that xn⊥x | c, then

p(xn|x) =

C∑

c=1

p(xn, c|x) =

C∑

c=1

p(xn|c)p(c|x), (6.7)

where the stochastic constraints∑N

n=1 p(xn|c) = 1 and∑C

c=1 p(c|x) = 1 aresatisfied.


The decomposition of the posterior probabilities for each point in the availableset p(xi|xj) =

∑Cc=1 p(xi|c)p(c|xj), for all i, j = 1, · · · , N is identical to

the aggregate Markov model originally proposed by Saul and Periera [76]. Thematrix of posteriors (elements of the normalized kernel matrix) can be viewedas an estimated state transition matrix for a first order Markov process. Thisdecomposition then provides class posterior probabilities p(c|xn) which can beemployed for clustering purposes.

A divergence based criterion such as cross-entropyN∑

i=1

N∑

j=1

K(xi,xj) log

{C∑

c=1

p(xi|c)p(c|xj)

}(6.8)

or distance based criterion such as squared error

N∑

i=1

N∑

j=1

{K(xi,xj) −

{C∑

c=1

p(xi|c)p(c|xj)

}}2

(6.9)

can be locally optimized by employing the standard non-negative matrix mul-tiplicative update equations (NMF) [55] (also described in section 2.1.3) orequivalently the iterative algorithm which performs Probabilistic Latent Se-mantic Analysis (PLSA) [42]. If the normalized Gram matrix is defined asGN×N = p(xi|xj) = K(xi,xj) then the decomposition of that matrix withNMF or PLSA algorithms yields GN×N = WH such that W = p(xi|c) andH = p(c|xj) are understood as the required probabilities which satisfy thepreviously defined stochastic constraints.

If a new point z is observed the estimated decomposition components can, inconjunction with the kernel, provide the required class posterior p(c|z),

p(c|z) =N∑

n=1

p(c|xn)p(xn|z) (6.10)

=N∑

n=1

p(c|xn)K(z,xn) (6.11)

=N∑

n=1

p(c|xn)K(z,xn)

∑Nn′=1 K(z,xn′)

. (6.12)

This can be viewed as a form of “kernel” based non-negative matrix factor-ization, where the ’basis’ functions p(c|xn) define the class structure of the

6.1 Density Estimation and Decomposition of the Gram Matrix 67

estimated density.

In attempting to identify the model order a generalization error based on thetest sample predictive negative log-likelihood can be employed

Gz = −N−1z

Nz∑

n=1

log {p(zn)} , (6.13)

where Nz denotes the number of test points. In the presented model, the likeli-hood of the test sample p(z) is derived from in the following manner:

p(z) =1

N

N∑

n=1

p(z|xn) (6.14)

=1

N

N∑

n=1

C∑

c=1

p(z|c)p(c|xn) (6.15)

The p(z|c) can be decomposed given the finite set such that

p(z|c) =∑N

l=1p(z|xl)p(xl|c), (6.16)

where p(z|xl) = K(z|xl). So the unconditional density estimate of an test pointgiven the current kernel decomposition which assumes a specific class structurein the data can be computed as follows

p(z) =1

N

N∑

n=1

N∑

l=1

C∑

c=1

K(z|xl)p(xl|c)p(c|xn). (6.17)

6.1.2 Examples of kernels

For continuous D-dimensional data a common choice of kernel, for both kernelPCA and density estimation, is the isotropic Gaussian kernel of the form

Kh(x,xn) = (2π)−D2 h−Dexp

{−

1

2h2||x − xn||

2

}(6.18)

Of course many other forms of kernel can be employed, though they may notthemselves satisfy the requirements of being a density. If we consider for ex-ample the case of the linear cosine inner-product in case of discrete data such


as vector space representations of text, the kernel is defined as

K(x,xn) =xTxn

||x|| · ||xn||. (6.19)

The decomposition of such cosine based matrix directly yields the requiredprobabilities.

This interpretation provides a means of spectral clustering which, in case ofcontinuous data, is linked directly to non-parametric density estimation andextends easily to discrete data such as for example text. We should also note thatthe aggregate Markov perspective allows us to take the random walk viewpointas elaborated in [59] and so a K-connected graph may be employed in definingthe kernel similarity KK(x,xn). Similarly to the smoothing parameter and thenumber of clusters, the number of connected points in the graph can be alsoestimated from the generalization error.

6.2 Discussion

For the case of a Gaussian kernel this interpretation of a kernel based clus-tering enables estimating the kernel width parameter by means of test predic-tive likelihood and as such cross-validation can be employed. In addition, theproblem of choosing the number of possible classes, a problem common to allnon-parametric clustering methods such as spectral clustering [59, 62], can nowbe addressed. This overcomes the lack of an objective means of selecting thesmoothing parameter in most other forms of spectral clustering models. Theproposed method first defines a non-parametric density estimate, and then theinherent class structure is identified by the basis decomposition of the normal-ized kernel in the form of class conditional posterior probabilities P (xn|c) andP (c|xn). Since the projection coefficients are provided a new, previously unob-served point can be allocated in the structure. Thus, projection of the normal-ized kernel function of a new point onto the class-conditional basis functionsyields the posterior probability of class membership for the new point. Some-thing which cannot be achieved by partitioning based methods such as thosefound in [59] and [62].

A number of points arise from the presented exposition. Firstly, in case ofcontinuous data, it can be noted that the quality of the clustering is directly re-lated to the quality of the density estimate. Once a density has been estimated

6.2 Discussion 69

the proposed clustering method attempts to find modes in the density. Also ifthe density is poorly estimated perhaps due to a window smoothing parameterwhich is too large then class structure may be over-smoothed and so modesmay be lost. In other words essential class structure may not be identified bythe clustering. The same argument applies to a smoothing parameter which istoo small thus causing non-existent structure to be discovered and to the con-nectedness of the underlying graph connecting the points under consideration.

C H A P T E R 7

Segmentation of textual andmedical databases

In the previous chapters some of the results were presented, and there the dis-cussed techniques were compared on the simple data sets. Carefully selectedauxiliary data sets were used in order to obtain good illustration. It allowedin some of the cases to decide which technique to apply in further investiga-tions. This chapter focuses on implementation of the earlier described modelson observational data.

7.1 Data description

In this section the applied data sets are presented together with short descriptionof the accompanying them preprocessing procedure. The detailed report of thepreprocessing steps is given in sections 7.2.1 and 7.3.1. Four data sets wereused in the further investigations.

72 Segmentation of textual and medical databases

Email data: is a collection1 of private emails categorized into three groupsnamely: conference, job, and spam. The documents are hand-labeled.Since emails were collected by an university employee the categories areuniversity related. The collection was preprocessed, details of the pro-cedure are presented in section 7.2.1, so the final data matrix consists of1405 documents each described by 7798 terms. The clustering of thedata was performed in latent space found in Latent Semantic Indexingframework [20]. This data set is used due to its simplicity and fairly goodseparation.

Newsgroups: is collection of 4 selected newsgroups2. The data is used in manypublications, starting from the data collector Ken Lang published in [51]and also for example in [6, 9, 68]. The original collection consists of20 different newsgroup categories each containing around 1000 records.In the experiments only four newsgroups were selected, namely com-puter graphics, motorcycles, baseball and Christian religion each of 200records. The preprocessing steps are identical to those performed on theEmail collection which are described in detail in section 7.2.1. In pre-processing 2 documents were removed3 resulting with the final numberof 798 records and 1367 terms. Also in case of this data set labels areavailable.

Sun-exposure study: The data was collected by Department of Dermatology,Bispebjerg Hospital University of Copenhagen, Denmark. It concerns acancer risk study. The data set used in the experiments represents oneyear study or in fact 138 days, collected during spring, summer, and au-tumn period. The survey was performed on a group of 196 volunteersresulting in a total number of 24212 collected records. The experimentsconcern only the one year fraction of diary database. However, full studywas performed throughout 3 years survey. This extended data set consistof diary records, detailed UV measurements, questionnaire of the pastsun habits and the measurements of the skin type. This data is availablefor future study.

For the survey purpose, a special device was constructed for measuringsun exposure4 of the subjects. The picture of the device is shown onfigure 7.1. Additionally, subjects were asked to fill out daily diary records

1The Email database can be obtained at following location: http://isp.imm.dtu.dk/staff/anna2The full data collection consisting of 20 categories is available at e.g., http://kdd.ics.uci.edu3See section 7.2.1 for details.4UVA and UVB radiation dose was measured every 10 minutes. In the performed experi-

7.1 Data description 73

Figure 7.1 The devise measuring sun radiation the skin is exposed on.

about their sun behavior by answering 10 questions which are listed intable 7.1. Since it is a common knowledge that high sun exposure leadsto the increase of cancer risk, it was expected that a link between thesetwo events can be established.

No. Question Answer1. Using measuring device yes/no2. Holiday yes/no3. Abroad yes/no4. Sun Bathing yes/yes-solarium/no5. Naked Shoulders yes/no6. On the Beach/Water yes/no7. Using Sun Screen yes/no8. Sun Screen Factor Number yes-number (26 values)/no9. Sunburned no/red/hurts/blisters

10. Size of Sunburn Area no/little/medium/large

Table 7.1 Questions concerning the daily sun habits in the sun-exposurestudy.

In the questionnaire, question number seven contains redundant informa-tion, for the investigation, since it is already included in question eight,and therefore it was removed from the data set. Also question numberone, which is in fact an indicator of missingness in the data, was notincluded in the cluster structure investigation. In result, the records aredescribed by 8 questions – 8 dimensional, categorical vectors. As ex-pected, in the survey, a lot of data is missing. Therefore, the techniquesare investigated to imputate missing values in these records. For otherexperiments, concerning data segmentation, the missing records were re-

ments only daily dose was available.


moved. In the diary collection there are 1073 incomplete records, 4171are missing in UVA/UVB measurements and when combining these twodata sets 5041 records have missing values. Therefore, only 19171 com-plete records from 187 subjects were used in the investigation.

Dermatological collection: is the collection5 of erythemato-squamous diseases.Six classes (the diagnosis of erythemato-squamous diseases) are observed:psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic der-matitis, pityriasis rubra pilaris. This database contains 358 instances de-scribed by 34 attributes, 33 of which are nominal (values 0, 1, 2, 3) andone of them (age) is ordinal. Original data set contains few missing val-ues that were removed for this investigation. The data set was previouslyused in [35]. The set is segmented by the aggregated Markov model, thatis described in Chapter 6.

7.2 Segmentation of textual databases

7.2.1 Preprocessing

In order to obtain the vector space representation of the the textual data [75]the documents are transformed into word-frequencies. Then, in order to re-duce, often large, dimension of the feature space certain preprocessing stepsare performed that are described below. Such processing the textual databasesis common in literature and can as well be found for example, in [39, 50, 52].

Each of the text documents is represented by the unique histogram over theword collection. The word collection is the list of all the words that occur in anyof the observed documents. For each document, the number of occurrences ofeach word is recorded and that creates a unique fingerprint (histogram, featurevector) of that document. The relationships among the words are neglected.Typically, the feature space has extremely high dimension. One could easilyimagine vectors of the size of a full dictionary, what is several thousands ofterms. In every language the sentences are build from verbs, substantives, ad-jectives, adverbs and also from the conjunctions, pronouns, prepositions etc.The last, do not carry any crucial information, for clustering, but they occur inthe sentences significantly more often than the other words. Therefore, it is an

5Available at http://ftp.ics.uci.edu/pub/machine-learning-databases

7.2 Segmentation of textual databases 75

important preprocessing step to remove them. In the information technologyarea they are called stopwords. A list of 585 stopwords has been used in theexperiments. Also words with very high6 or very low7 frequency of occurrenceare erased. In this work the terms that occur less than 2 times and the docu-ments containing less than 2 words are removed from the data set. In order toreduce dimensionality and compress the information a stem merging algorithmis applied. In this case it means that the frequencies of the words that have thesame stem but different suffixes are merged together8. For example, if wordslike working, worked, works occur in the collection, they are all representedby single stem word work and the frequencies of the sub-terms are accumulatedin the main term. That technique require lookup in the dictionary and thereforeit is time expensive. Summarizing, each text document is converted and rep-resented by a histogram over the list of selected words. The collection of suchhistograms is referred to as term-document matrix (TD).

Since the documents have various lengths the term-document matrix is normal-ized. Basically, any normalization method can be applied here, however, inthe experiments the normalization to the unit sphere was used (unit L2-normlength), i.e. xi = xi

||xi||2, where x is not normalized term-document matrix.

For the cross-validation purposes data was randomly divided to the training andthe test set. In the Email data, training set contains 702 vectors leaving for test703 examples. For newsgroups the sets had the same size of 399 samples each.

From both sets, the mean vector calculated from the training set µ = N−1∑

n xn

was subtracted form the data vectors.

In the following sections, steps of the KDD process, presented in Chapter 1, areapplied on the observational data sets. The data is preprocessed and then pro-jected to low-dimensional latent space using the Principal Component Analysis.That framework was introduced in [20]). In that space, the density is modeledwith Generalizable Gaussian Mixture model, from which the cluster structureis obtained. The interpretation of clusters is provided, in the form of keywords.Hierarchical clustering is also applied enabling multiple level classification andinterpretation. Some of the results can also be found in the following contribu-

6Those words are not unique for any of the documents.7It is generally not recommended to force small clusters.8Also another technique is very popular in literature, so called suffix stripping, where the

recognized word endings are deleted and the remaining identical stems are merged, e.g. PorterStemming Algorithm [67].


tions [52, 53, 80].

7.2.2 Projection to the latent space

All the textual data sets are high dimensional. The email collection after pre-processing is described by 7798 terms and the newsgroups by 1217. In eachcase the dimension is too large to be able to use effectively any of the GGMmodels9 discussed in chapter 3. Therefore, the representation of the data in thereduced space must be obtained. Selected projection methods are described indetail in chapter 2. In the experiments Principal Component Analysis is used.This general technique in context of the text data was originally proposed byDeerwester in [20] and named Latent Semantic Indexing (LSI), where as theprojection vectors the eigenvectors are used that are obtained by singular valuedecomposition. That defines a new latent space where the semantic similaritiesof the documents can be discovered.

On figure 7.2 the latent space for the investigated collections is shown. In eachcase (for visualization purpose), 2 suitable principal components are selected.Based on the presented scatter plots of both of the data sets, it is easy to see incase of both sets the existing structure in the data.

In order to determine the optimal number of principal components for each col-lection the eigenvalue curves are investigated10 as are shown on figure 7.3. ForEmail collection 5 principal components were selected and for Newsgroups 4components are used. In both cases, this decision is arguable, since no dominantgroup of components is observed.

7.2.3 Clustering and clusters interpretation

9In high dimensions the number of parameters to estimate is also high so large number ofdata points is needed. Additionally, the covariance matrix is also large and it needs to be invertedmany times in the learning process, which is time consuming. This drawbacks of the GGMmodels are explained in section 3.3.

10The attempt of calculating the generalization error (Eq. 2.12) was unsuccessful, since theerror curve was monotonically decreasing. The reason may be due to the shape of the eigenvaluecurve (see figure 7.3), where for high principal components large values, in comparison to thefirst eigenvalues, are still observed.


−0.4 −0.2 0 0.2 0.4−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4


Pri

ncip

al C

ompo

nent

2

Email data set

−0.6 −0.4 −0.2 0 0.2 0.4−0.4

−0.2

0

0.2

0.4

0.6


Pri

ncip

al C

ompo

nent

2

Newsgroups data set

Figure 7.2 Scatter plots of the training data sets in the latent space for theinvestigated text collections, Email and Newsgroups data sets, respectively.For good visualization 2 principal components are carefully selected. It iseasy to see in both cases that there exists a cluster structure in the data.

0 100 200 300 400 500 600 7020

0.2

0.4

0.6

0.8

1

Principal components

Eig

enva

lues

Email collection

2 4 6 8 100.4

0.45

0.5

0.55

0.6

0.65

0 50 100 150 200 250 300 350 3990

0.2

0.4

0.6

0.8

1


Eig

enva

lues

Newsgroups collection

2 4 6 8 100

0.5

1

Figure 7.3 Eigenvalue curves for the investigated textual collections, Emailand Newsgroups data sets. Additionally, the largest eigenvalues are shown inclose-up. Based on these plots the decision is made about the number of prin-cipal components. In case of Email collection 5 largest principal componentsare used and for Newsgroups 4 are selected.

7.2.3.1 Email collection

In the performed experiment for the Email collection the latent space of 5 com-ponents was used. At first, the collection was scanned against outliers. For


detection, the UGGM model with an outlier cluster was applied, for details seesection 3.6.2. No outlier samples were found in the training set. Thus, the fullcollection of 702 samples was used in segmentation process.

Clustering

For clustering the unsupervised Generalizable Gaussian Mixture model hardassignment (0/1 decision function) was applied, details of which can be foundin section 3.3. In the particular result, presented later in this section, sevenclusters11 were obtained. The cluster assignment for selected components isshown on figure 7.4.

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4


Pri

ncip

al C

ompo

nent

2

Email data set − hard assignment

1234567

Figure 7.4 Scatter plot of the Email data with indicated data points assign-ment to the clusters and shown density contours. The dependency betweenprincipal components 1 and 2 are shown. The density was estimated by theUGGM model with hard cluster assignment. 7 clusters, indicated in the leg-end bar, are observed.

For illustration, the same 2 principal components are displayed as were shownon figure 7.2.

Since labels are available, the confusion matrix, described in section 4.2, was11Number of clusters was decided based on the training error with AIC penalty [1]. Typically,

the number of clusters in this model, for Email collection varies in the range between 6 and 11.


also calculated and presented on figure 7.5. It can be used in further analysisas an supervised indication of cluster contents. Based on figure 7.5 it may

0.0 0.0 2.6

1.1 1.5 59.1

0.0 0.0 36.3

4.6 92.6 0.0

0.0 0.0 1.5

0.6 1.5 0.0

93.7 4.4 0.5

7

Conf.

6

Job

5

Spam

4

3

2

1

Email data set

0.0 0.0 1.9

1.0 0.7 70.2

0.5 0.0 24.7

2.6 91.9 0.0

0.0 0.0 1.1

0.0 1.5 0.0

95.8 5.9 2.1

7

Conf.

6

Job

5

Spam

4

3

2

1

Email test data set

Figure 7.5 Confusion matrix for Email data set. Left table shows the confu-sion in the training set and right, in the test set. From the presented figures,it may be concluded that most of the conference emails are accumulated incluster number 1. Cluster 4 takes majority of job emails and clusters 5 and 6of spam. The rest of the clusters (2,3 and 7) contain only small fractions ofdata points.

be concluded that most of the conference emails are accumulated in clusternumber 1 (94% of the training data points). Cluster 4 takes majority of jobemails (93%) and clusters 5 and 6 of spam (36% and 59%, respectively). Therest of the clusters contain only small fractions of data points.

A soft version of the GGM model was also applied for comparison. As ex-pected, the outcome is much more complex (model uses 21 clusters to describethe density p(x)). The scatter plot of the data with the surface plot of the densityfunction is presented on figure 7.6. Density structure between the soft and thehard assignment does not differ significantly (compare figures 7.4 and 7.6). Inthe soft assignment algorithm the density is described by 21 components while,in case of the hard assignment only 7 are needed. In the hard GGM algorithm,in the optimization process, the clusters that do not have members assigned, arere-initialized or deleted, while in soft version they are preserved, even when themixing proportions P (k) are smaller than N−1, what corresponds to one mem-ber. In that way, more complex models are possible, leading to a more accu-rate density estimate, but at the same time allowing clusters without membersamong the training set. Therefore, for clustering, hard assignment is usuallymore useful.


−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4


Pri

ncip

al C

ompo

nent

2

Email data set − soft assignment

Figure 7.6 Scatter plot of the Email data with indicated cluster structureand data probability density contours. The dependency between principalcomponents 1 and 2 are shown. The density is found by the UGGM modelwith soft cluster assignment. 21 clusters is observed. The observed densitystructure does not differ significantly for the hard assignment outcome (seefigure 7.4).

Hierarchical clustering

All similarity measures are applied here that are presented in section 4.1. Atthe first hierarchy level (j = 1) 7 clusters are used, obtained from hard UGGMmodel. The dendrograms for KL, Cluster Confusion, L2 and modified L2 sim-ilarity measures are presented on figure 7.7. For the Sample Dependent sim-ilarity measure the frequencies with which certain cluster combinations occurin the test set, are shown with the bar plot. The most often used combinationsare labeled over the corresponding bars. Dendrogram structures describe theconsecutive clusters that are merged in the process. Even though, the structuresvary significantly, it is difficult to select a superior method, since the quality ofhierarchical clustering is a subjective issue. Note, however, that even though KLoften provides interesting results it is an approximate measure and the ClusterConfusion similarity measure require density sampling and therefore is compu-tationally expensive.


5 6 7 1 4 2 3100

101

102

103

Cluster numbers

Dis

tanc

e

KL Similarity Measure for Email data set

5 6 1 4 2 7 310−1.6

10−1.5

10−1.4

Cluster numbers

Dis

tanc

e

Cluster Confusion Similarity Measure for Email data set

2 7 3 5 6 1 4102

103

104

105

Cluster numbers

Dis

tanc

e

L2 Similarity Measure for Email data set

2 3 7 5 6 1 410−2

100

102

104

Cluster numbers

Dis

tanc

e

L2 modified Similarity Measure for Email data set

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Merged cluster numbers

Freq

uenc

y of

occ

urre

nce

Sample Dependent Similarity Measure for Email test data set, ρ = 0.9

1

45

6

1+4

6+5

5+6

Figure 7.7 The dendrograms for Email data set. UGGM with hard clus-ter assignment resulted with 7 clusters that are used at the first level j = 1.Based on that outcome the presented hierarchies are build with five similaritymeasures are applied which are described in section 4.1. In case of sam-ple depended similarity measure the most often combinations are presentedabove the frequency bars. The tree structures describe the consecutive clus-ters that are merged in the process. The closest clusters include: clusters 5and 6 (spam emails) merged firstly by all the measures, and 1 and 4 which arefound to be closely related in semantic domain (both are university related.)


Once the hierarchy is build it is possible to determine the cluster assignment ofnew points. For the majority of the test samples, the first hierarchy level givesenough confidence in assignment to the cluster. Approximately 76% of the datapoints can be uniquely assigned to one of the seven basic clusters (majority isassigned to clusters 1, 4, 5 and 6) with a confidence larger than 90% (ρ = 0.9).Those values are shown on figure 7.7 for Sample Depended similarity measure,where the cluster posterior p(k|x) is is observed for the first hierarchy level(clusters from 1 to 7). The rest of the samples (24%) needs at least 2 clustersto provide sufficient description. A big fraction of these data points fall intothe composition of spam emails (clusters 5 and 6). Such union is suggested byKL and Cluster Confusion and Sample depended similarity measures. L2 andmodified L2 first combines all the minor importance clusters and then adds tothis mixture two spam clusters. Another likely formation provides mixture ofjob and conference emails (cluster 1 and 4).

Another method for interpretation is to obtain the cluster representation as forexample, to generate keywords for each cluster or cluster union. Such a tech-nique is fully unsupervised but it requires understanding of the context througha provided set of related words.

Keywords assignment

For the Email collection, clustered with hard UGGM and with hierarchies shownon figure 7.7, keywords are generated and presented in this section. First, let usintroduce the numbering schema of clusters in the dendrogram structure. Theclusters are marked with consecutive numbers. In case of the presented Emailexample there are 7 clusters (numbered: 1, 2, . . . , 7) at the first hierarchy levelj = 1. At each next level new number is assigned to the new composition.Thus, at level j = 2, the two closest clusters create cluster number 8, at thelevel j = 3 the next 2 clusters result with composition number 9, etc.

The idea is illustrated on the figure to the right. Inthis simple example 5 clusters are observed at thebasic level j = 1. At second level in the hierarchyj = 2 clusters 1 and 2 are merged, resulting withthe composition number 6. On the next level, clus-ters 2 and 3 create cluster 7. Next, 2 new clustersare merged together (6 and 7) and to the new com-position, number 8 is assigned. Finally, cluster 9 isthe composition of 8 and 5, collecting also all the


basic level clusters: 1, 2, 3, 4 and 5. Totally, with K clusters on the basic levelj = 1 there are K − 1 hierarchy levels, thus, total number of clusters in thehierarchy equals 2K − 1.

In order to assign keywords, the set of typical features is generated. Thus,based on the outcome of the UGGM algorithm, i.e. cluster means, covariancesand mixing proportions, a new set of 5000 points are randomly drawn form themodeled data distribution. From this sample as typical features the vectors areselected for which density value is in top 20%. By lowering this threshold, incase of multi-modal densities, it is possible to include to keywords represen-tation, also contributions from weaker clusters, i.e. those with wider variance.The typical features were back-projected to the original space are the data meanis added. Reconstructed in that way histograms are no longer positive due to therestricted number of principal components used in projection. Thus, the nega-tive values are neglected. As keywords, words, form the histograms assigned tothe clusters or cluster unions, are accepted with frequency higher than, e.g. 10%of the total weight, i.e. the word is accepted if wdi/

∑di wdi > 0.9, where wdi is

the frequency of a d’th word from the i’th typical feature vector (reconstructedhistogram).

Keywords for the first level in the hierarchy with corresponding mixing propor-tions P (k) are presented in table 7.2.

Cluster number 1 inherit clearly the conference keywords like: information,university, conference, call, workshop, etc. Cluster number 2 and number 4 areuniversity job related, thus, position, university, science, work, are appearinghere. Clusters 3, 5, 6 and 7 are easily recognized with often used words inspam emails thus, such terms like: list, address, call, remove, profit and freeare observed. One can refer now to the confusion matrix presented on figure7.5 and find similarities in the cluster interpretation.

For each dendrogram shown on figure 7.7 the keywords corresponding to higherhierarchy levels are given in tables 7.3, 7.4, 7.5, and 7.6. When merging theclusters together, the density of the union is expressed by the following equa-tion: p(x) =

∑k∈Iα

p(x|k)P (k), where Iα collects the indexes of the mergedclusters. If one cluster in the union is dominant (has narrow variance) main key-words will most likely come from this cluster, since majority of typical featureswill be sampled from its center (higher probability region). The more compa-rable the densities are, the more likely is that the union keywords are a mixturefrom both component clusters. For example, in table 7.3 the keywords corre-


Cluster P(k) Keywords1 .245 information, conference, university, paper, call, neural, research, appli-

cation, fax, contact, science, topic, workshop, system, computer, invite,internation, network, submission, interest

2 .004 research, university, site, information, science, web, application, posi-tion, computer, computation, work, brain, candidate, neural, analysis,interest, year, network

3 .009 succeed, secret, profit, open, recession, million, careful, produce,trump, success, address, letter, small, administration, question

4 .190 university, research, science, position, computation, brain, application,candidate, year, computer, send, department, information, analysis,neuroscience, interest, cognitive

5 .202 free, site, web, remove, click, information, visit, subject, service, adult,internet, list, sit, offer, business, line, time

6 .335 call, remove, free, address, card, list, order7 .014 call, remove, free, address, list, order, card, day, send, service, infor-

mation, offer, business, money, mail, succeed, make, company, time,line

Table 7.2 Keywords for 7 clusters obtained from UGGM model. Keywordsprovide the interpretation of the clusters. Cluster number 1 inherit clearly theconference keywords like: information, university, conference, call, work-shop, etc. Cluster number 2 and number 4 are university job related, thus,position, university, science, work, are appearing here. Clusters 3, 5, 6 and 7are easily recognized with often used words in spam emails thus, such termslike: list, address, call, remove, profit and free are observed.

sponding to the KL similarity measure are presented. There, cluster number 8is a composition of clusters 5 and 6 and as such inherit the keywords from bothclusters, which densities are comparable. Cluster number 9, however, which iscomposed from clusters 1 and 4 is represented by job keywords, so keywordsare coming only from cluster number 4, which is dominant. It is possible toinclude keywords form other clusters by selecting larger number of typical fea-tures and by that sampling the larger area of the probability space.


Cluster Keywords8 free, call, remove, information, site, subject, list, web, business, offer, message,

rep, time, mail, visit, line, day, address, send, year9 research, university, computation, science, application, succeed, position, in-

terest, information, neural, work, cognitive, open, brain, year, neuroscience,model, fax, secret, network

10 research, university, computation, science, application, succeed, position, in-terest, information, neural, work, cognitive, open, brain, year, neuroscience,model, fax, secret, network

11 free, call, remove, information, site, subject, list, web, business, offer, message,rep, time, mail, visit, line, day, address, send, year



Table 7.3 Keywords for hierarchy build with KL similarity measure. Clus-ter number 4 containing job emails has the narrower variance and thereforehigher density values. When selecting typical features, most or all of thepoints are generated by this particular cluster. Therefore, its keywords aredominating keywords from other clusters in hierarchy.

Similar interpretation obtain the hierarchy based on Cluster confusion similaritymeasure, presented on figure 7.4. Also here cluster number 4 is merged at thebeginning, thus it dominates in the selected typical features.

In case of L2 and modified L2 similarity measures, shown in tables 7.5 and 7.6,the clusters are merged successively, first the minor clusters, with few mem-bers, and then the clusters containing the spam emails and at last the clusterswith conference and job emails. That structure, taking into consideration theknown labeling, seems to give the best results for Email collection, both in hi-erarchy and the corresponding keywords. The major difference, between L2

and modified L2 similarity measures, is the increased distance from small clus-ters to the rest arising form included mixing proportion values in the distancemeasure.


Cluster Keywords8 free, call, remove, information, site, subject, list, web, business, offer, message, rep,

time, mail, visit, line, day, address, send, year9 research, university, computation, science, application, succeed, position, interest,

information, neural, work, cognitive, open, brain, year, neuroscience, model, fax,secret, network

10 research, university, computation, science, application, succeed, position, interest,information, neural, work, cognitive, open, brain, year, neuroscience, model, fax,secret, network




Table 7.4 Keywords for hierarchy build with Cluster Confusion similaritymeasure. Similar to KL measure, here cluster number 4 is merged on thebeginning of the hierarchy and since its density values are dominant it controlthe keywords generation process i.e., all the higher hierarchy level keywordsare job related.

Cluster Keywords8 call, remove, free, list, message, information, rep9 call, remove, free, list, message, information, rep10 call, succeed, address, free, secret, information, day, card, site, make, money, web,

offer, number, profit, time, order, business, remove, company11 free, call, remove, information, site, subject, list, web, business, offer, message, rep,

time, mail, visit, line, day, address, send, year12 call, information, university, conference, neural, fax, application, research, address,

model, science, internation, computer, includ, paper, workshop, program, network,work, phone


Table 7.5 Keywords for hierarchy build with L2 similarity measure. Theclusters are merged successively first the minor clusters, with few members(clusters 8 and 9) and then the clusters containing the spam emails (clusters10 and 11) and at last the clusters with conference (12) and job emails (13).


Cluster Keywords8 free, work, year, position, card, program, research, interest, business, address, offer,

computation, send, time, start, candidate, order, service, application, subject9 call, remove, free, list, message, information, rep10 call, succeed, address, free, secret, information, day, card, site, make, money, web,

offer, number, profit, time, order, business, remove, company11 free, call, remove, information, site, subject, list, web, business, offer, message, rep,

time, mail, visit, line, day, address, send, year12 call, information, university, conference, neural, fax, application, research, address,

model, science, internation, computer, includ, paper, workshop, program, network,work, phone


Table 7.6 Keywords for hierarchy build with modified L2 similarity measure.The clusters are merged successively first the minor clusters, with few mem-bers and then the clusters containing the spam emails and at last the clusterswith conference and job emails. The major difference, between this similaritymeasure and L2, is the increased distance from small clusters to the rest whatis due to including the mixing proportion value in the distance measure. Withrespect to known labeling modified L2 similarity measure provides the bestresults.

7.2.3.2 Newsgroups collection

In the experiments, the term-document matrix of the Newsgroups collectionis projected on the latent space defined by 4 left eigenvectors associated withlargest eigenvalues obtained from singular value decomposition of that matrix.The collection is first screened for outliers. As before, the UGGM algorithmwith outlier cluster was applied (for details refer to section 3.6.2). Since nooutliers are detected in the training set, the full set of 399 samples is used insegmentation.

Clustering

In one particular trial of the hard assignment unsupervised Generalizable Gaus-sian Mixture model, described in section 3.3, 6 clusters were obtained. Thescatter plot of the assigned to the clusters data and the corresponding densitycontours p(x) is presented on figure 7.8. Clusters are marked as indicated on thelegend bar. The same 2 principal components are shown as also are displayedon figure 7.2.


−0.6 −0.4 −0.2 0 0.2 0.4−0.4

−0.2

0

0.2

0.4

0.6


Pri

ncip

al C

ompo

nent

2


123456

Figure 7.8 Scatter plot of the data clustered with UGGM model with hardassignment for Newsgroups collection. The estimated probability densityfunction p(x) is shown with the counter plot. 6 clusters were found in thisparticular experiment. The legend bar provides cluster labeling.

Since data points are labeled, the confusion matrices for training and test en-semble are calculated and they are presented on figure 7.9. From that matrices itis possible to determine what kind of documents to expect in the clusters. Thus,for example, cluster 3 contain the majority of newsgroup documents concern-ing Christian religion, clusters number 1 and 5 – Baseball and cluster number2 – Computer Graphics. Some of the clusters, for example number 1, 3 or 5,are mixed with respect to the original labels.

Soft assignment UGGM model was performed for comparison. On average, thenumber of clusters obtained in the algorithm was larger than that obtained fromhard assignment algorithm. In one particular experiment 13 clusters were ac-quired. The scatter plot of the labeled data and the obtained probability densityfunction p(x) is presented on figure 7.10. As in case of Email collection, thedensity does not differ significantly from the hard version algorithm. However,it is described with twice as many components, what suggests better fit to theunderlying data distribution.


0.0 2.3 0.0 1.9

4.2 5.7 30.6 0.0

3.1 64.8 3.7 0.0

2.1 2.3 0.0 88.8

86.5 15.9 1.9 2.8

4.2 9.1 63.9 6.5

6

Comp.

5

Motor

4

Baseb.

3

Christ.

2

1


1.9 4.5 0.0 0.0

2.9 3.6 20.9 1.1

1.9 71.4 4.4 0.0

7.8 1.8 1.1 92.5

75.7 12.5 7.7 1.1

9.7 6.3 65.9 5.4

6

Comp.

5

Motor

4

Baseb.

3

Christ.

2

1

Newsgroups collection − test set

Figure 7.9 The confusion matrix of the Newsgroups collection calculatedbased on the available data labels. Left figure present the numbers for thetraining set and right figure for the test set. The data is originally labeled infour groups: Computer Graphics, Motorcycles, Baseball and Christian Reli-gion. The matrices in supervised way supply with the cluster explanation. Forexample, cluster 4 collects documents concerning motorcycles, cluster num-ber 3 in 89% is composed from Christian Religion newsgroups and Baseballnewsgroups are divided between two clusters number 1 and 5.

−0.6 −0.4 −0.2 0 0.2 0.4−0.4

−0.2

0

0.2

0.4

0.6


Pri

ncip

al C

ompo

nent

2


Figure 7.10 Scatter plot of the Newsgroups collection clustered with softassignment UGGM model. The estimated probability density function p(x)is shown with the counter lines. 13 components are acquired as the settingthat provides minimum generalization error. Visually, the density structure issimilar to the one obtained by the hard assignment algorithm (figure 7.8).


Hierarchical clustering

The hierarchical structure is build based in the 6 cluster outcome of the hardUGGM model. Five similarity measures, described in section 4.1.1 are ap-plied and they are presented on figure 7.11. KL, Cluster Confusion, L2 andmodified L2 similarity measures are visualized with the dendrograms and forSample Depended similarity measure the most frequently occurring combina-tions are shown. Approximately 70% of the data points are assigned to the firstlevel clusters that are coming directly from the UGGM model. The confidencelevel used in assignment is larger than 90%, i.e. the cluster posterior of the datapoints ia larger than 0.9, p(k|xn) > 0.9. Remaining 30% of the data points areassigned in the hierarchy. Most often combinations include following clusterunions: {1, 4}, {2, 4}, {1, 2, 4}, {2, 3}, {1, 5}, {5, 6}, {4, 6}. Clusters 1, 2 and4 are likely merged, since their overlap is significant. This overlap is concludedbased on the confusion matrices on figure 7.9 and the Cluster Confusion sim-ilarity measure (figure 7.11) which combines these clusters first. With respectto the original labeling the KL similarity measure performs the best in case ofthis collection. It preserved the original classes division on the third hierarchylevel, where 4 clusters corresponding to 4 classes are remained.

Keywords assignment

Keywords corresponding to the dendrogram structures are presented below onfigures 7.7, 7.8, 7.9, 7.10 and 7.11.

On figure 7.7 keywords for the first hierarchy level are presented, for the clustersobtained with hard assignment UGGM model. Thus, 6 clusters are observed,scatter plot of which is given on figure 7.8. In the table also correspondingmixture proportions are shown.


1 5 6 4 2 3101.1

101.3

101.5

101.7

KL Similarity Measure for Newsgroups data set

2 4 1 3 5 6

10−2

Cluster Confusion Similarity Measure for Newsgroups data set

4 6 2 1 5 3102

103

104L2 Similarity Measure for Newsgroups data set

5 6 4 1 2 3101

102

L2 modified Similarity Measure for Newsgroups data set

0

0.05

0.1

0.15

0.2

0.25

Merged cluster numbers

Freq

uenc

y of

occ

urre

nce

Sample Dependent Similarity Measure for Email test data set, ρ = 0.9

12

34

56 2+

13+

15+

1

4+2

2+3

1+4

2+4

1+5

2+4+

1

1+2+

4

Figure 7.11 Hierarchical structures for Newsgroups collection. UGGM withhard cluster assignment resulted with 7 clusters at the first level j = 1. Foursimilarity measures described in the section 4.1 are presented on figures. Theconfusion matrix and the scatter plots of the data were shown previously onfigures 7.5 and 7.4, respectively. All similarity measures differ significantly.


Cluster P(k) Keywords1 .22 team, game, year, win, run, pitch, hit, write, article, score2 .26 file, write, article, image, graphic, format, window, program, color, gif,

ftp, convert, package, work, read3 .25 god, write, christian, people, jesus, article, question, faith, truth, life,

christ, time, thing, bible, church4 .16 bike, write, article, motorcycle, dod, dog, good, road5 .11 write, article, year, game6 .01 write, article, bike, year, game, team, god, win, good, hit, run, pitch,

people, make, time, dod

Table 7.7 Keywords for the 6 clusters obtained from hard assignment UGGMmodel. Keywords provide the fully unsupervised interpretation of the clus-ters. For example, cluster number 1 is represented by such keywords as:team, game, win, pitch, hit which certainly connect with a team game topic.Keywords for cluster number 2 represent computer graphics related topicswith words like: file, image, graphics, format, etc. Cluster number 3 withwords like: god, christian, people, jesus, faith belongs to Christian religionnewsgroup. Keywords for cluster number 4 (bike, motorcycle, dod, road )correspond to the Motorcycle newsgroup.

Cluster number 1 is represented by such keywords as: team, game, win, pitch,hit which certainly connect with a team game topic. Keywords for cluster num-ber 2 represent computer graphics related topics with words like: file, image,graphics, format, etc. Cluster number 3 with words like: god, christian, peo-ple, jesus, faith belongs to Christian religion newsgroup. Keywords for clusternumber 4 (bike, motorcycle, dod, road ) correspond to the Motorcycle news-group. Cluster 5 relates to writing articles about games and cluster 6 containsmixture of keywords from different subjects like: bike, game, people. In ta-

Cluster Keywords7 team, game, year, win, run, pitch, hit, write, article, score8 team, game, year, win, run, pitch, hit, write, article, score9 team, game, year, win, run, pitch, hit, write, article, score10 god, write, christian, people, jesus, article, question, faith, truth, life, christ, time,

thing, bible, church11 god, write, christian, people, jesus, article, question, faith, truth, life, christ, time,

thing, bible, church

Table 7.8 Keywords for the clusters in the hierarchy build with KL similaritymeasure. The first levels are dominated by cluster 1, the top clusters (10 and11) inherit keywords from the structure most dominant cluster – number 3.

ble 7.8 keywords for hierarchy build based on KL similarity measure are pre-sented. For this outcome of UGGM model cluster number 3 is dominant. It canbe seen in all the following tables containing hierarchy keywords, where the top


level cluster is represented by keywords coming from this particular cluster. Inthe superior for this collection KL similarity measure the following clusters areobserved on the level with remained 4 clusters: 8{1, 5, 6}, 4, 2 and 3.

Cluster Keywords7 bike, write, article, motorcycle, dod, dog, good, road8 file, write, article, image, graphic, format, window, program, color, gif, ftp, convert,

package, work, read9 file, write, team, game, article, image, bike, year, win, graphic, run, format, pitch,

hit, window, program, color, gif, ftp, convert10 file, write, team, game, article, image, bike, year, win, graphic, run, format, pitch,

hit, window, program, color, gif, ftp, convert11 god, write, christian, people, jesus, article, question, faith, truth, life, christ, time,


Table 7.9 Keywords for the clusters in the hierarchy build with Cluster Con-fusion similarity measure.

Cluster Keywords7 bike, write, article, motorcycle, dod, dog, good, road8 file, write, article, image, graphic, format, window, program, color, gif, ftp, convert,

package, work, read9 file, write, team, game, article, image, bike, year, win, graphic, run, format, pitch,

hit, window, program, color, gif, ftp, convert10 file, write, team, game, article, image, bike, year, win, graphic, run, format, pitch,

hit, window, program, color, gif, ftp, convert11 god, write, christian, people, jesus, article, question, faith, truth, life, christ, time,


Table 7.10 Keywords for the clusters in the hierarchy build with L2 similaritymeasure.

Cluster Keywords7 write, article, year, game8 write, article, bike, year, game, motorcycle, dod, dog, team, baseball9 team, game, year, win, run, pitch, hit, write, article, score10 file, write, team, game, article, image, bike, year, win, graphic, run, format, pitch,

hit, window,program, color, gif, ftp, convert11 god, write, christian, people, jesus, article, question, faith, truth, life, christ, time,


Table 7.11 Keywords for the clusters in the hierarchy build with modifiedL2 similarity measure.


7.2.4 Clustering of Email collection with unsupervised/supervisedGeneralizable Gaussian Mixture model

Email collection is applied in illustration of the USGGM model, described insection 3.5. In these experiments slightly different, preprocessing steps are per-formed, than reported earlier, resulting with total number of 1280 email doc-uments (640 for both in training and test set), and the term-vector consists of1652 words. The difference is in a another threshold value which is set on termfrequencies.Words that occur less than 40 times12 are removed. what reducessignificantly the length of the term vector. In result, also some of the atypi-cal13 documents are removed that contained less than 2 words. The commonlyused framework Latent Semantic Indexing (LSI) [20] is employed, which op-erates using a latent space of feature vectors. These are found by projectingterm-vectors into a subspace spanned by the left eigenvectors associated withthe largest eigenvalues found by a singular value decomposition of the term-document matrix, for reference see chapter 2 with description of the projectionmethods. 5-dimensional subspace is used. The data in reduced space do notdiffer significantly from the previous experiments.

Figure 7.12 shows the average performance (over 1000 runs) of the USGGMalgorithm.

The algorithm parameter is set to γ = 0.5. The algorithm is performed withNu = 200 unlabeled examples and a variable number of labeled samples. Asexpected, with few labeled examples available, Nl = 10, 20, the optimal λ isclose to one, where all unlabeled data are fully used. The minimums in theclassification error curves (upper left plot on figure 7.12) are marked with theblack triangles N. As Nl increases, λ decreases and for Nl = 200 equals 0.3,indicating the reduced utility of unlabeled examples. The classification error isreduced, approximately 26% using unlabeled data for Nl = 10 and graduallydecreasing to 1% when Nl = 200. Thus, with large set of labeled examplesthe importance of unlabeled samples is negligible, therefore the value of thediscount factor λ is not crucial for the level of the error. The classification errorfor optimal λ as a function of the size of the labeled data set Nl is shown in theright upper plot of figure 7.12. The number of optimal components, selected by

12The threshold value was investigated and, in conclusion, the classification error remainsroughly the same for any values below approximately 100 occurrences. In order to reduce di-mensionality of the term vector, the threshold is set to 40 occurrences.

13Atypical documents contain many words, which are rare in the database.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

7

10

15

20

25

3035404550

λ

Cla

ssifi

catio

n E

rror

[%]

Nl=10

Nl=20

Nl=50

Nl=100

Nl=200

minima

2 4 6 8 10 12 14−7

−6.8

−6.6

−6.4

−6.2

−6

−5.8

−5.6

−5.4

−5.2

−5

K

AIC

Nl=10

Nl=20

Nl=50

Nl=100

Nl=200

minima

10 20 50 100 2000

5

10

15

20

25

Nl

Cla

ssifi

catio

n er

ror [

%]

Figure 7.12 Average performance of the USGGM algorithm over 1000 re-peated runs using Nu = 200 unlabeled examples and a variable number oflabeled examples Nl. Upper left plot shows the performance as a functionof the discount factor λ for unlabeled examples (λ = 0 corresponds to nounlabeled data). The upper right plot shows number of components selectedby the AIC criterion for optimal λ as described in section 3.5.

the mixture model14, grows with the number of labeled examples Nl. Naturally,the classification error decreases when enlarging the size of the labeled data set(lower plot of figure 7.12).

In figure 7.13, the hierarchies of individual class dependent densities p(x|y) arepresented. In this experiment only the modified L2 dissimilarity is used, sincethe KL similarity measure is approximate and the Cluster Confusion measureis computational expensive if little overlap exist as many ancillary data are re-quired. The modified L2 is computational inexpensive and basically treat dis-similarity as the cluster confusion, while the standard L2 do not incorporate

14The selection was based on the training error with AIC criterion [1].


Conference Jobs Spam

3 6 4 110

1

102

103

104

105

Cluster

Dis

sim

ilarit

y

7

8

9

2 610

3

104

105

106

ClusterD

issi

mila

rity

7

4 6 5 310

−4

10−2

100

102

104

Cluster

Dis

sim

ilarit

y

7

8

9

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pro

babi

lity

Cluster1 2 3 4 5 6 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7P

roba

bilit

y

Cluster1 2 3 4 5 6 7 8 9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pro

babi

lity

Cluster

Figure 7.13 Hierarchical clustering using the USGGM model. Left columnis class y = 1 conference, middle column y = 2 jobs, and right column isfor y = 3 spam. Upper rows show the dendrogram using the modified L2

dissimilarity for each class, and the lower row the histogram of cluster levelassignments for test data.

priors. The lower panel on figure 7.13 shows the cluster level assignment distri-butions of test samples. In case of conference class 90% of the data points fallsinto the first level clusters, obtained directly form the USGGM model. For job74% are classified in the first hierarchy level and for spam emails 83%. Rest ofthe data points is described by the higher hierarchy levels, i.e. by the combinedcluster representations.

Typical features, as described in section 4.3, are selected from the high densityregion and back-projected into original term-space providing keywords for eachcluster. Keywords are given in table 7.12. The conference class is dominatedby cluster 1 and cluster 4. This has keywords listed in table 7.12, which arein accordance with the meaning of conference. The job class split between 2clusters, namely 2 and 6 and spam emails are divided between clusters 3 and 5.

7.3 Segmentation of medical database 97

y k P (k|y) Keywords1 .7354 information, conference, call, workshop, university3 .0167 remove, address, call, free, business

1 4 .2297 call, conference, workshop, information, submission, paper, web6 .0181 research, position, university, interest, computation, science

2 .6078research, university, position, interest, science, computation,application, information2

6 .3922 research, position, university, interest3 .6301 remove, call, address, free, day, business

3 5 .3698 free, remove, call

Table 7.12 Keywords for the USGGM model. y = 1 is conference, y = 2is jobs and y = 3 is spam. The conference class is dominated by cluster 1and cluster 4. This has keywords which are in accordance with the meaningof conference. Similarly for the job class which is split between 2 clusters,namely 2 and 6 and for spam emails that are divided between 3 and 5.

7.3 Segmentation of medical database

7.3.1 Segmentation of the data from sun-exposure study

The hard assignment UGGM model, described in section 3.5, is used in cluster-ing the data of the sun-exposure study. The behavioral patterns are investigated,i.e. as data samples the several consecutive diary records are understood. In thatway, a similar approach to the one used in processing textual databases, may beused. The results of the presented below experiments can be found in [82].

Preprocessing

Since the technique was developed for textual data, it is necessary to redefinesome of the variables. As features, unique records are found in the data and thehistograms over these records are build. The total number of observed patternsfor 8 questions, selected from table 7.1, is 20736. This number is obtained bymultiplying all possible answers, i.e. 20736 = 2 · 2 · 3 · 2 · 2 · 27 · 4 · 4. However,only a small fraction of 423 patterns exists actually in the data. As documents,the sliding time window is used which is calculated separately for each of thesurvey participants. The optimal size of the window is an important issue. Forexample, taking the full set of records belonging to each person will produce


a set of points in the space that will not form any particular structure, sinceeach of them will contain most of the observed patterns. On the other hand,if observing only one diary record at the time, all the different vectors will beequal distant so the cluster structure in such space will be lost as well. In theexperiments a window of size 7 is used. This was decided after performingseveral experiments, taking into account stationarity of the obtained clusteringand the level of the computational complexity, which is large for small windowsizes. Attempts were made to apply the generalization error (equation 2.12) inwindow selection, however, without success. The histograms are, at the sametime, characteristic for the subjects and the time windows. Similar to the textdata, the histograms create the pattern-window matrix which, for simplicity, isreferred to as the term-document matrix. This matrix is normalized to the unitL2-norm length.

The term-document matrix is formed from the histogram vectors that are ob-tained by counting occurrences of every pattern in the window. However, thehistograms does not convey time ordering information. Thus, it is possible toinclude additionally time information by considering the co-occurrence matrixof joint occurrences of neighbor patterns in the window. There are 207362 pos-sible co-occurrences (or in the case of this particular collection 4232) but onlya small fraction of 1509 combinations is present in the actual data set.

An attempt was made to combine the different types of data. While the diaryentries are of categorical type, the measured UV radiation is continuous. TheUV measurements, in order to match the presented framework, are quantizedand they are expressed by 4-valued representation. In order to incorporate thetime relation between the consecutive records the co-occurrence of the diaryrecords is introduced. Those three data sources can be combined in the pre-sented framework simply representing each document15 by the histogram ofthe feature vector from the combined sources. This idea is shown on the figurebelow.

15The document in this case corresponds to a window of n consecutive days of the surveybelonging to one subject.


The set of 19171 diary records with corresponding UV measurements is se-lected for the clustering experiments. Data is complete, i.e. there is no missingrecords or UV measurements. From this collection, records belonging to 10selected subjects are hold out for testing. The sliding window of size 7 re-sulted with 2738 data vectors from which 158 is selected as test set and 2064,80% of remaining data, as training set and 516 as validation set. Each featurevector consists of the diary histogram, the co-occurrence matrix and the UVhistogram.

In line with the KDD process, presented on figure 1.1, this particular collectionis processed as shown on figure 7.14. In the first step, data is windowed creating

Figure 7.14 Framework for data clustering: 1) The data is windowed intoseveral histogram vectors and together with the co-occurrence matrix and theUV histograms form a term-document matrix. 2) Data is normalized andprojected into the latent space found by singular value decomposition. 3) TheGaussian mixture model is used to cluster the data. 4) For interpretation thetypical features are drawn from data distribution and back-projected to theoriginal space where key-patterns are found.

vectors that contain data from consecutive days. Both diary histograms, theco-occurrence matrix and UV radiation histograms are screened against rarepatterns by removing patterns. The diary histogram is reduced from 423 to97 patterns16 and in a similar way, the co-occurrence matrix is reduced from

16The patterns larger than .1% maximum value was preserved. It equals 96% of total mass(∑

dn xdn = 100%).


1509 to the 80 most often occurring pairs of patterns. The next step involvesnormalization of the term-document matrix. Two types of normalization areperformed. First, each window histogram is scaled to unity L2-norm length,and then, pattern vectors are scaled to zero mean and unit variance over trainingsamples.

Projection to the latent space

Each type of the data17 is projected separately on the orthogonal directionsfound by singular value decomposition (the PCA projection method which isdescribed in section 2.1.2). The scatter plots of the data in selected principalcomponent space are presented on figure 7.15.

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Principal component 4

Pri

ncip

al c

ompo

nent

5

Diary data

0 0.2 0.4 0.6 0.8 1−0.4

−0.2

0

0.2

0.4

0.6

0.8

1


Pri

ncip

al c

ompo

nent

3

UV measurements

−1 −0.5 0 0.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8


Pri

ncip

al c

ompo

nent

8

Diary patterns co−occurrence

Figure 7.15 Scatter plots of the training data sets in the latent space for thesun-exposure collection. For good visualization 2 principal components werecarefully selected. The top panel shows the diary data and UV measurements.The lower plot presents the scatter plot of the projected diary patterns co-occurrence matrix. It is possible to see the existing structure in the data.

17Diary and UV histograms and co-occurrence matrix.


The corresponding eigenvalues are shown on figure 7.16. For both diary and

0 20 40 60 800

5

10

15

20

25

30

35

40


Eig

enva

lue

Diary data

0 5 10 15 200

10

20

30

40

1 2 3 40

5

10

15

20

25

30

35

40

45

Principal componentsE

igen

valu

e

UV measurements

0 20 40 60 800

5

10

15

20

25

30

35


Eig

enva

lue

Diary patterns co−occurrence

0 5 10 15 200

10

20

30

40

Figure 7.16 Eigenvalue curves for the sun-exposure collection. Additionally,the largest eigenvalues are shown in close-up. Based on these plots the deci-sion of number of principal components is made. 9 eigenvalues were chosenin case of Diary data and Diary patterns co-occurrence. 3-dimensional spacewas selected for UV measurements.

co-occurrence 9 largest eigenvalues are selected. In case if UV measurements3 eigenvalues are chosen.

Clustering and cluster interpretation

The clustering is performed of term-document matrix build from diary records.Additionally, the cluster structure is investigated of combined diary records withUV measurements and the co-occurrence matrix. The segmentation was per-formed by the unsupervised Gaussian Mixture model with hard cluster assign-ment.


Since the data is not labeled only unsupervised cluster interpretation can beapplied. Based on estimated density the typical features are selected and back-projected to the histogram space where the mean is added. As previously, inthe case of textual collections, the negative values in back-projected vectorsare neglected. As keywords in case of this data set key-patterns are selectedwhich corresponds to the most probable diary records, UV measurements andco-occurrence couplings.

Furthermore, the used framework makes it possible to describe the behavior ofevery new person in the experiment in terms of cluster assignment and associ-ated keywords. The confidence of assigning the person into a given cluster kcan be expressed by the posterior probability:

p(k|Per) =1

N

N∑

n=1

p(k|Per ,xn) · p(xn), (7.1)

where xn is a feature vector of the size d and i = 1, 2, . . . , N . The number offeature vectors N is different for every person and depends on the number ofreturned diary records and the window size. In this experiment the selected testset with records from 10 subjects are used.

The investigation was performed of the importance of the co-occurrence ma-trix and the UV histograms for the clustering. The results of the experimentsare collected in the tables 7.13, 7.14, 7.15 and 7.16 where the key-patterns, as-sociated probabilities and description of the clusters are provided. In the firstcolumn the cluster number is displayed. Second column contains the most prob-able patterns for the cluster. The third gives the probabilities for the key-patternsand the fourth column presents a general interpretation of the cluster based onthe observed key-patterns.

In table 7.13 the results are shown of clustering of the diary histograms. Thepresented patterns are equivalent to the set of questions given in section 7.1.Note, that 2 questions (number 1 and 7) were a priori excluded for this exper-iments. For example: pattern 10111 describes the following set of answers: 1.holiday - yes, 2. abroad - no, 3. sun bathing - yes, 4. naked shoulders - yes,5. on the beach - yes, remaining questions 6,7 and 8 - no, or pattern 0: all thequestions where answered - no or pattern 1: 1. holiday - yes and the rest of thequestions from 2 to 8 - no. This rule for describing patterns hold as well in caseof tables 7.14, 7.15 and 7.16.


#. Key-Pattern Probability. Description1. 10001,11,10111 0.33,0.32,0.19 holiday, on the beach, sun bathing2. 0 0.98 working - no sun3. 1 0.9 on holiday - no sun4. 0,0001,1 0.4,0.27,0.18 working naked shoulders - no sun5. 1,1101 0.67,0.17 holiday, naked shoulders6. 1011,1001,10011 0.47,0.17,0.16 holiday , sun bathing7. 11 0.5 holiday abroad - no sun8. 10111,0001,1001 0.45,0.17,0.13 holiday, sun bathing, naked shoulders9. 0000001 0.05 no sun, sunburned - red10. 0 0.99 working - no sun

Table 7.13 Key-patterns for clustering diary histograms. In the first columnthe cluster number is shown. Second column contains the most probable pat-terns for the cluster. The presented pattern numbers are equivalent to the setof questions given in section 7.1. For example: pattern 10111 gives the fol-lowing set of answers: holiday - yes, abroad - no, sun bathing - yes, nakedshoulders - yes, on the beach - yes, remaining questions 6,7 and 8 - no, or pat-tern 0 means that all the questions where answered - no. Third column givesthe probabilities for the key-patterns, and fourth column presents a generaldescription of cluster.

# Key-Pattern Probability. Description1. 1001,1000, 0.31,0.26, holiday, naked

1UV ,10011 0.16,0.11 shoulders, small UV radiation2. 11,0001,0, 0.29,0.2,0.17, holiday abroad,

2UV ,0UV 0.16,0.15 working3. 1,11 0.39,0.12, holiday4. 1011,2UV , 0.0.31,0.25, naked shoulders,

3UV ,0001 0.14,0.13 high sun radiation5. 1,2UV ,3UV ,10001 0.2,0.17,0.16,0.14,0.12,0.1 holidays, high UV

6. 1UV ,0UV 0.14,0.13 low UV

7. 3UV ,1001, 0.22,0.22, holiday, naked,2UV ,10011 0.16,0.15 shoulders, high UV

8. 0,0UV 0.6,0.4 no sun

Table 7.14 Key-patterns for clustering diary histograms combined with UV

histograms. In the first column the cluster number is displayed. Second col-umn contains the most probable patterns. The rule for interpreting the diarykey-patterns is given in table 7.13. Patterns corresponding to the UV his-tograms are marked with the subscript “UV ”. Four different values of UV

are observed: 0 corresponds to very low sun radiation and 3 describes veryhigh one. Third column gives the probabilities for the key-patterns and fourthcolumn presents general description of cluster based on the key-patterns.


Table 7.14 presents key-patterns for clustering diary histograms combined withUV histograms. Eight clusters were found. Diary key-patterns are explainedin table 7.13. Patterns corresponding to the UV histograms are marked withthe subscript “UV ”. Four different values of UV from 0 to 3 are observed: 0corresponds to the very low sun radiation and 3 describes very high one. Thisrule for describing UV -patterns hold as well in case of table 7.16.

# Pattern Probability. Description1. 1001,1101-1101 0.27,0.13 holiday,naked sholders2. 1001,1101-1101,1 0.26,0.21,0.1 holiday,naked sholders3. 0001,1001-0, 0.17,0.12, working,

10111,0-1001 0.11,0.1 naked shoulders4. 11,11-11 0.14,0.11 holiday, abroad5. 0001,1001-0, 0.27,0.14, holiday or working,

1001,0-1001 0.13,0.1 naked shoulders6. 1001,0,0-1,1-0,1,1-1 0.29,0.19,0.16,0.14,0.1,0.09 work - holiday, no sun7. 10011 0.19 holiday, on the beach8. 10001,1-10011 0.21,0.12 holiday, on the beach9. 0-0,0,0-1,1-0 0.36,0.35,0.12,0.12 working - no sun10. 1001,1-1101,1101-1, 0.26,0.16,0.12, holiday,

1101-1101,1011 0.12,0.11 naked shoulders

Table 7.15 Key-patterns for clustering diary histograms combined with co-occurrence matrix. In the first column the cluster number is displayed. Sec-ond column contains the most probable patterns for the cluster. The rule forinterpreting the diary key-patterns is given in table 7.13. The co-occurringpatterns are shown with the dash between them, e.g. “0-1” means that a pat-tern working is followed by pattern holiday. Third column gives the proba-bilities for the key-patterns and fourth column presents general description ofcluster based on the key-patterns.

In table 7.15 the key-patterns for clustering diary histograms combined withthe co-occurrence matrix are presented. The diary key-patterns are explainedin table 7.13. The co-occurring patterns are shown with the dash between theme.g., “0-1” means that a pattern working is followed by pattern holiday, pattern“1-10011” means that holiday without sun was followed by holiday spent onthe beach. This rule for describing co-occurrence patterns hold as well in caseof table 7.16.


# Pattern Probability Description1. 1001,0001,1101-1101 0.15,0.1,0.09 naked shoulders2. 0,1UV ,0-0,0001 0.17,0.16,0.14,0.13 working, low sun radiation3. 1001-0,0-1001, 0.17,0.14 no sun radiation,

0,1-0,0-1,0UV ,0.14,0.12,0.11,0.1 holiday-work4. 1UV ,2UV 0.12,0.1 medium sun exposure5. 3UV ,11,11-11 0.11,0.11,0.09 holiday, high sun radiation6. 0-0,0,0UV 0.29,0.27,0.23 working, no sun

Table 7.16 Key-patterns for clustering the diary histograms combined withthe co-occurrence matrix and the UV histograms. In the first column thecluster number is displayed. Second column contains the most probable pat-terns for the cluster. The diary key-patterns are explained in table 7.13. Theco-occurring patterns are explained in table 7.15 and the UV patterns in ta-ble 7.14. Third column gives the probabilities for the key-patterns and fourthcolumn presents general description of cluster based on the key-patterns. Thisclustering provides the most compact and explicit result. Cluster no. 5 col-lects very high sun radiation patterns while cluster no. 2 very low ones.

Table 7.16 shows the key-patterns for clustering diary histograms combinedwith co-occurrence matrix and UV histograms. The diary key-patterns are ex-plained in table 7.13. The co-occurring patterns are explained in table 7.15 andthe UV patterns in table 7.14. Both the UV values and the co-occurrence pairsare likely to appear as key-patterns. Cluster ni. 5 collects very high sun radi-ation patterns while cluster no. 2 very low ones and similarly clusters 3 and1. Cluster 4 corresponds to medium sun radiation behavior. This suggests thatjoining time information and the sun exposure measurements are important forthe clustering. In conclusion this clustering is the most compact and explicit inresult.

In figure 7.17 the probability is presented of observing certain groups of behav-iors in the clusters together with registered sun exposure values are. Clusteringwas performed using full pattern/window matrix18 for which keywords are dis-played in table 7.16. Five behaviors are specified: working - no sun exposure,holiday - no sun exposure, sun exposure describes mild sun behaviors oftenon the beach or naked shoulders without sun-screen and without sunburns, andcorresponding to high sun exposure behaviors: using sun-block and sunburns.In the lower panel the measurements are presented of the sun radiation. Forexample cluster number 6 groups behaviors marked as working - no sun andcorresponding UV values are low. An opposite situation occurs for cluster no.

18With all the sources combined, i.e. diary and UV and co-occurrence histograms.


1 2 3 4 5 60

0.2

0.4

0.6

0.8 Diary records

Clusters

Pro

babi

lity

working − no sunholiday − no sunsun exposure using sun−block sunburned

1 2 3 4 5 60

0.1

0.2

0.3

0.4

Pro

babi

lity

Clusters

UV values

UV 0UV 1UV 2UV 3

Figure 7.17 The probability of observing certain groups of behaviors in theclusters together with registered sun exposure values. Key-patterns for theclusters are presented in the table 7.16. For each cluster grouped behaviorsfrom diary records are presented on the upper plot and corresponding UV ra-diations are shown on the lower figure. For example cluster number 6 groupsbehaviors marked as working - no sun and corresponding UV values are low.An opposite situation occurs for cluster no. 5 which contains records withreported sunburns, sun exposure and using sun-block and consequently theobserved UV values are high.

5 which contains records with reported sunburns, sun exposure and using sun-block and consequently the observed UV values are high.

For the same clustering setting the cluster probabilities were calculated for 10test subjects using equation 7.1. Together with key-patterns presented in ta-ble 7.16 description is possible of the behavior of the particular persons duringthe whole period of the survey. For all test persons there is a large probability ofthe cluster no. 6 that describes working and no sun exposure. However, someof the periods are described by other patterns. For example, for person no. 251there is high probability component for cluster no. 5 describing holidays withhigh sun radiation. Persons no. 213 and 35 can be well described by clusters 6(working, no sun) and 1 (naked shoulders) while person no. 23 by clusters 6, 1and 4 (medium sun exposure).


8 23 27 35 48 189 213 242 246 2510

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Person index

Pro

babi

lity

123456

Figure 7.18 Cluster probabilities calculated for the 10 test persons equa-tion (7.1). Person index is shown on the x-axes and different grey level colorscorresponds to six clusters. The corresponding key-patterns are given in ta-ble 7.16. For all test persons there is a large probability of the cluster no. 6that describes working and no sun exposure. However, some of the periodsare described by other behavioral patterns. For example, for person no. 251there is high probability component for cluster no. 5 describing holidays withhigh sun radiation. Persons no. 213 and 35 can be well described by clusters6 (working, no sun) and 1 (naked shoulders) while person no. 23 by clusters6, 1 and 4 (medium sun exposure).

7.3.2 Imputation missing values in Sun-Exposure study

Preprocessing

The sun-exposure data was used in connection with the missing values imputa-tion in [81]. Since, the diary records, originally, are categorical, both nominaland ordinal, coding technique is proposed that converts the data to binary vec-tors. For this purpose 1-out-of-c coding is used. It represents c level categoricalvariable with a binary c bits vector. The example of coding is presented inChapter 5 in table 5.1.

From the questionnaire, given in table 7.1, one question is ordinal variable of


26 states, namely Sun Screen Factor Number. Therefore, for simplicity adn fordimensionality reduction, it is quantized, to 5 levels (no/1–7/8–16/17–35/> 35),before coding is applied. Each diary record, in the final form for missing dataanalysis, is described by 17 dimensional binary feature vector.

Due to the characteristics of data, three different profiles are taken into con-sideration. The first, which is called Complete Diary Profile (CDP), uses allrecords in the estimation process. The second, Personal Profile (PP), assumesthat all questionnaires from one person have similar characteristics while thecharacteristics across the persons differ. This arise from the expectation thathuman behavior varies from person to person. Thus, the estimation is donefrom the other complete records of given person. The third profile is the DayProfile (DP), which assumes that data vectors for one day are similar or equiva-lently belong to one distribution while parameters of the distributions across thedays vary. This is due to the fact that human behavior is influenced by weather,day of week, temperature, the season of the year, etc. The model using each ofthe described profiles is called a method. In addition, a Voting procedure is alsoconsidered. It compares proposals from all the above mentioned methods andtakes the majority vote among the outcomes. This method is expected to givethe best results, however, it is much more computationally expensive since itcombines the other three methods.

In order to imputate missing values in the diary records, the models describedin chapter 5 are implemented here. To check the performance, the techniquesare tested on complete questionnaires. Diary records description is presented insection 7.1 and necessary preprocessing steps are given in section 7.3.1.

The leave-one-out permutation estimate of the generalization error is performed,where one validation sample is chosen randomly from the complete data set in500 repeated permutations and then a number of training samples. The per-formance is then an average over the 500 permutations. As an example, ifconsidering Day Profile, the day number of the validation sample specifies theday number of the training samples of which, there are at most 194 persons tochoose from. When training set size, N , is smaller than 194 then N randomlychosen samples out of 194 are selected.

Errors of, so called, low concentration are investigated, i.e. only one binaryblock in the vector, that is corresponding to one question, is missing at the time.The final error rate is an average over such single errors made in all possiblenine blocks.


In case of KNN model, the number of nearest neighbors is separately optimized,for each profile and for each block, using another set of 500 repeated permuta-tion samples. The optimal (K in the range 1−30) is then found by determiningthe number which leads to minimum leave-one-out error.

0 20 40 60 80 1000.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

Size of the Training set

Err

or r

ate

Learning curves, Imputation with Gaussian Model

Personal ProfileDay ProfileComplete Diary ProfileVoting

0 20 40 60 80 1000.08

0.09

0.1

0.11

0.12

0.13


Err

or r

ate

Learning curves, Imputation with K−Nearest Neighbor Model

Personal ProfileDay ProfileComplete Diary ProfileVoting

Figure 7.19 Learning curves for the Gaussian model (left plot) and the K-Nearest Neighbor model (right plot). Four different methods are presentedhere: Personal Profile, Day Profile, Complete Diary Profile and Voting. Errorbars show standard error in 500 runs. The best results with respect to errorrate are observed for Day Profile when the size of the training set is maximal.The results of the Personal Profile and Voting are similar in the Gaussianmodel case. The Complete Diary Profile performs significantly worst. Forthe K-Nearest Neighbor model the size of the training set is not so importantfor the level of error rate like it is in case of Gaussian model.

Figure 7.19 presents learning curves for the Gaussian model (left plot) and theK-Nearest Neighbor model (right plot), respectively. All four methods areshown, namely Personal Profile, Day Profile, Complete Diary Profile and Vot-ing. As it was expected, Voting, gives very good results both for Gaussian andK-Nearest Neighbor model, however in both cases the Day Profile outperformthe other methods. The Complete Diary Profile performs significantly worst.For the K-Nearest Neighbor model the size of the training set is not so impor-tant for the level of error rate like it is in case of Gaussian model.

The same results, but compared method-wise are presented in figure 7.20. Gaus-sian model performs at least as good and in many cases much better than K-Nearest Neighbor model. The difference is more significant for larger trainingdata sets.


0 20 40 60 80 100

0.08

0.09

0.1

0.11

0.12

0.13


Err

or r

ate

Personal Profile

GMKNN

0 20 40 60 80 1000.065

0.07

0.075

0.08

0.085

0.09

0.095

0.1


Err

or r

ate

Day Profile

GMKNN

0 20 40 60 80 100

0.1

0.11

0.12

0.13

0.14


Err

or r

ate

Complete Diary Profile

GMKNN

0 20 40 60 80 100

0.075

0.08

0.085

0.09

0.095

0.1

0.105

0.11

0.115


Err

or r

ate

Voting

GMKNN

Figure 7.20 Comparison between GM (light line) and KNN model (darkline) for all the profiles shown separately. Gaussian model performs at leastas good and in many cases much better than K-Nearest Neighbor model. Thedifference is more significant for larger training data sets.

Figure 7.21 and 7.22 presents the performance of the imputation for each ofthe nine blocks separately. Every sub-figure corresponds to one question inthe questionnaire. The highest error is made in imputation of the second block(question no. 2: Holiday). The error rate for this block is so significant thatit basically creates the overall error rate of the validation sample. Not surpris-ingly, the value of this field is best predicted by Day Profile. Also thereforethe observed on figure 7.19 average error rate is the smallest for Day Profile.For the rest of the blocks, imputation with Personal Profile performs the mostsuccessfully. The situation is similar for the KNN model (figure 7.22).

Table 7.17 presents error correlation matrices for Gaussian and K-Nearest Neigh-bor models, respectively, for three methods: Personal, Day and Complete DiaryProfile. The Eij entry of error correlation matrix, which was presented in [38]is defined as Eij = Prob{error in method i ∧ error in method j}. The errorrates are shown for two extreme training set sizes, namely 5 and 100 samples.


0.10.120.140.16

block no. 1

0.20.30.40.5

block no. 2

0.04

0.06

0.08

block no. 3

0.040.060.08

block no. 4

Err

or r

ate

0.080.1

0.120.140.16

block no. 5

0.040.060.08

0.1block no. 6

204060800.040.060.08

0.1

block no. 7

20406080

0.010.015

0.02

block no. 8

Size of the Training set20406080

0.01

0.02

0.03block no. 9

Figure 7.21 Learning curves for GM model shown separately for all 9 blocks.On x-axes the size of the training set is shown and on the y-axes the error rate.Learning curves for the block no. 2 present the highest rate. In this case, theDay Profile 4 gives the best results in imputation. In the other cases, thePersonal Profile © performs best. Voting is marked with ♦ and CompleteDiary profile with �.

Ideally uncorrelated methods would return errors only on the main diagonal. Insuch case the Voting procedure would produce optimal results.

Figures 7.23 and 7.24 show the error rate for 30 validation samples as a functionof training set size. All the methods share the same set of validation samples. Itis interesting to see that for some of the validation samples, the error does notdepend on which method or model is used or the size of the training set. Thisphenomenon is even stronger when using KNN model (for example, sampleno. 2). In other cases, increased size of the training set reduces the error rate(sample no. 20 for the Gaussian model). It can also be seen that for othervalidation samples, the error rate varies from method to method (e.g., samples12 and 20 in Gaussian model) and between the models (e.g., sample 2). In suchcases, Voting returns the lowest error rate.


0.1

0.15

block no. 1

0.20.30.40.5

block no. 2

0.055

0.06

block no. 3

0.075

0.08

block no. 4

Err

or r

ate

0.1450.15

0.155

block no. 5

0.0880.09

0.0920.0940.096

block no. 6

204060800.056

0.058

0.06

block no. 7

204060800.019

0.0195

0.02

block no. 8

Size of the Training set20406080

0.019

0.0195

0.02

block no. 9

Figure 7.22 Learning curves for GM model shown separately for all 9 blocks.On x-axes the size of the training set is shown and on the y-axes the error rate.Similarly to the GM (figure 7.21), learning curves for the block no. 2 presentthe highest rate. In this case, the Day Profile 4 gives the best results inimputation. In most of the other cases, the Voting ♦ performs best. PersonalProfile is marked with © and Complete Diary profile with �.

Gaussian model5 samples

PP DP CDPPP 0.2481 0.0518 0.1701DP 0.0518 0.1566 0.1100

CDP 0.1701 0.1100 0.2634

30 samplesPP DP CDP

PP 0.1695 0.0351 0.2096DP 0.0351 0.1484 0.1705

CDP 0.2096 0.1705 0.2668K-Nearest Neighbor model

5 samplesPP DP CDP

PP 0.1754 0.0255 0.4093DP 0.0255 0.0870 0.1004

CDP 0.4093 0.1004 0.2024

30 samplesPP DP CDP

PP 0.2217 0.0405 0.2294DP 0.0405 0.1359 0.1240

CDP 0.2294 0.1240 0.2486

Table 7.17 Error correlation table for KNN model. Left and right tablespresent data for small training set (5 samples) and large training set (100samples), respectively. Used abbreviations: PP - Personal Profile, DP - DayProfile, CDP - Complete Diary Profile. Ideally uncorrelated methods wouldreturn errors only on the main diagonal. In such case the Voting procedurewould produce optimal results.


0

0.2

0.4

0.6

0.8

1Personal Profile

Siz

e of

Tra

inin

g se

t

5 10 15 20 25 30

25

50

75

100 0

0.2

0.4

0.6

0.8

1Day Profile

5 10 15 20 25 30

25

50

75

100

0

0.2

0.4

0.6

0.8

1Complete Diary Profile

Validation sample

Siz

e of

Tra

inin

g se

t

5 10 15 20 25 30

25

50

75

100 0

0.2

0.4

0.6

0.8

1Voting

Validation sample5 10 15 20 25 30

25

50

75

100

Figure 7.23 30 validation samples predicted with Gaussian models as a func-tion of size of the training set. The error rate is an average over 9 observedblocks. 0 corresponds to no error made and 1 to the error made in each of theblocks. Increased size of the training set reduces the error rate as for examplein case of sample no. 20 and for other validation samples, the error rate variesfrom method to method (e.g., samples 12 and 20).

0

0.2

0.4

0.6

0.8

1Personal Profile

Siz

e of

Tra

inin

g se

t

5 10 15 20 25 30

25

50

75

100 0

0.2

0.4

0.6

0.8

1Day Profile

5 10 15 20 25 30

25

50

75

100

0

0.2

0.4

0.6

0.8


Validation sample

Siz

e of

Tra

inin

g se

t

5 10 15 20 25 30

25

50

75

100 0

0.2

0.4

0.6

0.8

1Voting

Validation sample5 10 15 20 25 30

25

50

75

100

Figure 7.24 30 validation samples predicted with KNN model as a func-tion of size of the training set. The error rate is an average over 9 observedblocks. 0 corresponds to no error made and 1 to the error made in each ofthe blocks. For some of the validation samples, the error does not depend onwhich method or model is used or the size of the training set, for examplesample no. 2.


7.4 Aggregated Markov model for clustering and clas-sification

In the experiments presented in this section the aggregated Markov model isapplied. The model is described in detail in Chapter 6. Five data sets are used.The description of two of them, which were not stated earlier, is given below.

Linear structure 2-dimensional five Gaussian distributed clusters are createdwith the spherical covariance structure, as shown on figure 7.25 (leftplot). The clusters are linearly separable. This artificially created datais used for illustration of the simple clustering problem.

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

x1

x2

−10 −5 0 5 10−10

−5

0

5

10

x1

x2

Figure 7.25 The scatter plots of the artificial data for 5 Gaussian distributedclusters (left figure) and 3 cluster ring formations (right panel).

Manifold structure Three clusters are created as shown on the right plot offigure 7.25. Clusters are formed in the shape of rings all centered at theorigin with radiuses 2, 5 and 8, respectively. The data span 2 dimensions.This data is given as an example of complex nonlinear, yet separable,problem.

Additionally the performance of the algorithm is investigated on textual data:Email and Newsgroups collections and the small medical data set of six erythemato-squamous dermatological diseases. These databases are described in section 7.1.

7.4 Aggregated Markov model for clustering and classification 115

Preprocessing

In case of continuous space collections (Gaussian and Rings clusters) data vec-tors are normalized with its maximum value so, they fall in the range between0 and 1. This step is necessary whenever the features describing data points aresignificantly different in values or ranges. Such normalization is also performedin case of dermatological collection, in order to equalize the range of differentattributes. The normalization to the unit L2-norm length is applied for discretedomain data sets, i.e. for Email, Newsgroup and Dermatological collection.

For Gaussian and Rings clusters the isotropic Gaussian kernel (equation 6.18)is used. With discrete data sets (Emails, Newsgroups and Dermatological col-lection) the cosine inner-product (equation 6.19) is applied.

The cluster structure in the data is investigated in original space, i.e. no projec-tion is performed to the latent space and no dimensionality reduction is applied.

7.4.1 Clustering of the toy data sets

Linear structure

The Gaussian clusters example is a simple separation problem. The model istrained using 500 randomly generated samples and for generalization error 2500validation samples are selected. The aggregated Markov model as, a probabilis-tic framework, allows the new data points, not included in the training set, tobe uniquely mapped in the model. Therefore, it is possible to compute the gen-eralization error and based on it to determine the model parameters: the kernelwidth h for the continuous low dimensional data or K in K-connected graphfor discrete high dimensional values and the optimum number of classes c.

In 20 experiments, different training sets are generated, so the final error is anaverage over 20 outcomes of the algorithm on the same validation set. Theright plot of figure 7.26 presents the dependency of the generalization error asa function of the kernel smoothing parameter h. For h = 0.06 the error isminimal. For this particular h the model complexity is then investigated (leftplot of figure 7.26). Here, the minimal error is given for all 2, 3, 4 and 5 clusters.It can be shown, that the generalization error of the discussed model, in caseof fully separable examples, is identical for the number of clusters lower or


0 0.1 0.2 0.3 0.4 0.5

0

2

4

6

8

10

12

14

16

18

20

h − smoothing parameter

Gen

eral

izat

ion

erro

r5 Gaussian clusters

0.04 0.06 0.08 0.1−0.4

−0.3

−0.2

−0.1

0

h

Gen

. err

or

2 3 4 5 6 7 8 9 10−0.326

−0.324

−0.322

−0.32

−0.318

−0.316

c − number of clusters

Gen

eral

izat

ion

erro

r

5 Gaussian clusters

−0.3

2516

86

−0.3

2516

78

−0.3

2516

86

−0.3

2516

87

Figure 7.26 On both figures the generalization error is presented as a func-tion of the kernel smoothing parameter h (left panel) for 5 Gaussian dis-tributed clusters. The optimum choice is h = 0.06. The right figure presents,for optimum smoothing parameter, the generalization error as a function ofnumber of clusters. Here, any cluster number below or equal 5 may give theminimum error for which the error values are shown above the points. Theoptimum choice is a maximum model, i.e., c = 5 (see the explanation of thischoice in the text). The error bars show the standard error values.

equal the correct number. When perfect19 cluster posterior probability p(c|zi)is observed, the sample probability p(zi) is the same for both smaller and largermodels. It is true, as long as the natural cluster separations are not split, i.e. aslong as the sample has large (close to 1) probability of belonging to one of theclusters max

cp(c|zi) ≈ 1.

For Gaussian clusters the cluster posterior p(c|z) is presented on figure 7.27.Naturally, the values are in the 0–1 range. Perfect decision surfaces can beobserved. The probabilistic framework simplifies determination of the decisionsurfaces. For comparison, on figure 7.28, the components of the traditionalkernel PCA are presented. Here, both the positive and the negative values areobserved. That makes it difficult to determine the optimum decision surface.

Manifold structure

In case of Rings structure the clustering problem is not any longer simple. Theclusters are, however, separable but a nonlinear decision boundary is needed.In the training, 600 examples are used and for generalization 3000 validationsamples are generated. There are performed 40 experiments, where the different

190/1 valued


−1 −0.5 0 0.5 1−1

−0.50

0.51

0

0.5

1

CLUSTER 1

p(c|

z)

−1 −0.5 0 0.5 1−1

−0.50

0.51

0

0.5

1

CLUSTER 2

p(c|

z)

−1 −0.5 0 0.5 1−1

−0.50

0.51

0

0.5

1

CLUSTER 3

p(c|

z)

−1 −0.5 0 0.5 1−1

−0.50

0.51

0

0.5

1

CLUSTER 4

p(c|

z)

−1 −0.5 0 0.5 1−1

−0.50

0.51

0

0.5

1

CLUSTER 5

p(c|

z)

Figure 7.27 The cluster posterior values p(c|z) obtained from the aggregateMarkov model for five Gaussian clusters. The decision surfaces are nonnega-tive. In case of this simple example the separation is perfect, where 0/1 classposterior probability values are observed. It is simple to form the decisionsurfaces in case of the probabilistic outcome.

−1 −0.5 0 0.5 1−1−0.500.5

1

−0.1

−0.05

0

0.05

CLUSTER 1

−1 −0.5 0 0.5 1−1−0.500.5

1−0.05

0

0.05

0.1

0.15

CLUSTER 2

−1 −0.5 0 0.5 1−1−0.500.5

1−0.05

0

0.05

0.1

0.15

CLUSTER 3

−1 −0.5 0 0.5 1−1−0.500.5

1−0.05

0

0.05

0.1

0.15

CLUSTER 4

−1 −0.5 0 0.5 1−1−0.500.5

1

−0.1

−0.05

0

0.05

CLUSTER 5

Figure 7.28 The components of the traditional kernel PCA model for fiveGaussian clusters. The values are both positive and negative. That makes thedetermination of the decision surface more ambiguous.

training sets are selected. The generalization error, shown on figures 7.29, is anaverage over errors obtained in each of the 40 runs on the same validation set.


The optimum smoothing parameter (figure 7.29, left plot) equals h = 0.065

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20

25

30

35

40

45

50

55

h − smoothing parameter

Gen

eral

izat

ion

erro

r

3 Rings

0.05 0.055 0.06 0.065 0.07 0.0750.94

0.96

0.98

1

1.02

h

Gen

. err

or

2 3 4 5 6 7 8 9 100.9609

0.9609

0.961

0.961

0.9611

0.9611

0.9612

0.9612

0.9613


Gen

eral

izat

ion

erro

r

3 Rings

0.60

927

0.60

925

Figure 7.29 The generalization error as a function of kernel smoothing pa-rameter h for 3 clusters formed in the shape of rings (left panel). The opti-mum choice is h = 0.065. On the right figure the generalization error as afunction of number of classes is shown for the optimum choice of smoothingparameter. The error bars show the standard error. 2 and 3 clusters provide theminimal error values. As before the maximal model of 3 clusters is selected.The explanation of such selection can be found on page 116.

and the minimum generalization error is obtained for 3 classes. Similarly to theGaussian clusters example, a smaller model of 2 classes is also probable.20

The class posterior for Rings data set and the kernel PCA components are pre-sented on figures 7.30 and 7.31, respectively. Also in this case perfect (0/1)

−1−0.5

00.5

1

−1

−0.5

0

0.5

10

0.51

CLUSTER 1

p(c|

z)

−1−0.5

00.5

1

−1

−0.5

0

0.5

10

0.51

CLUSTER 2

p(c|

z)

−1−0.5

00.5

1

−1

−0.5

0

0.5

10

0.51

CLUSTER 3

p(c|

z)

Figure 7.30 The cluster posterior values p(c|z) obtained from the aggregateMarkov model for the Rings structure of three clusters. The decision surfacesare nonnegative. The separation is perfect since the class posterior probabilityvalues are 0/1. It is easy to determine the decision surfaces.

cluster posterior is observed (figure 7.30), which is the outcome of the aggre-20The generalization error is similar for both 2 and 3 numbers of classes.


gate Markov model.

The components of kernel PCA (figure 7.31) provide, as in case of cluster poste-rior, the separation but with more ambiguity in selection of the decision surface.

−1−0.5

00.5

1

−1

−0.5

0

0.5

1−0.2

−0.15−0.1

−0.05

CLUSTER 1

−1−0.5

00.5

1

−1

−0.5

0

0.5

10

0.020.040.060.08

CLUSTER 2

−1−0.5

00.5

1

−1

−0.5

0

0.5

10

0.020.040.060.08

CLUSTER 3

Figure 7.31 The components of the traditional kernel PCA model for Ringsstructure of three clusters. The components are both positive and negative. Itis difficult to determine the appropriate decision surface.

7.4.2 Clustering of the textual data collections

The generalization error for Email collection is shown on figure 7.32. The mean

Number of clusters

K−c

onne

cted

gra

ph

2 3 4 5 6 7 8 9 10

5

50

100

200

300

400

500

600

650

7002 3 4 5 6 7 8 9 10

2.65

2.651

2.652

2.653

2.654

2.655


Gen

eral

izat

ion

erro

r

Email collection

K=50K=100

Figure 7.32 Left panel presents the mean generalization error as a functionof both the class number and the K numbers of active neighbors in the K-connected graph for Emails collection. Since the differences around the min-imum are small, the additional plot (left figure) is provided, which shows theminimal curves. Thus, as the optimal model K = 50 (50-connected graph)with 3 clusters is chosen.


values are presented averaged from 20 random choices of the training and thetest set. For training 702 samples are reserved and the rest of 703 are used incalculation of the generalization error. Since, the used kernel is the cosine inner-product, the K-connected graph is applied to determine the number of activeneighbors in the Gram matrix. For Email collection, the minimal generalizationerror is obtained when using 50-connected graph with the model complexity of3 clusters. Since the data is not naturally separated, there exist small confusionamong the clusters. The smaller models are not favored as it was in case ofRings and Gaussian data sets.

89.5 4.1 1.3

9.9 95.2 0.5

0.6 0.7 98.2

3

CONF

2

JOB

1

SPAM

88.8 0.8 0.3

9.7 97.6 0.3

1.5 1.6 99.5

3

CONF

2

JOB

1

SPAM

Figure 7.33 As interpretation the confusion matrix of the training (left plot)and of the test set (right plot) is provided. The separation is almost perfect,only small confusion especially between conference and job emails is ob-served.

Figure 7.33 presents the confusion matrices of the training (left plot) and thetest set (right plot). The figures provide the supervised interpretation of thefound clusters. Thus, spam emails fall into cluster 1 in almost 100%. Clusternumber 2 covers job emails and cluster number 3 conference emails. There isslight confusion between cluster 2 and cluster 3. It can be expected, since bothjob and conference emails are university related. The results can be comparedwith similar experiment, as in outcome of UGGM model on figure 7.5.

The generalization error for Newsgroups collection is shown on figure 7.34.For the training, 400 samples is used randomly selected from the set and therest of the collection is designated for calculation of the generalization error.40 experiments was performed and figure 7.34 displays the mean value of thegeneralization error. For clarity, since the differences at the minimum are small,four selected generalization error curves with the standard error on the error barsare shown on the right plot of figure 7.34. The optimum model has 4 classes


3.14

3.142

3.144

3.146

3.148

3.15

3.152


K −

con

nect

ed g

raph

Gen

eral

izat

ion

erro

r

2 3 4 5 6 7 8 9 10

2

10

20

30

40

50

100

200

300

4002 3 4 5 6 7 8 9 10

3.138

3.14

3.142

3.144

3.146

3.148


Gen

eral

izat

ion

erro

r


K=10K=20K=30K=40

Figure 7.34 Left panel presents the mean generalization error as a functionof both the cluster number c and K which is the number of active neighborsin the K-connected graph for Newsgroups collection. Since the differencesat the minimum are small, it is difficult to see the placement of the minimum.Therefore, the curves for selected K (10 20 30 40) are provided on the rightplot. The optimal model complexity is 4 clusters when using 20-connectedgraph.

where the connectivity among 20 closest neighbors are remained in the Grammatrix.

Figure 7.35 presents the confusion matrices of the training (left plot) and thetest set (right plot).

3.5 1.0 89.9 5.1

78.9 1.0 2.2 2.0

3.5 0.0 0.0 85.9

14.0 98.0 7.9 7.1

4

Comp.

3

Motor

2

Baseb.

1

Christ.

3.5 1.0 89.9 5.1

78.9 1.0 2.2 2.0

3.5 0.0 0.0 85.9

14.0 98.0 7.9 7.1

4

Comp.

3

Motor

2

Baseb.

1

Christ.

2.4 2.0 89.1 0.0

83.5 1.0 3.6 5.0

1.2 2.0 0.9 91.1

12.9 95.1 6.4 4.0

4

Comp.

3

Motor

2

Baseb.

1

Christ.

Figure 7.35 The confusion matrix of the training (left plot) and the test set(right plot). There is a small confusion among the clusters. The data is wellseparated. Small confusion among the clusters is however observed.

The figures provide the supervised interpretation of the discovered clusters.Thus, for example the Motorcycles newsgroups documents are mostly covered


by cluster 1. Cluster 2 contains Christian Religion, cluster 3 Computer Graph-ics and cluster 4 Baseball. There is small confusion among the discoveredclusters. The results can be compared with the similar outcome of the UGGMmodel on figure 7.9.

In order to compare the aggregated Markov model with the classical spectralclustering method, presented in [62], the following experiments are performed.For both, continuous and discrete data sets, using both the Gaussian kerneland inner-product the investigation of the overall performance in classification,measured by the miss-classification error, is made. It is found, that both theaggregated Markov model and the spectral clustering model for selected modelparameters does comparably well in the sense of miss-classification error. How-ever, the spectral clustering model is less sensitive to the choice of smoothingparameter h. Note, that the spectral clustering method does not provide the ob-jective technique for selecting the optimal parameters. Thus, as the optimum,the parameters providing the minimum misclassification error are selected.

7.4.3 Classification of the medical data collections

The unsupervised classification task is performed on the dermatological col-lection. The data, described in detail in section 7.1 consists of 6 classes oferythemato-squamous diseases, namely psoriasis, seboreic dermatitis, lichenplanus, pityriasis rosea, cronic dermatitis, pityriasis rubra pilaris. In the ex-periments 358 examples are used that are described by 34 attributes. The min-imum in the calculated generalization error is indicating two class structure,which is the minimal investigated complexity. In conclusion, the data set istoo small to allow correct estimation of the model parameters. Therefore, the 6class structure is optimized directly and the misclassification error is consideredin determining the optimum number of connected neighbors. While the modelcomplexity is determined in supervised way, the learning itself is performed inunsupervised manner.

The misclassification error and the confusion matrix is presented on figure 7.36As the optimum number of active neighbors any number above K = 100 canbe selected. For decomposition purposes K = 300 is chosen. Then the Grammatrix is decomposed, resulting with a data separation of 6 clusters. For inter-pretation purposes the confusion matrix is provided. Here, it can be seen, thatthere is a large confusion, between two clusters, namely seboreic dermatitis andpityriasis rosea. The rest of the clusters are separated.


0 50 100 150 200 250 300 35040

60

80

100

120

140

160

180

200

K − connected graph

Mis

clas

sific

atio

n er

ror

[%]

0.0 63.3 0.0 93.8 0.0 0.0

0.9 36.7 0.0 6.3 0.0 0.0

0.0 0.0 1.4 0.0 100.0 0.0

0.0 0.0 0.0 0.0 0.0 100.0

0.0 0.0 98.6 0.0 0.0 0.0

99.1 0.0 0.0 0.0 0.0 0.0

6

p

5

sd

4

lp

3

pr

2

cd

1

prp

Figure 7.36 Left panel presents the misclassification error as a function of thenumber of neighbors in K–connected graph for Dermatological collection.K = 300 is selected. The right plot presents the confusion matrix obtainedfrom the model, where unsupervised learning is performed in 6 class struc-ture. The classes are marked with abbreviations form original erythemato-squamous diseases. Almost perfect classification can be seen except of twoclasses of seboreic dermatitis and pityriasis rosea which are confused.

C H A P T E R 8

Conclusion

The focus of this thesis has been in examining various principled approachesto data mining in order to discover inherent latent structure within the data andto provide and provide an intuitive interpretation of any such structure. Theanalysis was performed on medium size databases which came from disparatesources and were heterogeneous in nature.

The general framework, known as the Knowledge Discovery in Databases pro-cess was applied. The investigated methods include techniques for projection,clustering, structure interpretation, outlier detection and imputation missingvalues. In the first phase, projection techniques were investigated with the mainfocus on Principal Component Analysis. Though Random Projection and Non-negative Matrix Factorization were also examined as possible techniques fordimensionality reduction. The Random Projection method did not provide sat-isfactory results in the case of complex data types such as sparse textual dataand when large dimensionality reduction is required. This is the case of Gener-alizable Gaussian Mixture models, which require a small number of dimensionsin order to estimate correctly the parameters of the data density. Any increasein the dimensionality increases significantly the number of parameters to esti-mate which then requires more data points for an sensible estimate. The Non-negative Matrix Factorization provided the matrix decomposition which was

126 Conclusion

subsequently applied to clustering in conjunction with the aggregated Markovmodel.

For clustering purposes, the unsupervised and unsupervised/supervised Gener-alizable Gaussian Mixture models were investigated and applied later on theobservational data sets. These models return multi-modal density estimatewhich provides the cluster structure inherent in the data. In case of unsuper-vised/supervised GGM model the influence was investigated of the unlabeledsamples used in learning process. It was found that unlabeled examples areimportant for the estimation whenever the labeled data set is small. This impor-tance deteriorates as the size of labeled set increases.

Another important issue addressed in this thesis was the outlier detection meth-ods. In the KDD process, outliers can have a dual role. They can be consideredas noise. In which case, its removal improves the model. Alternatively, theoutliers are the subject of interest and require careful analysis. In this work twooutlier detection methods were compared. The first method was based on thecumulative distribution where the low probability samples were classified asoutliers. The second method was based on the Generalizable Gaussian mixturemodel with one cluster was designated to collect outliers. The method based oncumulative distribution require manual tuning of the rejection threshold. Theoutlier cluster technique is fully automatic and provides better outcome when-ever the data density is correct estimated.

The agglomerative hierarchical clustering was the next research area. In thiswork, hierarchical clustering was an extension to the aforementioned GaussianMixture model. Several similarity measures were applied. However, no methodsignificantly outperformed the other, despite their noted drawbacks. For in-stances, only approximate formula is available for higher hierarchy levels withKullback-Leibler similarity measure. The Cluster Confusion similarity mea-sure is computational expensive, since it requires many ancillary data points toobtain a good dissimilarity estimate especially when there is small cluster over-lap. The agglomerative hierarchical clustering can be employed as an additionalclustering level. This was found to be most appropriate when visualizing thethe data in order to present a multi-level understanding of the inherent clusterednature of the data.

Being able to intuitively understand the results is another important part of theKDD process. For labeled data sets, the confusion matrix between the classand the cluster structure was provided which enabled the interpretation of the

127

discovered clusters in terms of known labels. Furthermore, the next proposedmethod for interpretation was to generate the meaningful representatives ofclusters. In the case of textual databases keywords were provided and for otherforms of data the model important features were selected.

An attempt was made to combine different data sources and various types of thedata, in the clustering framework. It was shown the in case of the investigatedsun-exposure collection, that clustering benefited from the additional sources ofinformation. However, no objective measure of importance for the data sourceswas developed.

The issue of handling missing data was also addressed in this thesis. Similarto outliers, the missing data may be handled in two ways. First, if assumedas noise contribution, they can be removed from the data set as has been donein the majority of the performed experiments. In can be applies whenever themissingness process is believed to be completely at random. Second, the miss-ing data can itself be the subject of the data mining task and as such imputationtechniques can be applied. Two methods were collated in this work, namelyK-Nearest Neighbors and Gaussian model. Even though the experiments werecarried out on the binary data set of the sun-exposure study, the results obtainedby the Gaussian model were found to be superior.

As an alternative to the Gaussian Mixture model in data segmentation the aggre-gated Markov model was proposed. This technique, originating from spectralclustering methods, performs the decomposition of the Gram matrix in a fullyprobabilistic framework. It provides the estimates of the class posterior prob-abilities of the data points and allows new data points to be assigned in theframework. Therefore cross-validation can be applied, to select the model pa-rameters, which was not possible with classical spectral clustering techniques.Moreover, the data with the large dimensional feature space are easily handledby this method. Hence, no projection techniques are needed, even though theycan be applied.

Future work

When working with the disparate and heterogeneous data it would be interestingto develop an objective measure to determine if the combination of sourcesprovided more information than the individual sources.

128 Conclusion

It would be also interesting to develop a general model where both continuousand discrete data are combined in unsupervised and unsupervised/supervisedclustering framework.

As a future research task also more complex imputation methods like, for exam-ple maximum likelihood and multiple imputation techniques can be interestingto develop.

A P P E N D I X A

Equations

Jensen’s inequality:

If f is a convex function such that its hessian is positive semi-definite H ≥ 0for all x ∈ R, then for random variable X holds

E[f(X)] ≥ f(EX) (A.1)

Minkowski’s inequality:

If p > 1, then Minkowski’s integral inequality states that

[ ∫ b

a|f(x) + g(x)|pdx

] 1

p≤

[ ∫ b

a|f(x)|pdx

] 1

p+

[ ∫ b

a|g(x)|pdx

] 1

p (A.2)

Similarly, if p > 1 and ak, bk > 0, then Minkowski’s sum inequality states that

[ n∑

k=1

(ak + bk)p] 1

p≤

[ n∑

k=1

apk

] 1

p+

[ n∑

k=1

bpk

] 1

p (A.3)

Equality holds iff the sequences a1, a2, . . . and b1, b2, . . . are proportional.

130 Equations

Kullback-Leibler divergence:

Kullback-Leibler divergence [69] is defined as follow:

Div(k, l) =

∫p(x|k)ln

p(x|k)

p(x|l). (A.4)

A symmetric version is defined as D(k, l) = 12(Div(k, l) + Div(l, k)). Thus,

D(k, l) =1

2

∫p(x|k)ln

p(x|k)

p(x|l)+

1

2

∫p(x|l)ln

p(x|l)

p(x|k). (A.5)

where D(k, l) ≥ 0 and D(k, l) = D(l, k).

If the density functions p(x|k) and p(x|l) are Gaussians,

p(x|k) = (2π)−D2 |Σk|

− 1

2 exp(−1

2(x − µk)

TΣ−1k (x − µk)), (A.6)

and

p(x|l) = (2π)−D2 |Σl|

− 1

2 exp(−1

2(x − µl)

TΣ−1l (x − µl)), (A.7)

where D is dimension, then the KL divergence may be simplified in the follow-ing way.

lnp(x|k)

p(x|l)= ln

(2π)−D2 |Σl|

− 1

2 exp(−12(x − µk)

TΣ−1k (x − µk))

(2π)−D2 |Σl|

− 1

2 exp(−12(x − µl)

TΣ−1l (x − µl))

= ln|Σl|

1

2 exp(−12(x − µk)


|Σk|1

2 exp(−12(x − µl)

TΣ−1l (x − µl))

=1

2ln(|Σl|) −

1

2ln(|Σk|) −

1

2(x − µk)


+1

2(x − µl)

TΣ−1l (x − µl) (A.8)

(x − µ)TΣ−1(x − µ) = Trace(Σ−1(x − µ)(x − µ)T ). (A.9)

and due to linearity of the integral:∫

Trace(af(x))g(x)dx = Trace(a∫

f(x)g(x)dx) (A.10)

132 Equations

∫p(x|k)ln

p(x|k)

p(x|l)dx =

1

2ln(|Σl|) −

1

2ln(|Σk|) −

D

2+

1

2Trace(Σ−1

l Σk)

+1

2(µk − µl)Σ

−1l (µk − µl)

T (A.14)

∫p(x|l)ln

p(x|l)

p(x|k)dx =

1

2ln(|Σk|) −

1

2ln(|Σl|) −

D

2+

1

2Trace(Σ−1

k Σl)

+1

2(µl − µk)Σ

−1k (µl − µk)

T (A.15)

D(k, l) =1

2

∫p(x|k)ln

p(x|k)

p(x|l)+

1

2

∫p(x|l)ln

p(x|l)

p(x|k)

−D

2+

1

4(Trace(Σ−1

k Σl) + Trace(Σ−1l Σk))

+1

4(µk − µl)

T (Σ−1k + Σ−1

l )(µk − µl) (A.16)

A P P E N D I X B

Contribution: NNSP 2001

This appendix contain the article Imputating missing values in diary recordsof sun-exposure study. in Proceedings of IEEE Workshop on Neural Networksfor Signal Processing XI, editor: D. Miller, T. Adali, J. Larsen, M. Van Hulle,S. Douglas, pages: 489–498. Author list: A. Szymkowiak, P.A. Philipsen, J.Larsen, L.K. Hansen, E. Thieden and H.C. Wulf. Own contribution estimatedto 50%.

134 Contribution: NNSP 2001

�� !�"�#$#��%�&'#��(�)&+*,��-��./�10��2&��-#��1�3�-�4$

57698;:=<?>A@CB�DFEHG�@JILKJM/6 576JMONPERQRERSJTVU=WYXZKJ[J6P\]G�^"T_ULW`I=KJ\a6 bA6YcdGZWJTVULW`I�Ke 6YfgNYE UihPULWJXZKJcj6 kd6Yl�mYQ noX

I�p WPnqBC^_>�G�r_EHsLTgG�WJhutvG�r_NYUL>�G�r"ERs=GZQwtxBPhPULQRQRERWYyYKJz{mYE QHhPERWYy�|C}P~f�Uis"NYWPEHs=G�Q��dWYER�ZU=^"TVE r�<1BZn/�dU=WP>�GZ^_@`KY��b��}��C�C��\�<;WYyZ�;<CKP��ULWY>�G�^"@

MONPBCWYUZ��d�;�j�;�C}Z�j|C�Z�C�PK |C�?}�|YK |Z�C�C�� G��w�O��?��;��;�7}C�Z�Z�

e ��>�GZERQ��GCTV:ZK �VQ�K QR@;N9��ER>�>u6 h;r_m�6 hP@l�UL��U=E �;ERWJhw6 E >A>u6 h?r"m�6 hP@

X=�dU=SYGZ^Vr">�U=WCrdB�n��UL^">AG�r_BCQRBZyC<ZKPz{ERT_S9UL�;�VUL^"y�cdBCT_SYE r�G�Q�FWYER�CUL^"T_E r�<ABZnOkOBCS9ULWYNJGZyZU=W�KPz{EHTVS9UL�;�VU=^_yAz{GZ@?@CU�}Z|

��b��}��?�C�Ak�BZS9U=WPNJGZyCULW�KP�dU=WP>�GZ^_@

�1�/�L =¡�¢Y£Z Z¤(¥�¦�¢1�=§/¦�¨"©?ª`«�¬9�=§�¡�©��= =§]�®J¯`°w§]©;�L i±o¬9¦]¦]¢Y±²¡�©;�d£C¬9¦]£Z©?¡�¦]±o¦]³u�=§]¦�¨´]¢P�/±² =�7µ�©C¡�©�£Z¬9¶o¶o©;£Z =©;¸·�¡�¬9¹»º;¼`½u�i§��`¾i©;£Z =�Z¤�¿j´]±o��«/¢P«O©?¡A·�¬�£Z§/�=©;��¬9¦¸ i´]©³9©;¦]©C¡�¢Y¶w«]¡�¬J�/¶o©;¹À¬9·/¹Á±o�i�=±o¦]³�]¢P =¢1ÂY¢Y¶o§]©?�dµA´/±o£i´�¬�£C£Z§�¡��dµ�´]©;¦��=¬9¹Á©9¯Y¬J¡©?ÂJ©;¦4¢Y¶o¶�¯] =´]©v°w§]©;�L i±o¬9¦]��´]¢CÂ9©v¦]¬J ��©?©;¦4¢Y¦]�=µj©?¡�©;4±o¦�¢�°w§]©?�= =±o¬9¦]¦]¢Y±²¡�©9¤Ã ©?¡�©A¬9¦]¶²®¹Á±o�=�i±o¦]³�ÂY¢P¶o§]©;��¬9·�¶o¬?µÄ£C¬9¦]£Z©;¦` L¡�¢P =±o¬9¦¸¢P¡�©�±o¦`Â9©;�L i±o³9¢; i©;O¤dÅ�©£C¬J¦/�=±o]©?¡Á¢Y¦]Ä£Z¬9¹¸«/¢P¡�©4 "µ�¬-]±²Æ�©C¡�©;¦9 Á¹Á¬�]©?¶��·�¬J¡2±o¹¸«/§� =¢P i±o¦]³Ç¹Á±o�i��¨±o¦]³uÂY¢Y¶o§]©?�CÈd i´]©uÉ�¢P§/�=�=±�¢P¦Á¹2¬�]©?¶�¢Y¦]Á i´]©�¦]¬9¦�¨_«/¢P¡�¢Y¹Á©? L¡�±o£AÊx¨"Ëx©;¢;¡�©;�L Ëx©;±o³9´9��¬J¡A¹Á¬�]©;¶�¤

¥�Ëx¿jÌ1ÍAÎ�ÏxÐ�¿7¥iÍAË

f{NYU�>�EHT_T_ERWPyAhYG�r�G�SY^_BC�YQRUL>!BPsLsLmP^�T�ERW��?ER^_r_mJG�QRQR<�G�W;<1G�SYSYQREHsLG�r_ERBCW1BZn/T�r�G�r"ERTVr_EHsLT{r_B^_UiG�Q�QRE n²U�SY^_BZ�YQRU=>AT=6 p r�EHTdSJGZ^Vr_EHsLmPQHGZ^_QR<uER>�S9BC^Vr"GZW?r�DFNPU=WYUL�CUL^�T�r"G�r_EHT�r_EHs=G�Q/GZWJGZQ <PTVEHTEHTa�JGZT_U=hABCWAN;mY>AGZW�^_UiTVS9BZWJTVUiTL6]fgNYU�TVU=�ZU=^_ULWYUiT_T/BZnw>�EHT"TVERWPy�hYG�r�G�EHTOGZyCyZ^�Gi��G�r"U=h�E nSY^_BC�YGZ�YERQRE r�<�BZnahYG�r"G�h;^"BZSP��BCmPrgEHTFG�n²mYWYs�r_ERBCWxB�n/r"NPUj>AEHT_TVERWYy��GZQRmPUC6�5{rVr_UL>ASPr"Tdr_BÑ QRQ;ERW�>AERT"TVERWYy�hYG�r"Gd^�G�WYyCU=T�n²^_BC>�s�BC>�SYQRU��j>ABZW?r_UFsLGZ^_QRBdSY^_BPs�UihPmP^"U=T=K�QRER@CU�>�mPQ r_ERSYQRUER>ASPmPr"G�r_ERBCW�Ò �iÓ�KPB��ZU=^ e tu��JGCTVU=hwKPhPU�r_U=^_>AERWYERTVr_EHs�KP<CU�rFE r_UL^�G�r_ER�CUZK;SY^_BPs�UihPmY^_U=T�Ò ~CKJ};KÔ K]�=Ó�Kwr"Bu�YGCTVEHsAT�r"G�r_EHT�r"EHsLGZQ/>�ULr_NYBPhPT7�JGZT_U=hvBCW�TVER>ASPQRUA>�mPQ r"E ��GZ^_EHG�r"U�SJGZ^"GZ>�ULr_^_EHsZKr�<;SYEHsLGZQ QR<�Õ�G�mJT_T_EHG�W�KPhPULWJTVE r�<�GZSYSP^"Bi�PER>AG�r_ERBCWJT�Ò ��Ó�6p W�r_NYU1TVmYWP��U��PS9BCT_mP^"U�UL�PSJU=^_ER>AULW?rjTVr_mJhPE UihwK]Ö?mPUiT�r_ERBCWYWYGZER^_UiT�sLBZWJsLUL^_WYERWYyxTVmYW;�NJG��YE r"TFD�UL^"U�sLBZQRQRU=s�r_Uih1n²^"BZ>×~i�C��TVmY�;�VUis�r"TjØqr_NYU7yC^_BCmYS�B�n/S9U=BZSYQRUjE W;�CBZQR�CU=h�ERW�r_NYUU��PS9UL^"ER>�U=WCr7QHGZTVr_ERWYy�~=|C��hPG�<PT"Ù�6 p W¸GChYhPE r"ERBZW�K]�dÚ3^�GZhPEHG�r_ERBZWD{UL^_UA>AU=GCTVmY^_Uih�G�rGÁ~i�v>�ERW;mPr_UuT_GZ>�SYQRERWYy^"G�r_UZ6�l�NYERQ U�r"NPU�mYQ r_ER>�G�r"U�BC�;�VU=s�r_ER�CU1EHTjr_B�^"ULQHG�r_U�TVmYW;�

135

ÛJÜ�ÝYÞ ß"à=á?âdãåäPæ?àVçZá;ÜZèJä1é_ÞHàVê�æ�ë�ì=Ü�èJìLçLé=áCß_ÛYÞHà�í{æZé"ê7ë²æPì�îJàVçiàOæCè�ÞRïAðYîPß"Ü�ß"ÞRèPñ�ïAÞHà_àVÞRèYñò îPçià�ß_ÞRæCèYèJÜ�ÞRé_çxó�Ü�ôRîYçiàLõ�ö�çðPé"ç=àVç=è?ß�ß"ÛPç�ÜZèJÜ�ôR÷PàVÞHà�æZë�ß�í{æ¸ÝYÜCàVÞHìxïAÞHà_àVÞRèYñ2ó�Ü�ôRîYçÜZðPðYé_æ?ÜCì"ÛYç=à]ÝJÜCàVçiä�æCè�ðYÜZé"ÜZïAç�ß_é"ÞHì{Ü�èJäAèYæZèPø�ðJÜZé"ÜZï�çLß_é_ÞHì�é"çLðYé_çiàVçLè?ß"Ü�ß_ÞRæCèJàLáZé_ç=à_ð9ç=ì�øß_ÞRóCçLôR÷Cõ{ùFÜ�ß_ÛYçLéFß_ÛJÜ�èxÞRè;óCæZê;ÞRèYñAìLæZïAðYôRç�úuà�ß�Ü�ß_ÞHàVß_ÞHìLÜZô�ïAç�ß_ÛYæPäYàFí�çjì�æCèJì�çLè?ß_é�Ü�ß"ç�æCèçLó�ÜZô îJÜ�ß_ÞRèYñjß"ÛPç�ß�í�æAà_ì�ÛYçLïAç=à{îJà_Þ èYñ�Ü�ï�æPäPçLé_è�ôRçiÜ�é"èPÞRèYñ�ß_ÛYçLæCé_÷Aß_æ;æZô�á;ß_ÛYçxûVôRçiÜ�é"è;øÞRèYñ�ìLîPé"óZçiüPá�ígÛYÞHì"ÛAÞRè�ß_ÛYç{ðYé_çiàVçLè?ßaìLæZè?ß_çLú?ß ò îJÜZèCß"Þ ýJçià]ß_ÛYç{ýJôRô ø�ÞRè�ç=é_é_æCé�ÜCà]ë²îYèYì�ß_ÞRæCèæZë/ß_é"ÜZÞRèYÞ èYñ�à_ÜZïAðPôRç�àVÞRþ=çZõ�ÿ?îJì"ÛÜZèJÜ�ôR÷PàVÞHàdÞHàdÞRïAðJæCéVß�Ü�è?ßFë²æCé�çLúPðJç=é_ÞRïAçLè?ß"ÜZô]äPç=àVÞRñCè�õÿ;ç=ìLæZèJäPôR÷ZáPí�ç�ÞRè?óCçià�ß_ÞRñ?Ü�ß_ç�ß_ÛYç7î;ß"Þ ôRÞ ß�÷�æZë/óCæ�ß_ÞRèYñAà_ì"ÛYç=ï�çià�ë²æCégçLèYÛJÜZèYìLÞRèYñ�ß"ÛPç7ð9çLéVøë²æCé_ï�Ü�èJì�ç�æ�ë�ï�ÞHà_à_ÞRèPñ�äYÜ�ß�Ü�ïAçiì"ÛJÜ�èYÞHàVï�àLõ

�� è4ß"ÛPççLúPðJç=é_ÞRïAçLè?ß1ß�í�æ�ß�÷;ð9ç=à1æZë7äYÜ�ß"Ü¸í{ÜCà1ì�æCôRôRç=ì�ß_ç=äwõ�!{ÛYçàVîYÝ#"Vç=ì�ß"à1í�æCé_çvÜàVð9çiì�ÞHÜZô;äPç=à_ÞRñZèYçiä7í{Ü�ß"ì�Ûjì=Ü�ôRôRç=ä7ß_ÛYç�û�ÿ;îYèYà"ÜióCçLé�üPáiígÛYÞHì"Û�ïAç=ÜCàVîYé_çià]âFã%$4ÜZèJäjâdã'&é"ÜCäPÞHÜ�ß_ÞRæCè�õ èxÜCäPäPÞ ß_ÞRæCè�áPß_ÛYç�ë²æCôRôRæ�íFÞ èYñ ò îPçià�ß_ÞRæCèYèJÜ�ÞRé_ç�í{ÜCàgÜZôHàVæ�é"ç�ß_îYé"èPçiä)(* õgâ�àVÞRèYñ�ÿ;îYèJà_ÜióCçLé�+²÷Cç=à-,�èYæ/.0 õ{ö�æCé_ê;ÞRèPñ1+²÷Cç=à�,ièYæ#.2 õ3$dÝPé"æCÜCä4+²÷Cç=à-,�èYæ/.5 õFÿ;îYè�&{Ü�ß_ÛYÞ èYñ�+²÷Cç=à�,i÷Cç=àVø àVæCôRÜZé_ÞRîYï,ièYæ#.6 õ37�Ü�êCç=äuÿ;ÛYæZîYôHäPçLé�à�+²÷Cç=à-,�èYæ#.8 õ:9�è�ß_ÛYç�&�çiÜZì�Û;,<9�è1ß_ÛYç�í{Ü�ß_ç=é�+²÷Cç=à�,ièYæ#.= õgâ�àVÞRèYñ�ÿ;îYèvÿ;ìLé_ç=çLè>+²÷Cç=à�,ièYæ#.? õFÿ;îYè�@YÜCì�ß_æZé:7FîYï�ÝJç=é�+²èYæA, * ø = , ? ø *�8 , * = ø 2/6 ,/B 2C6 .D õFÿ;îYè?ÝYîYé_èYçiäE+oèPæA,�é_ç=äF,�Û;îPé_ß"à-,�ÝYô ÞHàVß_çLé�àG.*�H õFÿ;ÞRþLç�æ�ëaÿ;îYè;ÝYîPé"è1$dé_çiÜ�+oèPæA,�ôRÞ ß_ß_ôRçI,�ïAç=äPÞRîPï,�ôHÜ�é"ñZçI.J ÜCì"Û ò îYç=àVß_ÞRæZèYèJÜZÞRé_çí{ÜCà�à�ß_æCé_çiä�Ü�ôRæCèYñ2ígÞ ß_ÛåäYÜ�ß"ç�Ü�èJä à_îPÝ#"Vçiì�ßuÞHäPçLè?ß_Þ ý9ìLÜ�ß_ÞRæCèè;îPï�Ý9çLéiõgÿ;æCï�çjæ�ë�ß_ÛYçjÜZèJàVí�ç=é"àFÜZé_ç7ÝPÞRèJÜZé_÷4+²÷Cç=à-,�èYæ#.�ígÛYçLé"ç=ÜCà{æZß_ÛYçLé�àdÜZé_ç7ìLæPä;çiäîJàVÞRèYñ�Ü * ø�æCî;ß_ø�æ�ëqøLK�ÝYÞ èJÜZé_÷�é_ç=ðYé_ç=à_çLè?ß"Ü�ß_ÞRæZè�õM!gÛYç * ø�æCî;ß_ø�æ�ëqøLK�ì�æPäPÞRèPñ�ç=èYà_îPé"ç=àgß_ÛJÜ�ßß_ÛYçONdÜZï�ïAÞRèYñväPÞRàVß"ÜZèYìLçAÝJçLß�í�ç=çLè¸ÜZè;÷xß�í�æxäYÜ�ß"Ü�óCç=ì�ß_æCé"à�ç ò îYÜZôHà7æZèYçZá�ß_Û;îYà7ðYé_ç�øóCçLè?ß_ÞRèYñ�ÜZèxÜ�é"ÝYÞ ß"é"ÜZé_÷�ä;ÞHàVß"Ü�èJìLç�ë²æCéFì=Ü�ß_ç=ñCæZé_ÞHì=Ü�ô`äYÜ�ß"ÜAàVîJì"ÛuÜCàFÿ;îPè;ÝYîYé_èYç=äwõ!{ÛYçAàVîYèxëoÜZì�ß_æCé�è;îPï�Ý9çLé�+ ò îYç=à�ß"Þ æCèèYæJõ ? .dÛYÜCà�Ü�ôHÜZé_ñCçLédé"ÜZèPñCçjæ�ëOó�Ü�ôRîYçiàLõ èæCé"äPçLéjß_æ�äPç=ìLé_ç=ÜCàVç�ß_ÛYç�ôRçLèYñZß_Û2æZëdÞ ß"à�ÝYÞRèYÜZé_÷�é"çLðYé_çiàVçLè?ß"Ü�ß_ÞRæCè�á/Þ ß�ígÜZà ò îYÜZè?ß_ÞRþLçiäÞRè?ß_æ¸ýJóZçvôRç=óZç=ôHàP+oèPæA,�àVï�Ü�ôRô�+ * ø = .Q,�ïAç=äPÞRîYïR+ ? ø *�8 .Q,�ôHÜZé_ñZç>+ * = ø 2/6 .Q,�Û;îYñZçS+TB 2/6 .U.�õ@YîYéVß_ÛYçLé"ï�æCé_çCáZÞ ß�í{ÜZà�ì�æCïjÝYÞRèYç=äAígÞ ß_Û ò îYç=à�ß"ÞRæZè�èYæYõ = ì�é"ç=Ü�ß_ÞRèYñ�æCèPçdÝYÞRèYÜZé_÷�óCç=ì�ß"æZéÝYôRæ;ì�ê9õJ óCçLè?ß_îJÜ�ôRôR÷CáLë²æCé�çLóCç=é_÷dð9ç=é"àVæCè7ÜZèJä�ç=óZç=é_÷�äYÜ�÷Zá�Ü * = ø ä;ÞRïAçLèJà_Þ æCèJÜ�ô�ÝYÞRèJÜ�é_÷�óCç=ì�ß"æZéÞHà/ì�é"ç=Ü�ß_ç=äwõ ß�ì�æCèCß�Ü�ÞRèJà]èYÞRèYç�ÝYôRæ;ì�ê;à�ë²é"æZï-æZèYçaß"æ�ë²æCîYé]ÝPÞ ß�à/ç=ÜCì"Û�õ�!gÛPç=é_ç{ÜZé_ç 0 5 0#*I0äYÜ�ß�ÜFé_çiì�æCé"äYà�ÞRè7ß"ÛPçFäPÞRÜZé_÷Cá�ä;ÞHàVß_é_ÞRÝYîPß_ç=ä�Ü�ïAæCèYñ *�DC6 ð9çLé"à_æZèJà]ÜZèYä *�2<? äPÜ�÷PàLõ�!gÛPç=é_çÜZé_çuÜ�ß�ôRç=ÜCà�ß�æCèPçuïAÞHà_àVÞRèYñ�ó�Ü�ôRîYçuÞRè4ïAæZé"ç�ß_ÛJÜ�è *�HCH<H óCç=ì�ß_æCé"àAäPîPç�ß_æ¸ðYÜZéVß"ÞRÜZôRôR÷îYèPýYôRôRçiä ò îYç=àVß_ÞRæZèYèJÜZÞRé_ç=à�+²Þ�õ çZõRáYÞRèxÜ�ðYðYé_æ�ú`õ 5AV æ�ë]ß"ÛPç ò îYç=à�ß_ÞRæCèYèJÜ�ÞRé_çiàG.�õ


WYX�Z�Z[X�\E]_^�`�a�`bWdc�^�e�fgZhgikjl<mLn#oqp�jsrutUoqvCruwyx�zAoqruw<{Q|E}~j�wI�Q�k{QjO�Cj��QvC{�o�t�nAj��urkj�n�wCt�� ;�y�U�F�C�s��s�)�U�;�s��hgikjPnkwI�-wStUjs��o�tOnAjsrkv<�Qj�n�w<t��b�I�M�q�<�G�U��<�- #��s��;�Q¡>¢/£�¤¥iAj�{QjP¡¦o�t��Qikjr#�Ap�zFjs{¥vy}%js§kwyp�¨kxqj�ts�hg¤�v¥p�vAnAjsx�t)}~vC{)�uxqx oqrk©3oqr�p�oqtGtUoqrk©¥nkwI�-w3w<{QjªnAj�tG��{QoqzFj�n'ikjs{Gj<��h3iAjª�k{-tT�[p�j��QikvAno�t�zkwCtUj�n«vCr«�QikjPw<tGtU�kp�Ä�QoqvCr«�Gikwy��Qikj�n#o�w<{Q|«nkwy�Gw4�Cj��Qv<{-t�w<{Qj�¬�wy�utQtQo�wyrnAo�tTm�Q{Qoqzk�A�Qj�n;�ªhgikj�tQj��vCrunOo�t�w�rAvCrAm�üwy{-wyp�j��G{Qo��¥®�mL¯:j�w<{Qj�tU��¯¥j�oq©<i#zFv<{gp�vAnAjsx°�ª±�w<r#|nAo ²;js{Qj�rC�'p�vAnAjsx�t��sw<r�zFj�¨k{QvC¨FvCtQj�n)�¥h3ikoqt'üwy¨Fj�{�£uikvI¤gjs�Cjs{�£A}~vA��utQj�t:vCrP��vCp�üwy{Umoqrk©Ew��Qikj�sv<p�¨kxqo��swy�Qj�n�tT�QvA�GiuwCtT�Qo��p�vAnAjsx�¤3o �Qi«wPtUoqp�¨kxqjs{�rkv<rAm�üw<{Gwyp�js�Q{Qo��v<rkj}~vC{¥tU¨Fj��so �F��nAo�wy{G|�{Gj��sv<{-nAt��³:�kj��Qv��Gikj��Giuw<{GwC�-�Qj�{Qo�tT�Goq��tMv<}%nkwI�Gwk£#�Qikj�{Qj�w<{Qj'�Qik{Qjsj�nAo ²Fj�{Qj�rC�¥Ä{Gvy�uxqj�t��Gw<´<j�roqr/�Qv��vCrktQo�n#j�{Gwy�QoqvCr)�¥hgikj��k{-tT�'��wyxqxqj�n1�QikjPµª¶y·:¸F¹qºs»Lº�¼�½¿¾yÀ-Á�Â�ÀQ¶TÃM¹ º��utUj�t3�GiAj�}~�kxqxnkwI�-wOtQj��oqr��Gikj�j�tT�Qoqp�wI�Goqv<r[��hgikj�tQj��sv<runP�sw<xqxqj�n1�QikjÂ3ºsÀGÄs¶yÅ;¾I¹%Â�ÀQ¶TÃª¹qº�wCtQtU�kp�j�t�Qiuwy��Æ/�kj�tU�Qoqv<rkruw<o {Gj�t�}~{QvCpbv<rkj1¨Fjs{-tUvCr«iuw��Cj1tUoqp�o x�w<{O�Giuwy{-w<��Qj�{Qo�tT�Qo��t�¤¥iAoqxqj��Qikj�Giuw<{GwC�-�Qj�{Qo�tT�Qo��t%wC��{Qv/tQt��GiAj:¨Fjs{GtQv<rutÇnAo ²;js{��ªhgiko�tMw<{Qo�tUj�}~{Gv<pÈ�Qikj¥js§#¨Fj��-�Gwy�QoqvCr��QiuwI�i#�Ap�w<rSzFjsiuwI�/oqvC�k{��ywy{Go j�t�}~{Gv<p�¨Fj�{GtUvCr>�QvP¨Fjs{-tUvCr)�Phgikj��GiAoq{-n>¨k{Qvy�uxqjOo�t��w<x xqj�n�QikjP¼�¾yÁEÂ�ÀQ¶TÃM¹qºPw<rknw<tQtQ�kp�j�t��Gikwy�OnkwI�Gw4�Cj��-�QvC{Gt�}~v<{�vCrkj1nkwI|�w<{Qj1tUoqp�o x�w<{�v<{j�Æ/�koq�Iw<xqjsr/�Qxq|�zFjsxqvCrA©��Qv�v<rkj¥nAo�tT�G{Qoqzk�#�Goqv<r[£/¤3ikoqxqj¥üw<{Gw<p�js�Qjs{-t%vy};�GiAj'nAoqtU�Q{Qoqzk�A�Qoqv<rutwC��{Qv/tQtF�QikjgnkwI|At)�Iw<{Q|<�%hgiko�tÉo�tÉn#�kjª�Gv¥�Qikjª}¿wC�-�É�Gikwy��i#�kp�w<r�zFj�iuw��#oqv<{[o�tÉoqrAÊu�kjsru��j�nz#|�¤¥ikj��Qikjs{�£k�Qjsp�¨Fjs{-wI�G�A{Gj<£A�Qikj�tUj�wCtUvCr�vy}%�Qikj�|Cj�w<{�£#j��-�y�ghgikj�p�vAn#j�x[�utUoqrk©�j�w<�-iv<}É�Qikj�nAj�tG��{QoqzFj�n�Ä{Gvy�uxqj�tgo�t¥�sw<xqxqj�n�w�p�js�QikvAn;�ËLrw<nknAo �QoqvCr)£AwSÌF¶y»°½ÍÅ#Î�¨k{QvA�sj�nA�k{Qj¥o�t�w<x�tUv��vCrutUo�nAjs{Qj�n)��Ë��g�sv<p�üwy{Gj�tÇ¨k{QvC¨FvCtGwyx�t}~{QvCp�wyxqxÇ�Gikj�w<zFvI�Cj�p�j�rC�Goqv<rkj�n>p�js�Qikv#nkt�wyrunE�-wyĆj�t��QikjOp�wIÏUvC{Qo �T|��Cvy�QjOw<p�v<rk©�Qikj�v<�A�G�sv<p�j�t��¥h3iAo�t:p�js�QikvAn�oqt:j�§A¨Fj��Qj�n1�Gv�©Coq�Cj��Qikj�zFj�tT��{Qj�tQ�kx �-ts£uikvI¤gjs�Cjs{�£ko �o�t�p��u�Gi�p�vC{Qj��vCp�Ä�A�Gwy�QoqvCruwyxqxq|4j�§A¨FjsrutQo �CjtUoqru��jo ��vCp�zkoqrkj�t��Qikjvy�Qikj�{��GiA{Gjsjp�j��QikvAnkts�]�ÐkÑ�Ò�Ò�Ó¿ÐkÔWYÕ[Ö�×#Ø'ÙG]�WÈÚÛ tGtU�kp�j'�Qiuwy�¥�So�t3¬�wy�utQtQo�wyr�nAo�tT�Q{Go zk�A�Qj�nO¤3o �Qi�p�j�w<r�ÜÝwyrun��vy�Iw<{Qo�w<ru��j�Þ��MßA�k{Um�Qikjs{��Qiuwy��GiAjO}~j�wI�Q�k{QjO�Cj��QvC{�o�t�nAo �#o�nAj�nSoqr/�QvPvCzutUjs{G�<j�n>wyrun>p�o�tQtUoqrk©Püw<{U�Gts£Çw<t�«�È� �MàI�Q�ªá¥��3â:run#j�{¥�GiAj�¬�w<�utQtUo�w<r1p�v#nAjsx�wCtQtQ�Ap�Ä�QoqvCr[£k�Qikj�vCÄ�Qoqp�w<xÉoqrA}~js{Qj�ru��jv<}��Qikj�p�o�tQtUoqrk©�üwy{U�:o�t3©Coq�<j�r1w<tg�Qikj��sv<runAo �Qoqv<r1js§#¨Fj��-�Gwy�QoqvCr1vy}��GiAj�p�o�tQtUoqrk©�¨kw<{U�©Co �Cj�r��GiAj�vCzktQjs{Q�Cj�nOüw<{U��£ko°� j<�q£ã�ä°�Çá�å �ªà�æª�ÝÜgáçèÞ�á3à�Þ�é �àQàEê ä¿�ªà�ëSÜ3àsæ äU��æ¤3ikj�{QjÜ�� Ü à �GÜ á �ìwyruníÞÈ�_î Þ àQà Þ àUáÞ�ïàQá Þ ágáñð ä� <æ

hgikj�¬�w<�utQtUo�w<roqp�¨k�A�Gwy�Qoqv<r1p�vAn#j�x[o�t��Qikjsr1©Coq�Cjsr1w<t�ò]�WóÒØ°ôuÕuõyÓ~ö�÷�øÝù

137

úCû3ü:ýqþ#ý�ÿ��ÿ�� ý��sû�� !"�#�#��yý��«ÿ��þ#�$!"��%��ý��&�'�ký(!�)��*��$�#+�,�%��%-.��/-0�1�2�3��$Oý(,4�ý(�Qý��5�67! �%*�*:ý �,�98�û: ��$�� 4;�yý��ký��5<��%��16=�'�� %*�*>��þ%�1!��#�?��! �%4 ��*��"��ý(.!$�%* *��ÿ�A@sûBAû'CD��Qý�4 ��.4��1��/E)��uÿ��!"�yþ2�%�Qý(�%��! �74;��Qý F/GH-0��%4I� @ 6ký°û �<û�6JELK úM�@ NOQP�RTS�U V O%WX JGYK úM�@[Z ú NOQP%RTS \ U V O�W Z JE'] \ U V O%W Z JE^]�_ XH`ba#c�'��$�� M @�KYd �A@1d#ý('��.�Q3�4�ef�$�'��-[!"�#4��*��"��.� F��%4 ��*��1sûa û'g��%�h�$�#!��þ#�$!"��%� Uji �98k ü'ý þ#ý�ÿ��þ%�1!��#��ý��#��l�+�� U Knm Upo X U 87qb6D�'�� U[o ý(A��#e�� Gþ%��ÿ�þ#�1!��#��-0�1�2��3��$'�%�uÿ U 8r��94�ý(Uý��5kûk CD��Qý�4 ��s��?4�ý(�Uý��5�eký ��%��t�þ#�$!��%�'�%��.Uý�5#��-��9!"�#�uÿ#ý �Gý �#��%* uÿAý(+��Gý�e�3��Qý��#�v4��1�%�,-0�#�^��94�ý�Uý��5��%��'5Cýqþ#� ��.wQ��2�h��-0�$��3��$$x

JU 8rKyQý 5#�/z JE 8|{ JG�8 o JG;}�~o�o�� ` Upo Z JE o c��/��/�Q��2�Q� ��/�Q��,�Y��>�Q�s�+�/�/��: ��ÿAý(+��%��! �;4��1�%�3��-0�#�Aeký ��%��tEþ#�$!"��#� `b� �%4�4�ý��5Eÿ#ý(��%��! � c ý(�ÿ��"��ÿ��%-0�#* *��h$x

� ` �>X��%c K��N � � ~ d V ¡ W� Z V(¢ W� d X `�£�c

�'��$�� %�kÿ � �%��s�+�^��eký��%��tOþ%�1!��#�'��uÿ�¤ªý(h��eký � ` ÿAý�4 � ��Uý��#� c ý��kÿ�� F;û: ��'�%*�5%�#�Qý ��4¥-0�%�>��'��%��u��4 � ��Qý(! � u§¦7�$�%��$��T¦h��ý�5%�Qef�%�[¨/�Aÿ�� *Aý(>5Cýqþ%�$��# x �/�/�ª©��f��0�$�>«&¬úCû3ü:ýqþ#ý�ÿ��'��'ÿ�� [��ý��.�+��9��%��ûT��"�D��'��+��"��! �#�#��yý��ÿ��2��þ#�$!�u��#�%ý��'�ký(!�;��[*��$�#+�[�#��'��-f��^-0�$��3��$Çý�p4�ý(�Qý��5�6��98�û : ��^�� 4;�<ý �ký��5��'�'��$��.�%*�*=��þ#�1!��#�^�%��.!"�#4 ��*�� ý(h! �%*�*��ÿ��@sûBAû'g��%�h�$�#!��þ#�$!"��%� Uji �98�xk ü'ý þ#ý�ÿ��^��¥þ%�1!��#�%ý��?�#e��$�Qþ#��ÿA��uÿA4�ý(�Qý��5.��[�# U Km U o X U 8 q�ûk|® �%*(!"3�*(��[��MÿAý(+��!"�^CD¯Fû `0£Qc ef�"�� $�?�� U o ��uÿ��*�*#�%-��þ%�1!��#�-0��#4°��,��"�9�A@sû<±Q3�e�+�Gý ��3��,e�t��#��u�4�ý(Uý��5v��%��ý��þ%�1!��#�-0��#4I��9! �%4 ��*��"�� ^�A@sûk³² �� !"*��1+��þ%�1!��#� ` � u��$�%��1+�A��ý�5%�Qef�#� c ��uÿ³�f�$��-0�#��4n�4;�2´��%�Gý ��t�þ#�%�Qý��5��1+�Qý�4;�2��?�%->��.4�ý(�Qý ��5�þ2�%*�3��$sû


µs¶�·9µs¸�¹"ºYµs»/¼�½¾§¿�À#ÁÂ�Ã ÁsÄ�ÀvÅ"À#Æ Ç�È%Á�Ã�Ä�É�Ã Ç�Ã$Á�Ê0À#Á�Æ;È�¿�Å Ã�À%ÊpÄ�É�Ã Æ�À�Â�Ã Ë(Ì?À%¿�ÄÉ�Ã;Â�Í(È�ÁÎvÁ�Ã$Å À%Á�Â�Ì Ï=ÈÐ È%Ë�Í(Â�È2ÄÍ À#¿jÌ�Ã"Ä Ñ^È#Ì9ÄÈ%Ò#Ã ¿³À#Ó�ÄAÊ0ÁÀ%ÆÔÄÉ�Ã�Ê0Ó�Ë�Ë�ÎjÅ"À#Æ Ç�Ë�Ã Ä�Ã$Â|Õ#Ó�Ã1Ì+Ä�Í�À#¿�¿�È%Í�Á�Ã1Ì Öl×lÃÇfÃ Á�Ê0À%Á�ÆØÈ,Ë�Ã$È Ð Ã"Ù�À#¿�Ã"Ù�À#ÓQÄhÇfÃ Á�ÆAÓ�ÄÈ�Ä�Í�À%¿�Ã1Ì+Ä�Í�Æ;È�Ä�Ã�À%Ê[Ä�É�ÃAÚ%Ã$¿�Ã ÁÈ%Ë�Í�Û$È�Ä�Í�À#¿vÃ Á�ÁÀ%ÁsÈ%ÌÍ�¿¥Ü1Ý%Ý#ÝlÁ�Ã$ÇfÃ$È�Ä�Ã$ÂyÇfÃ Á�ÆAÓ�ÄÈ�Ä�Í�À%¿�Ì;À#¿�Ã Ð È%Ë Í(Â�È�Ä�Í�À%¿)Ì�È%Æ�Ç�Ë�Ã�Í(Ì,ÅÉ�À#Ì�Ã ¿LÁ�È�¿�Â�À#Æ�Ë�Î#ÏÄ�É�Ã ¿jÈ/¿�Ó�ÆAÞ�Ã$ÁAÀ�Ê^ÄÁÈ%Í�¿�Í�¿�Ú�Ì�È%Æ Ç�Ë�Ã1Ì Ö/ß^É�Ã,ÇfÃ Á�Ê0À%Á�Æ;È%¿�Å"Ã;Í(Ì.Ä�É�Ã$¿jÈ Ð Ã Á�È�Ú#Ã�À Ð Ã ÁÄ�É�Ã;Ü$Ý#Ý%Ý�ÇfÃ ÁÆ�Ó�ÄÈ�Ä�Í�À#¿�Ì$Ö×lÃ�È%Á�ÃpÍ�¿ Ð Ã1Ì+Ä�Í�Ú�È�Ä�Í�¿�Ú7Ã ÁÁ�À%Á�Ì�À�Ê�Ë�À�Ñ�Å"À#¿�Å Ã ¿�Ä�Á�È2Ä�Í�À#¿�Ï$ÍbÖ Ã#Ö�Ï2À#¿�Ë�ÎsÀ#¿�ÃDÞ�Ë À�Å�Ò�à�Õ�Ó�Ã$Ì+ÙÄ�Í�À#¿fá^Í�¿/Ä�É�Ã Ð Ã$Å"Ä�À#ÁhÍ(ÌhÆ Í(Ì�Ì�Í ¿�Ú�È�ÄhÄ�É�Ã�Ä�Í�Æ Ã%Öhß'É�Ã.â�¿�È�ËãÃ$Á�Á�À#ÁhÁ�È2ÄÃ.Í(Ì7È%¿�È Ð Ã$ÁÈ%Ú%ÃÀ Ð Ã$Á^Ì�Ó�ÅÉ/Ì�Í�¿�Ú#Ë�Ã?Ã$Á�Á�À#ÁÌDÆ;È#ÂQÃ9Í�¿vÈ�Ë�Ë�ÇfÀ�Ì�Ì�Í�Þ�Ë�Ã?¿�Í�¿�Ã.Þ�Ë�À�ÅÒ�Ì Ö

10 20 30 40 50 60 70 80 90 1000.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

Training set size

Err

or ra

te

Learning curves, Gaussian Model

Personal Profile Day Profile Complete Diary ProfileVoting

ä�å æ$ç#è+évê$ë�ì�é�í è+î�å î�ævïç#è+ð2éñ.ò�ó$èAô+õ#é,ö^í$ç�ñ§ñ§å í î�÷9ó�ø#éùûújä�ó1ç%èAø#å ü�é�è�éî�ô9÷9é�ô+õ�ó�ø#ñ�í"è+éý è+éñ§é�î2ô+é"ø�õ�é�è�é$ëpþ�é�è�ñ§ó$îQí ù�þ=è�ó ÿQù é��'í�� þ=è+ó$ÿQù é��>ó$÷ ý ù éô+é�^å í è�,þ=è+ó ÿQù ésí$î�ø��ãó$ô+å î#æ#ú =è§è+ó$è��Qí"è+ñpñ+õ#ó��|ø#éð�å í ô+å ó$î�ò�è+ó$÷ô+õ�é^÷9é�í î�ïç#è+ð2é^ó"ð2é�è'ê��sè�ç#î�ñú

�>Í�Ú%Ó�Á�Ã9Ü'È%¿�Â�â�Ú#Ó�Á�Ã��?Ç�Á�Ã$Ì�Ã ¿�ÄÌTË�Ã$È%Á�¿�Í�¿�Ú.Å Ó�Á Ð Ã1Ì>Ê0À#ÁTÄ�É�Ã�.È�Ó�Ì�Ì�Í(È�¿�Æ ÀQÂ�Ã ËfÈ%¿�ÂÄ�É�Ã��/Ù�hÃ1È�Á�Ã1Ì+Ä��hÃ Í�Ú#ÉQÞfÀ%Á.Æ�À�Â�Ã ËbÏ=ÁÃ$Ì�ÇfÃ1Å�Ä�Í Ð Ã Ë�Î#Ö�ß'É�Ã�Â�Ã Ð Í(È�Ä�Í�À%¿�Ê0ÁÀ%ÆØÄÉ�Ã Æ Ã$È%¿Í(Ì9Ì�É�À�Ñ'¿lÑhÍ Ä�É�Ä�É�Ã,Ã ÁÁ�À%Á.Þ�È�Á�Ì Ö,¾�Ä�ÂQÃ1Å"ÁÃ$È#Ì�Ã$Ì9Ì�Ë�Í�Ú#É#ÄË Î�ÑhÍ Ä�É�Í�¿�Å"Á�Ã1È#Ì�Í�¿�Ú<ÄÁÈ%Í ¿�Í�¿�ÚÌ�Ã ÄhÌ�Í�Û$Ã%Ö��À%Á?Ä�É�Ã��.È%Ó�ÌÌ�Í(È%¿�Æ�À�Â�Ã Ë��[À%Ä�Í�¿�Ú�ÏãÈ%Ì?Í Ä.Ñ'È%Ì?Ã��ÇfÃ$Å"Ä�Ã$Â=ÏãÚ#Í Ð Ã1Ì Ð Ã ÁÎ/Ú%ÀQÀ�Â�Á�Ã"ÙÌ�Ó�Ë ÄÌ Ï�Í ÄhÇ�Ã$Á�Ê0À#Á�Æ;Ì�ÞfÃ"Ä�Ä�Ã Á'Ä�É�È�¿vÈ%Ë�Ë=À�ÄÉ�Ã$Á^Æ Ã"Ä�É�À�Â�Ì^Ê0À#Á'Ì�Æ;È%Ë ËfÄÁÈ%Í ¿�Í�¿�Ú;Ì�Ã"Ä�Ì ÏQÉ�À2Ñ^ÙÃ Ð Ã Á�Ê0À%Á^Ë�È%Á�Ú#Ã Á�Ä�ÁÈ%Í�¿�Í ¿�Ú�Ì�Ã ÄÌ�Í Ä^ÇfÃ$Á�Ê0À#Á�Æ Ì'Ì�Ë�Í�Ú%É�Ä�Ë�Î Ñ�À#ÁÌ�Ã7Ä�É�È%¿��?È1Î� pÁÀ�â�Ë�Ã%Ö� TÃ Á�ÙÌ�À#¿�È�Ë! pÁÀ�â�Ë�ÃhÈ%¿�Â�"�À%Æ Ç�Ë�Ã"Ä�Ã#�7Í(È%Á�Î� pÁ�À%â�Ë�ÃhÂ�À�Ã1ÌpÚ#Ã ¿�Ã Á�È�Ë�Ë�ÎAÑ�À#ÁÌ�Ã#Öã¾§¿;Ä�É�ÃsÅ$È%Ì�ÃhÀ%Ê

$��#�HÆ�À�Â�Ã ËbÏ�ÞfÀ%Ä�É% TÃ$ÁÌ�À#¿�È�Ë& pÁÀ�â�Ë�Ã�È�¿�Â'"DÀ#Æ Ç�Ë Ã Ä�Ã��7Í(È%Á�Î( pÁ�À%â�Ë�Ã�ÁÃ"Ä�Ó�Á�¿�É�Í�Ú#ÉÃ ÁÁ�À#Á$Ö�¾§¿�Ä�É�Í(Ì9Å$È%Ì�Ã)�?È1Î( DÁ�À�â�Ë�Ã Å"Ë�Ã1È�Á�Ë�Î/ÇfÃ$Á�Ê0À%ÁÆ�Ã1Â�ÞfÃ"Ä�ÄÃ Á.Ä�É�È�¿%�pÀ�Ä�Í�¿�Ú<Ê0À#Á9È%Ë�ËÄ�É�Ã.Ä�ÁÈ%Í�¿�Í�¿�Ú Ì�Ã"Ä�Ì Ö

139

10 20 30 40 50 60 70 80 90 1000.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

K

Err

or ra

te

Learning curves, K − Nearest Neighbors Model

Personal Profile Day Profile Complete Diary ProfileVoting

*,+ -�.0/21436587!1�9:/2;<+ ;0->=�.?/2@61BAC=�9�D =�.0D 9:E21�F%GIH�/CE2J<1�K)LNMO1�9:/21BAPE)MO1B+ -�JRQSH�/CTUHRF01BDWVX*0H�.?/F0+ Y�1B/21B;REZT�1[E\J0H?F?A,9:/21^]0/21BA1�;6E21�F�J<1[/21�5Z_,1[/2AH�;`9:D0_Z/2H:a`D 1�b6cd9Be#_Z/2H:a`D 1�b?f&H�T�]<D 1[E21&cg+ 9:/e_Z/2H�a<D 1O9�;<F�hiH�E2+ ;0-0VkjZ//2H�/^Q<9:/2A&AJ<H�l'F01B@?+ 9�E2+ H�;�Gm/2H�TXE2J<1nTU1�9�;�=B.0/2@61�H�@61[/�o�p�p�pd/2.<;<ABV

qr%s!t?u�vBw�x�yZz[{�w�|[}?~)w�|\w:zU�?�g�:uSvB�?w�|��I|��v[w�|[w:r<z[w��Zy&}0|��?r�z[{�w�s!t?u�v[w�|��}?r!��S�^� �R�Ow��?w�v�y<{�w:v[wz[{�w��:�?~��!}RvB�I|\�?r��I|O~�}0�`w#��w�z2�gw:w�r�zB{Sw#z2�O�)�S�I|[��u!|[|[w��~)�S�Sw:�I| �

20 40 60 80 1000.06

0.08

0.1

0.12

0.14

Personal Profile

Training set size / K

Err

or ra

te

20 40 60 80 1000.06

0.08

0.1

0.12

0.14

Day Profile


Err

or ra

te

20 40 60 80 1000.06

0.08

0.1

0.12

0.14

Complete Diary Profile


Err

or ra

te

20 40 60 80 1000.06

0.08

0.1

0.12

0.14

Voting


Err

or ra

te

*,+ -�.0/21��?5�f&H�T�]`9�/\+ A2H�;�Q`1BE�l^1B1�;��D + -�J6E�D + ;<1��n9�;<F)�dMgM�TUHRF01BDk��F<9�/2��D + ;01:�&GIH:/g9�D DE2J<1g]0/2H:a`D 1BAnA2J0H�ln;)A21B]`9�/\9:E21BD e6V


�P�U�I�U :¡ ¢�£R¤B¡m¥(�\¢:¢�¦��[§!£6��¨�©0¤�¡I£?¤[ª0¢C�[¤�£R�m¦��m¦�ª>�\¢��#¨�©0¤�¢�«?¢�¤[¥>¬�¤[©?�¡m¢0®Z�[§�¢�¯�£?°!�[�\�I£?¦± ©`²S¢�¡n³�¡m�mª?§<��´O¬�¢�¤\¨�©0¤ ± �gµ�¢��[�[¢:¤��[§!£R¦>¶U·· ± ©S²S¢:¡n³N²�£?¤[¸�´�¹

20 40 60 80 100

0.1

0.12

0.14

0.16

0.18Learning curve for block no. 1

20 40 60 80 100

0.2

0.3

0.4

0.5

Learning curve for block no. 2

20 40 60 80 1000.04

0.06

0.08


20 40 60 80 100

0.04

0.06

0.08


20 40 60 80 100

0.08

0.1

0.12

0.14

0.16


20 40 60 80 100

0.04

0.06

0.08

0.1


20 40 60 80 1000.04

0.06

0.08


20 40 60 80 1000.005

0.01

0.015

0.02


20 40 60 80 1000.005

0.01

0.015

0.02

0.025


º,» ¼�½0¾2¿À0Á�Â!¿�Ã:¾2Ä<» Ä0¼CÅB½0¾2Æ6¿BÇnÈIÉ:¾Ã:Ê Ê�Ë2Ì<¿ÍUÎ0Ê ÉRÅ\Ï?ÇOÇ¿BÐ`Ã:¾\Ã�Ë2¿�Ê ÑRÒdÓ�ÔÕ» ÇdÇÌ0É�ÖnÄ�Ön» Ë2Ì�ÇÉ�Ê » ×Ê » Ä0¿�Ç�Ã:Ä`×�ØdÙOÙÚÖn» Ë2Ì)Ë2Ì<¿d×<Ã:Ç2Ì0¿�×SÒkÛ<Ü�Ã:Ý?¿�ÇnÃ�¾2¿dË¾\Ã:» Ä<» Ä0¼�Ç¿[Ë�Ç» ÞB¿dÖnÌ<» Ê ¿dßRÜ�Ã�Ý0¿BÇnÃ:¾2¿d¿B¾¾2É:¾¾\Ã:Ë2¿BÇBÒ�àg» Ã:áUÉ�Ä`×�áUÃ:¾2Ï6¿[¾g» Çâ&É�á�Ð<Ê ¿[Ë2¿�àg» Ã:¾Ñ�ãZ¾\É:ä`Ê ¿�å�ÅB» ¾2Å�Ê ¿U» ÇdàdÃ�Ñ�ãZ¾\É:ä`Ê ¿�å,Ç2æR½<Ã:¾2¿�» Çã,¿[¾2Ç2É�Ä`Ã:Ê,ãZ¾2É�ä<Ê ¿�å<Ã�Ä`×CË¾\» Ã�Ä<¼�Ê ¿d» Ç�çiÉ�Ë2» Ä0¼0Ò

è&�mª?°�¤[¢gé�¬�¤[¢��\¢:¦<�B�&�[§�¢g¢:¤B¤[©0¤&¤B£6�B¢d�\¢�¬!£R¤B£R�[¢�¡m¥#¨�©0¤�¢�£0 B§�©?¨��B§S¢d¦�� ¦�¢dµS¡m©S B¸S�:¹^ên«0ë¢:¤B¥C�[°SµSë�!ª0°S¤B¢d �©0¤[¤B¢��\¬�©0¦!²S�&�[©�©0¦�¢�ì<°�¢��2�B� ©0¦)�m¦��B§S¢#ì0°�¢��2�[�m©0¦S¦!£?�m¤[¢0¹&íg§�¢d¡m¢�£R¤[¦��m¦�ª �°�¤[«0¢#¨�©?¤�µS¡m©S B¸�¦�©�¹�î4³ ± �I²�²S¡ ¢:ë��[©0¬>�\°�µSë��ª0°�¤[¢6´�¬�¤[¢��[¢:¦<�B��B§S¢U§��mª?§�¢��2��¢�¤[¤[©0¤O¤B£R�[¢0¹íg§��I�µ�¡m©S B¸> :©?¤[¤B¢��\¬�©0¦!²��d�[©8ì<°�¢��2�[�m©0¦(¦�©�¹Zî�³NïO©?¤B¸<�m¦�ª`´�¹�íg§�¢C¢�¤[¤[©0¤¤�£6�B¢U¨�©0¤�B§S�I�µ�¡m©` �¸8µ�£0�\�I �£R¡m¡m¥> �¤B¢�£R�[¢��[§�¢)©6«0¢:¤B£?¡m¡i¢:¤B¤[©0¤¤B£R�[¢C¨�©?¤��[§�¢)«6£?¡m�I²S£R�[�m©0¦%�[£ ± ¬�¡m¢0¹�·©R��\°�¤[¬�¤[�I�[� ¦�ª0¡m¥?®S�[§�¢�«6£?¡m°�¢U©?¨^�[§��I��!¢�¡I²4�I�#µ�¢��2�#¬�¤[¢�²S�I ��B¢�²>µ`¥8ð#£6¥4ñ�¤[©?�¡m¢0¹gè�©0¤��[§�¢¤[¢��2��©R¨��[§�¢8µS¡m©S B¸S��ñ^¢:¤��\©0¦�£?¡gñ�¤[©R!¡m¢4� ± ¬S°S�B£R�[¢>ïd� �B§��[§�¢>� ± £R¡m¡m¢��2��¢:¤[¤B©?¤�¹�íg§�¢�\� �[°!£R�[�m©0¦(�m��\� ± �m¡I£R¤�¨�©?¤#�[§�¢)¶U·· ± ©`²S¢�¡�¹Uò©6ïg¢:«0¢:¤�®�� I�#¬�©<�[�\�mµ�¡m¢C�B©8�\¢:¢4³N£?¡I�\©¨�¤[© ± �[§�¢!ª?°�¤[¢��ó0®�îU£?¦!²�ô0´�®0�B§�£R�O�[§�¢�¢:¤[¤B©?¤O¤B£6�B¢#²S©`¢��¦�©?�g²S¢� :¤[¢�£0�\¢ ± °! B§�ï�� B§�m¦! �¤[¢�£?�[¢�²��\�mõ�¢�©?¨&�[§�¢��B¤B£?� ¦��m¦�ª��\¢��¹

í&£Rµ�¡m¢��ó4£?¦!²�î>¬�¤[¢��[¢:¦<��¢�¤[¤[©0¤) �©0¤[¤[¢�¡I£6�B� ©0¦ ± £6�B¤[�I �¢��U¨�©0¤�¯�£?°!�[�\�I£?¦ö£?¦�²Ú÷>ë·�¢�£R¤B¢��2�d·�¢�� ª0§`µ�©?¤ ± ©S²S¢:¡�®S¤[¢��\¬�¢� ��[�m«0¢:¡m¥?®<¨�©0¤O�[§�¤[¢�¢ ± ¢��B§�©`²��:ønñ^¢:¤B�[©?¦!£?¡�®`ð#£6¥�£?¦!²ù © ± ¬S¡m¢:�[¢%ð�I£?¤[¥úñn¤[©?�¡m¢0¹ûíg§�¢ýü�þ ÿ%¢:¦<�[¤B¥ ©?¨C¢�¤[¤[©0¤4 �©0¤[¤[¢�¡I£6�[�m©0¦ ± £R�[¤[� �� ô��I�²S¢�!¦�¢�²�£0��ü þ ÿ�� ñn¤[©0µ��¢�¤[¤[©0¤O�m¦ ± ¢��[§�©S²� �%¢�¤[¤[©0¤O�m¦ ± ¢��[§�©S²��0¹ íg§�¢4¢:¤[¤B©?¤¤B£R�[¢��^£R¤B¢d�\§�©Rïd¦C¨�©0¤^�2ïg©�¢ � �B¤[¢ ± ¢O�B¤B£R�m¦��m¦�ª��[¢��\�mõ�¢��®R¦!£ ± ¢�¡m¥��£R¦!²4ó��[£ ± ¬�¡m¢��¹

141

�� "!$# �$� �&%'#�! �$�(#*)��$#�� &%'#�! �$�(#*%�+"+ �$�(#�#��"��,�� (#�)*�$# �$�(#"#-�"� �$� ��+".�

�,� �/� �,�/�� $�(#-+"0&% �$� �".&%'# �� "0�+�/� �$� ��.&%�# �$� #- &!� ��(#�)��"%�� $� ��"0"+ �$� #�)��"% �� +"+�!

1243&5 6,7-8:9<;=;?>4;:@A>4;=;?6A5 2CB?D >�E/BF2-3"5 6HGI>-;KJMLON*P�6QG(BR2-E'ST;?D U�V*BRBF2-3"5 6AW<X�;?6AW=6QE�B:S"24BF2�GI>-;:W=YZ2-5 5B=;F2-D E"D E&U[W?6\B�]_^�W?2-Y`X&5 6QWFaT24E'Sb5 2C;?U�6ZB?;F24D E&D E"UcW=6\Bd]e7Cf-fcW?24YZX&5 6AWFaFg;?6AW=X�6Q@\B?D h*6Q5 i*Ncj,W?6AS243&3";?6Qh�D 2CB?D >�E"WQ8[k:kmlTk6\;?W?>-E'245�k<;F>4n'5 6-gKoMkplMoM2Aiqk<;?>4n'5 6-gsrHoTkpltrK>-YZX"5 6\BF6uo�D 24;=ik<;?>-n&5 6-N

�� (#�)�%� �� &��%"% �$� &��0".�� &��%"% �� "!&)�� $�(#�� ,�/� �� &��0". ��(#�� $� ��"��

�,� �� ,��,� �$� ��#*) �� &�"% �� "��0� �/� �$� �� &�"% ��(#�.&%�0 ��(#*�� "��,�/� �$� ��0� ��(#*�� "� �� &!�+

1243&5 6�v�8R9<;=;?>4;s@Q>4;?;?6Q5 2CBFD >�EZB\243&5 6sGI>4; wTx,xyY`>�S"6Q5zNRP�6\G(B 2-E&S`;?D U-V�B:BF2-3"5 6QWKX";?6AW=6QE�BKS"24BF2GI>4;KW?YZ2-5 5"B=;F2-D E"D E&UTW=6QB�]{^,W?2-Y`X&5 6QWFaH24E'S/5 2C;FU-6sB?;F24D E&D E&UtW=6\B�]e7Cf-fTW?24YZX"5 6AWFaFg*;?6QW=X�6Q@QB?D h�6A5 i�Nj,W=6CS�243&3";?6Qh�D 24B?D >-E&WQ8Ok:k|l�k6\;?W?>-E'245�kR;?>4n'5 6-gsoTkmltoT2Qi�k<;?>-n&5 6�g,rHoMk}l/rH>�Y`X&5 6\B?6o�D 24;=i�k<;?>-n'5 6�N~?��-��(�(��$��C�"�\�Q�4�I��\�-�d�[�C�\�$��$�s�,�"�$�(�[�\�4�\�$�\�c�4�Q�\�"�Q�K�"��(�[�"�d�Q��c��(��(��"��_�~=�b�F��Q��-��\�t�\�$�Z� ��Q�(��c�$�\��C��Q��,�"�$�I��Q�'��C�`��Q� �c��<�\�-�\�� A�4�

�H�(��$�\��%c�F�$�*�t�,�\�$�u�4�\�Q��T�Q��\�/��"�Z#��"�d��(�I�$��\�(��Q��[�$�(�-�T�"�T�d��$��A�\�(�"�q��\�Q��(�$�(��`�F�4�s�F�(�4�"�K�t� �'�\�$��[�4�\�$�'�$�s�\�$��\�,�\�$�M�\��[��F�C�s��(�I�$�*�Q�(��[�\��d�$�(��4�:~e��I�H�(�&�\�4�Q�-�?�Q� �$�t�\�/�\�4��Q�$��H��"� �F�"�d��$�\�$�,��(�I�$��\�(��d�\��[��(��4�*�\�$�,�-�\�\�"�K��&��:�$��4��4��"��T��I�A��[�C�Q��d�� d��4�$�I�K��F��u�"�H�\�$�T�\� �-�,��\�$��\�A��(�$�(�$�/�F�4�-�K�M��I��$�$�4�$��[�-��"�q�I�t�-��-��?�\�Q��$�"�4�T�T�$�4��F�(�$��` t ¡�[��'�-��¢z��"��4£��[��(�"�<�\��d�$�(��$�$�'�"%�¤C�H~=��\�$�4�,�-��\�-�4�"�(��C�Q�-�"�F�-�[�F�(�-�t��\�$�T�\�A��(�$�(�$�u�F�4�,�M�(�(��\��$�4�M�\�$��4�\�Q��Q��\��¢{�\��[�$� ��$��,#�!��"�u�\�$�b¥/��$�Q�F�I��¦�d��4�z¤A��~e�d�-��§��I�F��¨��F�-�4�¦�Q�$��\�$�4�/�*��(�(�$��\�(�"��\��d�$�(�-�-��\�$�[�4�\�Q��Q��\�u��\�(�-��\�"�©�[�4�\�$�'�q�Q�O�[�4�\�$�'��¨��C�?��4�-��Q��Z�[�'��-�I�4� ~=��\�$�I�T�-�"�F�-�-�&�s��\�(�$�c�c��\�4�\�$�\�O�\�$�`�(��,��?��4�Q�\�"��Q��\�"�

ª[«[¬�ª��®�¯R°�«[¬�¯

~e��I�d�"�-��-�Q��(� ��C£��-�C�\�-��Q��*�c�\�$��[�'��4�I�[��4�\��Q�±¨��C�F�Q�4�c��c�I��\�"��Q�Q�� $�(�$��F�4�Q�4��²T��,�4�"�-�-�$�\�$�u�-�\�\�"�T�A�*�Q�Z�I�/�?�\�\�"�$��(��\��[�$� ��\�-�I�*�Q�-�<��_� ��(�� /�-��q�(��C�Q�-�"�F��F�(�"�$� ³��-��&�\�(�c�M� �\��Q��Z��$��´F�$�$�(��Qµ'�'¶d�\��[�$� �"�

�T�$�$�(�&�(�$�� ·��-�\�-�"��d�4�\�$��4��4��(�$��"�b�\�$�u¨��(��Qµ��'�$�u¨��-��4��¨��Z�Q�4�(�C¸�*��&��"��\�$�I�T�$��Q�[�F�C�� ~=��\�$�I�T�-��\�/��F�(�$�c��*��\��³$�(�/�(��\�$�`�$�\��'�I�C�\�(��b��:�\�$��*��(�$�u��s�\�$�[¨$�(�'�Aµb��R��$��K�-�Q�F�"��:�s�\��³$�(�u��"�t�\�$�[�\�-�F��s�\�$�d¨��(��Qµ��[�*��"� �"�O�C�"��F�I��4�Q��¨$�(�[�(�[�$�\��-�d�-�&�u�(�¦�\�$��-�\�\�"�Z�A�*�\�"��²T��,�-��-�-�K�\�$�A�¹�[� £'�(�$��\�$�Z�d�4�\�$�'�$�M�(�T�$� �"�$�(�O��Q�[��4��4��4�&�/��O��\�c¨��`�\�$�$�-��c��'�� (�"�

~=�b�C�"��C�(��F�(�"�<�"��"�,�\�$�`��Q�-�F�-�&��$�*�A��F�4�-�'�\�$�Z¥/��\�F�I��[��4��I��F�$��-�\�(��\��\�$�/�$�"��¸e��Q��[�C�Q�\�I�tº�¸e�$�-��\��?�,�$�-� �"�'¨��M�d��4�R�� \�$��$�"�O�\�$�Z¥`��\�\�I��O�[�'��-��"�\�F�$�[�'�Q�(��:�I�H�'�(��I��\��/��"�H¨$� ��\�Z��Q��A�Q��Q�-�:��$��/��Z�s�\��³��(��[�C�\�$��Z�&�*��¨��-�F�K�\�-�\�� A�H�(�$��I�4��\�(�$��/�?�\�Q��$�/��(�(�`��Q�I�*�Q� �"�R�R~e��\�$�,�-�\�\�"�Q�R�c�"�'��¨&�Z�� ·��-�\�-�"��[�C�\�$��$�R�$�"�Z¨��4�4�u�$��C�"�\�\�-�I�*�\��<�4�Q��,�\�-�\�� Q�:�\�C�\�$�Q��/¨'�t�Q�$�� \�(��t�,�"�$�(�`�"� �"��\�$�b¨��-�?�c�(�[�$�'�A�*�\�(�"�§��-�F��Q�[��C��[�I�\�F�(�$�¦�$�*�A��p�$��c�F�c��(�M�\�Q��(�$� �$�y�\�C�Q�


» ¼�½\¾(¿$À`Á\Â�ÃFÄ$Å ½\Â�Æ�¾(¿c¾(ÇdÈ$ÁQ¼*É"Â-Æ�È�Â4ÁFÊ�¼"Á\ÇcË�¿$Ì4Â�Í�ÎMÏ$¾(Å(ÂtÃ\Â4É"Â4Á\ÂTÌC¼"Á\Á\Â-ÅIË*½Q¾(¼�¿[Ë�Ç[¼"¿$À�½\Ï$ÂÂ4ÁQÁ\¼"ÁQÃ�¼�Ê�½QÏ�Â�Ç[ÂC½\Ï$¼�Æ$ÃuÆ�¾IÃ?Ê{Ë*É"¼�ÁQÃ/» ¼�½\¾(¿$À�Ê�¼�ÁuÅIË�ÁQÀ�Â[½\ÁQË�¾(¿$¾(¿�ÀqÃ\ÂC½QÃ-Ð�Ñ=¿¹Ë"Æ�Æ�¾ ½\¾(¼"¿RÐ½\Ï$Â�Ä�Ã\Â�¼�ÊR¼*É"Â4Á\ÅIË�È$È$¾ ¿$ÀZ½\ÁQË�¾(¿$¾(¿�ÀdÃFÂ4½QÃ�Ë"Æ$Æ'¾ ½Q¾ ¼"¿�Ë�Å Å(Ò[¾(Ç[È�ÁQ¼*É"Â-Æ�Ì4¼�Á\ÁQÂ4ÅIË�½\¾(¼�¿�Ë�Ç[¼�¿$À½\Ï$ÂZÇdÂ4½\Ï$¼'Æ$Ã4Ð

ÓOÔ�ÕTÔ�ÓOÔ�Ö�×�Ô`ØÙ Ú\Û�ÜÝ�Þ,ß'à4ß"áFà-âZà-ã"äRà-ã&å�æOÝ�çFÝ�è-é-áFå&à4ã�ê,ë=ì�í&î'ïQá?ð�ä ñ?ïAåcò ïCàCá?ã&ä ã&óuô(á?é�âõä ã&öQé�â`î&ò ï\÷?ï`å"à4÷Fà

ð�ä àTà4ãuøHæ�à-î"î"á?é*à4ö\ß$ê ùtä ãuè�Ý�ú�Ý�ûHéCü à4ã�ê&ÞtÝ�ý�ïAñ?à4í"á?é�à4ã'å`è"Ý*þ,ò ñ=î'ïAö\÷?é-á�ÿzïAå"ñQÝ �Fê�� ! "�$#%��&'�( )+*,�� $��'�'��-�.0/��1&'�#2� êæ�é-á?ó�à-ã43Tà-í�ôIâuà4ã&ã5 í�6"ò ä ñ=ß&ï\á?ñQê&ç_ã&ö-Ý ê�Ú87$7�9"ê�ð�é�òzÝ":�ê&î"î�Ý$Ú8;$<'='Ú8;�>*ÝÙ ;AÛ�ÜÝ�Þ,ß'à4ß"áFà-âZà-ã"äRà-ã&å�æOÝ$çFÝ�è-é-áFå&à4ã�ê,ë=ædä ?�÷Fí�á?ï`âZé�å"ïQò ñ,ôIé-áA@�ïAà4á?ã"ä ã&óuô áFé-â ä ã"öAé-âCB

î&ò ïQ÷?ï å&àC÷Fà�ê ùTä ã/ý,Ý 5 Ý�DTÝ�Þ,á?ïQä ã&ï\áHà4ã'å`ì'Ý$E�à-ã"ñ=é�ã�ÿ�ïAå"ñQÝ �\ê�F 0#HGI��&'�&8� )��"�IJK�L�� -NMAO�� "�8/N�" � ��&8��8�"�PJK��$ ��-Q.0/��'&1��#2� ê<ûKà-âR6"á?ä å�ó�ï-êRæ[þPSýKß&ï�æ�çeý5 á?ïQñ?ñQê�Ú'7�7�>�ê&ð*é-òzÝ&çUTAS�æ[à$V�ä ã&óC@�ïAà4á?ã&ä ã&óuìXW"ñe÷?ïAâ`ñ 5 áFà-ö\÷?ä öAà-òzê�î"î�Ý":�>'=LYXZ*ÝÙ [CÛ�@RÝ'3`Ý'E�à-ã&ñ=ïQã�ê-û�Ý$@�ä ñ=ñ\6�ï\á?ó�à4ã'å 5 Ý4ì"à-ò à-â`é�ã$ê�ë=øRã"ñ=ïAâR6&ò ïHæ�ïQ÷?ß&é�å"ñ�ôIé4áIE�à-ã&å�ü á?ä ÷=÷FïQã

ú�ä ó�ä ÷�D�ïQöAé-ó-ã&ä ÷?ä é-ã�ê ùdä ã�ì�Ý)3Mí&ã"ó"ê0]HÝ�]&à-ò ò ñ=ä å"ï-ê�è"Ý�ì�^4á?ïAã"ñ=ïAãOà4ã'åcû,Ý)3tà4âZâ ÿ�ïAå"ñQÝ �Fê��_��&a`A "��b��c�! ��.��(-)��d*��8 ��$��'�8��-e�8� ê'çeø:ø:ø ê�Ú'7�7�;�ê'î"î�Ý"[$[�['=�[�9L;*ÝÙ 94ÛfDTÝ_@�ä ÷=÷?ò ïOà4ã'å�ú/ÝgD�í�6"ä ã�ê .0&1��&'��'&'��$�"� � ��"�(/��'��h`f� &8Ojik��'�'��-ml��&1� ê�è�é-ß&ãn ä ò ï�W[à4ã'ådì�é�ã"ñQê�o�ï\üNpHé4áaV'ê$Ú87$Y�>�ÝÙ ZAÛZú�Ý�D,í6&ä ã$ê ik� �(&'�(GI�(�q�1#%GI��&1��&'�( )+�! "�2�r )��8��1Gs ) �1�q��t.��"��/�� ê<è"Ý n ä ò ï�Wà4ã'å[ì�é�ã&ñQê�o,ï\üNpHé-áuV�ê�Ú87$YX>*ÝÙ :CÛfTtÝ-ý�á?ïQñ=î�ê�ì'Ý�þ�ß&âZà�åMà-ã&åADTÝ$o,ïQí&ã&ïQä ï\áAê�ëeý�áFà-ä ã"ä ã&ó,o,ïAí�áFà-ò�o,ï\÷_üKé-áuV�ñ�ü�ä ÷Fß�ú�ï1v&öAä ïAã*÷

úMà4÷Fà�ê ùyä ã¹è"Ý ú�Ý�ûHéAü à-ã$êtÞtÝ ý�ïAñ?à4í"á?é�à-ã&å¦è"Ýsþ�ò ñ=î�ïQöQ÷?é4ábÿzïCå�ñQÝ �Fê�� " �$��w��,��! ��#%�&8�( "k*,�8 ��8�'��-x.0/��1&1��#2� ê�ædé4áFó�à4ã43tà4í"ôIâZà-ã"ã 5 í�6&ò ä ñ=ß&ï\á?ñAêç_ã&ö-Ý ê�Ú'7�7�9"ê&ð*é-òzÝ":�ê"î&î�Ý�Ú8;$Y'='Ú'[XZ*ÝÙ >AÛfTtÝ ý�áFïQñ=î�êyDTÝdo,ïQí&ã&ïQä ï\ácà4ã'å¹ì�Ý þ�ß&âZà�å�êuëeøIzZöQä ïQã�÷�æ�ïQ÷?ß"é�å"ñuôIé-á[úMïAà4ò ä ã&ó�üsä ÷?ß

ædä ñ=ñ=ä ã"ó�úMàC÷\à ä ã/ì�í"î�ï\áFð�ä ñ=ïAåP@�ïAàCáFã"ä ã&ó�ê ùMä ã/ÞtÝ-ý�ïAñ?à4í"á?é"ê-ú/ÝCý�é�í�á?ïQ÷u{1VLWTà-ã&åtý,Ý�@�ïAïQãÿzïAå"ñQÝ �Fê)�r� ��" �$��f�(N�r��8�"�|��! "�$#%��&'�( )N*,�� $��'�'��-}.0/��1&'�#2� ê&ýKß&ï/æ�çeý5 á?ïQñ?ñQê�Ú'7�7�Z�ê&ð*é-òzÝ)>�ê&î&î�Ý�:$Y�7'=�:$7$:�Ý

143

0

0.2

0.4

0.6

0.8

1Personal Profile

Validation sample

Trai

ning

set

siz

e

20 40 60 80 100

25

50

75

100 0

0.2

0.4

0.6

0.8

1Day Profile

Validation sample

Trai

ning

set

siz

e

20 40 60 80 100

25

50

75

100

0

0.2

0.4

0.6

0.8


Validation sample

Trai

ning

set

siz

e

20 40 60 80 100

25

50

75

100 0

0.2

0.4

0.6

0.8

1Voting

Validation sample

Trai

ning

set

siz

e

20 40 60 80 100

25

50

75

100

0

0.2

0.4

0.6

0.8

1Personal Profile

Validation sample

K

20 40 60 80 100

25

50

75

100 0

0.2

0.4

0.6

0.8

1Day Profile

Validation sample

K

20 40 60 80 100

25

50

75

100

0

0.2

0.4

0.6

0.8


Validation sample

K

20 40 60 80 100

25

50

75

100 0

0.2

0.4

0.6

0.8

1Voting

Validation sample

K

20 40 60 80 100

25

50

75

100

~ � �$��u�c�L�|�I�\�a��y�a��u�P�(�$�� )�1�u�1�L�y��u��"� �1�y� ��u��R�P��a��A��)��|�� $�u�a�|�$�� u��f�P�y��"�$�\�u��C�� u�a��e~�� a�1�c�u��8 ¡��1�)�1�"��1��¢�£4��¤�u�a�� %�\�1�c�\� ¥1�r��\�)�1¢'� �"¢y¦�� 8�u� �$��u�$�C�� 1�1�K§d��y�f��u��_�u��u�¤�u��y�a��f�y�\��d�$� ¦$�$� � ��8�a� ��u��f�� '�1�

A P P E N D I X C

Contribution: KES 2001

This appendix contain the article Hierarchical Clustering for Data mining. ,in International Journal of Knowledge-Based Intelligent Engineering Systems,volume: 6(1), pages: 56-62. Author list: J. Larsen, A. Szymkowiak-Have, L.K.Hansen. Own contribution approximated to 30%.

146 Contribution: KES 2001

Hierarchical Clustering for DataminingAnna Szymkowiak, Jan Larsen, Lars Kai Hansen

Informatics and Mathematical Modeling Richard Petersens Plads, Build. 321,Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark,

Web: http://eivind.imm.dtu.dk, Email: asz,jl,[email protected]

Abstract. This paper presents hierarchical probabilistic clustering methods for unsu-pervised and supervised learning in datamining applications. The probabilistic clus-tering is based on the previously suggested Generalizable Gaussian Mixture model.A soft version of the Generalizable Gaussian Mixture model is also discussed. Theproposed hierarchical scheme is agglomerative and based on a ¨"© distance metric.Unsupervised and supervised schemes are successfully tested on artificially data andfor segmention of e-mails.

1 Introduction

Hierarchical methods for unsupervised and supervised datamining give multilevel descriptionof data. It is relevant for many applications related to information extraction, retrieval navi-gation and organization, see e.g., [1, 2]. Many different approaches to hierarchical analysisfrom divisive to agglomerative clustering have been suggested and recent developments in-clude [3, 4, 5, 6, 7]. We focus on agglomerative probabilistic clustering from Gaussian densitymixtures. The probabilistic scheme enables automatic detection of the final hierarchy level.In order to provide a meaningful description of the clusters we suggest two interpretationtechniques: 1) listing of prototypical data examples from the cluster, and 2) listing of typicalfeatures associated with the cluster. The Generalizable Gaussian Mixture model (GGM) andthe Soft Generalizable Gaussian mixture model (SGGM) are addressed for supervised andunsupervised learning. Learning from combined sets of labeled and unlabeled data [8, 9] isrelevant in many practical applications due to the fact that labeled examples are hard and/orexpensive to obtain, e.g., in document categorization. This paper, however, does not discusssuch aspects. The GGM and SGGM models estimate parameters of the Gaussian clusters witha modified EM procedure from two disjoint sets of observations that ensures high generaliza-tion ability. The optimum number of clusters in the mixture is determined automatically byminimizing the generalization error [10].

This paper focuses on applications to textmining [8, 10, 11, 12, 13, 14, 15, 16] withthe objective of categorizing text according to topic, spotting new topics or providing short,easy and understandable interpretation of larger text blocks; in a broader sense to createintelligent search engines and to provide understanding of documents or content of web-pages like Yahoo’s ontologies.

2 The Generalizable Gaussian Mixture Model

The first step in our approach for probabilistic clustering is a flexible and universal Gaussianmixture density model, the generalizable Gaussian mixture model (GGM) [10, 17, 18], which

147

models the density for ª -dimensional feature vectors by:

«)¬��®�¯±°² ³ ´�µ¶ ¬�·X® «)¬�g¸ ·X®u¹�«)¬�_¸ ·X®�¯ º» ¸ ¼1½0¾ ³ ¸�¿\ÀXÁ ¬ÃÂCº¼ ¬�rÂ�Ä³ ®UÅ"¾�Æ µ³ ¬�rÂÇÄ ³ ®U® (1)

where «)¬�_¸ ·X® are the component Gaussians mixed with the non-negative proportions ¶ ¬�·X® ,È °³ ´�µ ¶ ¬�·X® . Each component · is described by the mean vector Ä

³and the covariance matrix¾ ³ . Parameters are estimated with an iterative modified EM algorithm [10] where means are

estimated on one data set, covariances on an independent set, and ¶ ¬�·X® on the combined set.This prevents notorious overfitting problems with the standard approach [19]. The optimumnumber of clusters/components is chosen by minimizing an approximation of the generaliza-tion error; the AIC criterion, which is the negative log-likelihood plus two times the numberof parameters.

For unsupervised learning parameters are estimated from a training set of feature vectorsÉ ¯ËÊ'0ÌXÍUÎh¯ º ¹Ï¼sÐ�ÐaÐ�¹ÏÑ�Ò , where Ñ is the number of samples. In supervised learningfor classification from a data set of features and class labels

É ¯ÓÊ' Ì ¹ÏÔ Ì Ò , where Ô ÌhÕÊ º ¹Ï¼�¹�ÐaÐ�Ðu¹ÏÖ,Ò we adapt one Gaussian mixture, «0¬�_¸ ÔX® , for each class separately and classifyby Bayes optimal rule by maximizing «0¬(Ô"¸ �®d¯w«)¬�_¸ Ô�® ¶ ¬(Ô�®U× ÈeØÙ ´�µ «0¬�g¸ Ô�® ¶ ¬(Ô�® (under 1/0loss). This approach is also referred to as mixture discriminant analysis [20].

The GGM can be implemented using either hard or soft assignments of data to com-ponents in each EM iteration step. In the hard GMM approach each data example is as-signed to a cluster by selecting highest «0¬�·)¸ �Ì�®�¯Ú«)¬��Ì�¸ ·X® ¶ ¬�·X®U×\«)¬�)Ì�® . Means and co-variances are estimated by classical empirical estimates from data assigned to each com-ponent. In the soft version (SGGM) e.g., the means are estimated as weighted means Ä

³ ¯È Ì «)¬�·)¸ Ì ®�Û1 Ì × È Ì «)¬�·0¸ Ì ® .Experiments with the hard/soft versions gave the following conclusions. Per iteration the

algorithms are almost identical, however, SGGM requires typically more iteration to con-verge, which is defined by no changes in assignment of examples to clusters. Learning curve1

experiments indicate that hard GGM has slightly better generalization performance for smallÑ while similar behavior for large Ñ - in particular if clusters are well separated.

3 Hierarchical Clustering

In the suggested agglomerative clustering scheme we start by Ü clusters at level Ý ¯ º asgiven by the optimized GGM model of «)¬��® , which in the case of supervised learning is«)¬� ®�¯ È ØÙ ´�µ È °�Þ

³ ´�µ «0¬�g¸ ·¹UÔX® ¶ ¬�·L® ¶ ¬(ÔX® , where Ü Ù is the optimal number of components forclass Ô . At each higher level in the hierarchy two clusters are merged based on a similaritymeasure between pairs of clusters. The procedure is repeated until we reach one cluster atthe top level. That is, at level Ý ¯ º there are Ü clusters and 1 cluster at the final level,Ý ¯h¼ Ü Â º . Let «$ß'¬�_¸ ·X® be the density for the · ’th cluster at level Ý and ¶ ß'¬�·X® as its mixingproportion, i.e., the density model at level Ý is «)¬��®d¯ È ° Æ ßÏà µ

³ ´�µ ¶ ß'¬�·X® «$ß'¬(d¸ ·X® . If clusters ·and á at level Ý are merged into â at level Ý_ã º then

« ßÏà µ ¬�d¸ â ®�¯ « ß ¬�d¸ ·X®0Û ¶ ß ¬�·X® ã « ß ¬(d¸ á ®�Û ¶ ß ¬ á ®¶ ß'¬�·X® ã ¶ ß'¬ á ®¹ ¶ ßUà µ ¬ â ®�¯ ¶ ß ¬�·X® ã ¶ ß ¬ á ® (2)

The natural distance measure between the cluster densities is the Kullback-Leibler (KL) di-vergence [19], since it reflects dissimilarity between the densities in the probabilistic space.The drawback is that KL only obtains an analytical expression for the first level in the

1Generalization error as as function of number of examples.


hierarchy while distances for the subsequently levels have to be approximated [17, 18].Another approach is to base distance measure on the ägå norm for the densities [21], i.e.,æfç�èéÏêRëdìxíCç î�ï1ç�ð_ñ èXë�òcî$ï'ç�ð_ñ êcëUë åó'ô

whereè

andê

index two different clusters. Due toMinkowksi’s inequality

æfç�èéÏêRëis a distance measure. Let õ ì¡ö$÷�éÏøXéaù�ù1ùéÏú�û

be the setof cluster indices and define disjoint subsets õ0ü|ýRõ�þ ìQÿ , õ"ü��eõ and õ�þ��hõ , where õ"ü ,õ þ contain the indices of clusters which constitute clusters

èand

êat level � , respectively.

The density of clusterè

is given by:î ï ç�ð_ñ èLëdì�� î)ç�ð_ñ �Ãë

,� � ì��Pç��ë�� ç��ë

if�� õ ü , and zero otherwise.î�ï'ç�ð_ñ êcë�ì � � �� î)ç�ðdñ ��ë

, where� �

obtains a similar definition.According to [21] the Gaussian integral

í|î)ç�ð_ñ �Ãë î)ç�ðgñ ��ë ó8ô ì��Pç � � ò!��"\é# �%$ # " ë, where�Pç �_é�#yë ìqç�ø'&)ë�(*)�+ å ù1ñ #cñ ,-+ å ù/.021)ç!ò3�546# (7, �5�8ø�ë

. Define the vectors 8 ì ö9� � û, : ìwö'� � û

ofdimension

úand the

ú�;,úsymmetric matrix < ì ö9� � " û

with� � " ì=�Pç � � ò>� " é�# � $ # " ë

,then the distance can be then written as

æfç�èéÏêRë ìqç 8 ò : ë 4 < ç 8 ò : ë . Figure 1 illustratesthe hierarchical clustering for Gaussian distributed toy data.

A unique feature of probabilistic clustering is the ability to provide optimal cluster andlevel assignment for new data examples which have not been used for training.

ðis assigned

to clusterè

at level � ifî ï ç�è)ñ ð�ë3?�@

where the threshold@

typically is set to A*B C . The proce-dure ensures that the example is assigned to a wrong cluster with probability 0.1.

Interpretation of clusters is done by generating likely examples from the cluster, seefurther [17]. For the first level in the hierarchy where distributions are Gaussian this isdone by drawing examples from a super-eliptical region around the mean value, i.e.,

ç�ð%ò�EDaë 4 # (7,D ç�ð�òF� D�ëHGconst. For clusters at higher levels in the hierarchy samples are drawn

from each Gaussian cluster with proportions specified by�Pç�èXë

.

−4 −2 0 2 4 6 8−6

−4

−2

0

2

4

6

8

1 µ

12 µ

23 µ

34 µ

4

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 4 2 3

Dis

tanc

e

Cluster

Figure 1: Hierarchical clustering example. Left panel is a scatter plot of the data. Clusters 1,2 and 4 have widedistributions while 3 has a narrow one. Since the distance is based on the shape of the distribution and notonly its mean location, clusters 1 and 4 are much closer than any of these to cluster 3. Right panel presents thedendrogram.

4 Experiments

The hierarchical clustering is illustrated for segmentation of e-mails. Define term-vector asa complete set of the unique words occurring in all the emails. An email histogram is thevector containing frequency of occurrence of each word from the term-vector and defines thecontent of the email. The term-document matrix is then the collection of histograms for allemails in the database. After suitable preprocessing2 the term-document matrix contains 1405(702 for training and 703 for testing) e-mail documents, and the term-vector 7798 words. Theemails where annotated into the categories: conference, job and spam. It is possible to model

2Words which are too likely or too unlikely are removed. Further only word stems are kept.

149

directly from this matrix [8, 15], however we deploy Latent Semantic Indexing (LSI) [22]which operates from a latent space of feature vectors. These are found by projecting term-vectors into a subspace spanned by the left eigenvectors associated with largest singular valueof a singular value decomposition of the term-document matrix. We are currently investigat-ing methods for automatic determination of the subspace dimension based on generalizationconcepts. We found that a I dimensional subspace provides good performance using SGGM.

A typical result of running supervised learning is depicted in Figure 2. Using supervisedlearning provides a better resemblance with the correct categories at the level in the hierarchyas compared with unsupervised learning. However, since labeled examples often are lackingor few the hierarchy provides a good multilevel description of the data with associated inter-pretations. Finding typical features as described on page 3 and back-projecting into originalterm-space provides keywords for each cluster as given in Table 1.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Cluster

Pro

babi

lity

Confusion at the hierarchy level 1

ConferenceJob Spam

1 11 390

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cluster

Pro

babi

lity


ConferenceJob Spam

19 21 15 16 18 14 17 12 4 3 2 13 9 7 20 10 8 6 5 11 110

4

106

108

1010

1012

1014

Cluster

Dis

tanc

e

1 3 6 9 12 15 18 21 24 27 30 33 36 39 410

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18Distribution of test set emails

Pro

babi

lity

Cluster

Figure 2: Supervised hierarchical clustering. Upper rows show the confusion of clusters with the annotatedemail labels on the training set at the first level and the level where 3 clusters remains, corresponding to the threecategories conference, job and spam. At level 1 clusters 1,11,17,20 have big resemblance with the categories. Inparticular spam are distributed among 3 clusters. At level 19 there is a high resemblance with the categories andthe average probability of erroneous category on the test set is 0.71. The lower left panel shows the dendrogramassociated with the clustering. The lower right panel shows the histogram of cluster assignments for test data,cf. page 3. Clearly some samples obtain a reliable description at the first level (1–21) in the hierarchy, whereasothers are reliable at a higher level (22–41).

5 Conclusions

This paper presented a probabilistic agglomerative hierarchical clustering algorithm basedon the generalizable Gaussian mixture model and a JHK metric in probabilty density space.This leads to a simple algorithm which can be used both for supervised and unsupervisedlearning. In addition, the probabilistic scheme allows for automatic cluster and hierarchy levelassignment for unseen data and further a natural technique for interpretation of the clusters


Table 1: Keywords for supervised learning1 research,university,conference 8 neural,model 15 click,remove,hottest,action2 university,neural,research 9 university,interest,computetion 16 free,adult,remove,call3 research,creativity,model 10 research,position,application 17 website,adult,creativity,click4 website,information 11 science,position,fax 18 website,click,remove5 information,program,computation 12 position,fax,website 19 free,call,remove,creativity6 research,science,computer,call 13 research,position,application 20 mac7 website,creativity 14 free,adult,call,website 21 adult,government

1 research,university,conference 11 science,position,fax 39 free,website,cal l,creativity

via prototype examples and features. The algorithm was successfully applied to segmentationof emails.

References

[1] J. Carbonell, Y. Yang and W. Cohen, Special Isssue of Machine Learning on Information Retriceal Intro-duction, Machine Learning 39, (2000) 99–101.

[2] D. Freitag, Machine Learning for Information Extraction in Informal Domains, Machine Learning 39,(2000) 169–202.

[3] C.M. Bishop and M.E. Tipping, A Hierarchical Latent Variable Model for Data Visualisation, IEEE T-PAMI 3, 20 (1998) 281–293.

[4] C. Fraley, Algorithms for Model-Based Hierarchical Clustering, SIAM J. Sci. Comput. 20, 1 (1998) 279–281.

[5] M. Meila and D. Heckerman, An Experimental Comparison of Several Clustering and Initialisation Meth-ods. In: Proc. 14th Conf. on Uncert. in Art. Intel., Morgan Kaufmann, 1998, pp. 386–395.

[6] C. Williams, A MCMC Approach to Hierarchical Mixture Modelling. In: Advances in NIPS 12, 2000, pp.680–686.

[7] N. Vasconcelos and A. Lippmann, Learning Mixture Hierarchies. In: Advances in NIPS 11, 1999, pp.606–612.

[8] K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell Text Classification from Labeled and UnlabeledDocuments using EM, Machine Learning, 39 2–3 (2000) 103–134.

[9] D.J. Miller and H.S. Uyar, A Mixture of Experts Classifier with Learning Based on Both Labelled andUnlabelled Data. In: Advances in NIPS 9, 1997, pp. 571–577.

[10] L.K. Hansen, S. Sigurdsson, T. Kolenda, F.A. Nielsen, U. Kjems and J. Larsen, Modeling Text with Gen-eralizable Gaussian Mixtures. In: Proc. of IEEE ICASSP’2000, vol. 6, 2000, pp. 3494–3497.

[11] C.L. Jr. Isbell and P. Viola, Restructuring Sparse High Dimensional Data for Effective Retrieval. In: Ad-vances in NIPS 11, MIT Press, 1999, pp. 480–486.

[12] T. Kolenda, L.K. Hansen and S. Sigurdsson Indepedent Components in Text. In: Adv. in Indep. Comp.Anal., Springer-Verlag, pp. 241–262, 2001.

[13] T. Honkela, S. Kaski, K. Lagus and T. Kohonen, Websom — self-organizing maps of document collec-tions. In: Proc. of Work. on Self-Organizing Maps, Espoo, Finland, 1997.

[14] E.M. Voorhees, Implementing Agglomerative Hierarchic Clustering Ulgorithms for Use in DocumentRetrieval, Inf. Proc. & Man. 22 6 (1986) 465–476.

[15] A. Vinokourov and M. Girolami, A Probabilistic Framework for the Hierarchic Organization and Classi-fication of Document Collections, submitted for Journal of Intelligent Information Systems, 2001.

[16] A.S. Weigend, E.D. Wiener and J.O. Pedersen Exploiting Hierarchy in Text Categorization, InformationRetrieval, 1 (1999) 193–216.

[17] J. Larsen, L.K. Hansen, A. Szymkowiak, T. Christiansen and T. Kolenda, Webmining: Learning from theWorld Wide Web, Computational Statistics and Data Analysis (2001).

[18] J. Larsen, L.K. Hansen, A. Szymkowiak, T. Christiansen and T. Kolenda, Webmining: Learning from theWorld Wide Web. In: Proc. of Nonlinear Methods and Data Mining, Italy, 2000, pp. 106–125.

[19] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.[20] T. Hastie and R. Tibshirani Discriminant Analysis by Gaussian Mixtures, Jour. Royal Stat. Society - Series

B, 58 1 (1996) 155–176.[21] D. Xu, J.C. Principe, J. Fihser, H.-C. Wu, A Novel Measure for Independent Component Analysis (ICA).

In: Proc. IEEE ICASSP98, vol. 2, 1998, pp. 1161–1164.[22] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman, Indexing by Latent Semantic

Analysis, Journ. Amer. Soc. for Inf. Science., 41 (1990) 391–407.

A P P E N D I X D

Contribution: InternationalKES Journal 2002

This appendix contain the article Probabilistic Hierarchical Clustering with La-beled and Unlabeled Data in Proceedings of KES-2001 Fifth International Con-ference on Knowledge-Based Intelligent Information Engineering Systems &Allied Technologies, pages: 261–265. Author list: A. Szymkowiak, J. Larsen,L.K. Hansen. Own contribution approximated to 80%.

152 Contribution: International KES Journal 2002

Probabilistic Hierarchical Clustering with Labeled and Unlabeled Data

J. Larsen, A. Szymkowiak, L.K. Hansen

Informatics and Mathematical Modeling, Technical University of DenmarkRichard Petersens Plads, Build. 321, DK-2800 Kongens Lyngby, Denmark

Web: http://eivind.imm.dtu.dk, Emails: jl,asz,[email protected]

Abstract. This paper presents hierarchical proba-bilistic clustering methods for unsupervised and su-pervised learning in datamining applications, wheresupervised learning is performed using both labeledand unlabeled examples. The probabilistic cluster-ing is based on the previously suggested General-izable Gaussian Mixture model and is extended us-ing a modified Expectation Maximization procedurefor learning with both unlabeled and labeled exam-ples. The proposed hierarchical scheme is agglomer-ative and based on probabilistic similarity measures.Here, we compare a LNM dissimilarity measure, er-ror confusion similarity, and accumulated posteriorcluster probability measure. The unsupervised andsupervised schemes are successfully tested on artifi-cially data and for e-mails segmentation.

1 IntroductionHierarchical methods for unsupervised and supervis-ed datamining provide multilevel description of data,which is relevant for many applications related to in-formation extraction, retrieval navigation and organi-zation of information, see e.g., [4, 7]. Many differ-ent approaches to hierarchical analysis from divisiveto agglomerative clustering schemes have been sug-gested, and recent developments include [3, 6, 16,20, 24]. In this paper we focus on agglomerativeprobabilistic clustering from Gaussian density mix-tures based on earlier work [14, 15, 19] but extendedby suggesting and comparing various similarity mea-sures in connection with cluster merging. An advan-tage of using the probabilistic clustering scheme isautomatic detection of the final hierarchy level fornew data not used for training. In order to providea meaningful description of the clusters we suggesttwo interpretation techniques: listing of prototypicaldata examples from the cluster, and listing of typicalfeatures associated with the cluster.

The generalizable Gaussian mixture model (GGM)[8] and the soft generalizable Gaussian mixture model(SGGM) [19] are basic model for supervised and un-supervised learning. We extend this framework to

This research is supported by the Danish Research Councilsthrough Distributed Multimedia Technologies and Applicationsprogram within Center for Multimedia and the Signal and ImageProcessing for Telemedicine (SITE) program.

supervised learning from combined sets of labeledand unlabeled data [9, 17, 18] and present a modi-fied version of the approach in [17] called the unsu-pervised/supervised generalizable Gaussian mixturemodel (USGGM). Supervised learning from combin-ed sets is relevant in many practical applications dueto the fact that labeled examples are hard and/or ex-pensive to obtain, for instance in document catego-rization or medical applications. The models esti-mate parameters of the Gaussian clusters with a mod-ified EM procedure from two disjoint data sets to pre-vent notorious infinite overfit problems and ensuringgood generalization ability. The optimum number ofclusters in the mixture is determined automaticallyby minimizing an estimate of the generalization er-ror [8].

This paper focuses on applications to textmin-ing [8, 11, 12, 13, 18, 22, 21, 23] with the objec-tive of categorizing text according to topic, spottingnew topics or providing short, easy and understand-able interpretation of larger text blocks – in a broadersense to create intelligent search engines and to pro-vide understanding of documents or content of web-pages like Yahoo’s ontologies.

In Section 2, various GGM models for supervisedand unsupervised learning are discussed, in particu-lar we introduce the USGGM algorithm. The hier-archical clustering scheme is discussed in section 3and introduces three similarity measures for clustermerging. Finally, Section 4 provide numerical expe-riments for segmentation of e-mails.

2 The Generalizable GaussianMixture Model

The first step in our approach for probabilistic clus-tering is a flexible and universal extension of Gaus-sian mixture density model, the generalizable Gaus-sian mixture model [8, 14, 15, 19] with the aim ofsupervised learning from unlabeled and labeled data.Define O as the P -dimensional input feature vectorand the associated output, QHRTSVU�W-XYW/Z/Z�Z-W-[�\ , of classlabels, assuming [ mutually exclusive classes. Thejoint input/output density is modeled as the Gaussian

153

mixture in [17]1]2^ _a` bNc dae7fhgij k*l'm ^ _nc o'e ]2^ b6c o'e m ^ o'e (1)]2^ bNc o'e7f (2)pq c r�s*t j c�uwv'x ynz pr ^ b z|{ j e }%t5~ lj ^ b z�{ j e �where � is the number of components, ]2^ bNc o'e arethe component Gaussians mixed with the non-negati-ve priors m ^ o'e , � g j k*l m ^ o'e7f p

and the class-clusterposteriors m ^ _nc o'e , ��/knl m ^ _nc o'e7f p

. The o ’th Gaus-sian component is described by the mean vector

{ jand the covariance matrix t j

. d is the vector of allmodel parameters, i.e., dH�� m ^ _nc o'ew` { j `-t j ` m ^ o'e�� o�`�_�� . Since the Gaussian mixture is an universalapproximator, the model Eq. (1) is rather flexible.One restriction, however, is that the joint input/outputfor each components is assumed to factorize, i.e.,]2^ _�` bNc o'e7f m ^ _ac o'e ]2^ b6c o'e .

The input density associated with Eq. (1) is givenby ]2^ b6c d��Ye7f �i�wk*l ]2^ _a` b7e7fhgij k*l ]2^ b6c o'e m ^ o'e-` (3)

where d9�E�� { j `-t j ` m ^ o'e6� � o�`�_�� . Assuming a 0/1 loss function the optimal Bayes classification ruleis �_Ef��5� v � m ^ _ac b7e where2

m ^ _nc b%e7f ]2^ _�`-b7e]2^ b%e f gij k*l'm ^ _nc o'e m ^ oac b7e (4)

with m ^ oac b7e7f�]2^ b6c o'e m ^ o9e��-]2^ b7e .Define the data set of unlabeled examples � �Hf��b2�a��f p `-rY`/�/�/�/` ��'� and a set of labeled exam-

ples �� f��b2�a`�_��f p `-rY`��/�/�w`-� � � . The objectiveis to estimate d from the combined set � f � �Y� � �with ��f � �n¡ �� examples ensuring high gener-alizability. If no labeled data are available we canmerely perform unsupervised learning of d�� , how-ever, if a number of labeled data are available, esti-mation from both data sets is possible as ]2^ _nc b7e and]2^ b7e share the model parameters d�� [9]. The neg-ative log-likelihood for the data sets, which are as-sumed to consist of independent examples, is givenby ¢ f z3£ ¤�¥ ]2^ � c dae (5)z i�V¦§7¨ £ ¤�¥ gij k*l m ^ _ � c o'e ]2^ b � c o'e m ^ o'e

zª© i�Y¦§2« £ ¤�¥ gij knl ]2^ b2�nc o'e m ^ o'ewhere ¬� © p

is a discount factor. If the modelis unbiased (realizable), the estimation dn� from ei-ther labeled or unlabeled data will result in identical

1In [17] referred to as the generalized mixture model.2The dependence on ® is omitted.

optimal setting and thus© f p

is optimal. On theother hand, in the typical case of a biased mode, it isadvantageous to discount the influence of unlabeleddata [9, 18].

Initialization1. Choose values for � and ¬E © p

.2. Let ¯ be � different randomly selected indices

from � p `wrY`/�/�/�w` �|� , and set{ j f>b2° ± .

3. Let t�²�f�� ~ l � �V¦§ ^ b7� z³{ ² ew^ b7� z�{ ² e } ,where

{ ² f´� ~ l � �Y¦§ b7� , and set� o!�t j f�t�² .

4. Set� o5� m ^ o'e7f p � � .

5. Compute class prior probabilities: m ^ _'e�f�|~ l� � �Y¦§2¨µ ^ _�� z _'e , where µ ^ ¶Ye·f pif¶�f ¬ , and zero otherwise. Set

� o5� m ^ _nc o9e2fm ^ _'e .

6. Select a split ratio ¬3¸�¹�¸ p. Split the unla-

beled data set into disjoint sets as � � f � ��º l �� º » , with c � ��º l c|f½¼ ¹ ��¾ and c � ��º »�c|f� � z c � �º l c . Do similar splitting for the la-beled data set �� f �¿� º l � �¿� º » .

Repeat until convergence1. Compute posterior component probabilities:]2^ oac b � e3fÀ]2^ b � c o'e m ^ o'e�� j ]2^ b � c o'e m ^ o'e ,

for all �TÁ � � , and for all �|Á �� ,]2^ oac _V��` b2�9e7f m ^ _��ac o'e ]2^ b2�nc o'e m ^ o'e� jm ^ _��ac o'e ]2^ b7�ac o'e m ^ o'eYÂ

2. For all o update means{ j fi�V¦�§2¨ Ã Ä b2� m ^ oac _��` b2�9e ¡ © i�V¦�§*« Ã Ä b7� m ^ oac b7�'ei�V¦�§2¨ Ã Ä m ^ oac _��` b2�9e ¡ © i�V¦�§*« Ã Ä m ^ oac b7�'e3. For all o update covariance matricest j fi�V¦§7¨ Ã Å9Æ

j � m ^ oac _��` b7�'e ¡ © i�Y¦§*« Ã Å Æj � m ^ oac b7�'ei�Y¦§7¨ Ã Å m ^ oac _��` b7�'e ¡ © i�Y¦§*« Ã Å m ^ oac b7�'e

where Æj �5f�^ b7� zE{ j ew^ b2� zE{ j e } . Perform

a regularization of t j(see text).

4. For all o update cluster priors

m ^ o'e7f i�Y¦§7¨ m ^ oac _ � ` b � e ¡ © i�Y¦§2« m ^ oac b � e� �'¡ © ��5. For all o update class cluster posteriors

m ^ _nc o'e2f i�Y¦§7¨ µ ^ _ z _ � e m ^ oac _ � ` b � ei�Y¦§2¨ m ^ oac _��`-b2�'eFigure 1: The USGGM algorithm.


2.1 The USGGM AlgorithmThe model parameters are estimated with an itera-tive modified EM algorithm [8], where means andcovariance matrices are estimated from independentdata sets, and Ç5È ÉnÊ Ë'Ì , ÇEÈ Ë'Ì from the combined set.This approach prevents overfitting problems with thestandard approach [2]. It is designated the generaliz-able Gaussian mixture model with labeled and unla-beled data (USGGM) and may be viewed as an ex-tension of the EM-I algorithm suggested in [17]. TheGGM can be implemented using either hard or softassignments of data to components in each EM it-eration step. In the hard GGM approach each dataexample is assigned to a cluster by selecting high-est Ç5È ËaÊ Í7Ì . Means and covariances are estimatedby classical empirical estimates from data assignedto each component. In the soft version (SGGM) [19]means and covariances are estimated as weightedquantities, e.g., Î6Ï¿Ð�Ñ>Ò*Ó2È ËaÊ Í Ò Ì�Í Ò9Ô Ñ�Ò*Ó2È ËaÊ Í Ò Ì .GGM provides a biased estimate, which gives bet-ter results for small data sets [19], however, in gen-eral the soft version is preferred. The USGGM al-gorithm is summarized in Fig. 1 and is based on thesoft approach. The main iteration loop is aborted3

when no change in example cluster assignment isnoticed. Labeled examples are assigned to clustersË Ò Ð>ÕÖ ×%ØEÕ�Ù Ï ÇEÈ ËaÊ É Ò�Ú Í Ò Ì , ÛTÜ3Ý¿Þ , and unlabeledto Ë Ò Ð�ÕÖ-×7Ø5Õ�Ù Ï Ç5È ËaÊ Í Ò Ì , ÛßÜ�Ý�à . In contrast toEM algorithms there is no guarantee that each itera-tion leads to improved likelihood, however, practicalexperience indicates that the updating scheme is suf-ficiently robust. Potential poor conditioned covari-ance matrices for clusters where few examples areassigned is avoided by regularizing towards the over-all input covariance matrix á�â (defined in Fig. 1) asá Ï�ã á ÏNäßå á�â . å is selected as the smallest pos-itive number, which ensures that the resulting condi-tion number is smaller than æ Ô È ç èéwÌ , where é is thefloating point machine precision.

Essential algorithm parameters are the numberof components ê and the weighting factor ë . Inprinciple, these parameters should be chosen as tomaximize generalization performance. One methodis to pick ê and ë so that the cross-validation es-timate of the classification error is minimized. Aless computational cumbersome method is to selectê based on the AIC estimate of the generalizationerror [1, 8, 19], which is the negative log-likelihoodplus the number of parameters in the model, ê·È ç�È ç äæ�Ì Ô�ì äHí Ì�î3æ . The only remaining algorithm param-eter to determine is the split ratio ï , which in princi-ple also should be selected to achieve high general-ization performance. Practical simulations show thatï3Ð�ð'ñ ò is a proper choice in most cases.

3Convergence criteria based on changes in the negative log-likelihood can also be formulated.

2.2 Unsupervised GGM ModelIf only input data are available one has to perform un-supervised learning. In this case the object of mod-eling is the input density Eq. 3, which can be trainedusing the SGGM algorithm4 [19].

2.3 Supervised GGM ModelClearly USGGM can be used in the case of no un-labeled examples. Another choice is to use sepa-rate GGM models for the class conditional input den-sities, i.e., Ó2È Í6Ê É9Ì·Ð´Ñ>ó%ôÏwõnö Ó2È Í6Ê É Ú Ë'Ì ÇEÈ ËaÊ É9Ì withÓ2È Í6Ê É Ú Ë'Ì defined by Eq. (2) and where êH÷ is thenumber of components. Using Bayes optimal ruleand assuming a 1/0 loss function, classification isdone by maximizing Ó2È ÉnÊ Í%Ì½ÐøÓ2È Í6Ê É9Ì Ç5È É9Ì ÔÑ>ù÷ õ*ö Ó2È ÍNÊ É9Ì Ç5È É9Ì . The approach is also referredto as mixture discriminant analysis [10] and seemsmore flexible than the model in Eq. (1). However,it does not use discriminative training, i.e., minimiz-ing the classification error or negative log-likelihoodú ÐÀî Ñ Ò�û ü�×nÓ2È É Ò Ê Í Ò�Úwý Ì , where ý are model pa-rameters. Modeling instead Ó2È Í6Ê É'Ì will provide rea-sonable estimates of Ó2È ÉnÊ Í7Ì in the entire input space,whereas discriminative learning will use the data toobtain relatively better estimates of Ó2È ÉnÊ Í7Ì close tothe decision boundaries. The model in Eq. (1) de-scribes the joint input-class probability Ó2È É Ú Í7Ì�ÐÓ2È ÉnÊ Í%Ì Ó2È Í%Ì and may be interpreted as a partial dis-criminative estimation procedure.

3 Hierarchical ClusteringIn the case of unsupervised learning, i.e., learningÓ2È Í%Ì , hierarchical clustering concerns identifying ahierarchical structure of clusters in the feature spaceÍ . In the suggested agglomerative clustering schemewe start by ê clusters at level þEÐ�æ as given by theoptimized GGM model of Ó2È Í7Ì . At each higher levelin the hierarchy two clusters are merged based ona similarity measure between pairs of clusters. Theprocedure is repeated until we reach one cluster at thetop level. That is, at level þ�Ð�æ there are ê clusters,and one cluster at the final level, þ�Ð>ê .

For supervised learning one can either identifya hierarchical structure common for all classes, i.e.,working from the associated input density Ó2È Í%Ì , oridentifying individual hierarchies for each class byworking from the class conditional input densitiesÓ2È Í6Ê É'Ì . For the model in Eq. (1) Ó2È Í7Ì is given byEq. (3) andÓ2È Í6Ê É'Ì7Ð Ó2È É Ú Í7ÌÇEÈ É'Ì Ð óÿÏwõ*ö Ó2È Í6Ê Ë'Ì�ÇEÈ ËaÊ É9Ì (6)

4The SGGM is similar to USGGM in Fig. 1 and is essentiallyobtained by setting �� , neglecting steps 5 of the initializationand main iteration loop, and further neglecting sums over labeleddata.

155

where �� . Let� � � �� be the density for the � ’th cluster at level�, and � � � � � � the mixing proportion, which in the

general case both may depend on � . Further, the(class conditional) density model at level

�is� � � � �!�"��#%$ ��&('�*) ' � � � � �+� � � � �� . If clusters,

and - at level�

are merged into . at level��/10

then� ��&(' � �� 2.�� (7)� � � �� , �2� � � , �+� / � � � �� 2-!� � � � -� � �� , � � / � � � -3 �+� �� &(' � .* � �� , � � / � � � -3 �+�4 (8)

3.1 Level AssignmentA unique feature of probabilistic clustering is the abil-ity to provide optimal cluster and level assignmentfor new data examples, which have not been used fortraining. � is assigned to cluster � at level

�if� � � � �� 576 (9)

where the threshold 6 typically is set to 8 4 9 . Theprocedure ensures that the example is assigned to awrong cluster with probability 0.1.

3.2 Cluster InterpretationInterpretation of clusters is done by generating likelyexamples from the cluster [14, 19] and displayingprototype examples and/or typical features. For thefirst level in the hierarchy in which distributions areGaussian, prototype examples are identified as thosewho has highest density values. For clusters at higherlevels in the hierarchy, prototype samples are drawnfrom each Gaussian cluster with proportions speci-fied by �� or �� +� . Typical features are in thefirst level found by drawing ancillary examples froma super-eliptical region around the mean value, i.e.,� ��:<; � ��=�> $ '� � �3:<; � �@? const., and then listingassociated typical features, e.g., keywords as demon-strated in Sec. 4. At higher levels we proceed as de-scribed above.

3.3 Similarity measuresMany different similarity measures may be appliedin the framework of hierarchical clustering. The nat-ural distance measure between the cluster densities isthe Kullback-Leibler (KL) divergence [2], since it re-flects dissimilarity between the densities in the prob-abilistic space. The drawback is that KL only obtainsan analytic expression for the first level in the hierar-chy, while distances for the subsequent levels haveto be approximated [14, 15]. Consequently, we con-sider three different measures, which express similar-ity in probability space for models of � � �� or � � �� +�(cf. Sec. 3) and can be computed exactly at all levels

in the hierarchy5. Fig. 2 illustrates the hierarchicalclustering for Gaussian distributed toy data.

3.3.1 A�B Dissimilarity MeasureThe C D distance for the densities [25] is definedE � , �2-!�� F � � � � �� , �(: � � � �� -!�2� DGIH (10)

where,

and - index two different clusters. Due toMinkowksi’s inequality,

E � , �2-!� is a distance mea-sure, which also will be referred to as dissimilar-ity. Let J"�LK 0 ��MN�POQOPO��R�S be the set of clusterindices and define disjoint subsets J(T!UVJ�WX�ZY ,J[TX\]J and J�W^\_J , where J[T , J+W contain theindices of clusters, which constitute clusters

,and- at level

�, respectively. The density of cluster,

is given by: � � � �� , �3�`��a bdcIe f a � � � .2� , f a �� .2�2��ga bdcIe �� .2� if .<hiJ[T , and zero otherwise.� � � �� -!�� ga bdcdjlk a � � �� .�� , where k a obtains a sim-ilar definition. According to [25], the Gaussian inte-gral is given bym � � �� n � � � �� oP� GpH �<qr� ;ts�:!; uP��> s / > u � , whereqr� ;@��>v��r� Mdw(� $x*y D Od >z ' y D O�{*|N}(�2:�; = > $ ' ; �IMI� .Define the vectors ~��XKdf a S , ��XK k a S of dimen-sion R and the Rr� R symmetric matrix ��Kd� sPu Swith � s*u �<qr� ; s :�; u ��> s / > u � , then the distancecan be then written as

E � , �2-!�l�� ~�:V�%��=�� ~�:�t� . It turns out (see Fig. 2) that it is important toinclude the prior of the component in the dissimi-larity measure. The modified C D is then given by�E � , �2-!�� m � � � � �� , � � � � , ��: � � � � -!� � � � -!�2� DGIH ,which easily can be computed using a modified ma-trix

�� sPu �� nN�2�� o*�2� s*u .3.4 Cluster Confusion Similarity

MeasureAnother natural principle is based on merging clus-ters, which have the highest confusion. Thus, whenmerging two clusters, the similarity is the probabil-ity of misassignment (PMA) when drawing exam-ples from the two clusters seperately. Let � be anexample from cluster � � denoted by ��h3� � and let-_�r�p��d| � �� be the model estimate of thecluster, then the PMA for all

,��g- is given by:� � , �2-!�� ,��-�� (11)FN�l� � � �� , � �� , � G � / F�� -!� �� -!� G �where ��Kd�3�I-X��p��d| � �� S and like-wise for �� . In general,

� � , �2-!� can not be com-puted analytically, but can be approximated arbitrar-ily accurately by using an ancillary set of data sam-ples drawn from the estimated model. That is, ran-domly select a cluster . with probability �� .2� , draw asample from � � �� .2� and compute the estimated clus-ter� ��p��d| � �� . Then estimate �� ,��-��

5In the following sections we omit the possible dependence on� for notational convenience.


as the fraction of samples where � ��g�l v¡v��¢�£ or� ¡�� (�g¢!£ .3.5 Sample Dependent Similarity Mea-

sureInstead of constructing a fixed hierarchy for visual-ization and interpretation of new data a sample de-pendent hierarchy can be obtained by merging a num-ber of clusters relevant for a new data sample ¤ . Theidea is based on level assignment described inSec. 3.1. Let ¥�� ¦§ ¤�£ , ¦z�©¨Iª�«NªP¬P¬Q¬*ª� , be the com-puted posteriors ranked in descending order and com-pute the accumulated posterior ®�� d£��¯�°±�²(³[¥�� ¦§ ¤�£ .The sample dependent cluster is then formed by merg-ing the fundamental components ¦©�Z¨Iª�«NªP¬P¬Q¬�ª2¢where ¢X�g´�µ ¶ ° ®v� �d£l·7¸ , with e.g., ¸v��¹ º » .

1

2

3

4

Level ¼�½ modified ¼�½ Error confus.2 5= ¾ 1,4 ¿ 2 3 5= ¾ 1,4 ¿ 2 3 5= ¾ 1,4 ¿ 2 33 6= ¾ 1,2,4 ¿ 3 6= ¾ 1,3,4 ¿ 2 6= ¾ 1,3,4 ¿ 2

Figure 2: Hierarchical 2D clustering example with 4 Gaus-sian clusters. 1 and 4 have wide distributions, 2 more nar-row, and 3 extremely peaked. The priors are À%Á Â�ÃlÄ�ÅpÆ Çfor ÂgÄÉÈQÊ�ËdÊ�Ç and À@Á ÇdÃ�Ä^ÅpÆ È . The table shows theconstruction of higher-level clusters, e.g., the ¼ ½ distancemeasure groups clusters 1 and 4 at level 2, which is due tothe fact that distance is based on the shape of the distribu-tion and not only its mean. This also applies to the otherdissimilarity measures. At level 3, however, the ¼ ½ methodabsorbs cluster 4 into 5 to form cluster 6. The other meth-ods absorbs cluster 3 at this stage. The reason is that theprior of cluster 3 is rather low, which is neglected in the ¼�½method.

4 ExperimentsThe hierarchical clustering is illustrated for segmen-tation of e-mails. Define the term-vector as a com-plete set of the unique words occurring in all theemails. An email histogram is the vector containingfrequency of occurrence of each word from the term-vector and defines the content of the email. The term-document matrix is then the collection of histograms

for all emails in the database. Suitable preprocessingof the data is required for good performance. Thisconcerns: 1) removing words, which are too likely(stop words) or too unlikely6; 2) keeping only wordstems; and 3) normalizing all histogram vectors tounit length7. After preprocessing the term-documentmatrix contains 1280 (640 for training and 640 fortesting) e-mail documents, and the term-vector con-sists of 1652 words. The emails where annotated intothe categories: conference, job and spam. It is possi-ble to model directly from the term-document matrix,see e.g., [18, 22], however, we deploy the commonlyused framework Latent Semantic Indexing (LSI) [5],which operates using a latent space of feature vec-tors. These are found by projecting term-vectors intoa subspace spanned by the left eigenvectors associ-ated with largest singular values of a singular valuedecomposition of the term-document matrix. We arecurrently investigating methods for automatic deter-mination of the subspace dimension based on gener-alization concepts, however, in this work, the num-ber of subspace components is obtained from an ini-tial study of classification error on a cross-validationset. We found that a Ì dimensional subspace pro-vides good performance. Fig. 3 presents a 3D scatter

−0.4−0.2

00.2

0.4 −0.4−0.2

00.2

0.4−0.6

−0.4

−0.2

0

0.2

PC 1 PC 2

PC

3

Figure 3: 3D scatter plot of the data. Three largest out offive principal components are displayed. Light grey color- conference, black - job, dark grey - spam. Data is wellseparated, however, there exists small confusion betweenjob and conference e-mails.

plot of the first 3 feature dimensions, viz. the largestprincipal components. Data seem to be well sepa-rated, however, parts of job and conference e-mailsare mixed. Fig. 4 shows the performance of theUSGGM algorithm, and in Fig. 5 the hierarchicalrepresentations are illustrated.

6A threshold value for unlikely word up to approx. 100 occur-rences has little influence on classification error. In the simulationthe threshold was set to 40 occurrences.

7Another approach is to normalize the vectors to represent esti-mated probabilities, i.e., let vector sum to one. However, extensiveexperiments indicate that this approach give a feature space, whichis not very appropriate for Gaussian mixture models.

157

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

7

10

15

20

25

3035404550

λ

Cla

ssifi

catio

n E

rror

[%]

Nl=10

Nl=20

Nl=50

Nl=100

Nl=200

minima

10 20 50 100 2000

5

10

15

20

25

Nl

Cla

ssifi

catio

n er

ror [

%]

2 4 6 8 10 12 14−7

−6.8

−6.6

−6.4

−6.2

−6

−5.8

−5.6

−5.4

−5.2

−5

K

AIC

Nl=10

Nl=20

Nl=50

Nl=100

Nl=200

minima

Figure 4: Average performance of the USGGM algorithmover 1000 repeated runs using Í�Î3Ï©ÐPÑQÑ unlabeled ex-amples and a variable number of labeled examples ÍtÒ .The algorithm parameter is set to Ó3Ï1ÑpÔ Õ . Upper panelshows the performance as a function of the discount fac-tor Ö for unlabeled examples ( Ö�Ï�Ñ corresponds to nounlabeled data). As expected, if few unlabeled examplesare available, Í Ò Ïr×*ÑpØ�ÐPÑ , the optimal Ö is close to one,and all available unlabeled data are fully used. As Í Ò in-creases Ö decreases towards ÑpÔ Ù for Í Ò ÏgÐPÑQÑ , indicatingthe reduced utility of unlabeled examples. The classifica-tion error is reduced approx. ÐPÚdÛ using unlabeled datafor Í Ò Ï1×*Ñ , gradually decreasing to ×PÛ for Í Ò ÏrÐPÑQÑ .The classification error for optimal Ö as a function of Í Ò isshown in the middle panel. The lower panel shows numberof components selected by the AIC criterion for optimal Öas described in Sec. 2.1. As Í Ò increase, also it is advanta-geous to increase the number of components.

Ü Ý Þ@ß Ý+à ÜNá Keywords

1 .7354information, conference, call,workshop, university

3 .0167remove, address, call, free, busi-ness

14 .2297

call, conference, workshop, in-formation, submission, paper,web

6 .0181research, position, university, in-terest, computation, science

2 .6078research, university, position, in-terest, science, computation, ap-plication, information2

6 .3922research, position, university, in-terest

3 .6301remove, call, address, free, day,business3

5 .3698 free, remove, call

Table 1: Keywords for the USGGM model. Ü Ï�× is con-ference, Ü Ï<Ð is jobs and Ü Ï3Ù is spam.

Typical features as described in Sec. 3.2 and back-projecting into original term-space provides keywordsfor each cluster as given in Tab. 1. In Fig. 5 wechoose to illustrate the hierarchies of individual classdependent densities â�ã ä�å æ+ç using the modified è�édissimilarity only. The cluster confusion measureis computational expensive if little overlap exist asmany ancillary data are required. The modified è�éis computational inexpensive and basically treat dis-similarity as the cluster confusion, while the standardèlé do not incorporate priors. The conference classis dominated by cluster 1. This has keywords listedin Tab. 1, which are in accordance with the meaningof conference. The lower left panel shows the clusterlevel assignment distribution of test set emails, whichare classified as conference emails cf. Sec. 3.1. Someobtain significant interpretation at level 1 (clusters 1-6), while others at a high level (cluster 9). Similarcomments can be made for the jobs and spam classes.

For comparison, we further trained an unsuper-vised SGGM model and the results for a typical runare presented in Fig. 6. The top row illustrate the hi-erarchy formed by using the sample dependent, themodified è é dissimilarity, and the cluster confusionsimilarity measures. For the sample dependent mea-sure the numbers on top of the bars indicate the mostfrequent combinations of first level clusters. Clearlythere is a significant resemblance among the sampledependent and the cluster confusion similarity hierar-chies, e.g., higher level clusters formed by ê�ëpì�íNî andêpïNìPëQðNî . However, inspection of the bottom row pan-els, which show the cluster confusion with the classlabels, indicate that the cluster combinations of thesample dependent method is better aligned with theclass labels. The modified è é provides the best align-ment of clusters with class labels at level 8 and is in


3 6 4 110

1

102

103

104

105

Cluster

Dis

sim

ilarit

y

7

8

9

2 610

3

104

105

106

Cluster

Dis

sim

ilarit

y

7

4 6 5 310

−4

10−2

100

102

104

Cluster

Dis

sim

ilarit

y

7

8

9

1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pro

babi

lity

Cluster1 2 3 4 5 6 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pro

babi

lity

Cluster1 2 3 4 5 6 7 8 9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pro

babi

lity

Cluster

Figure 5: Hierarchical clustering using the USGGM model. Left column is class ñ�ò�ó conference, middle column ñ�ò�ôjobs, and right column is for ñ�ò�õ spam. Upper rows show the dendrogram using the modified ö�÷ dissimilarity for eachclass, and the lower row the histogram of cluster level assignments for test data, cf. Sec 4.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

10 6,1

9,2

6,3

9,3 6,3,

1

10,2

3,1

321

9

Cluster mixtures

Pro

babi

lity

7 8 5 4 10 1 6 3 2 910

−2

10−1

100

101

102

103

104

105

Cluster

Dis

sim

ilarit

y

Modified L2

1 3 6 2 10 9 4 8 7 510

−2

10−1

Cluster

Dis

sim

ilarit

y

Confusion similarity

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Cluster

Pro

babi

lity

ConferenceJob Spam

2 9 170

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cluster

Pro

babi

lity


5 7 170

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cluster

Pro

babi

lity


1 free,remove 2 research,university,position 3 free,information,address,day4 call,remove,address 5 free,service,site,sit,internet 6 call,remove,address,business7 call,remove 8 free,address,state,quantity 9 information,conference,workshop10 research,university,interest,position

Figure 6: Unsupervised SGGM modeling of ø�ù ú[û . Upper rows show the hierarchical structure. Left panel illustrates thesample dependent similarity measure Sec. 3.5, the middle panel the modified ö ÷ dissimilarity measure, Sec. 3.3.1, and the rightpanel the cluster confusion measure Sec. 3.4. measure. Lower rows show the confusion of clusters with the annotated emaillabels at the first level in the hierarchy (left panel) and at level 8, where 3 clusters remain for the modified öl÷ dissimilarity(middle) and the cluster confusion measure (right panel). E.g., the black bars are the fraction of conference labeled test setemails ending up in a particular cluster. In addition, keywords for each cluster of the first level are also provided.

159

that respect superior to the other methods for the cur-rent data set. The keywords for clusters 2,10 and 9provide perfect description of the jobs and confer-ence emails, respectively. Keywords for the otherclusters indicate that these mainly belong to the broadspam category.

5 ConclusionsThis paper presented probabilistic agglomerative hi-erarchical clustering schemes based on the introducedunsupervised/supervised generalizable Gaussian mix-ture model (USGGM), which is an extension of [17].The ability to learn from both labeled and unlabeledexamples is important for many real world applica-tions, e.g., text/webmining and medical decision sup-port. The USGGM was successfully tested on a text-mining example concerning segmentation of emails.

Using a probabilistic scheme allows for automaticcluster and hierarchy level assignment for unseendata, and provides further a natural technique for aninterpretation of the clusters via prototype examplesand features. In addition, three different similari-ties measures for cluster merging were presented andcompared.

References[1] H. Akaike: “Fitting Autoregressive Models for Pre-

diction,” Ann. of the Inst. of Stat. Math., vol. 21,1969, pp. 243–247.

[2] C.M. Bishop: Neural Networks for Pattern Recogni-tion, Oxford University Press, 1995.

[3] C.M. Bishop and M.E. Tipping: “A Hierarchical La-tent Variable Model for Data Visualisation,” IEEE T-PAMI vol. 3, no. 20, 1998, pp. 281–293.

[4] J. Carbonell, Y. Yang and W. Cohen: “Special Iss-sue of Machine Learning on Information RetricealIntroduction,” Machine Learning vol. 39, 2000, pp.99–101.

[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Lan-dauer and R. Harshman: “Indexing by Latent Se-mantic Analysis,” Journ. Amer. Soc. for Inf. Science.,vol. 41, 1990, pp. 391–407.

[6] C. Fraley: “Algorithms for Model-Based Hierarchi-cal Clustering,” SIAM J. Sci. Comput. vol. 20, no. 1,1998, pp. 279–281.

[7] D. Freitag: “Machine Learning for Information Ex-traction in Informal Domains,” Machine Learningvol. 39, 2000, pp. 169–202.

[8] L.K. Hansen, S. Sigurdsson, T. Kolenda, F.A.Nielsen, U. Kjems and J. Larsen: “Modeling Textwith Generalizable Gaussian Mixtures,” In Proc. ofIEEE ICASSP’2000, vol. 6, 2000, pp. 3494–3497.

[9] L.K. Hansen: “Supervised Learning with Labeledand Unlabeled Data,” submitted for pulication 2001.

[10] T. Hastie and R. Tibshirani: “Discriminant Analysisby Gaussian Mixtures,” Jour. Royal Stat. Society -Series B, vol. 58, no. 1, 1996, pp. 155–176.

[11] T. Honkela, S. Kaski, K. Lagus and T. Kohonen:“Websom — Self-organizing Maps of Document

Collections, in Proc. of Work. on Self-OrganizingMaps, Espoo, Finland, 1997.

[12] C.L. Jr. Isbell and P. Viola: “Restructuring SparseHigh Dimensional Data for Effective Retrieval,” inAdvances in NIPS 11, MIT Press, 1999, pp. 480–486.

[13] T. Kolenda, L.K. Hansen and S. Sigurdsson: “Inde-pedent Components in Text,” in Adv. in Indep. Comp.Anal., Springer-Verlag, pp. 241–262, 2001.

[14] J. Larsen, L.K. Hansen, A. Szymkowiak, T. Chris-tiansen and T. Kolenda: “Webmining: Learning fromthe World Wide Web,” Computational Statistics andData Analysis, 2001.

[15] J. Larsen, L.K. Hansen, A. Szymkowiak, T. Chris-tiansen and T. Kolenda: “Webmining: Learning fromthe World Wide Web,” in Proc. of Nonlinear Meth-ods and Data Mining, Rome, Italy, 2000, pp. 106–125.

[16] M. Meila and D. Heckerman: “An ExperimentalComparison of Several Clustering and InitialisationMethods,” in Proc. 14th Conf. on Uncert. in Art. In-tel., Morgan Kaufmann, 1998, pp. 386–395.

[17] D.J. Miller and H.S. Uyar: “A Mixture of ExpertsClassifier with Learning Based on Both Labelled andUnlabelled Data,” in Advances in NIPS 9, 1997, pp.571–577.

[18] K. Nigam, A.K. McCallum, S. Thrun, and T.Mitchell: “Text Classification from Labeled and Un-labeled Documents using EM,” Machine Learning,vol. 39, 2000, pp. 103–134.

[19] A. Szymkowiak, J. Larsen and L.K. Hansen: “Hi-erarchical Clustering for Datamining,” in Proc. 5thInt. Conf. on Knowledge-Based Intelligent Informa-tion Engineering Systems and Allied TechnologiesKES’2001, Osaka and Nara, Japan, 6–8 Sept., 2001.

[20] N. Vasconcelos and A. Lippmann: “Learning Mix-ture Hierarchies,” in Advances in NIPS 11, 1999, pp.606–612.

[21] E.M. Voorhees: “Implementing Agglomerative Hi-erarchic Clustering Ulgorithms for Use in DocumentRetrieval,” Inf. Proc. & Man., vol. 22, no. 6, 1986,pp. 465–476.

[22] A. Vinokourov and M. Girolami: “A Probabilis-tic Framework for the Hierarchic Organization andClassification of Document Collections,” submittedfor Journal of Intelligent Information Systems, 2001.

[23] A.S. Weigend, E.D. Wiener and J.O. Pedersen: “Ex-ploiting Hierarchy in Text Categorization,” Informa-tion Retrieval, vol. 1, 1999, pp. 193–216.

[24] C. Williams: “A MCMC Approach to HierarchicalMixture Modelling,” in Advances in NIPS 12, 2000,pp. 680–686.

[25] D. Xu, J.C. Principe, J. Fihser and H.-C. Wu: “ANovel Measure for Independent Component Analy-sis (ICA),” in Proc. IEEE ICASSP98, vol. 2, 1998,pp. 1161–1164.

A P P E N D I X E

Contribution: NNSP 2002

This appendix contain the article Clustering of Sun Exposure Measurements. inProceedings of IEEE Workshop on Neural Networks for Signal Processing XII,editor: D. Miller and T. Adali and J. Larsen and M. Van Hulle and S. Douglas,pages: 727–735. Author list: A. Szymkowiak-Have, J. Larsen, L.K. Hansen,P.A. Philipsen, E. Thieden and H.C. Wulf. Own contribution approximatedto 50%.


üLý�þLÿ�� ÿ þ�� ÿ þ�� ÿ þ�� ÿ

�� "!$#"%'&)(+*-,/.)&10324.5)6+798;:;�=<>.+?A@B6$CD7E8;<F� G'�=24.)C;@B6HC 78I � �J� IFK ,ML ,MN;@B6HC;O+8;PQ�=R K ,M6EST6$C;O)8=2J� UV�=WYX=L Z[O

79\ CTZ](^?_%`.+a_,/b$@-.+C;Sdc�.+a K 6$%`.ae,MbH.)Lfcg(TST6$LMLM,MC=h=8;ijX=, L/ST,MC=h`k^lTmR>6Eb K CT,/bH.+Lon4C=,M5)6H?e@B, ap#�()Zrq46HCT%`.)?_&D8=qVGs0pl+tûûv<o#"C=h)w"#^8TqV6$C=%`.+?e&

IFK (^C=6)xzyV{1|J{"|)l^|vk)}ûûT8 k^}^l)k=8 k)t^t)}�~;.�fxzyV{"|�{"|)t1�vl)|)}^}P�0�%`.), L�xF.)@_!)8 �BL�8 LM& KD� , %'%d� S"a_Xo� ST&�WY6HwfxF6$,M5",MC=Sf� ,M%'%�� S"a_Xo� ST&

OHq46HN=.)?Bae%�6HCâ4(+Z�qV6$?e%'.+a_(^LM()h^#)8Tij,M@_N 6$w"�B6$?eh�24(^@_N=, aA.+LnQC=,M5^6$?e@_, ap#'()ZzUz(^N 6$C K .)h)6HCo8Tij,/@BN 6$w"�B6H?_h'ij.)&1&^6�l)k

qVGs0pl{1uû'U�()N 6HC K .)h^6$Co8Tq46HCT%`.)?_&

��r�$�H��=�)�)� �$��"��[�^�=�[�� =�E��¡=�"�E�"�¢�E£��o¤e�1¥f¦z �H£>�+�§�H�H£��>�;¨�©f£��"�9¤�H�[ ��=��+�"��^ ��^�^�+��[��ª«�E£��o¤e¬��T�r��H��v�^�+�®�^ �[�[�1�)�H�"�°¯�� ±��f£��z�1�d ;¯�H£>�f²E�"�+�E�®�H ª �1�H¬��1��`��H¬´³�µ¶�+�=��[�"�E�[ �´��1�=�E£>��"��"�D�H�^��·J¬��[�®¦r�T¦z�1�¯� o�^£��H�"�v ;��[��"� �E�[¯��o�[��ªY�^�[£��$�E�1��[�«�E¬��¬r�^�E�^�+ ª �1��" £��v�H�1�� ¯-��T�H�Y¯� ;��H¬��v¦r£>�¦z �H�v ¯�£��1��H�H�=��[��ª�¦z �E�H��r�[��"�[�T�H�[ ��Q�z�^�AJ�"�"��H£��o¤A¬��T�r��H��1¥f¦z �H£>�+��=��1¡ �1�D�E£��=�[��d�T�E�H�"�E�H�[��ª'�H¬��J��$¸� ¯>�$¸o�[��^�=��)�1�1��´ª �1��1�+�T�¦��+ =�r�T�r�[�[�[�H�H�[�v¯��+�T��^v ;�E¸� ;�+�[ª �[��T��^¡ �"�[ ;¦z�"�Y¯� ;��E�^¥f��=��YJ�1�¹��[�o¤�[��ª¹�[�`��"�� $�H��T�H�"��H ¹�z��£��E�1¯3£��j¯� ;�d�)��£��$�H�1�+�[��ª¹ ¯Q�z�"¬��^¡o�[ ;��=�-��T�E�D�·J¬��¯��=��1J ;�¸º�^ ��r�[��"��»z�T�H�"�D�g¼o�1��T�D�E�[�d�[��1¥o�[��ª��[��¸ ��T¦�¦�� =�E¬`��H¬�¦�� ;�r�T�r�[�[�[�H�H�[�¹�^�[£��H�H�1��[��ª��r�T�E�"�� H¬��«ª �"��1��=�[�[½^�T�r�[�º¾��=£��H�E�[�T��¥f�E£>��d�� o��"�p�

�9¿g·JÀ�Á'Â�³gÃv·��EÁ'¿\ Csa K 6z@pa_X;ST,M6HSJ@BX=CT0�69�TN (^@_X=?_6r69�TN 6$?e,M%�6HCâE8EÄ1X=6H@Ba_,M()C=C;.), ?e6H@fb$()C;b96H?_C=,MC=hQ@BX=CT0 K .+w=, ae@*�6H?_6sb$()LMLM6EbAa_6ES`Z�?_(^%�mEt"��@BX=w"�B6EbAae@H� \ Cg.^STST, a_,M(^Co8=ST.),MLM#�nQÅ§?A.)ST,/.+a_,M()C�*j6$?_6s%�6E.0@BX=?_6ESg.a�.�mHu`%',MC1XTa_6'@_.)%'NTLM,MC=h�?e.+a_6JX;@B,MC=h�.�@BN 6Hb$,/.+LMLM#gST6H@B,Mh^C=6HSºÆe@BX=CT0�*j.+aeb K;Ç �R K 64X=L a_,M%`.+a_6s(^w"�B6Hb9a_,M5^64,/@Fa_(�?_6HL/.ae6Qa K 6 K 6$a_6$?e()h^6$C=6H()X;@zS=.+ae.J(+Z�@BX=CT0 K .)wT, ae@H8TnQÅST(^@_6V.)C;S'()a K 6$?QST.+ae.�È�6^� h=�M8"%'6HST,/b$.)L ?_6Hb$(^?eS=@eÉ�*-, a K a K 6sN=XT?eN (^@_64()Z�.)@_@_6H@_@_,MCTh�a K 6?_,/@B&�()ZQ@_&1,MC�bH.+C;b$6$?�Z�()?J,MC;S",M5",/STX=.)Lj@_XTw"�B6EbAae@H�gR K ,/@�N;.)N 6$?JZ�("b$X;@B6H@�(^C®a K 6�@BX=w"0ae.^@B&d()Zz,/ST6$C1a_, Z�#1,MC=hd?_6HLM6$5+.+C1as@Ba_?_X;bAaeXT?e6v,MC�a K 6'b9(^%vwT,MC=6ES�ST.+ae.�@B6$as()Z�@BX=C K .)wT, aST,/.+?e, 6E@'.+C;SºS=.),ML #«nQÅÊST(^@_6d%�6E.)@_XT?e6$%'6$C1ae@$��WY6�.), %Ë.a`,/ST6$C1a_, Z�#", C=h¹?e6$LM6$5+.)Câ@pa_?eX;bAa_X=?_6�X;@B,MC=h K ,M6$?e.)?eb K ,/bH.+L�NT?e()w;.)wT,MLM,/@pae,/b�b$LMX;@pa_6$?e,MCTh;�¹�4L a K ()X=h K a K 6�%'69a K (TSN=?_6H@_6$C1a_6ESd, C�Ì �HÍ�b$.)Cgw 6J,MC"5^()&^6HS�Z�(^? K ,M6$?A.+?Ab K ,/b$.)Lob9LMX;@pa_6H?_,MC=h=8 *j6�?e6H@B(^?BaQa_(�@B,M%v0N=LM6�N=?e()w;.)wT,MLM,/@pae,Mb�b$LMX=@Ba_6$?e,MCTh�,MC¹a K ,/@�*�(^?_&D��R K 6dST,/.+?e#Y?_6Hb$()?AST@JbH.+C�w 6�5",M6$*�6ES

163

Î^Ï�ÎdÐ^ÑHÒ9Ó_Ô)Õ�Ô)ÖQÒ$Î+Ó_ÑH×)Ô^Õ_Ø/ÒHÎ+Ù�Ú=Î+ÓeÎ=Û�Ü-Ý=Ñ$ÕeÑHÎ^ÏsÓeÝTÑ�Ú=Î)ØMÙ ÞYß4à�Ú"Ô1ÏBÑ`Ø/Ï�ÎgÒ$Ô)á1Ó_ØMá"â=Ô)â;Ïã ÑHÎ^ÏBâ=Õ_Ñ ã ÑHá^ÓVÜ-Ý=Ø/ÒeÝ�Ø/Ï ã ÑEÎ)Ï_âTÕeÑHÚ�Ö�Ô^ÕsÚTØ äDÑ$Õ_ÑHá^Ó�å Ñ$ÕeÏ_Ô)á;ÏVÚ"â=Õ_ØMá=×�æEç)è�Ú=ÎEÞTÏ$ésêjÝ=ÑÙMÔ)á=×)ë�Ó_ÑHÕ ã Ó_Ý=Ñ$Ô^Õ_Ñ9ÓeØ/Ò$Î)ÙDÎ+Ø ã Ø/ÏzÓ_ÔvØ/Ú"ÑHá1Ó_Ø Ö�Þ�Î�Ý=ØMÑ$ÕAÎ+ÕeÒAÝ=ØMÒHÎ)Ù=å=Õ_Ô^ì;Î)ìTØMÙMØ/ÏpÓeØMÒVÒ9ÙMâ;ÏpÓ_ÑHÕ_ØMá=×ã Ô"ÚTÑHÙ�Ü-Ý=Ø/ÒeÝ®Ñ$í`Ò$ØMÑ$á1Ó_ÙMÞ�Ý;Î+á;ÚTÙMÑHÏ�Ò9Ô ã ì=ØMá;Î+Ó_ØMÔ)á;Ï�Ô+ÖQÒ$Î+Ó_Ñ$×^Ô^Õ_Ø/Ò$Î)ÙrÎ)á=ÚYÒ$Ô)á1Ó_ØMá"â=Ô)â;ÏÚ=ÎÓAÎTé�îQÔ+Ü�ÑHÐ)ÑHÕHÛ=Ó_Ý=Ñ'Ø/Ú"ÑEÎ�Ô+ÖFÓeÝ=Ñvå=Õ_ÑEÏBÑ$á1Óså=Î)å Ñ$Õ�Ø/ÏQÓ_ÔdÏpÓeâ=ÚTÞdÓ_Ý=Ñ'Ò$Î)å;Î+ì=ØMÙMØ Ó_ØMÑHÏsÔ)ÖÔ^âTÕFï=Ñ$ðTØ ì=ÙMÑ ã â=Ù Ó_Ø ã ÑHÚTØ/ÎsÓeÑ9ð"ÓjÎ)á=Ú'Ø ã Î+×^ÑHÏ�ÚTÎ+ÓeÎ�ñ òTÛTó"Û"ô"Û=õ1Û"öE÷ ã ØMáTØMá=×�Ö�ÕeÎ ã ÑHÜ�Ô^Õ_øÖ�Ô^ÕQÎ)á=Î)ÙMÞTÏBØ/ÏjÎ)á=Ú�â=á;ÚTÑ$ÕeÏBÓeÎ)á=ÚTØMá=×'Ô+Örì ÑHÝ=ÎÐ"Ø Ô^ÕeÎ)ÙfÚ=ÎÓAÎTé

ùoú�ûýüVþ�ÿ��ùoú��ü¢ù��Jú�� ÏBå ÑHÒ$Ø/Î+ÙMÙMÞ�ÚTÑEÏBØM×)á=ÑEÚ«ÚTÑ$Ð"Ø/Ò9Ñ^Û ã ÑHÎ^ÏBâ=Õ_ØMá=×YÕ_ÑEÒ9Ñ$ØMÐ^ÑHÚºÏBâ=á«ÕeÎ^ÚTØ/ÎÓ_ØMÔ^á�� AÛ�Ü-Î)Ï×^Ø Ð^ÑHá�Ó_ÔgÓ_Ý=Ñ�×)ÕeÔ)â=åYÔ)ÖQÏBâ=ì��BÑHÒ9ÓeÏ$é��pá¹Î)Ú=ÚTØ Ó_ØMÔ^áfÛFÏBâ=ì��BÑHÒ9ÓeÏ�ÜjÑ$Õ_Ñ`Õ_Ñ��^â=ÑEÏpÓ_ÑEÚ�Ó_Ô��;ÙMÙÔ^â"ÓQÎ�ÚTØ/Î)Õ_Þ�Ò9Ô^á;Ò9Ñ$ÕeáTØMá=×JÓ_Ý=Ñ$ØMÕQÏ_â=á�ì ÑHÝ=ÎÐ"ØMÔ)ÕAÏ�ÚTâTÕeØMáT×�ÑHÎ^ÒeÝ�Ú=ÎÞ'Ô)Ö>Ó_Ý=Ñ�ÏpÓ_â;ÚTÞ��Ö�Ô)Õã Ô)ÕeÑsÚTÑ9ÓeÎ)ØMÙ/Ï$Û;ÏBÑHÑ'ñ æ��÷��9é��FØM×^Ý^Ó4ÏBÑ$ÙMÑEÒAÓeÑHÚ��1âTÑEÏpÓ_ØMÔ^á;Ï-Î)Õ_ÑVå=Õ_ÑEÏBÑHá^ÓeÑHÚ�Ý=Ñ$Õ_Ñ!

"$#!%'&(#!)+*(, "-#�*/.0,!12�35476�8 9 :<;>= =@?>ACBED<6FG3IHKJ!LM6�;�: =@?>ACBED<6NO3QPOR!DTSQ;EU�V!9 D<W =@?>ACBX=G?>A�YZA[6�8 ;ELM9 R<\]B^D!6_ 35à;'b�?X:TPOV!6�R<8 :!?CL�A =@?>ACBED<6cG3Qd7D�UMV<?-SI?X;'e�VfBXgh;ÛM?CL =@?>ACBED<6iO3QPOR!DTj�;êCU�6^LK`7R<\kJ�?>L D!6GB'F'ikl�;^8 R<?>A�9 D�L�;^D<W'?m2CYni�opG3QPOR!DOJ<ROLMD<?X: D!6GBXLM?E:<B^VGR!L[UMACBEJ<8 9 A[UM?CLMAqO3QPO9 rX?a6^sIPOR<DGJ<R!LMD<?X:tHKLM?X;uD!6GBE8 9 U[UM8 ?EB^\m?E:O9 R<\]B^8 ;ELMW�?êjÝ"â;Ï$Û=ÓpÜ�Ô`ÓpÞ"å ÑHÏ4Ô+ÖFÚ=Î+ÓeÎ'Ü�ÑHÕ_Ñ�Ò$Ô)ÙMÙMÑEÒAÓ_ÑEÚ+ jÒ9Ô^á1Ó_ØMá"âTÔ^â;Ï ã ÑHÎ^ÏBâ=Õ_Ñ ã Ñ$á1ÓeÏ-Ô)ÖrÓ_Ý=Ñ

ÏBâ=ágßQà�ÕeÎ^ÚTØMÎ+Ó_ØMÔ^áv�� w��QÎ+á;ÚdÒ$Î+Ó_ÑH×)Ô^Õ_Ø/Ò$Î)ÙoÚTØ/Î+ÕeÞ`ÕeÑHÒ$Ô)ÕAÚTÏHéK�zÎ^ÒeÝdÚTØ/Î+ÕeÞ�ÕeÑHÒ$Ô)ÕAÚ�Ø/ÏÕ_ÑHåTÕeÑHÏBÑHá1Ó_ÑHÚdì"Þ�Î)áYè�ÚTØ ã Ñ$á;ÏBØMÔ^á;Î+ÙrÐ^ÑHÒ9Ó_Ô^ÕVÎ)á=Ú�ÚTÑEÏ_Ò9ÕeØMì;ÑEÏVÎ�Ï_å;ÑEÒ9Ø � Òvì;ÑHÝ;ÎEÐ"ØMÔ)ÕsÔ)ÖÓ_Ý=Ñ-å;Î)ÕBÓ_Ø/Ò$âTÙ/Î)Õrå Ñ$ÕAÏBÔ^á�ÚTâ=Õ_ØMá=×VÓ_Ý=ÑQå=Î)ÕBÓeØMÒ$â=Ù/Î+ÕFÚ=ÎÞ)é�ê-ÝTÑ-Ó_Ô)ÓeÎ)ÙTá"â ã ì ÑHÕ�Ô)Ö å Ô1Ï_Ï_Ø ì=ÙMÑå;ÎÓ_Ó_Ñ$Õeá=ÏfÖ�Ô^Õ>Ó_Ý=Ñ�å=Õ_ÑEÏBÑHá^ÓeÑHÚ�ÏBÑ9Ó�Ô)Ö0�^â=ÑEÏpÓ_ØMÔ^á=Ï�Ñ'�1â=Î)Ù/ÏIxO�1õ+ç^ôTÛEÝ=ÔÜ�ÑHÐ^Ñ$ÕHÛHÔ^áTÙMÞ�Î4Ï ã Î)ÙMÙÖ�ÕeÎ^ÒAÓeØ Ô^á�Ô)Ö�ò<x)çJå;Î+ÓBÓ_ÑHÕ_á;Ï-Î^ÒAÓeâ=Î)ÙMÙMÞ�Ñ9ðTØ/ÏpÓQØMá�Ó_Ý=Ñ�Ø á"Ð^ÑHÏBÓ_ØM×^Î+Ó_ÑEÚ�Ú=Î+ÓeÎ'ÏBÑ$ÓHé

ÿm��üsÿm��yJü�ù�ù�z9û|{} Î+Ó_Ñ$á1Ó�~1Ñ ã Î)á1Ó_Ø/ÒT�3á;Ú"Ñ$ð"ØMá=×�� } ~��'ñ xE÷�ÜjÎ)Ï�ÚTÑ$Ð^Ñ$ÙMÔ^å ÑHÚ�Ö�Ô^ÕsÓ_Ñ$ð"ÓJÎ)á;Ú ã â=Ù Ó_Ø ã ÑHÚTØ/Îã ØMáTØMá=×;Û-ÏBÑHÑ¹ñ ò=ÛjóE÷�é��3á«Ó_Ý=Ø/Ï�ÏpÓ_â;ÚTÞ«Ü�Ñdå=â=ÕeÏ_âTÑ�Î¹Ï_Ø ã ØMÙ/Î)Õ'Ø/ÚTÑHÎ=Û-Ü-Ý=Ø/ÒeÝºÑHá=Î)ì=ÙMÑHÏÓ_Ô®Ò$Ô ã ì=ØMáTØMá=×®ÚTØ äDÑ$Õ_ÑHá1Ó�ÓpÞ"å ÑHÏ'Ô)ÖVÚ=Î+ÓeÎYØMá1Ó_Ô®ÎYÒ9Ô ã'ã Ô^á�Ö�ÕeÎ ã Ñ$Ü�Ô^Õ_øDé��ØM×^â=Õ_ÑYæå=Õ_ÑHÏ_Ñ$á1ÓeÏ�Ó_Ý=Ñ�×)ÑHáTÑHÕeÎ)Ù�Ö�ÕeÎ ã Ñ$Ü�Ô^Õ_ø�Ô)Ö-å=Õ_ÑHåTÕeÔ"Ò$ÑHÏ_Ï_ØMáT×;Û>Ò$ÙMâ=ÏBÓ_Ñ$ÕeØMáT×�Î)á;Ú¹Ú=ÎÓeÎgå Ô1ÏpÓBëå=Õ_ÔTÒ9ÑEÏ_ÏBØMá=×;é��3ádÓ_Ý=Ñt�=ÕAÏpÓ4ÏBÓ_Ñ$åoÛ Ú=ÎÓAÎ�Ø/Ï4Ü-ØMá;ÚTÔÜ�ÑEÚdÒ9Õ_ÑEÎ+Ó_ØMáT×'Ð^ÑHÒ9Ó_Ô^ÕeÏ�Ó_Ý;Î+ÓVÒ$Ô)á1ÓeÎ)ØMáÚ=ÎÓAÎsÖ�Õ_Ô ã Ò9Ô^á;ÏBÑHÒ$âTÓ_ØMÐ)ÑVÚTÎÞTÏ$érê-ÝTÑ4Ô)åTÓ_Ø ã Î)ÙDÏBØ(�$Ñ4Ô+ÖfÓ_Ý=ÑVÜ-ØMá;Ú"Ô+ÜºØ/Ï�Î)á`Ø/Ï_ÏBâ=ÑQÓeÔ�ì ÑÎ^ÚTÚTÕ_ÑEÏ_ÏBÑEÚfé��=Ô^ÕsÑ$ð=Î ã å=ÙMÑ)ÛfÓeÎ)ø"ØMá=×�ÓeÝTÑ�Ö�â=ÙMÙzÏBÑ$Ó�Ô)Ö�Õ_ÑEÒ9Ô^ÕeÚ=Ï4ì ÑHÙMÔ)á=×^ØMáT×�Ó_ÔgÎ�×^ØMÐ)ÑHáå Ñ$ÕAÏBÔ)á'Ü-ØMÙMÙ;å=Õ_ÔTÚTâ=Ò$ÑQÎ�ÏBÑ$ÓzÔ+Öfå Ô^ØMá^ÓAÏFØMá�ÓeÝTÑVÏBå;Î^Ò9ÑjÓeÝ=Î+ÓzÜQØ ÙMÙ;á=Ô)ÓFÖ�Ô)Õ ã Î)á"ÞJå;Î)ÕBÓ_Ø/ÒAëâ=Ù/Î+ÕVÒ9ÙMâ;ÏpÓ_ÑHÕeÏ$Û;Ï_Ø á;Ò$Ñ�ÑEÎ)ÒAÝdÔ+ÖrÓeÝTÑ ã Ü-ØMÙMÙrÒ9Ô^á1ÓeÎ+ØMá ã Ô1ÏpÓQÔ)ÖrÓ_Ý=ÑvÔ)ì;ÏBÑHÕ_Ð^ÑHÚ�å;ÎÓ_Ó_Ñ$Õeá=ÏHé� á�ÓeÝTÑ�Ô)Ó_Ý=Ñ$ÕQÝ;Î)á;ÚDÛTÓAÎ+ø"ØMá=×'Ô^á=Ñ�ÚTØ/Î)Õ_Þ�Õ_ÑHÒ$Ô^ÕeÚ�ÎÓQÓ_Ý=Ñ�Ó_Ø ã ÑJÜ-ØMÙMÙ>ÏBØM×^áTØ � Ò$Î)á1Ó_ÙMÞ�ØMá"ëÒ9ÕeÑHÎ^ÏBÑ4ÓeÝ=Ñ�Ò9Ô ã å=â"ÓAÎÓeØ Ô^á;Î+ÙoÒ$Ô ã å=ÙMÑ9ðTØ ÓpÞ^é��3á�Ó_Ý=Ñ�Ñ9ðTå Ñ$Õ_Ø ã Ñ$á1ÓeÏ4ÎvÜQØMá=ÚTÔ+Ü Ô)Ö�ÏBØ(�$Ñ�õ


�� '�!�M�k��5�!��^�t�C�Q�'�M�m�/�'�7�<�E��k�>� �<��M�C�M� �<�O�K�E�I�M�<�-�<�^�� w�� !�X�Q�X�T� �@��[�X��C�C�^��<� �[ �M��^��'�¡�@�>�C�M�'�M�I�'�<�k�M�'��C�M�<�C�I�K� ��k�M�<��>�' Z�G�>�>�!�[�M�X�!�X�K�t�^�[�M� ¢k�'�<�k�M�<�K£I¤[¥¦�!� ��M��^��'��/�^�M�t�I�-§��^�[�M�C�M�fË�� <�!�X�v�]�E�[�� ¢ª©�«��!�^��-� �5�M�<�>�m�<�^�M�]�^� � ¬>�X�]�'�<�k§O��X[�>�>�M�X�t�'�G�M�-��!��^�M�M�!��'��^�f�[� �!��<� �^��'�'� �!�a�O�X�>�'�t§f�'�[� �M� ��]�[§��'�>�'©5®@��M�<�a¯a�'�<�[�[� �^��m� ¢O�M�!�M�K�^� ��^�M� �M�<�� K�<�[�X�T�M�t�>� �!�[�M�C�K�M�<��<�^��G©�°@��±n��'��O�>�K�M�t� �G�M�>�M§O��C�K�M�<�a�M�X�[�<� ��>²f�>� �!�[�M�C�a�X�>�G�M�C�M�a�E�M�³ �^��G Z§!�M�X[�X�C�M�X�]�M�k�M�<�a�^�� <�'��[§��^�>�-��<�C�M�-�@�CÓ Z§<�^�[�M�>�M�!��E�M�� !�>�G�M� µ��X�ª©¶/·7¸0·�¹�º�»�¼a½ª¶/·a¾a¿!·7ºf¹'À^¶/ºf¹'º�¿GÁ�ÂC¹^Ã-·C¹^Ä!¹^ÃX¿GÅ0¹EÆfÇ�¹^Ã>¶(ÈT¹'É<Â>·^Ê�Â>¿OË�¶ ÉªÌT¶(É<ÂCÍT¿!À^À^Í!¸fÉ<Âa·�Â>¿@ÎÂC¶(Í!É0¿GÃC¶ Â�Ï]ÍOÁ�ÂC½ª¹�À^Å ¸0·MÂ>¹^ÃC¶(ÉªÌ]¿GÉ0ºÐÀEÍ!ÈÇfÅ(¹^Æ�¶ ÂMÏTÅ(¹^Ä!¹'ÅZ»QÑ[ÉÂC½ª¹-Ò0É0¿GÅ0Ç0¿GÇ�¹'ÃK¾7¹�ÇfÅ/¿OÉÂCÍ¶(É�ÄOÍ!ËO¹7ÂC½ª¹$À^ÍOÉ0ÀE¹'ÇfÂ�ÍGÁ�Ì!¹^Éª¹'Ã>¿OÅ ¶(Ó�¿@Â>¶(ÍOÉ]ÁÔÍOÃKÍOÇfÂC¶(ÈÐ¿GÅª¾$¶ É0ºfÍG¾Õ·�¹'Å ¹�ÀXÂ>¶ Í!É�Ê<·�¹^¹-ÁÔ¸fÃCÂC½ª¹^ÃÖ ×�Ø »EÙ Ã>¶ Ì!¶(É0¿GÅ(Å(Ï!ÊOÂC½ª¹�Ç0¿@ÂCÂC¹^Ã>É�Ú@¾-¶(É0ºfÍ@¾¦È¿GÂCÃ>¶ Æ¶/·�ÁÔÍ!ÃCÈ¹'ºTÁÔÃCÍ!ÈÛÂC½ª¹�½ª¶/·MÂCÍ!ÌOÃX¿GÈÜÄ!¹'ÀXÎÂCÍ!Ã>·7¿!À>½ª¶(¹'ÄO¹�ºÝ�ÏÐÀEÍ!¸ªÉ!Â>¶ ÉªÌTÍfÀ^ÀE¸ªÃ>ÃC¹^É0À^¹'·�ÍOÁ5¹^Ä!¹^Ã>Ï�ÁÔÍ!¸ªÉªº�Ç0¿GÂ�ÂC¹'ÃCÉ�¶(ÉÐÂ>½f¹m¾-¶(É0ºfÍ@¾t»Þ ÍG¾7¹'ÄO¹'Ã'Ê5ÂC½ª¹�½ª¶/·MÂCÍ!Ì!Ã>¿GÈÐ·]ºfÍ<¹�·]ÉªÍOÂTÀ^ÍOÉ�Ä!¹^Ï�ÂC¶(È¹�Í!Ã>ºf¹^ÃC¶(ÉªÌh¶(ÉfÁÔÍOÃ>È¿GÂC¶(Í!É+»|Ñ�ÂT¶/·Ç�Í!·>·�¶(ÝªÅ(¹-ÂCÍ]¶(ÉªÀ^Å(¸0º�¹�ÂC¶(È¹$¶(ÉfÁÔÍOÃ>È¿GÂC¶(Í!ÉÝ�ÏÀEÍ!É0·�¶/ºf¹^ÃC¶(ÉªÌmÂC½ª¹�ÀEÍOÎ�ÍfÀ^À^¸ªÃCÃC¹'ÉªÀ^¹aÈÐ¿@Â>ÃC¶ ÆÍOÁ�ß�ÍO¶(É<Â]Í�À'ÀE¸ªÃCÃ>¹^É0ÀE¹�·�ÍOÁaÉª¹^¶(Ì!½<Ý�Í!ÃtÇª¿GÂ�ÂC¹'ÃCÉ0·k¶(ÉàÂ>½f¹�¾$¶ É0ºfÍG¾m»�¼a½ª¹^Ã>¹¿OÃC¹�áOâ<ã ×!äOåÇ�Í!·>·�¶(ÝªÅ(¹�ÀEÍOÎ�Í�À'ÀE¸ªÃC¹'É0ÀE¹'·TÝª¸fÂ�ÍOÉªÅ(Ï�æ@çGâ!è|¾7¹'ÃC¹hÇªÃC¹'·C¹^É<Â�¶ ÉéÂC½ª¹|¿OÀEÂÇ0¿GÅ�ºª¿@ÂX¿�·�¹EÂ�»¼a½ª¹�ÀEÍ!É<ÂC¶(É<¸ªÍ!¸0·t·ÇfÉêÃ>¿Oºf¶/¿GÂC¶(Í!ÉvÈT¹�¿O·ÇfÃ>¹^È¹^É<Â>·t¾7¹'ÃC¹�ë!¸0¿OÉ!Â>¶(Ó^¹'ºv¶(ÉêÍOÃ>ºf¹'ÃmÂ>Í�ÒªÂÂC½ª¹TÇfÃ>¹'·�¹'É<ÂC¹'ºhÁÔÃX¿GÈ¹^¾aÍOÃCË�»tì7ÍOÂC½àºf¶/¿GÃ>Ïh½f¶/·�ÂCÍOÌ!Ã>¿OÈÐ·^Ê0ÂC½ª¹ÐÀEÍOÎ�ÍfÀ^ÀE¸ªÃ>ÃC¹^É0À^¹tÈÐ¿@Â>ÃC¶ Æ¿OÉªºv·ÇªÉ�Ã>¿!º�¶/¿GÂC¶(Í!É�¿GÃ>¹·>ÀEÃC¹'¹^Éª¹'º�¿GÌ<¿G¶(É0·MÂ]Ã>¿OÃC¹Ç0¿@ÂCÂC¹^ÃCÉ0·tÝ�ÏàÃC¹^ÈÍGÄ<¶(ÉªÌ�Ç0¿@Â�Â>¹^ÃCÉ0·¾-½ª¶/À>½�½0¿@ÄO¹mÍ�À'ÀE¸ªÃCÃ>¹^É0ÀE¹kÝ�¹^Å(Í@¾�¿ÀE¹^ÃCÂ>¿O¶(É�Â>½fÃ>¹'·�½ªÍ!Å/º+»¼a½ª¹-Éª¹EÆ�Â�·�ÂC¹^Ç¶(É�ÄOÍ!Å(ÄO¹�·IÉªÍOÃ>È¿OÅ(¶(Ó'¿GÂC¶(Í!É]ÍOÁ0ÂC½ª¹-Ç0¿@ÂCÂC¹^Ã>É�Ú@¾-¶(É0ºfÍ@¾ÕÈ¿GÂCÃC¶ Æ+»I¼-¾7ÍÂMÏ�Ç�¹'·�ÍGÁ�ÉªÍOÃCÈÐ¿OÅ(¶(Ó'¿GÂC¶(ÍOÉ�¿GÃC¹�Ç�¹'Ã�ÁÔÍOÃ>ÈT¹�º+»àíI¶(Ã>·MÂ�Êw¹'¿!À>½�¾$¶(É0º�ÍG¾îÄO¹�ÀXÂCÍ!Ã�¶(·T·CÀ'¿GÅ(¹�ºÂCÍ�¸ªÉª¶ Â�Ï�Å(¹'ÉªÌGÂC½�Ê�¿OÉ0º�ÂC½ª¹^É�Ê�Ç0¿@ÂCÂC¹^Ã>É�ÄO¹�ÀXÂCÍ!Ã>·�¿GÃ>¹t·CÀ'¿OÅ ¹�º�ÂCÍ�Ó^¹'ÃCÍÐÈT¹�¿GÉ�¿OÉ0ºh¸ªÉf¶ ÂÄ@¿OÃC¶/¿OÉ0ÀE¹wÍGÄO¹'Ã+ÂCÃ>¿O¶(Éª¶(ÉfÌ�·C¿OÈTÇªÅ(¹�·^»I¼a½ª¹�Â>½fÃ>¹^¹KÀ^ÍOÈÇ�Í!Éf¹'É<ÂIÈ¿GÂCÃC¶/À^¹'·KïZºf¶/¿OÃCÏ<Î�¾-¶(É0ºfÍ@¾½ª¶/·MÂCÍ!ÌOÃX¿GÈÐ·^Ê�ÀEÍOÎ�ÍfÀ^À^¸ªÃCÃC¹'ÉªÀ^¹w¿GÉ0ºTðwñ�ò�½ª¶(·�ÂCÍ!ÌOÃX¿GÈÐ·>ó�¿OÃC¹wÂC½ª¹'É]·C¹^Ç0¿GÃX¿@Â>¹^Å(Ï�ÇªÃCÍGß�¹�ÀXÂC¹�ºÍ!É!Â>Í]ÂC½ª¹mÁÔ¹^¾�ÇfÃ>¶(ÉªÀ^¶(Çª¿OÅ�À^ÍOÈÇ�ÍOÉª¹'É!Â�º�¶(Ã>¹'ÀXÂ>¶(ÍOÉ0·7ÁÔÍ!¸ªÉ0º�Ý�Ï�·�¶(ÉªÌO¸ªÅ/¿OÃ-ÄG¿GÅ(¸ª¹mºf¹�ÀEÍ!È�ÎÇ�Í!·C¶ ÂC¶(ÍOÉ¡ïnô�õ�ökóE»wíI¶(É0¿GÅ(Å(Ï!ÊIÂC½ª¹�ÌO¹'Éª¹^Ã>¿OÅ(¶(Ó'¿OÝªÅ(¹�÷m¿G¸0·>·�¶/¿GÉ�È¶ Æ�ÂÇªÃC¹�ÈTÍfºf¹^Åa¶/·t¸0·�¹�º

165

øÔù!ú$ûEü(ý0þMÿ��^ú��ÿ��]þ�ý��0þ ��!û�� !��"$#�%&�'�(�� )%*�,+- ).*/0��*�1+324"��657 ��98��Oý0þCþ �:��;�� <�ÿCýªú��4;ù=>�'ü@?A�!þ6�ªú��!B>� ù!ý0þ�ü�CD�E==fú��þCþ �F=D��HG I�JLKJ�MONP� 7 ��4QûEùE;��ù��)�<ÿR;�� <�ÿCýªú��$ùOø(8��Oý0þCþ �:��9=>�!�0þ � ÿ��'þKùGø+ÿ��SUTV=��;��)�0þ �(ùE��GüªøW�F�@ÿCýªú��XBE�'ûXÿ>ùOúY �/þ�=�)Z��F=[�Oþ)\ ]_^ Y@`RacbUd ]_^ YXe f�`_g ]_^ f�` ^ h `?��'ú�� ]_^ Y�e f�`jilk ^nmAoqp�rso ` �Oú��s8��Oý0þCþ �:��'=>�!�0þ � ÿ��'þt�U��= ]_^ f�` �Oú��0�fùE��)�u�@ÿ��B��;�� <<ÿCýªú��&�ªúCùE��ùOúCÿ��(ùE�ªþmþ�ý0û�|ÿ��Gÿ&v d ]_^ f�`�a h � 7 ��&��Oú��;&�^ÿ��^úXþ m �U��= r �Oú��'þ�ÿ��;4�@ÿ��F=�øÔúCùE; ÿ��Tÿ>ú��[=��@ÿ��þ��Eÿ0w ayxOY_z p { a h}|!|)| ~�� >CD;�� ;��)��)�u�@ÿ��B��ü(ùE�UT�ü��E�^ü��ªù�ù>=êû^ù<þMÿÐø�ý��0ûEÿ��(ù�� ùGøkÿ��øÔùOú�;�\D� a$� v zA�n�F� ^ ](^ Y z e f�` `ÿ�ªúCù!ý��9�)<>��ûXÿ��Gÿ��(ùE�>TV;��U<� ;��!�Gÿ��(ùE�4;&�^ÿ�ªù>=��}�V��ùOú�=>�'ú�ÿCù��)�0þ�ýªú��6��!��^ú��Oü��!��(ü T� ÿ�C�J��Oú��;&�^ÿ��'ú>þ m ��= r �Oú��'þ�ÿ��;��Gÿ��F=�øÔúCù�; ÿ��0=>�/þW��ù��<ÿ$þ �^ÿ>þaùOøIùE�0þ �^ú�BO�Gÿ��(ùE�ªþ��=tÿ��7ùE�fÿ��;4�Gü>��ý;��^úQùOø�;�� <�ÿCýªú��$ûEùE;��0ùE��)�<ÿ>þL�/þIøÔù!ý��=s�>Ckÿ��X�X��RT[ûEú�� ÿ��^ú��(ùE�G h Jj�ONP� 7 ��ûEùE;��ªü �^ÿ��D�Oü��Où!ú�� ÿ��; øÔù!ú9��!��'ú��Oü��!��fü��'8��Gý0þCþ��:�U��;�� <�ÿCýªú��';ù>=�'ü^ 808�� ` û)��*��møÔùOý��=��G I�Jq�!NP�24��/05R )�6��"��/0�0�s/s !24�7 ��[8��Oý0þCþ �:��;�� <<ÿ>ýfú��9;ù=�^üj;ù=�^ü/þtÿ��*=��Gÿ��=�)�0þ � ÿ�C��ùOú�=>�'úmÿ>ùàþ ��ùOÿ&��ù!ý�ÿ>ü�� 'ú!JL?X�/û��=>�/û!�@ÿ��ªùE�>TMþMÿ��Gÿ��(ù��Oú�� ÿ�C[��=��Gÿ��J5ÿ��Ðû^ý�;]ýªü:�Gÿ��B��ªúCùE��U��(ü�� ÿ�CG K>J��!N��/þaû^ù�;��ªý�ÿ��!=[� ^W� `}a�� úCùE� ^n�'�*� ` p � a xOY \ ]_^ Y�¡ � ` � øÔùOúX�Gü(ü�ÿ�ªú��þ ªù!ü�=ªþ� � 7 �ýªþ!J�ÿ��mù!ýfÿCü��^ú>þ-ù�û'ûEý��>CÐü(ùU?j�'úA��GúCÿ-ùOøIÿ��tû^ý;�ýªü:�@ÿ�� BE�tûEýªú�BE��924/&2�/0¢��0��V��ù!ú�=�^ú�ÿ>ùsZ��=9��!CETP��Gÿ�ÿ��'ú��0þKû^ùOúCú��'þ ��ùE��=� ��mÿCù&�!�!û�ÐùGø�ÿ��mûEü(ý0þMÿ��^ú>þ!J<û)�)�<ÿ��'ú>þ m��)�F=|ÿCù'��4��!û��uTP�ªúCùU� �'ûEÿ��F=hÿCù�ÿ��ù!ú��E��Gü�þ��!û)�ùOøA�ªù!ú�;4�Gü��)�F=��/þMÿCùE�Oú��U;Ðþ�£!�¤ ýªú�ÿ��^ú�;Tù!ú��EJ<ÿ��ký0þ �F=�øÔú��U;��!?7ùOú��&;4��þj� ÿX��ù<þCþ ��ªü��ÿ>ù�=��þCûEú�� kÿ��!��OB>�(ùOúùOøL�)BE�^ú�C��)?��^ú>þCù��*��ÿ��)<>��'ú��;��)�<ÿA�>Cý0þ ��&��ùGÿ��û^ü(ýªþ�ÿ��^ú��Oþ>þ ��;��)�<ÿ��=�!þCþ�ùfû��:�Gÿ��F=6�E�)CuTP��@ÿCÿ��^ú��ªþ!� 7 ��Kû^ù��Z�=�)�0û��KùOø��Oþ>þ ��E��-ÿ��j��'ú>þ�ùE�0��!ÿ>ù-ÿ��R�E��B��!�ûEü(ý0þMÿ��^ú f û!�U�[��0��<�ªú��'þ>þ �!=9�>CÐÿ��ù<þMÿ��'ú��(ùOú��fú>ù��(ü � ÿ�Cq\

]_^ f�e ¥�¦�§E`}a h~ b�¨ ]_^ f�e ¥�¦�§ p Y ¨ `(g ]_^ Y ¨ ` p ^P© `?��'ú�� Y ¨ �(þ��tøW�F�@ÿ>ýfú��6BE�'ûEÿCù!úaùGø�ÿ��mþ ��!��S4�U��=*ª a hEp�©>p!|)|!|�p�~ � 7 ��6��ý�;s��'ú-ùOøøW�!�GÿCýªú��tB��ûXÿCù!ú>þ ~ �/þA=>� «q�'ú��)�<ÿKøÔù!új�!B��'ú�C��'ú>þ�ùE�*�U��=4=�)��!��=ªþ7ùE�ÿ��6��ý�;s��'ú7ùOøú��^ÿCýªú��!=�=�:�Gú�Cú��'û^ùOú�=ªþ��U��=�ÿ��?X��=fùU?�þ ��!��¬�(®F¯�°²±F³ ´(µ�¶�·tµ�¯�¸F¹ º�»�³@°²¯t¼F´²¯�½²³�¾V°(°²±O³R¿�¯�Àn°(¼F´²¯�»O¶�»F¹ ³}Á�³ ¶�°P¸F´²³@Â)³�¾V°²¯�´PÀ_Á�´²¯�¿�³ ¶�¾P±�¯�Á�°²±F³¾�¹ ¸FÀ²°²³�´²À�Á�¯�¸O®Fº�³�Ã ÄFÃU»�·6Å�¯�®)°²³RÆ�¶�´²¹ ¯jÀV¶�¿�¼O¹ Ç ®OÄFÃ


È*É�Ê�Ë�ÌqÍ&ÊÎAÏ�Ð&Ñ Ð�Ò�ÓUÔ6Õ!Ö�ÕU×uÕsØÙ:ÚUÛ�Ü*Û�Ð!Ý)Ó�Û�Ø�Ñ6ÚUÞ�ØDÝ�ÓEÛ�Û�Ð!Ñ ß�ÓEÞ�Ø>Ù�Þ�à�á@â ãåäUÚUæ�ç�ÐFÑtèjÐ!Û�Ð&Ñ Ð)æ�ÐFÝ�Ò�ÐFØÔWÓEÛ�Ò�Ï�Ð�Ý�æ�ç�Ñ�Ò�Ð)Û�Ù�Þ�à'Ð�éß�Ð)Û�Ù�ê&Ð!ÞEÒ�Ñ)ë'ì6ÚUÒ�ÚDÚ�Û�Ð9Ý�ÓEê�ß�æ Ð)Ò�Ð9Ù²ë Ð�ë�í(Ò�ÏÐ!Û�Ð9Ù:Ñ0Þ�ÓDê�Ù:Ñ�Ñ Ù�Þ�àÛ�ÐFÝ�ÓEÛ�Ø�Ñ�Ó�Û[á@â ãîäOÚ�æ�çÐFÑ)ëïÎAÏ�ÐDê�Ù�Ñ�Ñ Ù�Þ�à�Û�ÐFÝ�ÓEÛ�Ølß�Û�ÓEð�æ�Ð)ê$ÔWÓEÛ4Ò�Ï�Ð�Ý�ç�Û�Û�Ð!ÞuÒ9Ø�ÚUÒ�ÚÑ Ð)Ò�èAÚEÑ�ß�Ú�Û Ò�æ ÜDÚEØ�ØÛ�Ð!Ñ�Ñ Ð!ØDÙ�ÞHñ ÕFòFóPë�ÎAÏ�Ð4Ñ ç�Þ�ð�Ð)Ï�ÚOäuÙ�ÓEÛ�Ñ6ÓUÔ�Õ!ô>×9Ñ ç�ð>õ Ð!Ý�Ò�Ñ�Øç�Û�Ù�Þ�àÑ ç�ê�ê&Ð!ÛXß�Ð!Û�Ù�Ó>Ø[èAÐ)Û�ÐsÝ�ÓEæ�æ�Ð!Ý�Ò�Ð!Ø�ëtöXÔ_Ò�Ï�Ù:Ñ�Õ!ò�ß�Ð)Û�Ñ�Ó�Þ�Ñ�èjÐ)Û�Ð�Ï�Ó�æ:Ø[ÓEçÒXÔWÓEÛ�Ò�ÐFÑ�Ò�Ù�Þ�à�ë÷ ç�Þ0Ð)é>ß�ÓuÑ ç�Û�Ð}ê�Ð!ÚEÑ ç�Û�Ð!ê�Ð)ÞuÒ�Ñ�èjÐ!Û�ÐRøuç�ÚUÞuÒ�Ù�ù!Ð!Ø�Ù�ÞuÒ�Ó6úXäUÚUæ�ç�Ð!Ñ!ë(ÎAÏ�ÐjÑ�æ Ù:Ý)Ù�Þàtè�Ù�Þ�ØÓOèÓ�Ô�Ñ�Ù�ù)ÐX×tèAÚEÑ_ÚUß�ß�æ�Ù�Ð!Ø0ÔWÓ�Û�ê&Ù�Þ�àsû�ü�ô�òXÒ�Û�Ú�Ù�Þ�Ù�Þà�Ú�Þ�Ø*ÕOüUôXÒ�Ð!Ñ Ò@ÔWÐ!ÚUÒ�ç�Û�ÐjäEÐ!Ý�Ò�ÓEÛ�Ñ)ë_ýRÚ�Ý�ÏÔWÐ!ÚUÒ�ç�Û�Ð�äEÐ!Ý�Ò�Ó�Û�Ý�ÓEÞ�Ñ Ù:Ñ�Ò�ÓUÔ}Ò�ÏÐ�ØÙ:ÚUÛ�Ü[ÏÙ:Ñ�Ò�Ó�àEÛ�Ú�ê[í�Ò�Ï�Ð&Ý)ÓUþVÓ>Ý!Ý�ç�Û�Û�Ð!Þ�Ý�Ð0ê�ÚUÒ�Û�Ù é�Ú�Þ�ØÒ�Ï�Ð0á}â ãÿÏÙ:Ñ Ò�ÓEà�Û�Ú�ê[ë_ÎAÏ�Ð�Ø>Ù:Ú�Û�Ü4Ï�Ù:Ñ�Ò�ÓEà�Û�ÚUê1Ù:ÑjÛ�ÐFØç�Ý�Ð!Ø4ÔWÛ�ÓEê úuû � Ò�Ó�Öu×0ß�ÚOÒ Ò�Ð)Û�Þ�Ñð>Ü�Û�Ð!ê&ÓUä>Ù Þ�à�Û�ÚUÛ�Ð�ß�ÚUÒ Ò�Ð!Û�Þ�Ñ)ë��VÞHÚ�Ñ Ù�ê�Ù�æ:ÚUÛ4èAÚOÜ�íjÒ�Ï�ÐDÝ�Ó�þPÓÝ)Ý)çÛ�Û�Ð)Þ�Ý)Ð*ê4ÚUÒ�Û�Ù é�Ù:ÑÛ�ÐFØ>ç�Ý)Ð!Ø9ÔWÛ�ÓEê ÕFü�òEÖ�Ò�Ó&Ò�Ï�Ðsô�ò&ê�ÓEÑ Ò�ÓUÔ Ò�Ð)Þ�ÓÝ)Ý)çÛ�Û�Ù�Þ�à�ß�Ú�Ù�Û�ÑAÓ�Ô(ß�ÚOÒ�Ò�Ð)Û�Þ�Ñ!ë@ýRÚEÝ�Ï�Ó�ÔÒ�Ï�Ð!Ñ�Ð4ê4ÚUÒ�Û�Ù:Ý)Ð!ÑsÚUÛ�Ð4ß�Û�ÓUõ ÐFÝ�Ò�ÐFØ�Ñ Ð!ß�Ú�Û�ÚUÒ�Ð!æ�Ü�Ó�Þ�Ò�ÏÐ*ÓEÛ Ò�Ï�ÓEà�ÓEÞ�ÚUæRØ>Ù�Û�Ð!Ý�Ò�Ù�Ó�Þ�Ñ�Ô ÓEç�Þ�Øð>Ü ÷�� ìsë��ÓEÛtð�Ó�Ò�Ï ØÙ:ÚUÛ�Ü�Ú�Þ�ØDÝ�Ó�þPÓÝ)Ý)çÛ�Û�Ð)Þ�Ý)Ð�Ò�Ï�Ð&Ö9æ�Ú�Û�àEÐ!Ñ ÒtÐ!Ù àEÐ)Þ>äUÚUæ�ç�Ð!ÑXÙ:Ñtç�Ñ ÐFØÚ�Þ�Ø � ÔWÓEÛ6á@â ã Ø�ÚUÒ�Ú�UëÎAÏ�Ð�Ù Þ>äEÐ!Ñ�Ò�Ù�àEÚUÒ�Ù�ÓEÞ'èAÚ�Ñtß�Ð)Û ÔWÓEÛ�ê�ÐFØ�Ó�Ô@Ò�Ï�Ð&Ù�ê&ß�ÓEÛ Ò�Ú�Þ�Ý�ÐsÓ�Ô@Ò�Ï�Ð&Ý)ÓUþVÓ>Ý!Ý�ç�Û�Û�Ð)Þ�Ý�Ðê4ÚOÒ�Û�Ù é*Ú�Þ�Ø9Ò�Ï�Ð&á}â ãÿÏÙ:Ñ Ò�Ó�àEÛ�Ú�ê4ÑRÔWÓEÛAÒ�Ï�Ð�Ý�æ�ç�Ñ�Ò�Ð!Û�Ù�Þ�à�ë}ÎAÏ�Ð�Û�ÐFÑ ç�æ Ò�ÑAÓ�Ô(Ò�Ï�Ð0Ð�éß�Ð)Û þÙ�ê�Ð)ÞuÒ�Ñ�Ú�Û�Ð0Ý�ÓEæ�æ�Ð!Ý�Ò�Ð!Ø*Ù�Þ�Ò�Ï�Ð�Ò�ÚUð�æ�Ð!Ñ�Õ�í�ûí � ÚUÞ�Ø*ú�ë

�VÞHÒ�ÏÐ Ð)éß�Ð!Û�Ù�ê�Ð)ÞuÒ�Ñ9Ï�ÚUÛ�Ø ÚEÑ�Ñ Ù�àEÞê�Ð!ÞEÒ�� ê&ÓØÐ)æ�ñ úUó�Ù:Ñ9ç�Ñ�Ð!Ø�íXÙ²ë ÐEë íXÒ�Ï�Ðß�ÚUÛ�ÚUê�Ð�Ò�Ð)Û�ÑXÓUÔ}Ò�Ï�Ð&Ý)æ�ç�Ñ Ò�Ð)Û�Ñ��ïÚUÞ�Ø��ÿèjÐ)Û�ÐsÐFÑ�Ò�Ù�ê4ÚUÒ�Ð!Ø[ÔWÛ�ÓEê Ò�Ï�Ð�Ñ Ð�Ò�ÓUÔjÑ�ÚUê�ß�æ�Ð!ÑÚEÑ�Ñ Ù�àEÞÐFØ*Ò�Ó*Ð!ÚEÝ�Ï'Ó�Ô@Ò�Ï�Ð&Ý�æ�ç�Ñ�Ò�Ð!Û�Ñ)ë��VÞDÓEÛ�ØÐ)ÛtÒ�Ó�Ú�Ý�ÏÙ�Ð!ä�ÐsÚ9ê&ÓEÛ�Ð�ØÐ�Ò�Ú�Ù�æ�Ð!ØDÝ)æ ç�Ñ Ò�Ð)ÛÑ�Ò�Û�ç�Ý�Ò�ç�Û�Ð�Ó�Þ�Ð0Ý�ÓEç�æ�Ø*ç�Ñ ÐsÑ Ó�Ô Ò�� ,ñ Ö�íq×)óPë�VÞlÒ�Ï�Ð�Ò�Ú�ð�æ�Ð!Ñ[Õ�í�ûí � Ú�Þ�Ø�ú�Ò�Ï�Ð'Û�Ð!Ñ�çæ Ò�Ñ�Ó�Ô6ð�ÚEÝ��uþPß�Û�ÓEß�Ú�àuÚOÒ�Ù ÓEÞ�Ú�Û�Ð[Ñ Ï�ÓUè�Þ�ëÎAÏ�Ð��Ð!ÜEþPß�ÚUÒ Ò�Ð!Û�Þ�Ñ)íuÚEÑ�Ñ ÓÝ�Ù:ÚUÒ�ÐFØ�ß�Û�ÓEð�ÚUð�Ù�æ�Ù Ò�Ù�Ð!ÑAÚUÞ�Ø*ØÐ!Ñ�Ý)Û�Ù�ßÒ�Ù�ÓEÞ4Ó�Ô�Ò�Ï�Ð�Ý�æ�ç�Ñ�Ò�Ð!Û�ÑjÚ�Û�Ðß�Û�ÓUäuÙ:ØÐ!Ø�ë��ÞsÒ�Ï�Ð��Û�Ñ�Ò}Ý�ÓEæ�ç�ê&Þ�Ò�Ï�Ð�Ý)æ�ç�Ñ�Ò�Ð!Û_Þuç�ê�ð�Ð)Û@Ù:Ñ@ØÙ:Ñ ß�æ:ÚFÜEÐFØqë ÷ ÐFÝ�ÓEÞ�ØsÝ)Ó�æ�ç�ê�ÞÝ�ÓEÞuÒ�Ú�Ù�Þ�Ñ�Ò�Ï�Ð�ê�ÓEÑ Ò9ß�Û�Ó�ð�Ú�ðæ�ÐDß�ÚUÒ Ò�Ð!Û�Þ�Ñ�ÔWÓEÛ9Ò�Ï�Ð Ý�æ�ç�Ñ�Ò�Ð)Û!ë ÎAÏ�Ð'Ò�Ï�Ù�Û�Ø à�Ù�äEÐ!Ñ�Ò�Ï�Ðß�Û�ÓEð�Ú�ð�Ù�æ�Ù Ò�Ù�Ð!Ñ6ÔWÓ�Û�Ò�Ï�Ð��Ð!ÜEþPß�ÚUÒ Ò�Ð!Û�Þ�Ñ6Ú�Þ�ØDÒ�Ï�Ð�ÔWÓEç�Û Ò�Ï�Ý)Ó�æ�ç�ê�Þ ß�Û�Ð!Ñ�Ð)ÞuÒ�Ñ�Ú*àEÐ!ÞÐ!Û�Ú�æØÐ!Ñ�Ý)Û�Ù�ßÒ�Ù�ÓEÞ*Ó�ÔLÒ�Ï�ÐsÝ�æ�ç�Ñ�Ò�Ð!Û�ð�ÚEÑ Ð!Ø*ÓEÞ�Ò�Ï�Ð��EÐ)ÜuþPß�ÚUÒ Ò�Ð)Û�Þ�Ñ)ë�VÞ&Ò�Ú�ðæ�Ð0ÕjÒ�Ï�Ð�Û�Ð!Ñ ç�æ Ò�Ñ_Ó�Ô�Ý�æ�ç�Ñ�Ò�Ð)Û�Ù�Þ�à�ÓUÔ�Ò�Ï�ÐXØÙ:Ú�Û�ÜsÏÙ:Ñ Ò�Ó�àEÛ�Ú�ê4Ñ_ÚUÛ�Ð�Ñ Ï�ÓUè�Þ�ë(ÎAÏ�Ðß�Û�Ð!Ñ�Ð)ÞuÒ�ÐFØ0ß�ÚUÒ Ò�Ð!Û�Þ�Ñ@ÚUÛ�ÐjÐ!øuç�Ù�äOÚ�æ�Ð)ÞuÒ_Ò�Ó�Ò�Ï�ÐXÑ Ð)Ò@Ó�Ô�øuç�Ð!Ñ Ò�Ù�Ó�Þ�Ñ_àEÙ�äEÐ)Þ�Ù�Þ�Ñ�Ð!Ý�Ò�Ù�Ó�Þ��!

"�#%$'&)(*�!+-,.��/0�21)3Uë4��Ó�Û0Ð�é�ÚUê�ß�æ�Ð�0ß�ÚOÒ�Ò�Ð)Û�Þ Õ!ò�ÕEÕ�Õ&ØÐ!Ñ�Ý�Û�Ù�ð�Ð!Ñ�Ò�Ï�Ð4ÔWÓ�æ�æ�ÓUè�Ù�Þ�à'Ñ Ð�ÒsÓ�ÔÚ�Þ�Ñ�èjÐ)Û�Ñ5�*Õ�ë'Ï�ÓEæ�Ù:ØÚOÜ�þ�6�ÜEÐ!Ñ-6�í@ûë�Ú�ð�Û�ÓuÚ�Ø�þ�6UÞ�Ó�6í � ë�Ñ�çÞ�ð�ÚUÒ�Ï�Ù�Þ�àDþ�6�Ü�ÐFÑ-6�í(ú�ëÞ�Ú7�EÐ!Ø�Ñ Ï�Ó�ç�æ:ØÐ)Û�Ñ0þ�6UÜEÐ!Ñ-6�í@üë�ÓEÞ�Ò�Ï�Ð�ð�Ð!ÚEÝ�Ï�þ�6UÜEÐ!Ñ�6í_Û�Ð)ê4Ú�Ù�ÞÙ�Þ�à�øuçÐFÑ�Ò�Ù�ÓEÞ�Ñ98�í ×Ú�Þ�Ø9ô0þ�6UÞ�Ó�6�íEÓEÛRß�ÚUÒ Ò�Ð!Û�Þ*ò!�}ÚUæ�æ�Ò�Ï�Ð6øuç�Ð!Ñ Ò�Ù�Ó�Þ�ÑRèXÏ�Ð)Û�Ð�ÚUÞ�Ñ èAÐ)Û�ÐFØ�6UÞ�Ó�6�ÓEÛjß�ÚUÒ Ò�Ð)Û�ÞÕ:�6ÕEë�Ï�ÓEæ�Ù�Ø�ÚOÜ*þ�6UÜEÐFÑ-64ÚUÞ�Ø�Ò�Ï�Ð�Û�Ð!Ñ ÒtÓ�Ô_Ò�Ï�Ð�øuç�Ð!Ñ Ò�Ù�Ó�Þ�ÑXÔ Û�Ó�ê û&Ò�Ó�ô�þ�6�ÞÓ�6�ë�Î�ÏÙ:ÑÛ�ç�æ�Ð6ÔWÓEÛXØÐFÑ�Ý�Û�Ù ð�Ù�Þ�à�ß�ÚUÒ Ò�Ð)Û�Þ�ÑAÏ�ÓEæ:Ø�Ú�Ñ�èjÐ)æ�æ�Ù�Þ�Ò�Ï�Ð�Ý!ÚEÑ Ð�Ó�Ô(Ò�Ú�ð�æ�Ð0ûí � Ú�Þ�Ø*ú�ëÎ(ÚUð�æ�Ð[û[ß�Û�Ð!Ñ�Ð)ÞuÒ�Ñ9��Ð!ÜuþPß�ÚUÒ Ò�Ð)Û�Þ�Ñ�ÔWÓ�Û&Ý�æ�ç�Ñ�Ò�Ð!Û�Ù�Þ�à�ØÙ:ÚUÛ�Ü Ï�Ù:Ñ�Ò�Ó�àEÛ�Ú�ê4Ñ0Ý)Ó�ê�ð�Ù�ÞÐFØè�Ù Ò�Ïcá@â ã Ï�Ù:Ñ�Ò�Ó�àEÛ�Ú�ê�Ñ!ëHý}Ù�àEÏuÒ4Ý)æ�ç�Ñ Ò�Ð)Û�Ñ&èAÐ)Û�Ð�ÔWÓEç�Þ�Ø�ë�ìtÙ:Ú�Û�Ü;�EÐ)ÜuþPß�ÚOÒ�Ò�Ð)Û�Þ�Ñ�Ú�Û�ÐÐ�éß�æ:ÚUÙ�Þ�ÐFØHÙ�ÞHÒ�Ú�ð�æ�Ð�Õ�ë=<@ÚUÒ Ò�Ð!Û�Þ�Ñ*Ý�ÓEÛ�Û�ÐFÑ ß�ÓEÞ�ØÙ�Þ�à�Ò�Ó�Ò�Ï�Ð�á@â ã Ï�Ù:Ñ�Ò�Ó�àEÛ�Ú�ê�Ñ9Ú�Û�Ðê4ÚUÛ��ÐFØ�è�Ù Ò�Ï�Ò�ÏÐ*Ñ ç�ð�Ñ�Ý)Û�Ù�ßÒ>6>á@â ã.6�ë��Ó�ç�Û&Ø>Ù ?qÐ!Û�Ð)ÞuÒ�äOÚ�æ�çÐFÑ�Ó�Ô6á}â ã ÔWÛ�ÓEê ò[Ò�Ó� ÚUÛ�Ð�ÓEð�Ñ Ð)Û�ä�ÐFØ@�Rò�Ý)ÓEÛ�Û�ÐFÑ ß�Ó�Þ�Ø�ÑjÒ�Ó�Ò�ÏÐsäEÐ)Û�Ü9æ�ÓOèïÑ ç�Þ'Û�ÚEØÙ�ÚUÒ�Ù�ÓEÞ[ÚUÞ�Ø � ØÐ!Ñ�Ý)Û�Ù�ð�Ð!ÑäEÐ)Û�Ü&Ï�Ù�à�Ï*ÓEÞ�Ð�ë@Î�ÏÙ:ÑAÛ�ç�æ�ÐtÔWÓEÛAØÐ!Ñ�Ý)Û�Ù�ð�Ù�Þà4á@â ã&þPß�ÚUÒ Ò�Ð!Û�Þ�ÑRÏ�Ó�æ:Ø*ÚEÑRèjÐ!æ�æqÙ Þ9Ò�ÏÐ�Ý)ÚEÑ ÐÓ�ÔLÒ�ÚUð�æ�Ð0ú�ëA-B�CED@F)DHGHI J0I K*LNM�O*JQP�O�F)D@R)O�JSDHFTK*LTU0CEDVJ0C7O�WD@K�XEUSCEDVDHI Y*DZL%[*O*\ ])DVGH])^_[%D@RE]EU�O�P`K%^0D'DZ\ O*R7K%^SO-U0D

J0DH\ DHGaU0I K%L�GZO*L�RD�I L%[%K%b*DHF�]EJ0I L)YcU0C)DNGHK*L)GHDHWEUdK�X2Y*DZLEDH^SO*\ I e-O-U0I K*L.f g*hji

167

kml n�oap5qsr2t-u_u0oZv0w r2vSx*y)t*yEz { z usp%l |doZ}0~Hv0z ��u0z x*w� l �Z�%�*��%� �%�*� �Z�E�*�*� � l �*� � � l �%� � � l �Z� � x*{ z �)tZp � x*w�u � oNyoZt�~ �� }0�Ew�y)t�u � z wE��El � � l �%� � x*v0��z wE�Tq�w)x`}0�)w�El � � l � x%w � x*{ z �)tZpmq�wEx`}0�Ew� l �� %�*��%� � � l � � � l �%� � � l �-� � x*v0��z wE�cw7t��%oH��} � x*�){ �EoHv0}�q�w)x�}0�)w� l �*� �%�Z�� l �5� � � l � � � x*{ z �)tZp � w)t��%oH��} � x%�E{ �EoZv0}�El �-��*�%� �-�*�E�*� �Z�*�E�*� � l � � � � l � � � � l � � � x*{ z �)tZp � }0�Ew�y)t�u � z wE��l �*� � l � � x%{ z �7tHp�t*yEv0x5t��q2w)x�}0�)w� l �-��*�%�*� �*�%��*� �Z�%�� l � �� l � � � � l � � � x*{ z �)tZp � }0�Ew�y)t�u � z wE� � w)t��%oH��} � x*�){ �EoHvS}� l �*�%�*�%�*�� l �%� wEx`}0�Ew � }0�)w5y)�EvSwEoH��q�v0oH��Z� l � � l �%� � x*v0��z wE�Tq�w)x`}0�)w

�@�5�� m�E�V�`�-��_�Q�%�a�H��H�� ¡£¢��¥¤�� ¦� S�H�-�Z§ ��¨�©§ �5�a��ª�§ S�H¢�¨��Z��«� �¬�S��Hª��N®��H S�d¤�¢E� ¦:«��Zª:�T¤�� ¦: S��H�-��7¦�«��!�-�c§ ` aª�¢*¯N�2¬`°��¤*¢��Q©±¤*¢�� ¦�«��¤�¢E�)�-�5§ �� c�Hª��m«�¢� S�c�:�H¢E�� 5�a�H��H�: c¡j¢5��Zª:�¤�� ¦� S�H�-�%¬��¥ª��T��Z�� a��7�H�*©��Q�%�a�H��H��7¦�«��!�-�H ��5�H�T�%²)¦�§ ³�� *�)��Z¢`�Hª��N a�-��¢5¡�²)¦�� a�H§ ¢�� ¨�§ ³)��§ �� a�*¤-�H§ ¢E�'��´�µ¶�·�¸-¹º�»Hµ7¼H½�´�¾jµ�¿�À5¬dÁ�¢5�d��Â:��«�� V�Q�%�H�H�-�H�.�*Ã��E�c¨E§ ³)�� Hª��¡j¢E� � ¢%¯N§ �:¨� a�-�¢5¡V��: a¯¥�-�H ��¥ª:¢E� § ©��9�V�)�* �ÄQ�5�:�H¢)��©9��:¢:Ä: a¦:�±�Q�%�Hª�§ ��¨��V�7�� Ä��Q�5Å)�*©� Hª:¢E¦:� ©:�-�H ��)�� *Ä¢E��Hª��!�*��¤ZªÆ�T�7�� Ä��H�*«��§ �:§ ��¨.²7¦�� S�H§ ¢E�: �Ç7Ä È.�5�Q©�É9��:¢:ÄV¢5��5�a�H��H�ÊÃ>«��*��: m�HªQ�%�� Hª��²7¦�� S�H§ ¢�� m¯Nª:��H��: a¯¥�-�H�%©4ËE��¢�Ë:¬��¥ª�§ �Z©Æ¤�¢E� ¦�«��4¨�§ ³E�* `�Hª��:�H¢E��:§ � § �H§ �� ¡j¢5�m�Zª:�ÅE��7�_�Q�5�a�H�-�H�� Ä:�5�Q©9¡j¢�¦:�a�Hª.¤�¢E� ¦�«��>�:�H�� H��7�H T��¨E��-�Z��©�* a¤-�H§ �:�H§ ¢��.¢5¡�¤*� ¦: S�Z�-�*¬

ÌaÍ.Î�ÏÐ!Ñ£Ò�Ó�Î-Ô'Ò�ÕÒ�Ö:×SØ2Ï7ÎZÎ-Ò�Ù-Í2ÚdÛÝÜ:ÙcÞ%Ñ£ß2ÚHÎ�Ò5Ù-à£Í'á�â!àjÏÙ-Ö>Ô!àjÚZÎ-Ü:áÙ�Ïã�Ú�Þ%Ü:ã>Ð!à£Í'ÒEâ±ämà Î�ÔÞ%Ü×SÜ!Þ5Þ5ß'Ù-Ù-Ò�Í'Þ5Òåã.Ï7Î-Ù-à æ;ÏÙ-Ò4Ø'Ù-ÒEÚZÒ5Í�Î-ÒEâ@çéè�Ô'ÒÊâ!à£ÏÙ-ÖêÕ:Ò�Ö:×SØ2Ï7ÎZÎ-Ò�Ù-Í2Ú>ÏÙ-Ò4Ò5æQØ'ÑjÏà£Í!ÒEâà£ÍëÎ�ÏÐ'Ñ£Òíìçîè`Ô!Ò;Þ%Ü×SÜQÞ�Þ%ß'Ù-Ù�à£Í!áïØ2Ï7ÎZÎ-Ò�Ù-Í2Ú�Ï7Ù�Ò�ÚZÔ'Ü)ämÍ�ä`à Î�ÔëÎ-Ô'Òïâ'ÏÚ-Ô�Ð�Ò%ÎZäcÒ5Ò�ÍÎ-Ô'Ò5ãðÒç á2ç£ñ�òó7×�ìEò�ã.Ò�ÏÍ2Ú¥Î-Ô2Ï7ÎcÏ�Ø2Ï7ÎZÎ-Ò�Ù-Í.äcÜÙ�Õ�à£Í'á�àjÚ¥ÛÝÜÑ£Ñ£Ü7äcÒ�â>ÐQÖ9Ø2Ï7ÎZÎ-Ò�Ù-Í.Ô!Ü:Ñ£àjâ'ÏEÖ:ñØ2Ï)Î-Î-Ò5Ù�Í�ò!ì5×�ìEóó'ììEò±ã±ÒEÏÍ'Ú�Î-Ô2Ï)Î�Ô'ÜÑ£àjâ'Ï)ÖÆämà Î-Ô'Üß!Î9ÚZß'Íôä�Ï:ÚmÛÝÜ:Ñ£Ñ Ü7äcÒEâõÐ�Ö�Ô'Ü:Ñ£à£â'Ï)ÖÚZØ�Ò�Í:Î�ÜÍÊÎ�Ô!Ò±Ð�Ò�Ï:Þ�Ô�ç�è�Ô'àjÚ�Ù-ß'Ñ£Ò9ÛÝÜÙ�â!Ò�Ú-Þ5Ù-à£Ð'à£Í!á4Þ5Ü7×SÜ!Þ5Þ5ß'Ù-Ù-Ò�Í'Þ5Ò�Ø2Ï7ÎZÎ-Ò�Ù-Í2Ú�Ô'Ü:ÑjâõÏÚäcÒ�Ñ£Ñ@à£ÍÆÎ�Ô!Ò9Þ5Ï:ÚZÒ�Ü7Û�Î*Ï7Ð'Ñ£Ò�ö2ç

èdÏ7Ð'Ñ£Ò�ö.Ú-Ô!Ü7ämÚcÎ-Ô'Ò�Õ:Ò5Ö�×SØ2Ï)Î-Î-Ò5Ù�Í'ÚcÛÝÜÙmÞ%Ñ£ß2ÚHÎ-Ò�Ù-à£Í'á.â!àjÏÙ-Ö�Ô'à£ÚZÎ-Ü:áÙ*Ï7ã�Ú�Þ5Üã>Ð'à£Í!ÒEâä`à Î-ÔõÞ5Ü7×aÜQÞ�Þ%ß'Ù-Ù-Ò�Í2Þ%Ò�ã.Ï7Î-Ù-à æ�ÏÍ2âÊ÷NøZùúÔ'à£ÚZÎ-Ü:áÙ*Ï7ã�Ú5çNè`Ô'Ò9â!àjÏ7Ù�Ö4ÕÒ�Ö:×SØ2Ï7ÎZÎ-Ò�Ù-Í2Ú�ÏÙ-ÒÒ%æ!Ø'ÑjÏ7à£Í'ÒEâÆà£Í�Î*Ï7Ð'Ñ£Òåì:çmè�Ô'Ò>Þ5Ü7×SÜ!Þ�Þ%ß'Ù-Ù-ÒEâ4Ø2Ï)ÎZÎ�Ò5Ù-Í2Ú�Ï7Ù-Ò9Ò%æ!Ø'ÑjÏà Í'ÒEâ�à Í�Î*Ï7Ð'Ñ£Ò±Ó�ÏÍ2âÎ-Ô'Òõ÷�øZùûØ2Ï7ÎZÎ-Ò�Ù-Í2Ú±à ÍíÎ�ÏÐ!Ñ£ÒÊü!ç;ýcÜÎ-Ô;Î�Ô!Òô÷�ø-ùûþ)ÏÑ£ß'Ò�Ú.Ï7Í2âïÎ�Ô'Ò�Þ5Ü7×aÜQÞ�Þ%ß'Ù-Ù�Ò5Í2Þ%ÒØ2Ï7à£Ù*Ú�ÏÙ-Ò�Ñ£à£ÕÒ�Ñ£ÖôÎ-ÜôÏ7Ø'Ø�ÒEÏ7Ù>ÏÚ�Õ:Ò5Ö�×SØ2Ï)ÎZÎ�Ò5Ù-Í2Ú�ç�è�Ô'àjÚ9Þ5Üß'ÑjâïÚZß'á:áÒEÚHÎ�Î-Ô2Ï)Î�ÿZÜ:à Í'à£Í'áÎ-à£ã.Ò�à£Í!ÛÝÜ:Ù-ã�Ï)Î-à£Ü:ÍÊÏÍ'â4Î-Ô'Ò>ÚZß'Í�Ò%æ!Ø�Ü:Ú-ß'Ù-Ò�ã.Ò�Ï:ÚZß'Ù-Ò�ã.Ò5Í�Î�Ú`ÏÙ-Ò�à£ã±Ø�Ü:ÙZÎ�ÏÍ�Î`ÛÝÜ:Ù`Î-Ô'ÒÞ%Ñ£ß2ÚHÎ�Ò5Ù-à£Í'á2ç��ÊÜ:Ù-Ò5Ü7þÒ�Ù�ñ:Î-Ô'Ò9âQÒEÚ-Þ5Ù-à£Ø!Î-à£ÜÍÆÜÛ�Î-Ô'Ò�Þ5Ñ£ß2ÚHÎ-Ò5Ù*Ú�àjÚ`ã.Ü:Ù-Ò�Ò5æ!Ø!Ñ£àjÞ5à Î�ç

ÌaÍ��2áß'Ù-Ò�ü�Î�Ô!Ò�Ø'Ù-ÜÐ2ÏÐ'à£Ñ à ÎZÖ±ÜÛ�ÜÐ2Ú-Ò5Ù-þQà£Í'á9Þ5Ò5ÙZÎ*Ï7à£Íåá:Ù-Ü:ß!Ø2ÚNÜÛ�Ð�Ò5Ô2Ï)þ�à£Ü:Ù�ÚNà£ÍåÎ-Ô'ÒÞ%Ñ£ß2ÚHÎ�Ò5Ù�Ú9Î-Ü:áÒ5Î-Ô'Ò5Ù>ä`à Î�ÔéÙ-Ò5á:àjÚHÎ-Ò�Ù-ÒEâïÚZß'ÍéÒ%æ!Ø�Ü�ÚZß'Ù-Ò4þ)ÏÑ£ß'Ò�Ú±Ï7Ù�ÒåØ'Ù�Ò�ÚZÒ�Í:Î�Ò�âVç��TÑ£ß2ÚH×Î-Ò�Ù-à£Í'áôä�Ï:Ú>â!Ü:Í!ÒÆß2ÚZà£Í'áôÛÝß'Ñ£Ñ`Ø2Ï)Î-Î-Ò5Ù�Í��)ä`à£Í2â!Ü)ä ã�Ï)Î�Ù-à æïÛÝÜ:Ù>ämÔ!àjÞ*ÔéÕÒ�ÖQäcÜ:Ù�â'Ú9ÏÙ-Òâ!àjÚZØ'ÑjÏEÖ:Ò�â±à£Í.Î�Ï7Ð'Ñ£Òmö2ç�dà£þ:ÒmÐ�Ò�Ô'Ï)þQà£ÜÙ*Ú�Ï7Ù�ÒmÚZØ�ÒEÞ%à �2Ò�â� �� !�#"�$&%'�(� �!��"5ñ) �(* �,+.-(/0�1��2�&�!��"�$&%'�(� �!��"5ñ�� !�3"�$&%'�(� �!��"�â!Ò�Ú�Þ%Ù-à£Ð�ÒEÚ�ã.à£Ñjâ�ÚZß'Í�Ð�Ò5Ô2ÏEþQà£Ü:Ù�Ú�ÜÛsÎ-Ò�ÍÜ:ÍÊÎ-Ô'Ò>Ð�ÒEÏÞ*ÔôÜÙ�Í'ÏÕ:Ò�âõÚZÔ'Ü:ß'Ñ£â!Ò�Ù�Úmämà Î-Ô'Üß!Î9ÚZß'ÍQ×HÚ-Þ%Ù�Ò5Ò5ÍõÏÍ2âõä`à Î-Ô'Ü:ßQÎ�ÚZß'ÍQÐ!ß'Ù�Í'Ú�ñ�� 4� �!�5�76&*8�:9;�ïÏÍ2âéâ!àjÏ7Ù-ÖïÙ-Ò�Þ5ÜÙ*â!Ú9ä`à Î�ÔéÙ-Ò5Ø�Ü:ÙZÎ-ÒEâ<� �!��6&�!�&�!�%ç�ÌaÍ;Î�Ô!ÒÆÐ�ÜÎZÎ-Ü:ã�2áß'Ù-Ò�Î-Ô'Ò>ÜÐ2ÚZÒ�Ù-þ:Ò�â�ÚZß'ÍõÒ%æ!Ø�Ü�ÚZß'Ù-Ò�ã±ÒEÏÚ-ß!Ù�Ò5ã.Ò5Í�Î�Ú�Ï7Ù�Ò�Ø'Ù-ÒEÚZÒ5Í�Î-ÒEâVç=�'Ü:Ù�Ò%æ'Ïã±Ø'Ñ£ÒÞ%Ñ£ß2ÚHÎ�Ò5Ù9Í�ß'ã>Ð�Ò5Ù�>�áÙ-Ü:ß'Ø2Ú�Ð�Ò�Ô'Ï)þQà Ü:Ù�Ú�ã�Ï7Ù�ÕÒEâ�ÏÚ?��(��4�@��A�&�!�éÏÍ'âêÞ5ÜÙ�Ù-Ò%×ÚZØ�Ü:Í2âQà£Í'áÆ÷�øZùúþ7Ï7Ñ£ß'Ò�Ú�Ï7Ù�Ò�Ñ£Ü7ä�ç=B�Ø'Ø�Ü:Ú-à Î�Òñ�Þ5Ñ ß2ÚZÎ-Ò5Ù�Í!Ü2ç=C.Þ5ÜÍ�Î�Ïà£Í2Ú`Ù-ÒEÞ%Ü:Ù�â'Ú�ämà Î�ÔÙ-Ò�Ø�ÜÙZÎ�Ò�âåÚZß'ÍQÐ!ß'Ù�Í'Ú�ñ'ÚZß'ÍåÒ5æ!Ø2Ü�ÚZß'Ù-Ò�ÏÍ'âåß2ÚZà£Í'á.ÚZß'ÍQ×aÐ!Ñ£Ü!Þ�ÕåÏÍ'â4Þ5ÜÍ2ÚZÒ:D�ß!Ò�Í:Î�Ñ£Ö4÷NøZùþ)ÏÑ£ß'Ò�ÚmÏ7Ù-Ò�Ô'à£áÔ�ç

�'ÜÙ�Î-Ô'ÒåÚ-Ï7ã.Ò�Þ%Ñ£ß2ÚHÎ-Ò�Ù-à£Í'áÊÚZÒ5ÎZÎ-à£Í'á�Î-Ô'Ò.Þ5Ñ£ß'ÚZÎ-Ò5Ù9Ø'Ù-Ü:Ð'ÏÐ'à£Ñ£à Î-à£Ò�Ú�äcÒ5Ù�Ò.Þ�Ï7ÑjÞ5ß'Ñ£Ï7Î-ÒEâE�D�ç�F0ü.GNÛÝÜ:Ù�ìEó�Î-ÒEÚHÎ�Ú-ß'ÐQÿHÒEÞ*Î*Ú5ç�è�Ü:á:Ò%Î-Ô'Ò5Ù`ä`à Î-ÔÆÕ:Ò5Ö�×SØ2Ï)Î-Î-Ò5Ù�Í'Ú�Ø!Ù�Ò�ÚZÒ�Í�Î-Ò�âåà£Í4Î�ÏÐ'Ñ Ò�ö


H IJLKNM�O5P;Q,QRJTSRU O�SRV&W(P;W(X Y X Q�K&Z [\JT]_^TSRX `aQRX V Ub Z b7c&cab d b7c&c c:d c Z e b d c Z f&g d h V Y X i(P7K d U(P;j&JTi

b k:l8md b7c&cab b c Z b g d c Z b b ] h V n(Y i:JTSR] d ]RopP;Y Y.q�rRsfaZ b&b d c c c:b d cad c Z f t d c Z f d c Z b�uNd h V&Y X i�PTKvP W:SRVNP;i d

f k(l8md cak:l8m c Z b g d c Z b7w x V&SRjNX U:yeaZ b d b b c Z e t d c Z b f d h V&Y X i�PTKz Z b7cab&b d f k(l8m d c Z c Z e b&d c Z f wad U�P;j JTi{] h V n:Y i(JTSR] d

e k(l8md c c c:b c Z b z d c Z b e h X y h ]Rn(UvS_P i:X P�QRX V Uw Z b d f k:l8m\d e k(l8md b�c c c:b c Z f d c Z b�uNd c Z b g d c Z b z d c Z b f d c Z b h V Y X i�PTKa] d�h X y h q�rRsgaZ b;k(l8md cak:l8m c Z b z d c Z b e Y V x q�rRsu Z e k(l m d b�c cab&d c Z f f d c Z f&f d h V Y X i�PTK d U�P;j J7i d

f k:l8m d b7c&cab b c Z b g d c Z b7w ] h V&n:Y i(JTSR] d�h X y h q�rRs| Z cad c k:l8m c Z g d c Z z U(V ]Rn:U

}�~N�� A�(�#�=��,��~N�L�T��T��@��N��;� ��_�T��T� ��.� ~N�L�� L�T�a�a�7~N�@��;�a�1�� 4�� 7��\�L� �� L��T�:�N�7~a�{�;¡£¢R��T��?¤.�7�_�� a� ��{��7�.��;� ��_�T��¥�.�1�!��@� ��.� �T�.� ~;�� '¡§¦¥�;� �a��;�:� �.�@��;�:�(��~N� �.�v�T��@�a�L�v�.�T�a��~N�� ~N�L�T��7�.�v��N�v�7�.��;� ��_�T��&¡#}�.��.� ~&�L�A¨(��,��~N�L�T��T��1~&�T��©.�� ~a� �.�&�ª� �A�7~a�� «a¡�¬�~N�L�T��7�.�v�;�N�T�T�;�L�!�:��.� �.��T��T��\�T�� _�T�:�N�7~a�{�{~&�T��@~N�T¨(� �� T�v�T�.��T�.��L��7� �.�¯®��\�L�p®.¡¯°��a�.�±�¥� ²'��T� �(�¯³:~N� ��;�¯�N�5�\�L��~&�T��:��L��T³(� �'��´µ�;�a�L�T� �L�!�a��¥��T��³:�;�L�?� �&�¶�L��7~a�.� ~&�T� �:�#~N��2·��.�;�L�;�T� �� =³(��L�?�.� �:�ª�a��a¡{}�.� �7�� a� ��{�2�a� ³(�;� �7�.��.�T�a��~N�� T� �;�±��N�\�7�.��¨(��¥�,��~&�L�T�;�T�.�¯~a��v��a�.�L�7�{�;�:� �.�@�1�.�T� �L�;��T�±�a� �.�;�7~N�!�.�;�L��7� �.�T� �:�@�a��;� ��_�T��µ��~a�L� ��a��T�� ¨(��¥�,��~N�L�T��7�.�;¡

1 2 3 4 5 60

0.2

0.4

0.6

0.8 Diary records

Clusters

Pro

babi

lity

working − no sunholiday − no sunsun exposure using sun−block sunburned

1 2 3 4 5 60

0.1

0.2

0.3

0.4

Pro

babi

lity

Clusters

Pid values

pid 0pid 1pid 2pid 3

°�� a�.�T�{��µ}��v�¥�T�:��~N�� R��N�\�:��L��T³¥� �.��;�;�L�7~N� �2�a�T�a��µ�N��!�;��~&³�� N�7�µ� �?�7�.�v�;� ��_�T��T��T�:�a��7�.�;�@�� T��7�;�a� �_�T�;�T� ��T�.��;©¥�!�a�T�¥�T��³:~N� ��;�;¡3�=��¥�,��~N�L�T��7�.�{��N��T�� _�T�;�T��~&�T��.�T�;�L� �(�T�&�1� �@�T�.��7~N�� ¸¥¡\°.�a�� ~a�7�@� � �.�_�7��a�T�a��!� �{�!� ��~&³�� a�T�±�8�T�a�¹�¥� ~N�L�v�T� �;�a�7�¥��~&�T��.�T�;�L� �(�T�&��:��7�.�=�.��!�� N��~a��;�a�L�T� �L�!�a��¥� ��@�\�T�º� ��L�� :��T��»� �&��¤��a�.�T�:¡¼ ½{¾.¼8¿.ÀaÁvÂ2¾.Ã�Ã�Ä4Ä!À:Á�Å&Æ;¼ Ç!½�¼8Ã.È4Ã¥É�½�Ê'À�Ë5ÀaÊ5Â:¿�¼8Ã¥Æ1Ã¥É�½�Ê'À�Ç5Â¥Æ7½�¼�ÅNÌ'Í8Â¥Æ1Ç5ÀaÆ;Á7Ã.È5Á1Ä!Ì'Æ�¼8È'¾½�Ê'À#Î»Ê'Ã.Í8ÀAÇ�ÀaÆ�¼8Ã!Ä�Ã¥É1½�Ê'À#À&Ï!Ç�ÀaÆ�¼8Ð�ÀaÈ�½aÑ£Ò'Ã.Æ2Â�Í8Í»½;ÀaÁT½�Ç�ÀNÆ Á7Ã.È'Á�½�Ê'ÀaÆ�À#¼�Á�Â�Í�Â�Æ;¾¥À

169

Ó Ô'Õ;Ö,ÖR×TØRÙ Ô�ØRÚ Û�Õ;Û:Ü Ý Ü Ö�Þ&ß à\×TáRâ7ØRÜ ã:ÖRÜ Ú Ùä ß ä�å åaä&æ ä&ä7åaäTç_ä ä�åaä å ß è&é æ å ß ä�ê ë Ú Ý Ü ì(Õ7Þ æ Ù(Õ í ×Tì{á ë Ú&Ý ì(×TØRáèaß ä7å&åaä&æ ä&ä7åaäTç_ä ä7å:ä æ ä å ß è&î æ å ß è ä&æ å ß ä ë Ú Ý Ü ì(Õ7Þ æ Ù(Õ í ×Tì{á ë Ú&Ý ì(×TØRáê ß å å&åaä æ ä7å&åaäTç�å:æ å ß ä é æ å ß ä è æ ï Ú ØRíaÜ Ù(ð æ

ä7åaä&ä ä æ å;çRä�å å:ä å ß ä&ä æ å ß ä Ù�Õ;í&×Tì1á ë Ú ñ(Ý ì(×TØRáò ß ä&ä æ ä äTçRä&ä å ß ä ò æ å ß ä&ä ë Ú&Ý Ü ì�ÕTÞ æ Õ;Û(ØRÚ&Õ;ìó ß å å&åaä æ ä7å&åaäTç�å:æ å ß èNé æ å ß ä ò æ ë Ú Ý Ü ì(Õ7ÞpÚ Ø ï Ú Ø_íNÜ Ù:ð æ

ä�å åaä&æ å;ç_ä7å å:ä å ß ä�êaæ å ß ä Ù�Õ;í&×Tì1á ë Ú ñ(Ý ì(×TØRáîaß ä7å å:ä æ åaæ å�ç_ä&æ äLçôåaæ ä æ äTçRä å ß è õ æ å ß ä õ æ å ß ä î æ å ß ä ò æ å ß ä æ å ß å õ ï Ú&ØRí ç�ë Ú Ý Ü ì(ÕTÞ æ Ù(Ú áRñ(ÙéNß ä7å&åaä ä å ß ä õ ë Ú Ý Ü ì�ÕTÞ æ Ú&ÙpÖ ë ×�Û�×�Õ;â ëö ß ä7å&å åaä&æ äTç_ä7å å:ä ä å ß è ä æ å ß ä è ë Ú Ý Ü ì�ÕTÞ æ Ú&ÙpÖ ë ×�Û�×�Õ;â ëõaß å�çôå:æ å:æ å�ç_ä æ äLçôå å ß ê î æ å ß ê óaæ å ß ä è æ å ß ä è ï Ú ØRíaÜ Ù(ð ç Ù:Ú»áRñ:Ùä�å ß ä7å å:ä æ äLç_ä ä�åaä&æ ä&ä7åaäTç_ä æ å ß è&î æ å ß ä î æ å ß ä è æ ë Ú Ý Ü ì(ÕTÞ æ

ä&ä7åaäTç_ä ä7å:ä æ ä7å:ä ä å ß ä è æ å ß ä&ä Ù�Õ;í&×Tì1á ë Ú ñ(Ý ì(×TØRá÷�øNù�ú û�ü�ý¯þ»û�ÿ�� ø��Tû�� ;ú ��Tû�� ø��Tÿ�� 7ø�� !��"�vù�� .û��$#%� �� &�'�(��Tû��;û�@ø�� )+* ,-�.��û0/��1��:ú �� 2��û!�;ú ��Tû��3��vù!û��4� 3�� ú ø;ÿ�û(�+*35�û(��"��aú ��6�$��"�'�7ø�� û7� �� aù�øNù�ú û � ø��Tû��8��û7�;ú ��Tû��(*{÷9�.û �� ø��Lÿ;::û�ÿ�� ø��7û�� ø��7û1û�) � ú ø�� û(�� <��øNù�ú û6=�*�÷��û��&�'�(�� ø��Tû��ø��Tû$��(#%�>#%� ��>��û��ø��ù!û��-#û;û(�>��.û(�¹û"* ��* ?@"A �= @ �@û ø��%��ø�� ø � ø��Tû��>#��:B� �� C��:ú ú ��#û(��ù�ÿ � ø��Tû��D��aú � �.ø ÿ'*µ÷�� >��:ú �� E(û�<��û � ��:ù�øaù�� ú � �� û�F��>��ûG::û;ÿB� � ø��7û��ø��H�I��J�(�aú �� Tû�Lû(�'��<�:û��û��7øaú�.û�� "�F��3�;ú ��Tû��ù�ø�Tû(�F��F��û.::û;ÿB� � ø��Tû��*

8 23 27 35 48 189 213 242 246 2510

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Person index

Pro

babi

lity

123456

K1� ��Tû@ü�ý�Lú ��Tû�� aù�øaù�� ú � �� û��&øNú ��ú ø��Tû��;��.û<= A �Tû�� û��"��$M!N�*0O&P"Q�* R�û��"�� .û�)S� F��(#%�J�"�I��ûD)B�Rø�)¥û(�ø��T�� U'û��Tû(�'�F��Tû;ÿ�ú û�E:û ú.��:ú ��>��Tû� � �"�H��<��V�� )�;ú ��Tû��*±þ»û;ÿB� � ø��Tû��ø��7û�� E(û��F� �<�7øaù�ú û�W�*

X+Y�Z�[+\�[+]_^_] `a>Z�b!`�c+dfe�^_g4h`�d�Y2i+Z4j%k<`�c4\'`$lHd"h�e�Y�]_[ d�h.m.Z�Y�nH]_i+o>\Bi4lGi+Z>h�gpiqd�rHX Z�h�g+Y�d�js ZBm8d�t�d�Y�u�h�Z�vFd7ZBb3`�cpd�X d�Y�]_Zpl+h.\�Y�d$lpd"h�e�Y�]_[4d"l;[�a>Z�`�c+d�Y8[ d�c4\"tH]_Z�Y(h�j�w+Z�Y8d�r+\�vFX+^_dbxZ�Y>X d�Y�h�Z�iyi+Z+j{z�|H}G`�c+d�Y�dV]�hFc+]_o�c~X+Y�Z�[4\B[+]_^_] `aye�Z�v>X Z�i+d�i�`>bxZ�Y;e�^_g+h�`�d�YDi+Z+j{|


� �4��&�-��-� � �-��'��'� � � �� -��-� �"�-� �(��(� ��(�"�(� �(��(� ��(��"� �� "� �� "� �(� �B��-�'�(�'� �'��-�� '�_�� x�� (�� ¡"� �"� ��¢"� �� £ ¤ �(�-�� '¥ � � � ¤ �-�'�7��'� ��-� �(�£�� (��x�� (�� "� ��¢ �"�.�-�'�7��"� ��-� ��

�� x�� "�_� � �� ¢"� �� "� �"� �(�(� �� '�(� � �'�� ¤ �(�-�¢"� � �'�_� � � �"�_� �"� �� ¦ ��'� � ¦ �-�'�7�§��(�-�'�-�� £��'� �9� ��(� �(��-�� (�� (�(� �"� �� "�� B�� "� ¥��7�-�'�7��(�"� ��-� �(�¡�� x�� "� ��"�_� �� (�"� �� "� ��£ ¤ ��-�� "¥ � �'�.�-�'�

¨1©�ª�« ¬6�®�¯�¬�°�±&²�©�³�³¬�´�µ�¶2·�¸�´$¹(« º�¶�³�¬�´» µ�¼<³½�¬ ¾�» ©�´°G½�» ¶�³¸"¼�´�©�¿6¶$¹�¸"¿7ª�» µ�¬�¾>À%» ³½q³�½�¬6¹�¸�±¸B¹�¹�º�´�´¬�µ�¹�¬8¿ ©�³�´» Á6©�µ�¾6³½�¬8Â�Ã�ÄÅ½�» ¶�³�¸�¼�´�©�¿ ¶�Æ0Ç-µ6³½�¬CÈ�´¶�³�¹�¸�« º�¿6µ6³½�¬C¹�« º�¶�³¬�´�µBº�¿7ªp¬�´» ¶�¾�» ¶�²�« ©(°'¬�¾pÆ<É�¬�¹�¸"µ�¾;¹�¸"« º�¿ µq¹�¸"µ'³�©�» µ�¶2³½�¬�¿ ¸�¶�³$²�´¸"ªH©�ª�« ¬6²H©�³�³¬�´µ�¶.·�¸�´$³½�¬�¹�« º�¶�³¬�´(Æ¨�½�¬D¾�» ©�´°IÊ"¬�°B±&²H©�³�³�¬�´µ�¶ ©�´¬;¬�Á�²�« ©�» µ�¬(¾T» µS³�©�ª�« ¬ÌË"ÆÅ¨�½�¬D¹�¸�±&¸'¹(¹�º�´�´¬(¾I²H©�³³¬�´µ�¶F©�´¬¬�Á�²�« ©�» µ�¬�¾Í» µS³�©�ª�« ¬DÎÌ©�µ�¾Í³½�¬GÂ�Ã�ÄÏ²H©�³�³�¬�´µ�¶6» µS³�©�ª�« ¬GÐBÆ~¨�½�» ´�¾S¹�¸"« º�¿ µÍ¼�» Ñ"¬(¶ ³�½�¬²�´¸�ªH©�ª�» « » ³» ¬�¶!·�¸�´�³�½�¬%Ê'¬�°�±&²H©�³�³¬�´µ�¶0©�µH¾7·�¸�º�´�³�½6¹�¸"« º�¿ µ�²�´¬(¶�¬�µB³¶!¼�¬(µ�¬�´�©�«p¾�¬�¶�¹�´�» ²�³» ¸"µ ¸�·¹�« º�¶�³¬�´8ª�©�¶�¬(¾<¸�µF³½�¬.Ê'¬�°�±&²�©�³�³¬�´�µ�¶�ÆÒpÓ�Ô�Õ�Ö�×_Ø+×_Ù+Ú>Û+Ü�Ý_×�ÒpÞ'ßpÔ.à2× á�ÛâÛ+×_Ú�ÛVÔ�ã+ÙVÖ�Þ�Òp×�ÞBá�×_Ü�Ù3ä2å�Ó�Ö(Ô�Ü�Ù+Ô.ÙpÜ4ä$æHç"è>ÞBÙ4ÒVè�éFÕ�ÞBÙÌØ Óà8Ó�Ý_Ý1ÒpÓ"Ô�Õ�Ö�×_Ø4Ó"Ò;Ø�ßDÕ�Ý ã4Ô�á�Ó�Ö�Ô2êGë&à8Ü�Ö�ìH× Ù+Ú4í�Ù+ÜFÔ�ã+Ù îCÞ�Ù4ÒâçFëxÙ4Þ�ì�Ó"ÒGÔ�Û+Ü�ã+Ý�ÒpÓ�Ö(Ô�î9à2Û+×_Ý_Óï Ó�Ö(Ô�Ü�ÙqÙ+Ü4ä%æ�è<ØHßGÕ�Ý_ã4Ôá�Ó�Ö�Ô�êpí3çfÞBÙ4ÒGðVëxñ>Ó�Òp×_ã+ñòÔ�ã+ÙGÓ�ó ï Ü�Ô�ã+Ö�Ó'î(ä

ô>õ>öÍô<÷CøVù0ú"õ>ö

û Û+×�Ô ï Þ ï Ó�Ö8Òp×�Ô�Õ�ã+Ô�Ô�Ó�ÔCã+Ô�×_ÙpÚ<Þ�ÙDü!ÞBá�Ó�Ù�á�ý�Ó�ñ>Þ�Ù�á�×�Õ2þ�Ù4ÒpÓ�óH×_Ù+Ú Ý_×_ì�Ó�ñ>Ó�á�ÛpÜpÒDÿxÜ�Ö ï Ö�Ü��Õ�Ó"Ô�Ô�×_Ù+Ú>ÞBÙ4ÒÌÕ�Ý_ã4Ôá�Ó�Ö�×_Ù+Ú>Õ�ÞBá�Ó�Ú�Ü�Ö�×�Õ�Þ�Ý3Ò+Þ'á(Þpä��VÜ�Ö�Ó�Ü��Ó�Ö�í�× á ï Ö�Ü��×�ÒpÓ�Ô8á�Û+Ó ï Ü�Ô�Ô�×_Ø+×_Ý_× áßÿxÜ�ÖfÕ�Ü�ñ<Ø+×_Ùp×_Ù+ÚVñ<ãpÝ á�× ï Ý_ÓqÒ+Þ'á�ÓDáß ï Ó�Ôf×_Ù�á�ÜâÞâÕ�Ü�ñ>ñFÜ�Ù��Ó�Õ(á�Ü�Ö<Ô ï Þ�Õ�Ó>ÿxÖ�Þ�ñFÓ�à8Ü�Ö�ì1ä� ÓDÞ ï+ï Ý_×_Ó�ÒVá�Û+Ó>ñ>Ó�á�Û+ÜpÒâá�ÜÌÞBÙ4Þ�Ý_ßpÔ�×�Ô�ÞqÕ�Ü�ñ<Øp×_Ù4ÞBá�×_Ü�ÙÍÜBÿ8Õ�Þ'á�Ó�Ú�Ü�Ö�×�Õ�Þ�Ý�Òp×_Þ�Ö�ßâÒ+ÞBá�ÞÞ�Ù+ÒGÖ�Ó�Þ�Ý��'Þ�Ý_ãpÓ"ÒÌÔ�ã+ÙÌÖ�Þ�ÒH×�ÞBá�×_Ü�ÙqñFÓ"Þ�Ô�ãpÖ�Ó�ñ>Ó�Ù�á�Ô�ä$Ô�× Ù+Ú>á�Û+Ó Þ�Ù4Þ�Ý Ü�Ú�ßFá�ÜFá�Ó�óHá�ñ>×_Ù��×_Ù+ÚÌà8Ó ï Ö�Ü ï Ü�Ô�Ó�ÒÍñ>Ó�á�Û+ÜHÒ+Ô6ÿxÜ�Öf×_Ù�á�Ó�Ö ï Ö�Ó�á�ÞBá�×_Ü�ÙÍÜBÿ.á�Û+ÓD×�ÒpÓ�Ù�á�× �4Ó"ÒIÕ�Ý_ã4Ôá�Ó�Ö�Ô�ä û Ûp×�ÔÔ�Õ(ÛpÓ�ñ>Ó;Þ�Ý_Ý_Ü'à$Ô6ÿxÜ�ÖfÓ �'Þ�Ý_ã4Þ'á�× Ù+Úâá�ÛpÓqÔ�×_Ú�Ù+× �4Õ�ÞBÙ4Õ�ÓDÜ�ÿ��BÞ�Ö�×_Ü�ã4Ô ÿxÓ�ÞBá�ã+Ö�ÓGÖ�Ó ï Ö�Ó�Ô�Ó�Ù�á�Þ��á�×_Ü�Ù4Ô�ä��+Ü�Ö6á�Û+ÓDÔ ï Ó�Õ�× � ÕDÒ+Þ'á(ÞqÔ�Ó�áfà8ÓDÞ�Ò+ÒpÖ�Ó"Ô�Ô�Ó"Òâá�Û+Ó>Ö�Ü�Ý_ÓFÜ�ÿ2Òp× �1Ó�Ö�Ó�Ù�áfÖ�Ó ï Ö�Ó�Ô�Ó�Ù��á�ÞBá�×_Ü�Ù+Ô�ä2å%Ö�Ó�Ý_×_ñ>× Ù4Þ�Ö�ßGÖ�Ó"Ô�ã+Ý á�Ô2×_Ù4Òp×�Õ�ÞBá�Ó á�Û4Þ'á7á�Û+Ó<Ô�Ó ��ã+Ó�Ù4Õ�Ó ×_ÙHÿxÜ�Ö�ñDÞBá�×_Ü�ÙâÞ�Ù4Ò��ÒpÜ�Ô�Ó�ñ>Ó�Þ�Ô�ã+Ö�Ó�ñFÓ�Ù�á�Ô.Õ�Ü�Ù�á�Ö�×_Ø+ãpá�Ó�á�Ü;Ôá�Þ�Ø+× Ý_×��×_ÙpÚFá�Û+ÓfÕ�Ý_ã4Ôá�Ó�Ö�×_Ù+ÚFñ>ÜpÒpÓ�Ý0Þ�Ù+Òq× á�Ô�×_Ù��á�Ó�Ö ï Ö�Ó�á(Þ'á�× Ü�Ù0ä

�� öÍô � ù

� Ë�� $Æ"!CÊ�©�» Ê"¬$#%'&1» ³�³» µ�¼(!8º�³¸�´¬�¼�´¬�¶�» Ñ'¬*)F¸�¾�¬(« ¶2·�¸�´,+3´�¬(¾�» ³�» ¸"µ"# -�.(/0/214365,798;:�<�/0=$1365�>?79@A7$12BC@679821D#�Ñ"¸"«�ÆHÐBË #�²�²4ÆHÐ�"ÎFE�Ð��G�#1Ë9H$I H�Æ

� Ð9�fÉHÆ�J.¬(¬�´�À�¬�¶�³¬�´9#HÉpÆ�J2º�¿f©�» ¶K#AL$ÆA&�º�´µ�©�¶K#�¨8ÆAM ©�µH¾�©�º�¬�´9©�µH¾*N�Æ��.©�´¶½�¿f©�µ"#O%-Ç-µH¾�¬�Á�±» µ�¼�ª'°PM4©�³¬�µB³%É�¬�¿f©�µB³» ¹Q!CµH©�« °�¶�» ¶K# -,R�10.TS�:VU�1O>W3"XA1O5Y3�U <�/;5Z12>"X�[�:V/0X$:�#�Ñ'¸�«�Æ��Ë$#²�²4ÆHÎ H�Ë\EB$]�G�#1Ë9H$H ]�Æ

� ÎF�*M3Æ^�.©�µ�¶�¬(µI©�µH¾�_�Æ�M ©�´¶�¬(µW#P%a`8µ�¶�º�²p¬�´Ñ�» ¶�¬(¾�M4¬(©�´µ�» µ�¼V©�µH¾bL.¬�µ�¬�´�©�« » c�©�³» ¸�µ"# -I» µdQUF3"X$:�:Ae0[f/;g�=T3�5h798;:jiAk�k6lm<�nQnonp<�/W7K:VUq/;@679[�3?/"@�rhs,3?/"5a:�Uq:V/0X$:�3?/utT:Av"Uq@�rt(:V7\w�36U xW=$#VyÌ©�¶�½�» µ�¼�³¸�µzJ�{o#A`�ÉV!,#pËFH H$I�#p²�²+ÆHÐ$|KE�Î ]�Æ

� q�*M3ÆO�.©�µ�¶�¬�µ"#!ÉpÆ3É�» ¼"º�´�¾�¶¶�¸�µ"#!¨8Æ1¯�¸�« ¬�µH¾�©�#�&�Æ0}8» ¬�« ¶�¬(µW#2`$Æ ¯4~�¬�¿ ¶$©�µH¾�_�Æ�M ©�´�¶�¬�µ"#%�)F¸B¾�¬�« » µ�¼V³¬�Á�³6À%» ³½I¼"¬�µ�¬�´�©�« » c�©�ª�« ¬;¼"©�º�¶�¶» ©�µT¿6» Á�³º�´¬�¶K# -â» µ�dQUq3"X :V:Ve0[f/;g?=j365<�nonon�<9s�.T>0>"d��6��6�"#+Ð ] ] ]V#pÑ"¸"«�ÆA�.Ç\#�²�²4Æ�Î��H�qEBÎ�$H�G'Æ

171

� �9�*�o�W�h� � �9�6�6��;��?� �?�h�q�6��K�"�"�A�W�;�q�Z��K�� 6��,�?�� Z�6��9�^�'�a�6�A�K ?�K��A�K��o¡K�$¢ ?� £�A�9��q��q� ¤A�� ,¥f�q�§¦A��A��Z�'��q��V� �6¨T¢�¦6� �Z� ¢ �9�A� �*¡9� ��Z�K��9� ©�ªq«$«$ª��2¬9;®�¯�°�±K±K²A³µÝ¶�·¸T¸�¹"º�»6¼�¼�» �

� ½F�*�V��;�F�Z�Z�K�"�6��6�Q� �6��K�"�A¾,��¿�ÀK¤V¢*Á$�FÂ�� qÁ�£Y�Q�9Ã�� A�o�?Ä4�A�Z� �'�Z� �q�6��K�Å� �6�T�o�6�h�$� �K��A�V��Y�Æ�KÇ6¢ � �6� �AÄÈO�"�9�q�Z�A� �6¨h¥��Z�$¢É�Z�6�^�Æ� �Z� ��b� �A�^�Æ�KÇ"� ©§Ê�¶W¯�ËO;±KÌ6±9°�¶?Í;Ì6Î2¬9±KÌ6±9°f¬�Ï±9°fÐ$¬*Ì�Í0³�³"Ì6±KÌ�Ì�Í;Ì�Î Ñ;¬F°f¬ ��Ã$�$�D�6Ò$Ó�� 6 "�6��ÔFÕ9ÖV� Ò$ª��;ªq«$«$ª��

� Õ9�*�V�2�;�q�Z��K�"�4¾��2¿VÀ�¤A¢*Á$�9Â� � Á��q��O�Q� �A��9�W��'×O�Z�$Ç6� Ç6� � � �'�Z� ¡z�o� ��\�q�Z¡\�6� ¡9� �Ä4� ¦A�'£�Z��Z� �6¨�Â�� Z�Ø�;�qÇ?�K� �F�Ù� �6�ÛÚo�6� �qÇ?�K� �F�ÝÜh�q�\�V� ©ßÞ�ÍW±9²�· Í;ÌA±F° ¶?Í;Ì�ÎCà"¶?"· Í;Ì6Îá¶6´â Í;¶Aã Î�²A³;ä6²�ÏZå�Ì6¬9²V³æÞKÍ?±9²VÎfÎf°�ä�²AÍ?±ÛçoÍ;ä?°fÍ;²�²V·q°fÍ;ä ¹ Ñ;¬K±K²6¯�¬ ��Ã�� D�(½��6�V��Ô$� A "�?�q½FÖ�½�ª��Wªq«$«$ª��

� ÓF�*è��é� A� ��¤�� º Ì6±K±K²V·qÍbê*²VÐ$¶�ä�Í0°�±F° ¶?Í�Ì�Í0³ ¸ ²V;·qÌ6Î ¸ ²V±\ã,¶�· ëW¬$�?Äì�q¢§ÇV�Z� �A¨ ��Úo�6� £Ã$��\�� a¤(×O�Z�9��K�;Ô9í$í ½V�

� íF�*¾��z¿�ÀK¤V¢*Á$�FÂ�� qÁ?��V�T�;�F�Z�Z�K�Ù�q��p��(�Q� �A��9�W�C�'�o� ��F�Z¡\�6� ¡9� ��Ä4� ¦A�'�\��Z� �A¨î¥f�q��A�q�\�q¢*� �A� �6¨V� ©Q� �,ï,�qè^�qÇ6�K��O� �$� � �� 6��é��F�Q�9Â�� Z�^ðD�9�A�K� ñ\� º ·F¶"ÐVòWóA±9ô�Þ�ÍW±$ò?Ê,¶?Í;´Zò¶�Í â Í;¶Aã Î�²A³;ä�²�ÏZå�Ì6¬9²V³îÞ�ÍW±K²AÎfÎf°�ä6²AÍW±ÆÞ�Í;Ý¶�· ¯�ÌA±F°�¶�ÍõçoÍ;ä?°fÍ;²�²V·q°fÍ;ä ¹ Ñ;¬K±9²A¯�¬Ì6Í0³�ö(ÎfÎf°�²V³b÷ì²AÐ ô;Í;¶?Î�¶�ä�°�²A¬ ��ª « «VÔ �? 6 W��ª ½�Ô\ÖAªq½��

� ÔF«F�P¾��?¿�ÀK¤V¢*Á$�FÂ�� qÁ?�6×;��×��6� � � 6��K�"�?�A��;�q�Z��K�"�?��6�Q�q�6��9�W��ø^�?�ì�6� �9�A�K�Å�q��P��V�Æ¦6� ¥a��a�a¢* A¦��q�Z� �6¨ù¢ � �� 6¨õÃ$�q� ¦6�K�C� �ú�V� �q��¤Û�Z�K¡K� �\�A�C�q¥j�Z¦A�A£Y��ûA ?� ��¦A�Z�É��Z¦6�V¤�� ©ú� �Ü��Wü(� � � ��9�;�Q��¾Q�A� � �f�W�A�W�;�F�Z��9�W�"ü��?ý,��o¦6� � �,� �6��¿?�6Üh� ¦6¨ � � ��ðD�9�A�K� ñ\� º ·F¶"Ð ²V²V³"Ï°fÍ"ä?¬T¶�´,Þ�çoçQçõþµ¶6· ëW¬Fô;¶6Ëá¶�Í ¸ ²A;·FÌ�Î ¸ ²�±�ã,¶�·që"¬(Ý¶6· ¹ °�ä?Í;Ì6Î º ·F¶"Ð$²V¬F¬9°fÍ;äÿ ÞF��q� ¢*� ¦A�Z�"��üT� ��Z� ¡\��¦6��K��Z�K�"ª « «�Ô$�� 6 W��$Ó$í9Ö��$í$Ó��

A P P E N D I X F

Contribution: special issueCSDA 2002

This appendix contain the article Webmining: Learning from the World WideWeb. in special issue of Computational Statistics and Data Analysis, volume:38, pages: 517–532. Author list: J. Larsen, L.K. Hansen, A. Szymkowiak-Have, T. Christiansen and T. Kolenda. Own contribution approximated to 10%.

174 Contribution: special issue CSDA 2002

List of Figures

1.1 The steps of the KDD process. . . . . . . . . . . . . . . . . . . . . . 5

2.1 The Random Projection algorithm. . . . . . . . . . . . . . . . . . . . 11

2.2 Illustration of the Random Projection. . . . . . . . . . . . . . . . . . 12

2.3 The Principal Component Analysis algorithm. . . . . . . . . . . . . . 14

2.4 Scatter plot of the 3 Gaussian clusters. . . . . . . . . . . . . . . . . . 16

2.5 Scatter plot of the 3 rings. . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Feature plot of the 2 text clusters. . . . . . . . . . . . . . . . . . . . . 16

2.7 Results of the projection for the 3 Gaussian clusters. . . . . . . . . . . 16

2.8 Results of the projection of the 3 rings structure. . . . . . . . . . . . . 17

2.9 Results of the projection of the 2 cluster artificial text example. . . . . 17

2.10 Illustration selecting the number of components. . . . . . . . . . . . . 23

3.1 Expectation-Maximization algorithm . . . . . . . . . . . . . . . . . . 28

190 LIST OF FIGURES

3.2 The illustration of the selection of model complexity. . . . . . . . . . 31

3.3 The unsupervised Gaussian Mixture algorithm. . . . . . . . . . . . . 32

3.4 The USGGM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 The illustration of the outlier sample. . . . . . . . . . . . . . . . . . . 38

3.6 The influence of the one outlier sample . . . . . . . . . . . . . . . . . 39

3.7 The outlier detection with Gaussian mixture. . . . . . . . . . . . . . . 40

3.8 The outlier detection example . . . . . . . . . . . . . . . . . . . . . 41

4.1 The Cluster Confusion similarity measure algorithm. . . . . . . . . . 47

4.2 Scatter plot of the data generated to compare similarity measures. . . 49

4.3 Dendrograms for hierarchies created with different similarity measures. 50

5.1 An example for data missing at random . . . . . . . . . . . . . . . . 57

5.2 An example for data missing completely at random . . . . . . . . . . 57

5.3 The algorithm for KNN model for imputation. . . . . . . . . . . . . . 61

5.4 The algorithm for Gaussian model for imputation. . . . . . . . . . . . 62

7.1 The devise measuring sun radiation the skin is exposed on. . . . . . . 73

7.2 Scatter plots of Email and Newsgroups data sets . . . . . . . . . . . . 77

7.3 Eigenvalue curves for Newsgroups data sets . . . . . . . . . . . . . . 77

7.4 Scatter plot of the Email data with clusters assign. and density contours. 78

7.5 Confusion matrix for Email data set. . . . . . . . . . . . . . . . . . . 79

7.6 Scatter plot of the Email data with cluster structure and density contours. 80

LIST OF FIGURES 191

7.7 The dendrograms for Email data set. . . . . . . . . . . . . . . . . . . 81

7.8 Scatter plot of the Newsgroups data clustered with hard UGGM model. 88

7.9 The confusion matrix of the Newsgroups collection. . . . . . . . . . . 89

7.10 Scatter plot of the Newsgroups collection data. . . . . . . . . . . . . 89

7.11 Hierarchical structures for Newsgroups collection. . . . . . . . . . . . 91

7.12 Average performance of the USGGM algorithm . . . . . . . . . . . . 95

7.13 Hierarchical clustering using the USGGM model. . . . . . . . . . . . 96

7.14 Framework for sun-exposure data clustering . . . . . . . . . . . . . . 99

7.15 Scatter plots of the training data in the the sun-exposure collection. . . 100

7.16 Eigenvalue curves for the data in sun-exposure collection. . . . . . . . 101

7.17 The probability of observing certain groups of behaviors in the clusters 106

7.18 Cluster probabilities calculated for the 10 test persons . . . . . . . . . 107

7.19 Learning curves for imputation with the GM and the KNN model. . . 109

7.20 Comparison between GM and KNN for all the profiles . . . . . . . . 110

7.21 Learning curves for GM model shown separately for all 9 blocks. . . . 111

7.22 Learning curves for GM model shown separately for all 9 blocks. . . . 112

7.23 30 validation samples predicted with Gaussian models. . . . . . . . . 113

7.24 30 validation samples predicted with KNN model. . . . . . . . . . . . 113

7.25 The scatter plots of the 5 Gaussian clusters. . . . . . . . . . . . . . . 114

7.26 The generalization error for 5 Gaussian clusters. . . . . . . . . . . . . 116

7.27 The cluster posterior values p(c|z) for 5 Gaussian clusters. . . . . . . 117

192 LIST OF FIGURES

7.28 Components of the kernel PCA model for 5 Gaussian clusters. . . . . 117

7.29 The generalization error for Rings structure. . . . . . . . . . . . . . . 118

7.30 The cluster posterior values p(c|z) for Rings structure. . . . . . . . . 118

7.31 Components of the traditional kernel PCA model for Rings structure. . 119

7.32 Generalization error for Email collection . . . . . . . . . . . . . . . . 119

7.33 Confusion matrix for Email collection. . . . . . . . . . . . . . . . . . 120

7.34 Generalization error for Newsgroups collection. . . . . . . . . . . . . 121

7.35 Confusion matrix for Newsgroups collection. . . . . . . . . . . . . . 121

7.36 Misclassification error for Dermatological collection. . . . . . . . . . 123

List of Tables

4.1 The illustration of the Sample Dependent similarity measure. . . . . . 49

4.2 A simple illustration of the confusion matrix idea. . . . . . . . . . . . 51

5.1 An example of 1-out-of-c coding. . . . . . . . . . . . . . . . . . . . . 56

7.1 Questions in the sun-exposure study. . . . . . . . . . . . . . . . . . . 73

7.2 Keywords for 7 clusters obtained from UGGM model for Email data. . 84

7.3 Keywords for hierarchy build with KL similarity measure. . . . . . . 85

7.4 Keywords for hierarchy build with Cluster Confusion similarity measure. 86

7.5 Keywords for hierarchy build with L2 similarity measure. . . . . . . . 86

7.6 Keywords for hierarchy build with modified L2 similarity measure. . . 87

7.7 Keywords for 6 clusters from the UGGM model for Newsgroups data. 92

7.8 Keywords for KL hierarchy for Newsgroups data. . . . . . . . . . . . 92

7.9 Keywords for Cluster Confusion hierarchy for Newsgroups data. . . . 93

194 LIST OF TABLES

7.10 Keywords for modified L2 hierarchy for Newsgroups data. . . . . . . 93

7.11 Keywords for modified L2 hierarchy for Newsgroups data. . . . . . . 93

7.12 Keywords for the USGGM model. . . . . . . . . . . . . . . . . . . . 97

7.13 Key-patterns for clustering diary histograms. . . . . . . . . . . . . . . 103

7.14 Key-patterns for clustering diary with UV histograms. . . . . . . . . 103

7.15 Key-patterns for clustering diary histograms with co-occurrence matrix. 104

7.16 Key-patterns for diary, UV histograms and co-occurrence matrix. . . 105

7.17 Error correlation table for Gaussian model and KNN model. . . . . . 112

Bibliography

[1] H. Akaike. Information theory and an extension of the maximum like-lihood principle. 2nd International Symposium on Information Theory,pages 267–281, 1973.

[2] Hagai Attias. Inferring parameters and structure of latent variable mod-els by variational bayes. In Proceedings of the fifteenth conference onuncertainty in artificial intelligence, pages 21–30, 2000.

[3] P. Baldi and K. Hornik. Neural networks and principal component anal-ysis: Learning from examples without local minima. Neural Networks,2:53–58, 1988.

[4] Weizhu Bao and Shi Jin. The random projection method. In Advances inScientific Computing, pages 1–11, 2001.

[5] Matthew J. Beal and Zoubin Ghahramani. The variational bayesian em al-gorithm for incomplete data: with application to scoring graphical modelstructures. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid,D. Heckerman, A. F. M. Smith, and M. West, editors, Bayesian Statistics7. Oxford University Press, 2003.

[6] Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Winter. On fea-ture distributional clustering for text categorization. In W. Bruce Croft,David J. Harper, Donald H. Kraft, and Justin Zobel, editors, Proceedingsof SIGIR-01, 24th ACM International Conference on Research and Devel-opment in Information Retrieval, pages 146–153, New Orleans, US, 2001.ACM Press, New York, US.

196 BIBLIOGRAPHY

[7] Anthony J. Bell and Terrence J. Sejnowski. An information-maximizationapproach to blind separation and blind deconvolution. Neural Computa-tion, 7(6):1129–1159, 1995.

[8] Y. Bengio, P. Vincent, and J-F. Paiement. Learning eigenfuncions of sim-ilarity : Linking spectral clustering and kernel pca. Technical report,Departement d’informatique et recherche operationnelle, Universite deMontreal, 2003.

[9] Ella Bingham and Heikki Mannila. Random projection in dimensionalityreduction: applications to image and text data. In Knowledge Discoveryand Data Mining, pages 245–250, 2001.

[10] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford Univer-sity Press, Oxford, 1995.

[11] Christopher M. Bishop and Michael E. Tipping. A hierarchical latent vari-able model for data visualization. IEEE Transactions on Pattern Analysisand Machine Intelligence, 20(3):281–293, 1998.

[12] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptronsand singular value decomposition. Biological Cybernetics, 59:291–294,1988.

[13] Jaime Carbonell, Yiming Yang, and William Cohen. Special issue of ma-chine learning on information retrieval introduction. Machine Learning,39:99–101, 2000.

[14] Miguel A. Carreira-Perpinan. A review of dimension reduction tech-niques. Technical Report CS–96–09, Dept. of Computer Science, Uni-versity of Sheffield, January 1997.

[15] R. B. Cattell. The scree test for the number of factors. J. Multiv. Behav.Res., 1966.

[16] A. Corduneanu and C.M. Bishop. Variational bayesian model selectionfor mixture distributions. In T. Richardson and T .Jaakkola (Eds.), ed-itors, Proceedings Eighth International Conference on Artificial Intelli-gence and Statistics, pages 27–34, 2001.

[17] Vittorio Castelli Thomas M. Cover. On the exponential value of labeledsamples. Title Pattern Recognition Letters, 16:105–111, 1995.

BIBLIOGRAPHY 197

[18] J.M. Craddock and C.R. Flood. Eigenvectors fo representing the 500 mb.geopotential surface over the northen hemisphere. Quarterly Journal ofthe Royal Meteorological Society, 1969.

[19] Sanjoy Dasgupta. Experiments on random projection. In In Proc. 16thConf. Uncertainty in Artificial Intelligence, 2000.

[20] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W.Furnas, and Richard A. Harshman. Indexing by latent semantic analysis.Journal of the American Society of Information Science, 41(6):391–407,1990.

[21] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal StatisticalSociety B, 39:1–38, 1977.

[22] K.I. Diamantaras and S.Y. Kung. Principal component neural networks:Theory and applications. John Wiley & Sons Inc., New York, 1996.

[23] George H. Dunteman. Principal components analysis. Sage UniversityPaper Series on Quantitative Applications in Social Sciences, 69, 1989.

[24] H. T. Eastment and W. J. Krzanowski. Cross-validatory choice of the num-ber of components from a principal components analysis. Technometrics,24:73–77, 1982.

[25] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining toknowledge discovery in databases. AI Magazine, 17:37–54, 1996.

[26] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth.Knowledge discovery and data mining: Towards a unifying framework.In Knowledge Discovery and Data Mining, pages 82–88, 1996.

[27] Chris Fraley. Algorithms for model-based Gaussian hierarchical cluster-ing. SIAM Journal on Scientific Computing, 20(1):270–281, 1999.

[28] Dayne Freitag. Machine learning for information extraction in informaldomains. Machine Learning, 39:169–202, 2000.

[29] K. Fukunaga and D.R. Olsen. An algorithm for finding intrinsic dimen-sionality of data. In IEEE Transactions on Computers, volume 20, pages165–171, 1976.

198 BIBLIOGRAPHY

[30] Zoubin Ghahramani and Michael I. Jordan. Learning from incompletedata. Technical Report 108, MIT Center for Biological and ComputationalLearning, 1994.

[31] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from in-complete data via an EM approach. In Jack D. Cowan, Gerald Tesauro,and Joshua Alspector, editors, Advances in Neural Information ProcessingSystems, volume 6, pages 120–127. Morgan Kaufmann Publishers, Inc.,1994.

[32] Mark Girolami. Mercer kernel-based clustering in feature space. IEEETransactions on Neural Networks, 13(3):780–784, 2002.

[33] Mark Girolami. Orthogonal series density estimation and the kernel eigen-value problem. Neural Computation, 14(3):669 – 688, 2002.

[34] Johannes Grabmeier and Andreas Rudolph. Techniques of cluster algo-rithms in data mining. Data Mining and Knowledge Discovery, 6(4):303–360, October 2002.

[35] H. Altay Guvenir, Gulsen Demiroz, and Nilsel Ilter. Learning differentialdiagnosis of erythemato-squamous diseases using voting feature intervals.Aritificial Intelligence in Medicine, 13(3):147–165, 1998.

[36] David Hand, Heikki Mannila, and Padhraic Smyth. Principles of datamining. The MIT Press, Cambridge, Massachusetts 02142, August 2001.

[37] L.K. Hansen, J. Larsen, F. Nielsen, S. C. Strother, E. Rostrup, R. Savoy,N. Lange, J. Sidtis, C. Svarer, and O. B. Paulson. Generalizable patternsin neuroimaging: How many principal components? NeuroImage, pages534–544, 1999.

[38] L.K. Hansen, C. Liisberg, and Peter Salamon. Ensemble methods forrecognition of handwritten digits. In S.Y. Kung, F. Fallside, J. Aa.Sørensen, , and C.A. Kamm, editors, Proceedings of Neural NetworksFor Signal Processing, pages 540–549, 1992.

[39] L.K. Hansen, S. Sigurdsson, T. Kolenda, F. Nielsen, U. Kjems, andJ. Larsen. Modeling text with generalizable gaussian mixtures. In Pro-ceedings of IEEE International Conference on Acustics Speach and SignalProcessing, volume VI, pages 3494–3497, 2000.

[40] Harry H Harman. Modern Factor Analysis. University of Chicago Press,2 edition, 1967.

BIBLIOGRAPHY 199

[41] John Hertz, Anders Krogh, Richard Palmer, and Roderick V. Jensen. In-troduction to the theory of neural computation. American Journal ofPhysics, 62(7), 1994.

[42] Thomas Hofmann. Probabilistic Latent Semantic Indexing. In Proceed-ings Of The 22nd Annual Acm Conference On Research And DevelopmentIn Information Retrieval, pages 50–57, Berkeley, California, August 1999.

[43] A.J. Izenman. Recent developments in nonparametric density estimation.Journal of the American Statistical Association, 86:205–224, 1991.

[44] J. Edward Jackson. A user’s guide to principal components. Wiley Seriesin Probability and Mathematical Statistics, 1991.

[45] I.T. Jolliffe. Principal component analysis. Springer Series in Statistics,1986.

[46] N. Kambhatla and T.K. Lee. Dimension reduction by local principal com-ponent analysis. Neural Computation, 9:1493–1516, 1997.

[47] Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. Distance-basedoutliers: Algorithms and applications. VLDB Journal: Very Large DataBases, 8(3–4):237–253, 2000.

[48] T. Kohonen. Self-organized formation of topologically correct featuremaps. Biological Cybernetics, 43:59–69, 1982.

[49] T. Kolenda. Adaptive tools in virtual environments: Independent compo-nent analysis for multimedia. PhD thesis, Department of MathematicalModelling, Technical University of Denmark, January 2002.

[50] T. Kolenda, L.K. Hansen, J. Larsen, and O. Winther. Independent com-ponent analysis for understanding multimedia content. In H. Bourlard,S. Bengio J. Larsen T. Adali, and S. Douglas, editors, Proceedings of IEEEWorkshop on Neural Networks for Signal Processing XII, pages 757–766,Piscataway, New Jersey, 2002. IEEE Press. Martigny, Valais, Switzerland,Sept. 4-6, 2002.

[51] Ken Lang. NewsWeeder: learning to filter netnews. In Proceedings ofthe 12th International Conference on Machine Learning, pages 331–339.Morgan Kaufmann publishers Inc.: San Mateo, CA, USA, 1995.

[52] J. Larsen, L.K. Hansen, A. Szymkowiak-Have, T. Christiansen, andT. Kolenda. Webmining: Learning from the world wide web. specialissue of Computational Statistics and Data Analysis, 38:517–532, 2002.

200 BIBLIOGRAPHY

[53] J. Larsen, A. Szymkowiak-Have, and L.K. Hansen. Probabilistic hierar-chical clustering with labeled and unlabeled data. International Journalof Knowledge-Based Intelligent Engineering Systems, 6(1):56–62, 2002.

[54] Thorbjørn Larsen and Tobias Tobiasen. Automatisk fejldetektering iukomplette data. Master’s thesis, Department of Mathematical Modelling,Technical University of Denmark, 2001.

[55] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative ma-trix factorization. In Advances in Neural Information Processing Systems,pages 556–562, 2000.

[56] Steven J. Leon. Linear Algebra with Applications. Prentice Hall, 6thedition, January 2002.

[57] Roderick J.A. Little and Donald B. Rubin. Statistical Analysis With Miss-ing Data. Probability abd Mathematical Statistics. John Wiley & Sons,1987.

[58] Marina Meila and David Heckerman. An experimental comparison of sev-eral clustering and initialization methods. In In: Proceedings of the Four-teenth Conference on Uncertainty in Artificial Intelligence, pages 386–395, 1998.

[59] Marina Meila and Jianbo Shi. Learning segmentation by random walks.In Todd Leen, Thomas Dietterich, and Volker Tresp, editors, Advances inNeural Information Processing Systems, volume 13, pages 873–879. MITPress, 2000.

[60] D.J. Miller and H.S. Uyar. A mixture of experts classifier with learningbased on both labeled and unlabeled data. In Advances in Neural Infor-mation Processing Systems, volume 9, pages 571–578, 1997.

[61] Thomas P. Minka. Automatic choice of dimensionality for PCA. In Ad-vances in Neural Information Processing Systems, pages 598–604, 2000.

[62] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis andan algorithm. In In Advances in Neural Information Processing Systems,volume 14, pages 849–856, 2001.

[63] Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, and Tom M.Mitchell. Text classification from labeled and unlabeled documents us-ing EM. Machine Learning, 39(2/3):103–134, 2000.

BIBLIOGRAPHY 201

[64] E. J. Nystrom. Uber die praktische auflosung von linearen integralgle-ichungen mit anwendungen auf randwertaufgaben der potentialtheorie.Commentationes Physico-Mathematica, 4(15):1–52, 1928.

[65] E. Oja. Neural networks, principal components, and subspaces. Interna-tional Journal of Neural Systems, 1(1):61–8, 1989.

[66] M. Partridge and R. Calvo. Fast dimensionality reduction and simple pca.Intelligent Data Analysis, 2:203–214, 1998.

[67] M. F. Porter. An algorithm for suffix stripping. Program, 14:130–137,1980.

[68] Bhavani Raskutti, Herman Ferra, and Adam Kowalczyk. Combining clus-tering and co-training to enhance text classification using unlabelled data.In Proceedinds of Eighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 620–625, 2002.

[69] S. Ripley. Pattern Recognition and Neural Networks. Cambridge Univer-sity Press, 1996.

[70] Christian P. Robert. The Bayesian Choice: a decision-theoretic motiva-tion. Springer-Verlag, 1994.

[71] D. Rubin. Bayesian inference for causal effects: The role of randomiza-tion. Annals of Statistics, 6:34–58, 1978.

[72] Donald B. Rubin. Inference and missing data. Biometrica, 63:581–592,1976.

[73] Donald B. Rubin. Multiple Imputation for Nonresponse in Surveys. J.Wiley and Sons, New York, July 1987.

[74] D. E. Rumelhart, G. E. Hinton, and J. L. McClelland. A general frame-work for parallel distributed processing. In D. E. Rumelhart, J. L. Mc-Clelland, et al., editors, Parallel Distributed Processing: Volume 1: Foun-dations, pages 45–76. MIT Press, Cambridge, 1987.

[75] Gerard Salton. Automatic Text Processing: The Transformation, Analysis,and Retrieval of Information by Computer. Addison Wesley, 1989.

[76] Lawrence Saul and Fernando Pereira. Aggregate and mixed-order Markovmodels for statistical language processing. In Claire Cardie and RalphWeischedel, editors, Proceedings of the Second Conference on Empirical

202 BIBLIOGRAPHY

Methods in Natural Language Processing, pages 81–89, Somerset, NewJersey, 1997. Association for Computational Linguistics.

[77] J.L. Schafer. Analysis of Incomplete Multivariate Data, volume 72 ofMonographs on Statistics and Applied Probability. Chapman & Hall,1997.

[78] Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller. Non-linear component analysis as a kernel eigenvalue problem. Neural Com-putation, 10:1299–1319, 1998.

[79] G. Schwarz. Estimating the dimension of a model. The Annals of Statis-tics, 6:461–464, 1978.

[80] A. Szymkowiak, J. Larsen, and L.K. Hansen. Hierarchical clustering fordatamining. In Proceedings of KES-2001 Fifth International Conferenceon Knowledge-Based Intelligent Information Engineering Systems & Al-lied Technologies, pages 261–265, 2001.

[81] A. Szymkowiak, P.A. Philipsen, J. Larsen, L.K. Hansen, E. Thieden, andH.C. Wulf. Impuating missing values in diary records of sun-exposurestudy. In D. Miller, T. Adali, J. Larsen, M. Van Hulle, and S. Douglas,editors, Proceedings of IEEE Workshop on Neural Networks for SignalProcessing XI, pages 489–498, Falmouth, Massachusetts, 2001.

[82] A. Szymkowiak-Have, J. Larsen, L.K. Hansen, P.A. Philipsen, E. Thieden,and H.C. Wulf. Clustering of sun exposure measurements. In D. Miller,T. Adali, J. Larsen, M. Van Hulle, and S. Douglas, editors, Proceedingsof IEEE Workshop on Neural Networks for Signal Processing XII, pages727–735, 2002.

[83] Anna Szymkowiak-Have, Mark Girolami, and Jan Larsen. A probabilisticapproach to kernel principal component analysis. to be published in 2003.

[84] Michael Tipping and Christopher Bishop. Probabilistic principal compo-nent analysis. Technical Report NCRG/97/010, Neural Computing Re-search Group, Aston University, September 1997.

[85] Volker Tresp, Subutai Ahmad, and Ralph Neuneier. Training neural net-works with deficient data. In Jack D. Cowan, Gerald Tesauro, and JoshuaAlspector, editors, Advances in Neural Information Processing Systems,volume 6, pages 128–135. Morgan Kaufmann Publishers, Inc., 1994.

BIBLIOGRAPHY 203

[86] Volker Tresp, Ralph Neuneier, and Subutai Ahmad. Efficient methods fordealing with missing data in supervised learning. In G. Tesauro, D. Touret-zky, and T. Leen, editors, Advances in Neural Information Processing Sys-tems, volume 7, pages 689–696. The MIT Press, 1995.

[87] N. Vasconcelos and A. Lippman. Learning mixture hierarchies. Technicalreport, MIT Media Laboratory, 1998.

[88] Chris William and Mathias Seeger. Using the nystrom method to speedup kernel machines. In Todd Leen, Thomas Dietterich, and Volker Tresp,editors, Advances in Neural Information Processing Systems, volume 13,pages 682–688. MIT Press, 2000.

[89] C. Williams. A mcmc approach to hierarchical mixture modelling. InAdvances in Neural Information Processing Systems, volume 12, pages680–686, 2000.

[90] S. Wold. Cross-validatory estimation fo the numer of components in factorand principal components models. Technometrics, 20:397–405, 1978.

[91] D. Xu, J. Principe, and J. Fisher. A novel measure for independent com-ponent analysis (ICA ). In In: Proceedings of IEEE ICASSP98, volume 2,pages 1161–1164, 1998.

[92] Lei Xu and Michael I. Jordan. On convergence properties of the EM algo-rithm for gaussian mixtures. Neural Computation, 8(1):129–151, 1996.