rezumat_eva_kovacs.pdf

7/27/2019 Rezumat_Eva_Kovacs.pdf

1/12

FACULTY OF AUTOMATION AND COMPUTER SCIENCE

COMPUTER SCIENCE DEPARTMENT

VA KOVCS

Contributions to the Development of Methods in

Data Mining in Large Databases

PHD THESIS

Abstract

Scientific advisor

Prof. Dr. Ing. IOSIF IGNAT

__________________Cluj-Napoca, 2007__________________


2/12

Contents

Chapter 1. Introduction................................................................................................................8

1.1. Definitions..................... .............. .............. ............... .............. .............. .............. ............... 91.2. Description of the process of data mining.................. .............. .............. .............. ........ 111.3. Preprocessing and transformation of data for data mining.................. .............. ....... 141.4. Operations in data mining................ ................. ............. ............... .............. ............. ..... 181.5. Clustering methods of the database................... ............... .............. .............. ............... . 201.6. Predictive modeling methods.................... ............... .............. ............... ............... ..........241.7. Link analysis methods................... .............. ............... ............... ............... ...... 261.8. Structure of the thesis.................. ............... .............. .............. ............... ............... ......... 33

Chapter 2. Current State of Data Mining.................................................................................. 35

2.1. Clustering methods................. ............... ............... ............... .............. .............. .............. . 362.2. Classifications methods................. ............... .............. .............. ............... ............... ........ 39

2.3. Feature selection methods.................... ............... .............. .............. ............... ............... . 462.4. Conclusions.................. ............... .............. ............... .............. .............. ............... ............ 47

Chapter 3. Development of Clustering Methods with Clustering with Prototype Entity

Selection......................................................................................................................................... 50

3.1. Clustering with Prototype Entity Selection................... .............. ................ .............. ... 513.2. Clustering with Prototype Entity Selection Algorithm.................... ............... ............ 543.3. Optimized CPES using variable radius................. ............... ............... ............... .......... 593.4. CPES Algorithm with variable radius.................. .............. .............. ............... ............. 603.5. K-means optimized by CPES................ .............. ............... ............. ............... .............. . 613.6. Experimental study................. ............... .............. ............... .............. .............. ........... 643.7. Results of the experimental study for CPES............... ............... ............... ............... .... 673.8. Results of the experimental study for CPES with variable radius.................. .......... 743.9. Results of the experimental study for K-means with CPES................ ............... ........ 783.10. Analysis of the results.................. ............... ............. ............... ............... .............. ........... 833.11. Conclusions................. ............... ............... ............. ................ .............. .............. ............. 88

Chapter 4. Development of Preprocessing Methods by Reduct Equivalent Feature

Selection......................................................................................................................................................904.1. Rough Set Theory.............. ............... .............. ............... ............... ............... ............... .... 914.2. Reduct Equivalent Feature Selection.................. ................ .............. ............... ............. 964.3. Reduct Equivalent Feature Selection Algorithm....................... .............. .............. ...... 994.4. Experimental Study................ ............... .............. .............. ............... .............. .............. 1074.5. Conclusions.................. ............... .............. ............... .............. .............. ............... ........... 111

Chapter 5. Development of Classification Methods by Reduct Equivalent Rule

Induction......................................................................................................................................113

5.1. Classification from the perspective of Rough Set Theory................. ............... ......... 115

5.2. Reduct Equivalent Rule Induction................ ............... ............... ................ ................ 1175.3. Reduct Equivalent Rule Induction Algorithm...................... .............. .............. ......... 1195.4. Experimental Study................ ............... .............. .............. ............... .............. .............. 1245.5. Conclusions.................. ............... .............. ............... .............. .............. ............... ........... 129

Chapter 6. Final Conclusions............................................................................................................1316.1. Future research trends................. ............... .............. ............... ............. ............... ........ 137

Bibliography.................................................................................................................................138

Annexes........................................................................................................................................151


3/12

Chapter 1.Introduction

The development of hardware during the last two decades made possible the saving of largeamounts of data on computers. The exact volume of these data is impossible to state, all estimations aremere suppositions. Researchers from the Berkley University have calculated that approximately oneExabyte (1 million Terabytes) of data are generated and saved each year.

Real-time data mining of databases is one of the main research areas in databases. The volume ofdatabases and their constant growth is an important problem in data mining. Data mining is a newdiscipline in development which uses the resources and ideas of several fields. The abstract role of datamining is to discover new and useful information in databases. Data mining techniques are to developmodels, structures, regularities, etc, in large databases. The models discovered in databases can becharacterized according to: accuracy, precision, interpretability and expressivity.

Knowledge can be defined in several ways, for example: General term used to describe an object, idea, condition, situation or another fact that can

be a number, a letter or a symbol. It can be a chart, an image and/or alphanumeric

characters. It suggests elements of information that can be processed or produced by acomputer.

Facts, known things, one can draw conclusions from.There are several definitions for data mining; we mention only few of them: Data mining consists of the extraction of predictive information hidden in a large

database. Data mining is the process through which advantageous models are discovered in

databases Data mining is a rapidly growing field that combines the methods of databases, statistics,

supervised learning and other related fields in order to extract useful information fromexisting data.

Data mining is the non-trivial process which identifies new, valid, useful andinterpretable models in existing data.

Algorithms of data mining can be categorized keeping in mind the representation of models, inputdata and the field in which the algorithm is used. The model can be represented by decision trees,regression functions, associations or others. The most common classification of data mining algorithmsdivides them in four operations: predictive modeling, database clustering, link analysis and deviationdetection [6][7][8][9].

Data mining is a repetitive process, which contains several steps, beginning with the understandingand definition of the problem and ending with the analysis of the results and the application of a strategyfor the use of the results [14]. A data mining process is illustrated in Fig. 1.

Fig. 1. The data mining process

- 1 -


4/12

Chapter 2.Current State of Data Mining

The second chapter describes the present state of research on data mining, and the most recentresearch in the field. The algorithms of the main methods in data mining are then presented, namely that ofdatabase clustering, predictive modeling and feature selection.

From the analyzed methods used in database clustering, from the category of partitioning methodsK-means, K-modes,K-medoid, CLARA (Clustering Large Applications) and CLARANS(Clustering LargeApplication based upon RANdomized Search) are mentioned.K-means algorithm is one of the best knownclustering algorithms used in database clustering. The necessity that the user should specify the number ofclusters is a disadvantage. The method is also inadequate for finding clusters of nonconvex forms or ofdifferent sizes and is sensitive to noises that influence the mean value of a cluster.K-medoidalgorithm isalso a clustering algorithm used in database clustering. It works efficiently with a small set of data, while itis unable to manage a large set. CLARA (Clustering LARge Applications) is used instead ofK-medoidinorder to manage a large set of data.

From the analyzed methods used in database clustering SLINK (Single LINKage) and BIRCH

(Balanced Iterative Reducing and Clustering using Hierarchies) belong to the category of hierarchicalmethods, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS(Ordering Points To Identify the Clustering Structure) to the category of density-based methods.

From the analyzed methods within predictive modeling from the category of algorithms that arebased on decision trees we mentionID3, C4.5, CHAID (Chi-square automatic interaction detection), CART(Classification And Regression Trees) and QUEST(Quick, Unbiased and Efficient Statistical Tree).

The most popular classification algorithm used in predictive modeling on neural networks isbackward propagation algorithm. It learns on a multilayer feed-forward network. Neural networks aremore efficient than decision trees due to the adjustment. A deficiency of neural networks is that they acceptonly numeric input, therefore categorical data must be recoded.

Bayesian classifiers are statistical classifiers and belong to predictive modeling. Bayesian classifiersdemonstrated high precision and speed, and are used for large databases. In practice several disagreementsappear, for example, due to the supposition that attributes are independent from one another.

The majority offeature selection methods belong to supervised learning. There are two types offeature selection algorithms: filter and wrapper type algorithms. RELIEF algorithm and its developmentsare representative for filter type. There are feature selection methods that use successfully Rough SetTheory.

Chapter 3.Development of Clustering Methods with Clustering with

Prototype Entity Selection

This chapter presents three original clustering methods: Clustering with Prototype Entity Selection

(orCPES) [16][19], Clustering with Prototype Entity Selection with variable radius [17] and K-means withClustering with Prototype Entity Selection [18]. First the methods, then the experimental studies are

presented, followed by the comparison of the results with other clustering methods from the specializedliterature.

CPES method will be used in data mining both as an independent clustering process and acombination of this method withK-means, a clustering algorithm often used in data mining.

- 2 -


5/12

Clustering with Prototype Entity Selection Algorithm

We propose the use of CPES method as a clustering method for data mining. When using thismethod the user does not need to specify the number of clusters, the algorithm will obtain this number. Themethod also ensures optimal clustering. This method will be efficient within the frame of a data mining

process, because it is very important for a user that the used method should need only few input data. Inshort, the algorithm is presented as follows:

1. Initialization of constants , A, rand fitness function favd2. Generation of clusters , fori=1,..,nii xxcluster =)(3. repeat

3.1. For each a pair is selected,)( ixcluster jx3.2. if then)())(( ji xfxclusterf

rezumat_eva_kovacs.pdf

Documents