highly robust methods in data mining issn1452-4864/8_1_2013_may_1_132... · predictive data mining....

1. STATISTICS VS. DATA MINING

Data mining can be characterized as a

process of information or knowledge

extraction from data sets, which leads to

revealing and investigating systematic

associations among invidivual variables. It

contains the exploratory data analysis,

descriptive modeling, classification, and

regression, while classification together with

HIGHLY ROBUST METHODS IN DATA MINING

Jan Kalina*

Institute of Computer Science of the Academy of Sciences of the Czech Republic,Pod Vodárenskou věží 2, 182 07 Praha 8, Czech Republic

(Received 14 January 2013; accepted 4 February 2013)

Abstract

This paper is devoted to highly robust methods for information extraction from data, with a

special attention paid to methods suitable for management applications. The sensitivity of available

data mining methods to the presence of outlying measurements in the observed data is discussed as

a major drawback of available data mining methods. The paper proposes several newhighly robust

methods for data mining, which are based on the idea of implicit weighting of individual data values.

Particularly it propose a novel robust method of hierarchical cluster analysis, which is a popular data

mining method of unsupervised learning. Further, a robust method for estimating parameters in the

logistic regression was proposed. This idea is extended to a robust multinomial logistic classification

analysis. Finally, the sensitivity of neural networks to the presence of noise and outlying

measurements in the data was discussed. The method for robust training of neural networks for the

task of function approximation, which has the form of a robust estimator in nonlinear regression, was

proposed.

Keywords: Data mining, robust statistics, High-dimensional data, Cluster analysis, Logistic

regression, Neuralnetworks.

* Corresponding author: [email protected]

S e r b i a n

J o u r n a l

o f

M a n a g e m e n t

Serbian Journal of Management 8 (1) (2013) 9 - 24

www.sjm06.com

DOI: 10.5937/sjm8-3226

regression are commonly denoted as

predictive data mining. The concept of data

mining can be described as the analytical

part of the overall process of extracting

useful information from data, which is

traditionally called knowledge discovery

(Fayyad et al. , 1996; Nisbet et al. , 2009;

Martinez et al. , 2011). We can say that data

mining has the aim to extract information,

while knowledge discovery goes further and

aims at acquiring knowledge relevant for the

field of expertise; compare Fernandez (2003)

and Zvárová et al. (2009). Actually Soda et

al. (2010) required data mining methods to

be dynamically integrated within knowledge

discovery approaches.

In management applications, data mining

is often performed with the aim to extract

information relevant for making predictions

and/or decision making, which can be

described as selecting an activity or series of

activities among several alternatives.

Decision making integrates uncertainty as

one of the aspects with an influence on the

outcome. The development of computer

technology has allowed to implement

partially or fully automatic decision support

systems, which can be described as very

complicated systems offering assistance with

the decision making process with the ability

to compare different possibilities in terms of

their risk. The systems are capable to solve

a variety of complex tasks, to analyze

different information components, to extract

information of different types, and deduce

conclusions for management,financial, or

econometricapplications (Gunasekaran &

Ngai, 2012; Brandl et al. , 2006), allowing to

find best available decision within the

framework of the evidence-based

management (Briner, 2009).

Unlike classical statistical procedures, the

data mining methodology does not have the

ambition to generalize its results beyond the

data summarized in a given database. In data

mining, one does not usually assume a

random sample from a certain population

and the data are analyzed and interpreted as

if they constituted the whole population. On

the other hand, the work of a statistician is

often based on survey sampling, which

requires also to propose questionnares, train

questioners, work with databases, aggregate

the data, compute descriptive statistics, or

estimating non-response. Moreover, a

statistician commonly deals with observing a

smaller number of variables on relatively

small samples, which is another difference

from the data mining context.

Removal of outliers is one of important

steps of validating the plausibility of a model

both in statistics and data mining. A common

disadvantage of popular data mining

methods (linear regression, classification

analysis, machine learning) is namely their

high sensitivity (non-robustness) to the

presence of outlying measurements in the

data. Moreover, statisticians have developed

the robuststatistical methodology as an

alternative approach to some standard

procedures, which possess a robustness

(insensitivity) to the presence of outliers as

well as to standard distributional

assumptions. Although the concept of robust

data mining describing a methodology for

data mining based on robust statistics has

been introduced (Shin et al. , 2007), robust

methods have not found their way to real

data mining applications yet.

This paper proposes new highly robust

methods suitable for the analysis of data in

management applications. We use idea of

implicit weighting to derive new alternative

multivariate statistical methods, which

ensuresa high breakdown point. This paper

has the following structure. Section 2 recalls

10 J.Kalina / SJM 8(1) (2013) 9 - 24

the highly robust least weighted squaresmethod for estimating parameters in a linearregression models. Section 3 proposes arobust cluster analysis method. Section 4proposes a robust estimation method forlogistic regression, which is used to define arobust multinomial logistic classification inSection 5. Section 6 is devoted to robustnessaspects of neural networks and finallySection 7 concludes the paper.

2. LEAST WEIGHTED SQUARES

REGRESSION

Because classical statistical methodssuffer from the presence of outlying datavalues (outliers), robust statistical methodshave been developed as an alternativeapproach to data modeling. They originatedin 1960sas a diagnostic tool for classicalmethods (Stigler, 2010), but have developedto reliable self-standing procedures pointtailor-made to suppress the effect of datacontamination by various kinds of outliers(Hekimoglu et al. , 2009). This sectiondescribes the least weighted squaresestimator, which is one of promising robustestimators of parameters in the linearregression model. Its idea will be exploitedin the following sections.

M-estimators represent the most widelyused robust statistical methods (Maronna etal. , 2006). However, as the concept ofbreakdown point has become the mostimportant statistical measure of resistance(insensitivity) against noise or outlyingmeasurements in the data (Davies & Gather,2005), the M-estimators have been criticizedfor their low breakdown point in linearregression (Salibián-Barrera, 2006). Onlyrecently, highly robust methods defined as

methods with a high breakdown have beenproposed.

The least weighted squares estimator(LWS) proposed by Víšek (2001) is a highlyrobust statistical tool based on the idea ofdown-weighting less reliable observations.This estimator is based on implicit weightingof individual observations. The idea is toassign smaller weights to less reliableobservations, without the necessity tospecify precisely which observations areoutliers and which are not.

In the original proposal, the user has toselect the magnitudes of the weights. Theseare assigned to particular observations onlyafter a permutation, which is automaticallydetermined during the computation of theestimator. However, more recent versions ofthe LWS do not require the user to choose themagnitudes of the weights. Just recently,Čížek (2011) proposed several adaptiveversions of the LWS estimator, including atwo-stage procedure with data-dependentquantile-based weights. The estimator has ahigh breakdown point and at the same time a100 % asymptotic efficiency of the leastsquares under Gaussian errors. Its relativeefficiency is high (over 85 %) compared tomaximum likelihood estimators also undervarious distributional models for moderatesamples, as evaluated in a numerical study(Čížek, 2011). Besides, compared to M-estimators, the LWS estimator of regressionparameters does not require a simultaneousestimation of the error variance, which is adisadvantage of M-estimators losing acomputational simplicity.

The LWS estimator is a weighted analogyof the least trimmed squares (LTS) estimatorof Rousseeuw and van Driessen (2006), whoconsidered weights equal to 1 or 0 only.However, the LTS estimator suffers from ahigh sensitivity to small deviations near the

11J.Kalina / SJM 8(1) (2013) 9 - 24

center of the data, while the LWS estimatorpossesses also a reasonable localrobustness(Víšek, 2001). The computationof the LWS estimator is intensive and anapproximative algorithm can be obtained asa weighted version of the LTS algorithm(Rousseuw and van Driessen, 2006).Diagnostic tools for the LWS estimator in thelinear regression context (mainly foreconometric applications) were proposed byKalina (2011) andhypothesis testsconcerning the significance of LWSestimators by Kalina (2012a). Econometricapplications of robust statistical methodswere summarized by Kalina (2012b). Ageneralization of the LWS estimator fornonlinear regression will be proposed inSection 6. 4.

3. ROBUST CLUSTER ANALYSIS

Cluster analysis (clustering) is acommonly used data mining tool inmanagement applications. Examples ofusing the cluster analysis include acustomersegmentation (differentiation) based onsociodemographic characteristics, life style,size of consumption, experience withparticular products, etc. (Mura, 2012).Another application is the task of marketsegmentationallowingto position products orto categorize managers on the basis of theirstyle and strategy (Punj & Stewart, 1983;Liang, 2005). All such tasks solved bygrouping units into natural clusters at theexplorative stage of information extractionare vital for a successful marketing ormanagement strategy.

Cluster analysis is a general methodologyaiming at extracting knowledge about themultivariate structure of given data. It solves

the task of unsupervised learning by dividingthe data set to several subsets (clusters)without using a prior knowledge about thegroup membership of each observation. It isoften used as an exploratory technique andcan be also interpreted as a technique for adimension reduction of complex multivariatedata. Cluster analysis assumes the data to befixed (non-random) without the ambition fora statistical inference. We can say that itcontains a wide variety of methods withnumerous possibilities for choosing differentparameters and adjusting the wholecomputation process.

The most common approaches toclustering include the agglomerativehierarchical clustering and k-meansclustering, which both suffer from thepresence of outliers in the data and stronglydepend on the choice of the particularmethod. Some approaches are also sensitiveto the initialization of the random algorithm(Vintr et al. , 2012). Robust versions of theclustering have been recently studied in thecontext of molecular genetics, but they havenot penetrated to management applicationsyet. A robust hierarchical clustering wasproposed by García-Escudero et al. (2009)and a robust version of k-means clusteranalysis tailor-made for high-dimensionalapplications by Gao and Hitchcock (2010).It is also possible but tedious to identify andmanually remove outlying measurementsfrom multivariate data (Svozil, 2008).

In this paper, we propose a new robustagglomerative hierarchical cluster analysisbased on a robust correlation coefficient as ameasure of similarity between two clusters.The Pearson correlation coefficient is acommon measure of similarity between twoclusters (Chae et al. , 2008). We propose arobust measure of distance between twoclusters based on implicit weighting (Kalina,

12 J.Kalina / SJM 8(1) (2013) 9 - 24

2012a) inspired by the LWS regressionestimator.

Algorithm 1: (Robust distance betweentwo clusters).

Let us assume two disjoint clusters C1 andC2 of p-variate observations. ThedistancedLWS (C1,C2) between the clusters C1 and C2

is defined by the following algorithm. 1. Select an observation X=(X1,…,Xp)T

from C1 and select an observationY=(Y1,…,Yp)T from C2.

2. Consider the linear regression model

Yi=β0 + β1 Xi + ei, i=1,…p. (1)3. Compute the LWS estimator in the

model (1). The optimal weights determinedby the LWS estimator will be denoted byw*=(w1,. . . ,wp)T.

4. The LWS-distance between X and Ydenoted by dLWS (X,Y) will be defined as

where is the weighted mean of X1,.. . ,Xp and is the weighted mean of Y1,. .. ,Yp with the optimal weights w*.

5. The distance dLWS (C1, C2) betweenthe clusters C1 and C2 is defined as max dLWS

(X, Y) over all possible observations Xcoming from C1 and all possibleobservations Y coming from C2.

N ow we describe the whole procedure ofhierarchical bottom-up robust clusteranalysis. The robust cluster analysis methoddenoted by LWS-CA is defined in thefollowing way.

Algorithm 2: (Robust cluster analysisLWS-CA)

1. Consider each particular observationas an individual cluster.

2. Searchfor such a pair of clusters C1

and C2 which have the minimal value of therobust distance dLWS (C1, C2).

3. The pair of clusters selected in step 2is joined to a single cluster.

4. Repeat steps 2 and 3 until a suitablenumber of clusters is found, using the gapstatistic criterion (Tibshirani et al. , 2001).

TheLWS-CAmethod corresponds to asingle linkage clustering used together with arobust version of the Pearson correlationcoefficient based on implicitly assignedweights. The stopping rule based on the gapstatistic compares the inter-cluster variabilitywith the intra-cluster variability in the data.Alternatively, the user may require a fixedvalue of the final number of resultingclusters before the computation.

4. ROBUST LOGISTIC REGRESSION

The logistic regression is a basic tool formodeling trend of a binary variabledepending on one or several regressors(continuous or categorical). From thestatistical point of view, it is the mostcommonly used special case of a generalizedlinear model. At the same time, it is alsocommonly used as a method forclassification analysis (Agresti, 1990).

The maximum likelihood estimation ofparameters in the logistic regression isknown to be too vulnerable to the presenceof outliers. Moreover, the logistic regressiondoes not assume errors in the regressors,

13J.Kalina / SJM 8(1) (2013) 9 - 24

p

1j

p

1j

2wj

*j

2wj

*j

p

1iwiwi

*i

LWS

)YY(w)XX(w

)YY)(XX(w)Y,X(d

(2)

wX wY

which may be an unrealistic assumption inmany applications. Outliers appear in thelogistic regression context quitecommonly asmeasurement errors or can emerge as typingerrors both in regressors and the response(Buonaccorsi, 2010). This section proposesa novel robust estimator for parameters oflogistic regression denoted based on the leastweighted squares estimator (Section 2) andderive its breakdown point.

Christmann (1994), who used the leastmedian of squares (LMS) method forestimating parameters in the logisticregression model and proved the estimator topossess the maximal possible breakdownpoint. evertheless, the low efficiency ofthe LMS estimator has been reported asunacceptable; see Shertzer and Prager (2002)for a management application. Čížek (2008)advocated highly robust estimation ofparameters of the logistic regression,because they have the potential to have arelatively low bias without a need for a biascorrection, which would be necessary for M-estimators. A nonparametric alternative tothe logistic regression is known as themultifactor dimensionality reductionproposed by Ritchie et al. (2001).

Let us now consider the model of thelogistic regression with a binary responseY=(Y1,…, Yn)T. Its values equal to 1 or 0 canbe interpreted as a success (or failure,respectively) of a random event. Theprobability of success for the i-th observation(i=1,…,n) is modeled as a response ofindependent variables, which can be eithercontinuous or discrete. The regressors aredenoted as Xi=( X1,…, Xpi)T for i=1,…,n.The conditional distribution of Yi assumingfixed values of the regressors is assumed tobe binomial Bi(mi, πi), where mi is a knownpositive integer and probability πi depends

on regression parameters β1,…,βp fori=1,…,n through

. (3)

We introduce the notation

and

(4)

for i=1,…,n.

Definition 2. We define the leastweighted logistic regression (LWLR)estimator in the model (3) in the followingway. We consider the transformations (4),where the unknown probabilities πi

(i=1,…,n) are estimated by the maximumlikelihood method. The LWLR is defined asthe LWS estimator with the adaptivequantile-based weights computed for thedata

(5)

The LWLR estimator has a highbreakdown point as evaluated in thefollowing theorem, which can be proven asan analogy of the result of Christmann(1994).

Theorem 1. We assume the values mi areassumed to be reasonably large. Further, weassume technical assumptions of Christmann(1994). Then, the breakdown point of theLWLR estimator computed with the data-dependent adaptive weights of Čížek (2011)

is equal to:

(6)

14 J.Kalina / SJM 8(1) (2013) 9 - 24

pipi11i

i X...X1log

2/1

ii

iii Y1m

YXX~

2/1

ii

iiiii Y1m

YX1/logY~

.Y~,X~,...,Y~,X~ Tn

Tn

T1

T1

n

2p2n

Trener

Text Box

N

where denotes the integer part of n/2,

defined as the largest integer smaller or equal

to n/2. Let us discuss using the robust logistic

regression as a classification method intotwo groups. The observed data areinterpreted as a training data base with theaim to learn a classification rule. Let pi

defined by

(7)

denote the probability that the i-thobservation belongs to the group 1, where(b1,. . . , bp)T is the LWLR estimator of (β1,. .. , βp)T. Theoretical textbooks (e. g.Jaakkola, 2013) commonly recommend touse the logistic regression to classify a newobservation to the group 1, if and only if

pi > 1/2, (8)

which is equivalent to

(9)

However, this may be very unsuitable insome situations, especially if a large majorityof the training data belongs to one of the twogiven groups.

A more efficient classification rule isobtained by replacing the classification rule(9) by the rule pi > c, where the constantc isdetermined in the optimal way, that itminimizes the total classification error. Wepoint out that it is equivalent to maximizingthe Youden’s index I (Youden, 1950) definedas:

I = sensitivity + specificity – 1, (9`)

where sensitivity defined as the probabilityof a correct classification of a successfulobservation and specificity is the probabilityof a correct classification of an unsuccessfulobservation. The optimization of thethreshold c in the classification context iscommon for neural networks, as it will bedescribed in Section 6. 2.

5. ROBUST MULTINOMIAL

LOGISTIC CLASSIFICATION

ANALYSIS

Classification analysis into several (morethan two) groups is a common task inmanagement applications. We extend therobust logistic regression estimator LWLR ofSection 5 to a robust multinomial logisticclassification analysis into several groups.The new method is insensitive to thepresence of contamination in the data.

First, we recall the multinomial logisticregression, which is an extension of thelogistic regression (3) to a model with aresponse with several different. We assumethe total number of n measurements denotedas (X11,…,X1n)T,…, (Xp1,…,Xpn)T. Each ofthem belongs to one of K different groups.The index variable Y=(Y1,…,Yn)T containsthe code 1,…,K corresponding to the groupmembership of each measurement. Weconsider a model assuming that Y follows amultinomial distribution with K categories.

Definition 3. The data are assumed as Kindependent random samples of p-variatedata. The multinomial logistic regressionmodel is defined as

15J.Kalina / SJM 8(1) (2013) 9 - 24

2n

pipi11pipi11 Xb...XbXb...Xbi e1/ep

.0p1p

logi

i

,n,...,1i,X...X)KY(p)kY(p

log pikpi11ki

i

k=1,…, K-1 (10)

where P(Yi=k) denotes the probability that

the observation Yi belongs to the k-th group.

Here each of the groups 1,…,K has its

own set of regression parameters, expressing

the discrimination between the particular

group and the reference group K. We also

point out that the value

(11)

which compares the probability that the

observation Yi belongs to the k-th group with

the probability that Yi belongs to the l-thgroup (k=1,…,K, l=1,. ,,,,K),can be

evaluated as

Our aim is to use the observed data as a

training set to learn a robust classification

rule allowing to assign a new observation

Z=(Z1,…,Zp)T to one of the K given groups.

As a solution, we define the robust

multinomial logistic classification analysis

based on a robust estimation computed

separately in the total number of K-1particular models (10).

Definition 4. In the model (10), let pk

denote the estimate of the probability that the

observation Z belongs to the k-th group for

k=1,…,K by replacing the unknown

regression parameters in (10) by their

maximum likelihood estimates. Using the

notation

and

(13)

for i=1,…,n, the robust estimator is obtained

as the LWS estimator with adaptive quantile-

based weights in the model

(14)

The robust multinomial logistic

classification analysis (RMLCA) assigns a

new observation Z to the group k, if and only

if

pk ≥ pj for all k=1,…,K, j≠k. (15)

6. NEURAL NETWORKS

6. 1. Principles of neural networks

Neural networks represent an important

predictive data miningtool with a high

flexibility and ability to be applied to a wide

variety of complex models (Hastie et al. ,

2001). They have become increasingly

popular as an alternative to statistical

approaches for classification (Dutt-

Mazumder et al. , 2011). In management

applications, neural networks are commonly

used in risk management, market

segmentation, classification of consuming

spending patterns, sale forecasts, analysis of

financial time series, quality control, or

strategic planning; see the survey by

Hakimpoor et al. (2011) for a full

discussion. However, Krycha and Wagner

(1999) warned that most papers on

management applications ‘do not report

completelyhow they solved the problems at

hand by the neural network approach’.

Neural networks are biased under the

16 J.Kalina / SJM 8(1) (2013) 9 - 24

,)lY(p)kY(p

logi

i

i11lpikpi11ki

i XX...X)lY(p)kY(p

log

.X... pilp (12)

ikiki21

kkiki XvX~,))p1(pm(v

))p1/(plog(vY~ kkkiki

.)Y~,X~(,...,)Y~,X~( Tkn

Tkn

T1k

T1k

presence of outliers and a robust estimationof their parameters is desirable (Beliakov etal. , 2011). In general, neural networksrequire the same validation steps asstatistical methods, because they involveexactly the same sort of assumptions asstatistical models (Fernandez, 2003).Actually, neural networks need an evaluationeven more than the logistic regression,because more complicated models are morevulnerable to overfitting than simplermodels. Unfortunately, a validation is ofteninfeasible for complex neural networks. Thisleads users to a a tendency to believe thatneural networks do not require any statisticalassumptions or that they are model-free(Rusiecki, 2008), which may lead to a wrongfeeling that they do not need any kind ofvalidation.

We mention also some other weak pointsof training neural networks. They are oftencalled black boxes, because their numerousparameters cannot be interpreted clearly andit is not possible to perform a variableselection by means of testing the statisticalsignificance of individual parameters.Moreover, learning reliable estimates of theparameters requires a very large number ofobservations, especially for data with a largenumber of variables. Also overfitting is acommon shortage of neural networks, whichis a consequence of estimating theparameters entirely over the training datawithout validating on an independent dataset.

Different kinds of neural networks can bedistinguished according to the form of theoutput and the choice of the activationfunction and different tasks require to usedifferent kinds of networks. In this work, weconsider two most common kinds ofmultilayer perceptrons, which are also calledmultilayer feedforward networks. Their

training is most commonly based on theback-propagation algorithm. Section 7. 2discusses neural networks for a supervisedclassification task, i. e. networks with abinary output, and explains their link to thelogistic regression. Section 7. 3. disussesneural networks for function approximation,i. e. networks wih a continuous output, andstudies a robust approach to their fitting. Anew robust estimation tool suitable for sometypes of such neural networks is described inSection 7. 4 as the nonlinear least weightedsquares estimator.

6. 2. Neural networks for classification

We discuss supervised multilayerfeedforward neural networks employed for aclassification task. The input data arecoming from K independent random samples(groups) of p-dimensional data and the aim isto learn a classification rule allowing toclassify a new measurement into one of thegroups. In this section, we also describe theconnection between a classification neuralnetwork and logistic regression.

The network consists of an input andoutput layer of neurons and possibly of oneor several hidden layers, which are mutuallyconnected by edges. A so-called weightcorresponds to each observed variable. Thisweight can be interpreted as a regressionparameter and its value can be any real (alsonegative) number. In the course of theprocess of fitting the neural network, theweights connecting each neuron with one ofneurons from the next layers are optimized.A given activation function is applied on theweighted inputs to determine the output ofthe neural network.

First, let us consider a neural networkconstructed for the task of classification totwo groups with no hidden layers. The

17J.Kalina / SJM 8(1) (2013) 9 - 24

output of the network is obtained in the form

(16)

where x ∈ Rp represents the input data, Rdenotes the set of all real numbers, w is thevector of weights, and b is a constant(intercept). Simple neural networks usuallyuse the activation function g as a binaryfunction or a monotone function, typically asigmoid function such as the logisticfunction or hyperbolic tangent.

If the logistic activation function

(17)

is applied, the neural networks is preciselyequal to the model of logistic regression(Dreiseitl & Ohno-Machado, 2002). Thisspecial case of a neural network is differentfrom the logistic regression only in themethod for estimating the (regression)parameters. Moreover, if the hyperbolictangent activation functionas

(18)

the neural network without hidden layers isagain equal to the model of the logisticregression. Moreover, it can be easilyproven that g2(x) = 2g1(2x) + 1 for any realx.

An influence of outliers on classificationanalysis performed by neural networks hasbeen investigated only empirically (e. g.Murtaza et al. , 2010). Classification neuralnetworks are sensitive to the presence ofoutliers, but a robust version has not beenproposed. For the classification to twogroups, the sensitivity of the neural networksis a consequence of their connection to the

logistic regression. Thanks to the connectionbetween neural networks and logisticregression, the robust logistic regressionestimator LWLR (Section 5) represents at thesame time a robust method for fittingclassification neural networks with nohidden layers.

Moreover, references on the sensitivityanalysis of neural networks have notanalyzed other situations, which negativelyinfluence training neural networks (cf.Yeung et al. , 2010). An example ismulticollinearity, which is a phenomenon asharmful for neural networks as for thelogistic regression.

Neural networks for classification to Kgroups are commonly used with one or morehidden layers. Such networks use a binarydecision in each node of the last hidden layerdriven by a sigmoid activation function.Therefore, they are sensitive to the presenceof outliers as well as the networks forclassification to two groups.

6. 3. Neural networks for function

approximation

eural networks are commonly used as atool for approximating a continuous realfunction. In management, multilayerfeedforward networks are often used mainlyfor predictions, e. g. for modeling andpredicting market response (Gruca et al. ,1999) or demand forecasting (Efendigil et al., 2009). Most commonly, they use anidentical activation function, i. e. they can bedescribed as

(19)

An alternative approach to functionapproximation is to use radial basis functionneural networks. ow we need to describe

18 J.Kalina / SJM 8(1) (2013) 9 - 24

)bxw(g)x(f T

,Rx,e11)x(g x1

,Rx),xtanh()x(g2

.Rx),bxw(g)x(f pT

Trener

Text Box

N

Trener

Text Box

N

the back-propagation algorithm for functionapproximation networks, which is at thesame time commonly used also for theclassification networks of Section 6. 2.

The back-propagation algorithm forneural networks employed for functionapproximation minimizes the total errorcomputed across all data values of thetraining data set. The algorithm is based onthe least squares method, which is optimalfor normally distributed random errors in thedata (Rusiecki, 2008). After an initiation ofthe values of the parameters, the forwardpropagation is a procedure for computingweights for the neurons sequentially inparticular hidden layers. This leads tocomputing the value of the output andconsequently the sum of squared residualscomputed for the whole training data set. Toreduce the sum of squared residuals, thenetwork is sequentially analyzed from theoutput back to the input. Particular weightsfor individual neurons are transformed usingthe optimization method of the steepestgradient. However, there are no diagnostictools available, which would be able todetect a substantial information in theresiduals, e. g. in the form of theirdependence, heteroscedasticity, orsystematic trend.

A robust version of fittingmultilayerfeedforward networks for the task offunction approximation for contaminateddata was described only for specific kinds ofneural networks. Rusiecki (2008) studiedneural networks based on robust multivariateestimation using the minimum covariancedeterminant estimator. Chen and Jain (1994)investigated M-estimators and Liano (1996)studied the influence function of neuralnetworks for function approximation as ameasure of their robustness.

Jeng et al. (2011) or Beliakov et al.

(2012) studied neural networks based onnonlinear regression. Then, instead ofestimating the parameters of their neuralnetworks by means of the traditionalnonlinear least squares, they performed theLTS estimation. In this paper, we propose analternative approach based on the LWSestimator. Because such approach is generaland not connected to the context of neuralnetworks, we present the methodology as aself-standing Section 6. 4.

6. 4. Nonlinear least weighted squares

regression

Let us consider the nonlinear regressionmodel

Yi=f(β1X1i+…+βpXpi)+ei, i=1,…,n, (20)

where Y=(Y1,…,Yn)T is a continuousresponse, (X11,…,X1n)T,…, (Xp1,…,Xpn)T

regressors and f is a given nonlinear function.Let u(i)(b) denote a residual corresponding tothe i-th observation for a given estimatorb=(b1,…,bp)T∈Rp of the regressionparameters (β1,…, βp)T. We consider theresiduals arranged in ascending order in theform

(21)

We define the least weighted squaresestimator of the parameters in the model (20)as

(22)

where the argument of the minimum iscomputed over all possible values ofb=(b1,…,bp)T and where w1,…,wn aremagnitudes of weights determined by the

19J.Kalina / SJM 8(1) (2013) 9 - 24

)b(u...)b(u)b(u 2)n(

2)2(

2)1(

),b(uwminarg 2)i(

n

1i i

user, e. g. linearly decreasing or logisticweights (Kalina, 2012a). N ow we describethe algorithm for computing the solution of(23) as an adaptation of the LTS algorithmfor the linear regression (cf. Rousseeuw &van Driessen, 2006).

Algorithm 3: (Nonlinear least weightedsquares estimator).

1. Set the value of a loss function to +∞.Select randomly p points, which uniquelydetermine the estimate b of regressionparameters β.

2. Evaluate residuals for all observations.Assign the weights to all observations basedon (21).

3. Compare the value of

computed with the resultingweights with thecurrent value of the loss function. If

is larger, go to step 4. Otherwise

go to step 5.

4. Set the value of the loss function

to and store the values of the

weights. Find the nonlinear regressionestimator of β1,…,βp by weighted leastsquares (Seber & Wild, 1989) using theseweights. Go back to steps 2 and 3.

5. Repeatedly (10 000 times) performsteps 1 through 4. The output (optimal)weights are those giving the global optimum

of the loss function over all

repetitions of steps 1 through 4.

7. CONCLUSION

This paper fills the gap of robuststatistical methodology for data mining byintroducing new highly robust methods

based on implicit weighting. Robuststatistical methods are established as analternative approach to certain problems ofstatistical data analysis.

Data mining methods are routinely usedby practitioners in everyday managementapplications. This paper persuades thereaders that robustness is a crucial aspect ofdata mining which has remained only a littleattention so far, although sensitivity of datamining methods to the presence of outlyingobservations has been repeatedlyreported asa serious problem (Wong, 1997; Yeung et al., 2010). On the other hand, an experienceshows that too sophisticated data miningmethods are not very recommendable forpractical purposes (Hand, 2006) and simplemethods are usually preferable.

This paper introduces new tools for therobust data mining using an intuitively clearrequirement to down-weight less reliableobservations. The robust cluster analysisLWS-CA (Section 3), the robust logisticregression LWLR (Section 4), and the robustmultinomial logistic classification analysis(Section 5) are examples of newly proposedmethods which can be understood as part ofthe robust data mining framework.

The logistic regression is commonlydescribed as a white-box model (Dreiseitl &Ohno-Machado, 2012), because it offers asimpler interpretation compared to the neuralnetworks characterized as a black box.Because robust methods have been mostlystudied for continuous data, our proposal ofrobust logistic regression is one ofpioneering results on robust estimation in thecontext of discrete data.

In management applications, neuralnetworks are quite commonly used, but stillless frequently compared to the logisticregression. Also neural networks aresensitive to the presence of outlying

20 J.Kalina / SJM 8(1) (2013) 9 - 24

)b(uw 2)i(

n

1i i

)b(uw 2)i(

n

1i i

)b(uw 2)i(

n

1i i

)b(uw 2)i(

n

1i i

measurements. This paper discusses theimportance of robust statistical methods fortraining neural networks (Section 6) andproposes a robust approach to neuralnetworksfor function approximation. Suchapproach still requires an extensiveevaluation with the aim to detect a possibleoverfitting by means of a cross-validation orboostrap methodologies. Moreover, robustneural networks for classification purposesbased on robust back-propagation remains tobe an open problem. A classification neuralnetwork is characterized as a generalizationof the logistic regression, although the twomodels do not use the same method forparameter estimation. Another possibilityfor a robust training of classification neuralnetworks would be to propose a robust back-propagation based on the minimum weighted

covariance determinant estimator proposedby Kalina (2012b).

In this paper, we do not discuss a robustinformation extraction from high-dimensional data, which represents a veryimportant and complicated task. It is truethat numerous data mining methods sufferfrom the so-called curse of dimensionality.Also neural networks have not been adaptedfor high-dimensional data; it can be ratherrecommended to use alternative approachesfor a high-dimensional supervisedclassification (e. g. Bobrowski & Łukaszuk,2011).

21J.Kalina / SJM 8(1) (2013) 9 - 24

ВИСОКО РОБУСТНИ МЕТОДИ ИСТРАЖИВАЊА ПОДАТАКА

Јан Калина

Извод

Овај се рад бави високо робустним методама екстракције информација из података, узпосебну пажњу посвећену методима погодним за примену у менаџменту. Осетљивостдоступних метода истражива, на присуство екстремних резултата мерења у полазној бази, једискутована као основни недостатак разматраних метода. Овај рад предлаже неколико новијихробустних метода истраживања података, које се заснивају на идеји имплицитних тежинскихкоефицијената индивидуалних вредности података. Посебно се предлаже новији робустниметод хијерархијске анализе кластера, који је популарaн метод анализе података и учењамрежа. Даље, предлаже се робустни метод за процену параметара током логистичке регресије.Ова идеја се шири ка робустној мултиномијалној логистичкој анализи класификације. Накрају, дискутује се осетљивост неуронских мрежа на присуство шума екстремних података уполазној бази. Предлаже се метод робустног тренинга неуронских мрежа у циљуапроксимације функције, која има форму робустног алата процене у нелинеарној регресионојанализи.

Кључне речи: Истраживање података, Робустна статистика, Вишедимензиони подаци, Анализакластера, Логистичка регресија, Неуронске мреже

The research was supported by the grant13-01930S of the Grant Agency of theCzech Republic

References

Agresti A. (1990). Categorical dataanalysis. Wiley, New York.

Beliakov G. , Kelarev A. , & Yearwood J.(2012). Robust artificial neural networksand outlier detection. Technical report,arxiv. org/pdf/1110. 1069. pdf (downloadedNovember 28, 2012).

Bobrowski L. , & Łukaszuk T. (2011).Relaxed linear separability (RLS) approachto feature (gene) subset selection. In Xia X.(Ed. ). Selected works in bioinformatics.InTech, Rijeka, 103-118.

Brandl B. , Keber C. , & Schuster M. G.(2006). An automated econometric decisionsupport system. Forecast for foreignexchange trades. Central European Journalof Operations Research, 14 (4), 401-415.

Briner R. B. , Denyer D. , & Rousseau D.M. (2009). Evidence,based management.Concept cleanup time? Academy ofManagement Perspectives, 23 (4), 19-32.

Buonaccorsi J. P. (2010). Measurementerror. models, methods, and applications.Boca Raton. Chapman & Hall/CRC.

Chae S. S. , Kim C. , Kim J. ,M. , &Warde W. D. (2008). Cluster analysis usingdifferent correlation coefficients. StatisticalPapers, 49 (4), 715-727.

Chen D. S. , & Jain R. C. (1994). Arobust back propagation learning algorithmfor function approximation. IEEETransactions on Neural Networks, 5 (3), 467-479.

Christmann A. (1994). Least median ofweighted squares in logistic regression withlarge strata. Biometrika, 81 (2), 413-417.

Čížek P. (2011). Semiparametricallyweighted robust estimation of regressionmodels. Computational Statistics & DataAnalysis, 55 (1), 774-788.

Čížek P. (2008). Robust and efficient

adaptive estimation of binary,choiceregression models. Journal of the AmericanStatistical Association, 103 (482), 687-696.

Davies P. L. , & Gather U. (2005).Breakdown and groups. Annals of Statistics,33 (3), 977-1035.

Dreiseitl S. , & Ohno,Machado L. (2002).Logistic regression and artificial neuralnetwork classification models. Amethodology review. Journal of BiomedicalInformatics, 35, 352-359.

Dutt,Mazumder A. , Button C. , Robins A., & Bartlett R. (2011). Neural networkmodelling and dynamical systém theory. arethey relevant to study the governingdynamics of association football players?Sports Medicine, 41 (12), 1003-1017.

Efendigil T. , Önüt S. , & Kahraman C.(2009). A decision support system fordemand forecasting with artificial supportnetworks and neuro,fuzzy models. Acomparative analysis. Expert Systems withApplications, 36 (3), 6697-6707.

Fayyad U. , Piatetsky,Shapiro G. , &Smyth P. (1996). From data mining toknowledge discovery in databases. AIMagazine, 17 (3), 37-54.

Fernandez G. (2003). Data mining usingSAS applications. Boca Raton. Chapman &Hall/CRC.

Gao J. , & Hitchcock D. B. (2010).James,Stein shrinkage to improve k,meanscluster analysis. Computational Statistics &Data Analysis, 54, 2113-2127.

García,Escudero L. A. , Gordaliza A. , SanMartín R. , van Aelst S. , & Zamar R.(2009). Robust linear clustering. Journal ofthe Royal Statistical Society, B71 (1), 301-318.

Gruca T. S. , Klemz B. R. , & Petersen E.A. F. (1999). Mining sales data using aneural network model of market reponse.ACM SIGKDD Explorations Newsletter, 1

22 J.Kalina / SJM 8(1) (2013) 9 - 24

(1), 39-43. Gunasekaran A. , & N gai E. W. T. (2012).

Decision support systems for logistic andsupply chain management. DecisionSupport Systems and Electronic Commerce,52 (4), 777-778.

Hakimpoor H. , Arshad K. A. B. , Tat H.H. , Khani N. , & Rahmandoust M. (2011).Artificial neural networks’ applications inmanagement. World Applied SciencesJournal, 14 (7), 1008-1019.

Hand D. J. (2006). Classifier technologyand the illusion of progress. StatisticalScience, 21 (1), 1-15.

Hastie T. , Tibshirani R. , Friedman J.(2001). The elements of statistical learning.Springer, New York.

Hekimoglu S. , Erenoglu R. C. , & KalinaJ. (2009). Outlier detection by means ofrobust regression estimators for use inengineering science. Journal of ZhejiangUniversity, Science A, 10 (6), 909-921.

Jaakkola T. S. (2013). Machine learning.http. //www. ai. mit. edu/courses/6. 867-f04/lectures/lecture-5-ho. pdf (downloadedJanuary 4, 2013).

Jeng J. ,T. , Chuang C. ,T. , & Chuang C.,C. (2011). Least trimmed squares basedCPBUM neural networks. ProceedingsInternational Conference on System Scienceand Engineering ICSSE 2011, IEEEComputer Society Press, Washington, 187-192.

Kalina J. (2012a). Implicitly weightedmethods in robust image analysis. Journal ofMathematical Imaging and Vision, 44 (3),449-462.

Kalina J. (2012b). On multivariatemethods in robust econometrics. PragueEconomic Papers, 21 (1), 69-82.

Kalina J. (2011). Some diagnostic toolsin robust econometrics. Acta UniversitatisPalackianae Olomucensis Facultas Rerum

Naturalium Mathematica ,50 (2), 55-67. Krycha K. A. , & Wagner U. (1999).

Applications of artificial neural networks inmanagement science. A survey. Journal ofRetailing and Consumer Services, 6, 185-203.

Liang K. (2005). Clustering as a basis ofhedge fund manager selection. Technicalreport, University of California, Berkeley,cmfutsarchive/HedgeFunds/hf_managerselection. pdf (downloaded December 20, 2012).

Liano K. (1996). Robust error measurefor supervised neural network learning withoutliers. IEEE Transactions on Neural

Networks, 7 (1), 246-250. Maronna R. A. , Martin R. D. , & Yohai V.

J. (2006). Robust statistics. Theory andmethods. Chichester. Wiley.

Martinez W. L. , Martinez A. R. , & SolkaJ. L. (2011). Exploratory data analysis withMATLAB. Second edition. Chapman &Hall/CRC, London.

Mura L. (2012). Possible applications ofthe cluster analysis in the managerialbusiness analysis. Information Bulletin ofthe Czech Statistical Society, 23 (4), 27-40.(In Slovak. )

Murtaza N. , Sattar A. R. , & Mustafa T.(2010). Enhancing the software effortestimation using outlier elimination methodsfor agriculture in Pakistan. Pakistan Journalof Life and Social Sciences, 8 (1), 54-58.

Nisbet R. , Elder J. , Miner G. (2009).Handbook of statistical analysis and datamining applications. Elsevier, Burlington.

Punj G. , & Stewart D. W. (1983).Cluster analysis in marketing research.Review and suggestions for applications.Journal of Marketing Research, 20 (2), 134-148.

Ritchie M. D. , Hahn L. W. , Roodi N. ,Bailey L. R. , Dupont W. D. , Parl F. F. , &Moore J. H. (2001). Multifactor-

23J.Kalina / SJM 8(1) (2013) 9 - 24

dimensionality reduction reveals high-orderinteractions among estrogen-metabolismgenes in sporadic breast cancer. AmericanJournal of Human Genetics, 69 (1), 138-147.

Rousseeuw P. J. , & van Driessen K.(2006). Computing LTS regression for largedata sets. DataMining and KnowledgeDiscovery,12 (1), 29-45.

Rusiecki A. (2008). Robust MCD,basedbackpropagation learning algorithm. InRutkowski L, Tadeusiewicz R. , Zadeh L. ,Zurada J. (Eds. ). Artificial Intelligence andSoft Computing. Lecture Notes in ComputerScience, 5097, 154-163.

Salibián,Barrera M. (2006). Theasymptotics of MM,estimators for linearregression with fixed designs. Metrika, 63,283-294.

Schäfer J. , & Strimmer K. (2005). Ashrinkage approach to large,scale covariancematrix estimation and implications forfunctional genomics. StatisticalApplications in Genetics and MolecularBiology, 4 (1), Article 32, 1-30.

Seber A. F. , & Wild C. J. (1989).Nonlinear regression. Wiley, New York.

Shertzer K. W. , & Prager M. H. (2002).Least median of squares. A suitableobjective function for stock assessmentmodels? Canadian Journal of Fisheries andAquatic Sciences, 59, 1474-1481.

Shin S. , Yang L. , Park K. , & Choi Y.(2009). Robust data mining. An integratedapproach. In Ponce J. , Karahoca A. (Eds. ).Data mining and knowledge discovery inreal life applications. I-Tech Education andPublishing, New York.

Soda P. , Pechenizkiy M. , Tortorella F. ,& Tsymbal A. (2010). Knowledgediscovery and computer,based decisionsupport in biomedicine. Knowledgediscovery and computer,based decisionsupport in biomedicine. Artificial

Intelligence in Medicine, 50 (1), 1-2. Stigler S. M. (2010). The changing

history of robustness. American Statistician,64 (4), 277-281.

Svozil D. , Kalina J, Omelka M. , &Schneider B. (2008). DNA conformations

and their sequence preferences. NucleicAcids Research, 36 (11), 3690-3706.

Tibshirani R. , Walther G. , & Hastie T.(2001). Estimating the number of clusters ina data set via the gap statistic. Journal of theRoyal Statistical Society Series B, 63 (2),411-423.

Vintr T. , Vintrová V. , & Řezanková H.(2012). Poisson distribution basedinitialization for fuzzy clustering. NeuralN etwork World, 22 (2), 139-159.

Víšek J. Á. (2001). Regression with highbreakdown point. In Antoch J. , Dohnal G.(Eds. ). Proceedings of ROBUST 2000,Summer School of JČMF, JČMF and CzechStatistical Society, Prague, 324-356.

Yeung D. S. , Cloete I. , Shi D. , & N g W.W. Y. (2010). Sensitivity analysis for neuralnetworks. Springer, New York.

Youden W. J. (1950). Index for ratingdiagnostic tests. Cancer 3, 32-35.

Zvárová, J. , Veselý A. , & Vajda I.(2009). Data, information and knowledge.In P. Berka, J. Rauch and D. Zighed (Eds.), Data mining and medical knowledgemanagement. Cases and applicationsstandards. IGI Global, Hershey, 1-36.

24 J.Kalina / SJM 8(1) (2013) 9 - 24

highly robust methods in data mining issn1452-4864/8_1_2013_may_1_132... · predictive data mining....

Documents