highly robust methods in data mining issn1452-4864/8_1_2013_may_1_132... · predictive data mining....
TRANSCRIPT
1. STATISTICS VS. DATA MINING
Data mining can be characterized as a
process of information or knowledge
extraction from data sets, which leads to
revealing and investigating systematic
associations among invidivual variables. It
contains the exploratory data analysis,
descriptive modeling, classification, and
regression, while classification together with
HIGHLY ROBUST METHODS IN DATA MINING
Jan Kalina*
Institute of Computer Science of the Academy of Sciences of the Czech Republic,Pod Vodárenskou věží 2, 182 07 Praha 8, Czech Republic
(Received 14 January 2013; accepted 4 February 2013)
Abstract
This paper is devoted to highly robust methods for information extraction from data, with a
special attention paid to methods suitable for management applications. The sensitivity of available
data mining methods to the presence of outlying measurements in the observed data is discussed as
a major drawback of available data mining methods. The paper proposes several newhighly robust
methods for data mining, which are based on the idea of implicit weighting of individual data values.
Particularly it propose a novel robust method of hierarchical cluster analysis, which is a popular data
mining method of unsupervised learning. Further, a robust method for estimating parameters in the
logistic regression was proposed. This idea is extended to a robust multinomial logistic classification
analysis. Finally, the sensitivity of neural networks to the presence of noise and outlying
measurements in the data was discussed. The method for robust training of neural networks for the
task of function approximation, which has the form of a robust estimator in nonlinear regression, was
proposed.
Keywords: Data mining, robust statistics, High-dimensional data, Cluster analysis, Logistic
regression, Neuralnetworks.
* Corresponding author: [email protected]
S e r b i a n
J o u r n a l
o f
M a n a g e m e n t
Serbian Journal of Management 8 (1) (2013) 9 - 24
www.sjm06.com
DOI: 10.5937/sjm8-3226
regression are commonly denoted as
predictive data mining. The concept of data
mining can be described as the analytical
part of the overall process of extracting
useful information from data, which is
traditionally called knowledge discovery
(Fayyad et al. , 1996; Nisbet et al. , 2009;
Martinez et al. , 2011). We can say that data
mining has the aim to extract information,
while knowledge discovery goes further and
aims at acquiring knowledge relevant for the
field of expertise; compare Fernandez (2003)
and Zvárová et al. (2009). Actually Soda et
al. (2010) required data mining methods to
be dynamically integrated within knowledge
discovery approaches.
In management applications, data mining
is often performed with the aim to extract
information relevant for making predictions
and/or decision making, which can be
described as selecting an activity or series of
activities among several alternatives.
Decision making integrates uncertainty as
one of the aspects with an influence on the
outcome. The development of computer
technology has allowed to implement
partially or fully automatic decision support
systems, which can be described as very
complicated systems offering assistance with
the decision making process with the ability
to compare different possibilities in terms of
their risk. The systems are capable to solve
a variety of complex tasks, to analyze
different information components, to extract
information of different types, and deduce
conclusions for management,financial, or
econometricapplications (Gunasekaran &
Ngai, 2012; Brandl et al. , 2006), allowing to
find best available decision within the
framework of the evidence-based
management (Briner, 2009).
Unlike classical statistical procedures, the
data mining methodology does not have the
ambition to generalize its results beyond the
data summarized in a given database. In data
mining, one does not usually assume a
random sample from a certain population
and the data are analyzed and interpreted as
if they constituted the whole population. On
the other hand, the work of a statistician is
often based on survey sampling, which
requires also to propose questionnares, train
questioners, work with databases, aggregate
the data, compute descriptive statistics, or
estimating non-response. Moreover, a
statistician commonly deals with observing a
smaller number of variables on relatively
small samples, which is another difference
from the data mining context.
Removal of outliers is one of important
steps of validating the plausibility of a model
both in statistics and data mining. A common
disadvantage of popular data mining
methods (linear regression, classification
analysis, machine learning) is namely their
high sensitivity (non-robustness) to the
presence of outlying measurements in the
data. Moreover, statisticians have developed
the robuststatistical methodology as an
alternative approach to some standard
procedures, which possess a robustness
(insensitivity) to the presence of outliers as
well as to standard distributional
assumptions. Although the concept of robust
data mining describing a methodology for
data mining based on robust statistics has
been introduced (Shin et al. , 2007), robust
methods have not found their way to real
data mining applications yet.
This paper proposes new highly robust
methods suitable for the analysis of data in
management applications. We use idea of
implicit weighting to derive new alternative
multivariate statistical methods, which
ensuresa high breakdown point. This paper
has the following structure. Section 2 recalls
10 J.Kalina / SJM 8(1) (2013) 9 - 24
the highly robust least weighted squaresmethod for estimating parameters in a linearregression models. Section 3 proposes arobust cluster analysis method. Section 4proposes a robust estimation method forlogistic regression, which is used to define arobust multinomial logistic classification inSection 5. Section 6 is devoted to robustnessaspects of neural networks and finallySection 7 concludes the paper.
2. LEAST WEIGHTED SQUARES
REGRESSION
Because classical statistical methodssuffer from the presence of outlying datavalues (outliers), robust statistical methodshave been developed as an alternativeapproach to data modeling. They originatedin 1960sas a diagnostic tool for classicalmethods (Stigler, 2010), but have developedto reliable self-standing procedures pointtailor-made to suppress the effect of datacontamination by various kinds of outliers(Hekimoglu et al. , 2009). This sectiondescribes the least weighted squaresestimator, which is one of promising robustestimators of parameters in the linearregression model. Its idea will be exploitedin the following sections.
M-estimators represent the most widelyused robust statistical methods (Maronna etal. , 2006). However, as the concept ofbreakdown point has become the mostimportant statistical measure of resistance(insensitivity) against noise or outlyingmeasurements in the data (Davies & Gather,2005), the M-estimators have been criticizedfor their low breakdown point in linearregression (Salibián-Barrera, 2006). Onlyrecently, highly robust methods defined as
methods with a high breakdown have beenproposed.
The least weighted squares estimator(LWS) proposed by Víšek (2001) is a highlyrobust statistical tool based on the idea ofdown-weighting less reliable observations.This estimator is based on implicit weightingof individual observations. The idea is toassign smaller weights to less reliableobservations, without the necessity tospecify precisely which observations areoutliers and which are not.
In the original proposal, the user has toselect the magnitudes of the weights. Theseare assigned to particular observations onlyafter a permutation, which is automaticallydetermined during the computation of theestimator. However, more recent versions ofthe LWS do not require the user to choose themagnitudes of the weights. Just recently,Čížek (2011) proposed several adaptiveversions of the LWS estimator, including atwo-stage procedure with data-dependentquantile-based weights. The estimator has ahigh breakdown point and at the same time a100 % asymptotic efficiency of the leastsquares under Gaussian errors. Its relativeefficiency is high (over 85 %) compared tomaximum likelihood estimators also undervarious distributional models for moderatesamples, as evaluated in a numerical study(Čížek, 2011). Besides, compared to M-estimators, the LWS estimator of regressionparameters does not require a simultaneousestimation of the error variance, which is adisadvantage of M-estimators losing acomputational simplicity.
The LWS estimator is a weighted analogyof the least trimmed squares (LTS) estimatorof Rousseeuw and van Driessen (2006), whoconsidered weights equal to 1 or 0 only.However, the LTS estimator suffers from ahigh sensitivity to small deviations near the
11J.Kalina / SJM 8(1) (2013) 9 - 24
center of the data, while the LWS estimatorpossesses also a reasonable localrobustness(Víšek, 2001). The computationof the LWS estimator is intensive and anapproximative algorithm can be obtained asa weighted version of the LTS algorithm(Rousseuw and van Driessen, 2006).Diagnostic tools for the LWS estimator in thelinear regression context (mainly foreconometric applications) were proposed byKalina (2011) andhypothesis testsconcerning the significance of LWSestimators by Kalina (2012a). Econometricapplications of robust statistical methodswere summarized by Kalina (2012b). Ageneralization of the LWS estimator fornonlinear regression will be proposed inSection 6. 4.
3. ROBUST CLUSTER ANALYSIS
Cluster analysis (clustering) is acommonly used data mining tool inmanagement applications. Examples ofusing the cluster analysis include acustomersegmentation (differentiation) based onsociodemographic characteristics, life style,size of consumption, experience withparticular products, etc. (Mura, 2012).Another application is the task of marketsegmentationallowingto position products orto categorize managers on the basis of theirstyle and strategy (Punj & Stewart, 1983;Liang, 2005). All such tasks solved bygrouping units into natural clusters at theexplorative stage of information extractionare vital for a successful marketing ormanagement strategy.
Cluster analysis is a general methodologyaiming at extracting knowledge about themultivariate structure of given data. It solves
the task of unsupervised learning by dividingthe data set to several subsets (clusters)without using a prior knowledge about thegroup membership of each observation. It isoften used as an exploratory technique andcan be also interpreted as a technique for adimension reduction of complex multivariatedata. Cluster analysis assumes the data to befixed (non-random) without the ambition fora statistical inference. We can say that itcontains a wide variety of methods withnumerous possibilities for choosing differentparameters and adjusting the wholecomputation process.
The most common approaches toclustering include the agglomerativehierarchical clustering and k-meansclustering, which both suffer from thepresence of outliers in the data and stronglydepend on the choice of the particularmethod. Some approaches are also sensitiveto the initialization of the random algorithm(Vintr et al. , 2012). Robust versions of theclustering have been recently studied in thecontext of molecular genetics, but they havenot penetrated to management applicationsyet. A robust hierarchical clustering wasproposed by García-Escudero et al. (2009)and a robust version of k-means clusteranalysis tailor-made for high-dimensionalapplications by Gao and Hitchcock (2010).It is also possible but tedious to identify andmanually remove outlying measurementsfrom multivariate data (Svozil, 2008).
In this paper, we propose a new robustagglomerative hierarchical cluster analysisbased on a robust correlation coefficient as ameasure of similarity between two clusters.The Pearson correlation coefficient is acommon measure of similarity between twoclusters (Chae et al. , 2008). We propose arobust measure of distance between twoclusters based on implicit weighting (Kalina,
12 J.Kalina / SJM 8(1) (2013) 9 - 24
2012a) inspired by the LWS regressionestimator.
Algorithm 1: (Robust distance betweentwo clusters).
Let us assume two disjoint clusters C1 andC2 of p-variate observations. ThedistancedLWS (C1,C2) between the clusters C1 and C2
is defined by the following algorithm. 1. Select an observation X=(X1,…,Xp)T
from C1 and select an observationY=(Y1,…,Yp)T from C2.
2. Consider the linear regression model
Yi=β0 + β1 Xi + ei, i=1,…p. (1)3. Compute the LWS estimator in the
model (1). The optimal weights determinedby the LWS estimator will be denoted byw*=(w1,. . . ,wp)T.
4. The LWS-distance between X and Ydenoted by dLWS (X,Y) will be defined as
where is the weighted mean of X1,.. . ,Xp and is the weighted mean of Y1,. .. ,Yp with the optimal weights w*.
5. The distance dLWS (C1, C2) betweenthe clusters C1 and C2 is defined as max dLWS
(X, Y) over all possible observations Xcoming from C1 and all possibleobservations Y coming from C2.
N ow we describe the whole procedure ofhierarchical bottom-up robust clusteranalysis. The robust cluster analysis methoddenoted by LWS-CA is defined in thefollowing way.
Algorithm 2: (Robust cluster analysisLWS-CA)
1. Consider each particular observationas an individual cluster.
2. Searchfor such a pair of clusters C1
and C2 which have the minimal value of therobust distance dLWS (C1, C2).
3. The pair of clusters selected in step 2is joined to a single cluster.
4. Repeat steps 2 and 3 until a suitablenumber of clusters is found, using the gapstatistic criterion (Tibshirani et al. , 2001).
TheLWS-CAmethod corresponds to asingle linkage clustering used together with arobust version of the Pearson correlationcoefficient based on implicitly assignedweights. The stopping rule based on the gapstatistic compares the inter-cluster variabilitywith the intra-cluster variability in the data.Alternatively, the user may require a fixedvalue of the final number of resultingclusters before the computation.
4. ROBUST LOGISTIC REGRESSION
The logistic regression is a basic tool formodeling trend of a binary variabledepending on one or several regressors(continuous or categorical). From thestatistical point of view, it is the mostcommonly used special case of a generalizedlinear model. At the same time, it is alsocommonly used as a method forclassification analysis (Agresti, 1990).
The maximum likelihood estimation ofparameters in the logistic regression isknown to be too vulnerable to the presenceof outliers. Moreover, the logistic regressiondoes not assume errors in the regressors,
13J.Kalina / SJM 8(1) (2013) 9 - 24
p
1j
p
1j
2wj
*j
2wj
*j
p
1iwiwi
*i
LWS
)YY(w)XX(w
)YY)(XX(w)Y,X(d
(2)
wX wY
which may be an unrealistic assumption inmany applications. Outliers appear in thelogistic regression context quitecommonly asmeasurement errors or can emerge as typingerrors both in regressors and the response(Buonaccorsi, 2010). This section proposesa novel robust estimator for parameters oflogistic regression denoted based on the leastweighted squares estimator (Section 2) andderive its breakdown point.
Christmann (1994), who used the leastmedian of squares (LMS) method forestimating parameters in the logisticregression model and proved the estimator topossess the maximal possible breakdownpoint. evertheless, the low efficiency ofthe LMS estimator has been reported asunacceptable; see Shertzer and Prager (2002)for a management application. Čížek (2008)advocated highly robust estimation ofparameters of the logistic regression,because they have the potential to have arelatively low bias without a need for a biascorrection, which would be necessary for M-estimators. A nonparametric alternative tothe logistic regression is known as themultifactor dimensionality reductionproposed by Ritchie et al. (2001).
Let us now consider the model of thelogistic regression with a binary responseY=(Y1,…, Yn)T. Its values equal to 1 or 0 canbe interpreted as a success (or failure,respectively) of a random event. Theprobability of success for the i-th observation(i=1,…,n) is modeled as a response ofindependent variables, which can be eithercontinuous or discrete. The regressors aredenoted as Xi=( X1,…, Xpi)T for i=1,…,n.The conditional distribution of Yi assumingfixed values of the regressors is assumed tobe binomial Bi(mi, πi), where mi is a knownpositive integer and probability πi depends
on regression parameters β1,…,βp fori=1,…,n through
. (3)
We introduce the notation
and
(4)
for i=1,…,n.
Definition 2. We define the leastweighted logistic regression (LWLR)estimator in the model (3) in the followingway. We consider the transformations (4),where the unknown probabilities πi
(i=1,…,n) are estimated by the maximumlikelihood method. The LWLR is defined asthe LWS estimator with the adaptivequantile-based weights computed for thedata
(5)
The LWLR estimator has a highbreakdown point as evaluated in thefollowing theorem, which can be proven asan analogy of the result of Christmann(1994).
Theorem 1. We assume the values mi areassumed to be reasonably large. Further, weassume technical assumptions of Christmann(1994). Then, the breakdown point of theLWLR estimator computed with the data-dependent adaptive weights of Čížek (2011)
is equal to:
(6)
14 J.Kalina / SJM 8(1) (2013) 9 - 24
pipi11i
i X...X1log
2/1
ii
iii Y1m
YXX~
2/1
ii
iiiii Y1m
YX1/logY~
.Y~,X~,...,Y~,X~ Tn
Tn
T1
T1
n
2p2n
where denotes the integer part of n/2,
defined as the largest integer smaller or equal
to n/2. Let us discuss using the robust logistic
regression as a classification method intotwo groups. The observed data areinterpreted as a training data base with theaim to learn a classification rule. Let pi
defined by
(7)
denote the probability that the i-thobservation belongs to the group 1, where(b1,. . . , bp)T is the LWLR estimator of (β1,. .. , βp)T. Theoretical textbooks (e. g.Jaakkola, 2013) commonly recommend touse the logistic regression to classify a newobservation to the group 1, if and only if
pi > 1/2, (8)
which is equivalent to
(9)
However, this may be very unsuitable insome situations, especially if a large majorityof the training data belongs to one of the twogiven groups.
A more efficient classification rule isobtained by replacing the classification rule(9) by the rule pi > c, where the constantc isdetermined in the optimal way, that itminimizes the total classification error. Wepoint out that it is equivalent to maximizingthe Youden’s index I (Youden, 1950) definedas:
I = sensitivity + specificity – 1, (9`)
where sensitivity defined as the probabilityof a correct classification of a successfulobservation and specificity is the probabilityof a correct classification of an unsuccessfulobservation. The optimization of thethreshold c in the classification context iscommon for neural networks, as it will bedescribed in Section 6. 2.
5. ROBUST MULTINOMIAL
LOGISTIC CLASSIFICATION
ANALYSIS
Classification analysis into several (morethan two) groups is a common task inmanagement applications. We extend therobust logistic regression estimator LWLR ofSection 5 to a robust multinomial logisticclassification analysis into several groups.The new method is insensitive to thepresence of contamination in the data.
First, we recall the multinomial logisticregression, which is an extension of thelogistic regression (3) to a model with aresponse with several different. We assumethe total number of n measurements denotedas (X11,…,X1n)T,…, (Xp1,…,Xpn)T. Each ofthem belongs to one of K different groups.The index variable Y=(Y1,…,Yn)T containsthe code 1,…,K corresponding to the groupmembership of each measurement. Weconsider a model assuming that Y follows amultinomial distribution with K categories.
Definition 3. The data are assumed as Kindependent random samples of p-variatedata. The multinomial logistic regressionmodel is defined as
15J.Kalina / SJM 8(1) (2013) 9 - 24
2n
pipi11pipi11 Xb...XbXb...Xbi e1/ep
.0p1p
logi
i
,n,...,1i,X...X)KY(p)kY(p
log pikpi11ki
i
k=1,…, K-1 (10)
where P(Yi=k) denotes the probability that
the observation Yi belongs to the k-th group.
Here each of the groups 1,…,K has its
own set of regression parameters, expressing
the discrimination between the particular
group and the reference group K. We also
point out that the value
(11)
which compares the probability that the
observation Yi belongs to the k-th group with
the probability that Yi belongs to the l-thgroup (k=1,…,K, l=1,. ,,,,K),can be
evaluated as
Our aim is to use the observed data as a
training set to learn a robust classification
rule allowing to assign a new observation
Z=(Z1,…,Zp)T to one of the K given groups.
As a solution, we define the robust
multinomial logistic classification analysis
based on a robust estimation computed
separately in the total number of K-1particular models (10).
Definition 4. In the model (10), let pk
denote the estimate of the probability that the
observation Z belongs to the k-th group for
k=1,…,K by replacing the unknown
regression parameters in (10) by their
maximum likelihood estimates. Using the
notation
and
(13)
for i=1,…,n, the robust estimator is obtained
as the LWS estimator with adaptive quantile-
based weights in the model
(14)
The robust multinomial logistic
classification analysis (RMLCA) assigns a
new observation Z to the group k, if and only
if
pk ≥ pj for all k=1,…,K, j≠k. (15)
6. NEURAL NETWORKS
6. 1. Principles of neural networks
Neural networks represent an important
predictive data miningtool with a high
flexibility and ability to be applied to a wide
variety of complex models (Hastie et al. ,
2001). They have become increasingly
popular as an alternative to statistical
approaches for classification (Dutt-
Mazumder et al. , 2011). In management
applications, neural networks are commonly
used in risk management, market
segmentation, classification of consuming
spending patterns, sale forecasts, analysis of
financial time series, quality control, or
strategic planning; see the survey by
Hakimpoor et al. (2011) for a full
discussion. However, Krycha and Wagner
(1999) warned that most papers on
management applications ‘do not report
completelyhow they solved the problems at
hand by the neural network approach’.
Neural networks are biased under the
16 J.Kalina / SJM 8(1) (2013) 9 - 24
,)lY(p)kY(p
logi
i
i11lpikpi11ki
i XX...X)lY(p)kY(p
log
.X... pilp (12)
ikiki21
kkiki XvX~,))p1(pm(v
))p1/(plog(vY~ kkkiki
.)Y~,X~(,...,)Y~,X~( Tkn
Tkn
T1k
T1k
presence of outliers and a robust estimationof their parameters is desirable (Beliakov etal. , 2011). In general, neural networksrequire the same validation steps asstatistical methods, because they involveexactly the same sort of assumptions asstatistical models (Fernandez, 2003).Actually, neural networks need an evaluationeven more than the logistic regression,because more complicated models are morevulnerable to overfitting than simplermodels. Unfortunately, a validation is ofteninfeasible for complex neural networks. Thisleads users to a a tendency to believe thatneural networks do not require any statisticalassumptions or that they are model-free(Rusiecki, 2008), which may lead to a wrongfeeling that they do not need any kind ofvalidation.
We mention also some other weak pointsof training neural networks. They are oftencalled black boxes, because their numerousparameters cannot be interpreted clearly andit is not possible to perform a variableselection by means of testing the statisticalsignificance of individual parameters.Moreover, learning reliable estimates of theparameters requires a very large number ofobservations, especially for data with a largenumber of variables. Also overfitting is acommon shortage of neural networks, whichis a consequence of estimating theparameters entirely over the training datawithout validating on an independent dataset.
Different kinds of neural networks can bedistinguished according to the form of theoutput and the choice of the activationfunction and different tasks require to usedifferent kinds of networks. In this work, weconsider two most common kinds ofmultilayer perceptrons, which are also calledmultilayer feedforward networks. Their
training is most commonly based on theback-propagation algorithm. Section 7. 2discusses neural networks for a supervisedclassification task, i. e. networks with abinary output, and explains their link to thelogistic regression. Section 7. 3. disussesneural networks for function approximation,i. e. networks wih a continuous output, andstudies a robust approach to their fitting. Anew robust estimation tool suitable for sometypes of such neural networks is described inSection 7. 4 as the nonlinear least weightedsquares estimator.
6. 2. Neural networks for classification
We discuss supervised multilayerfeedforward neural networks employed for aclassification task. The input data arecoming from K independent random samples(groups) of p-dimensional data and the aim isto learn a classification rule allowing toclassify a new measurement into one of thegroups. In this section, we also describe theconnection between a classification neuralnetwork and logistic regression.
The network consists of an input andoutput layer of neurons and possibly of oneor several hidden layers, which are mutuallyconnected by edges. A so-called weightcorresponds to each observed variable. Thisweight can be interpreted as a regressionparameter and its value can be any real (alsonegative) number. In the course of theprocess of fitting the neural network, theweights connecting each neuron with one ofneurons from the next layers are optimized.A given activation function is applied on theweighted inputs to determine the output ofthe neural network.
First, let us consider a neural networkconstructed for the task of classification totwo groups with no hidden layers. The
17J.Kalina / SJM 8(1) (2013) 9 - 24
output of the network is obtained in the form
(16)
where x ∈ Rp represents the input data, Rdenotes the set of all real numbers, w is thevector of weights, and b is a constant(intercept). Simple neural networks usuallyuse the activation function g as a binaryfunction or a monotone function, typically asigmoid function such as the logisticfunction or hyperbolic tangent.
If the logistic activation function
(17)
is applied, the neural networks is preciselyequal to the model of logistic regression(Dreiseitl & Ohno-Machado, 2002). Thisspecial case of a neural network is differentfrom the logistic regression only in themethod for estimating the (regression)parameters. Moreover, if the hyperbolictangent activation functionas
(18)
the neural network without hidden layers isagain equal to the model of the logisticregression. Moreover, it can be easilyproven that g2(x) = 2g1(2x) + 1 for any realx.
An influence of outliers on classificationanalysis performed by neural networks hasbeen investigated only empirically (e. g.Murtaza et al. , 2010). Classification neuralnetworks are sensitive to the presence ofoutliers, but a robust version has not beenproposed. For the classification to twogroups, the sensitivity of the neural networksis a consequence of their connection to the
logistic regression. Thanks to the connectionbetween neural networks and logisticregression, the robust logistic regressionestimator LWLR (Section 5) represents at thesame time a robust method for fittingclassification neural networks with nohidden layers.
Moreover, references on the sensitivityanalysis of neural networks have notanalyzed other situations, which negativelyinfluence training neural networks (cf.Yeung et al. , 2010). An example ismulticollinearity, which is a phenomenon asharmful for neural networks as for thelogistic regression.
Neural networks for classification to Kgroups are commonly used with one or morehidden layers. Such networks use a binarydecision in each node of the last hidden layerdriven by a sigmoid activation function.Therefore, they are sensitive to the presenceof outliers as well as the networks forclassification to two groups.
6. 3. Neural networks for function
approximation
eural networks are commonly used as atool for approximating a continuous realfunction. In management, multilayerfeedforward networks are often used mainlyfor predictions, e. g. for modeling andpredicting market response (Gruca et al. ,1999) or demand forecasting (Efendigil et al., 2009). Most commonly, they use anidentical activation function, i. e. they can bedescribed as
(19)
An alternative approach to functionapproximation is to use radial basis functionneural networks. ow we need to describe
18 J.Kalina / SJM 8(1) (2013) 9 - 24
)bxw(g)x(f T
,Rx,e11)x(g x1
,Rx),xtanh()x(g2
.Rx),bxw(g)x(f pT
the back-propagation algorithm for functionapproximation networks, which is at thesame time commonly used also for theclassification networks of Section 6. 2.
The back-propagation algorithm forneural networks employed for functionapproximation minimizes the total errorcomputed across all data values of thetraining data set. The algorithm is based onthe least squares method, which is optimalfor normally distributed random errors in thedata (Rusiecki, 2008). After an initiation ofthe values of the parameters, the forwardpropagation is a procedure for computingweights for the neurons sequentially inparticular hidden layers. This leads tocomputing the value of the output andconsequently the sum of squared residualscomputed for the whole training data set. Toreduce the sum of squared residuals, thenetwork is sequentially analyzed from theoutput back to the input. Particular weightsfor individual neurons are transformed usingthe optimization method of the steepestgradient. However, there are no diagnostictools available, which would be able todetect a substantial information in theresiduals, e. g. in the form of theirdependence, heteroscedasticity, orsystematic trend.
A robust version of fittingmultilayerfeedforward networks for the task offunction approximation for contaminateddata was described only for specific kinds ofneural networks. Rusiecki (2008) studiedneural networks based on robust multivariateestimation using the minimum covariancedeterminant estimator. Chen and Jain (1994)investigated M-estimators and Liano (1996)studied the influence function of neuralnetworks for function approximation as ameasure of their robustness.
Jeng et al. (2011) or Beliakov et al.
(2012) studied neural networks based onnonlinear regression. Then, instead ofestimating the parameters of their neuralnetworks by means of the traditionalnonlinear least squares, they performed theLTS estimation. In this paper, we propose analternative approach based on the LWSestimator. Because such approach is generaland not connected to the context of neuralnetworks, we present the methodology as aself-standing Section 6. 4.
6. 4. Nonlinear least weighted squares
regression
Let us consider the nonlinear regressionmodel
Yi=f(β1X1i+…+βpXpi)+ei, i=1,…,n, (20)
where Y=(Y1,…,Yn)T is a continuousresponse, (X11,…,X1n)T,…, (Xp1,…,Xpn)T
regressors and f is a given nonlinear function.Let u(i)(b) denote a residual corresponding tothe i-th observation for a given estimatorb=(b1,…,bp)T∈Rp of the regressionparameters (β1,…, βp)T. We consider theresiduals arranged in ascending order in theform
(21)
We define the least weighted squaresestimator of the parameters in the model (20)as
(22)
where the argument of the minimum iscomputed over all possible values ofb=(b1,…,bp)T and where w1,…,wn aremagnitudes of weights determined by the
19J.Kalina / SJM 8(1) (2013) 9 - 24
)b(u...)b(u)b(u 2)n(
2)2(
2)1(
),b(uwminarg 2)i(
n
1i i
user, e. g. linearly decreasing or logisticweights (Kalina, 2012a). N ow we describethe algorithm for computing the solution of(23) as an adaptation of the LTS algorithmfor the linear regression (cf. Rousseeuw &van Driessen, 2006).
Algorithm 3: (Nonlinear least weightedsquares estimator).
1. Set the value of a loss function to +∞.Select randomly p points, which uniquelydetermine the estimate b of regressionparameters β.
2. Evaluate residuals for all observations.Assign the weights to all observations basedon (21).
3. Compare the value of
computed with the resultingweights with thecurrent value of the loss function. If
is larger, go to step 4. Otherwise
go to step 5.
4. Set the value of the loss function
to and store the values of the
weights. Find the nonlinear regressionestimator of β1,…,βp by weighted leastsquares (Seber & Wild, 1989) using theseweights. Go back to steps 2 and 3.
5. Repeatedly (10 000 times) performsteps 1 through 4. The output (optimal)weights are those giving the global optimum
of the loss function over all
repetitions of steps 1 through 4.
7. CONCLUSION
This paper fills the gap of robuststatistical methodology for data mining byintroducing new highly robust methods
based on implicit weighting. Robuststatistical methods are established as analternative approach to certain problems ofstatistical data analysis.
Data mining methods are routinely usedby practitioners in everyday managementapplications. This paper persuades thereaders that robustness is a crucial aspect ofdata mining which has remained only a littleattention so far, although sensitivity of datamining methods to the presence of outlyingobservations has been repeatedlyreported asa serious problem (Wong, 1997; Yeung et al., 2010). On the other hand, an experienceshows that too sophisticated data miningmethods are not very recommendable forpractical purposes (Hand, 2006) and simplemethods are usually preferable.
This paper introduces new tools for therobust data mining using an intuitively clearrequirement to down-weight less reliableobservations. The robust cluster analysisLWS-CA (Section 3), the robust logisticregression LWLR (Section 4), and the robustmultinomial logistic classification analysis(Section 5) are examples of newly proposedmethods which can be understood as part ofthe robust data mining framework.
The logistic regression is commonlydescribed as a white-box model (Dreiseitl &Ohno-Machado, 2012), because it offers asimpler interpretation compared to the neuralnetworks characterized as a black box.Because robust methods have been mostlystudied for continuous data, our proposal ofrobust logistic regression is one ofpioneering results on robust estimation in thecontext of discrete data.
In management applications, neuralnetworks are quite commonly used, but stillless frequently compared to the logisticregression. Also neural networks aresensitive to the presence of outlying
20 J.Kalina / SJM 8(1) (2013) 9 - 24
)b(uw 2)i(
n
1i i
)b(uw 2)i(
n
1i i
)b(uw 2)i(
n
1i i
)b(uw 2)i(
n
1i i
measurements. This paper discusses theimportance of robust statistical methods fortraining neural networks (Section 6) andproposes a robust approach to neuralnetworksfor function approximation. Suchapproach still requires an extensiveevaluation with the aim to detect a possibleoverfitting by means of a cross-validation orboostrap methodologies. Moreover, robustneural networks for classification purposesbased on robust back-propagation remains tobe an open problem. A classification neuralnetwork is characterized as a generalizationof the logistic regression, although the twomodels do not use the same method forparameter estimation. Another possibilityfor a robust training of classification neuralnetworks would be to propose a robust back-propagation based on the minimum weighted
covariance determinant estimator proposedby Kalina (2012b).
In this paper, we do not discuss a robustinformation extraction from high-dimensional data, which represents a veryimportant and complicated task. It is truethat numerous data mining methods sufferfrom the so-called curse of dimensionality.Also neural networks have not been adaptedfor high-dimensional data; it can be ratherrecommended to use alternative approachesfor a high-dimensional supervisedclassification (e. g. Bobrowski & Łukaszuk,2011).
21J.Kalina / SJM 8(1) (2013) 9 - 24
ВИСОКО РОБУСТНИ МЕТОДИ ИСТРАЖИВАЊА ПОДАТАКА
Јан Калина
Извод
Овај се рад бави високо робустним методама екстракције информација из података, узпосебну пажњу посвећену методима погодним за примену у менаџменту. Осетљивостдоступних метода истражива, на присуство екстремних резултата мерења у полазној бази, једискутована као основни недостатак разматраних метода. Овај рад предлаже неколико новијихробустних метода истраживања података, које се заснивају на идеји имплицитних тежинскихкоефицијената индивидуалних вредности података. Посебно се предлаже новији робустниметод хијерархијске анализе кластера, који је популарaн метод анализе података и учењамрежа. Даље, предлаже се робустни метод за процену параметара током логистичке регресије.Ова идеја се шири ка робустној мултиномијалној логистичкој анализи класификације. Накрају, дискутује се осетљивост неуронских мрежа на присуство шума екстремних података уполазној бази. Предлаже се метод робустног тренинга неуронских мрежа у циљуапроксимације функције, која има форму робустног алата процене у нелинеарној регресионојанализи.
Кључне речи: Истраживање података, Робустна статистика, Вишедимензиони подаци, Анализакластера, Логистичка регресија, Неуронске мреже
The research was supported by the grant13-01930S of the Grant Agency of theCzech Republic
References
Agresti A. (1990). Categorical dataanalysis. Wiley, New York.
Beliakov G. , Kelarev A. , & Yearwood J.(2012). Robust artificial neural networksand outlier detection. Technical report,arxiv. org/pdf/1110. 1069. pdf (downloadedNovember 28, 2012).
Bobrowski L. , & Łukaszuk T. (2011).Relaxed linear separability (RLS) approachto feature (gene) subset selection. In Xia X.(Ed. ). Selected works in bioinformatics.InTech, Rijeka, 103-118.
Brandl B. , Keber C. , & Schuster M. G.(2006). An automated econometric decisionsupport system. Forecast for foreignexchange trades. Central European Journalof Operations Research, 14 (4), 401-415.
Briner R. B. , Denyer D. , & Rousseau D.M. (2009). Evidence,based management.Concept cleanup time? Academy ofManagement Perspectives, 23 (4), 19-32.
Buonaccorsi J. P. (2010). Measurementerror. models, methods, and applications.Boca Raton. Chapman & Hall/CRC.
Chae S. S. , Kim C. , Kim J. ,M. , &Warde W. D. (2008). Cluster analysis usingdifferent correlation coefficients. StatisticalPapers, 49 (4), 715-727.
Chen D. S. , & Jain R. C. (1994). Arobust back propagation learning algorithmfor function approximation. IEEETransactions on Neural Networks, 5 (3), 467-479.
Christmann A. (1994). Least median ofweighted squares in logistic regression withlarge strata. Biometrika, 81 (2), 413-417.
Čížek P. (2011). Semiparametricallyweighted robust estimation of regressionmodels. Computational Statistics & DataAnalysis, 55 (1), 774-788.
Čížek P. (2008). Robust and efficient
adaptive estimation of binary,choiceregression models. Journal of the AmericanStatistical Association, 103 (482), 687-696.
Davies P. L. , & Gather U. (2005).Breakdown and groups. Annals of Statistics,33 (3), 977-1035.
Dreiseitl S. , & Ohno,Machado L. (2002).Logistic regression and artificial neuralnetwork classification models. Amethodology review. Journal of BiomedicalInformatics, 35, 352-359.
Dutt,Mazumder A. , Button C. , Robins A., & Bartlett R. (2011). Neural networkmodelling and dynamical systém theory. arethey relevant to study the governingdynamics of association football players?Sports Medicine, 41 (12), 1003-1017.
Efendigil T. , Önüt S. , & Kahraman C.(2009). A decision support system fordemand forecasting with artificial supportnetworks and neuro,fuzzy models. Acomparative analysis. Expert Systems withApplications, 36 (3), 6697-6707.
Fayyad U. , Piatetsky,Shapiro G. , &Smyth P. (1996). From data mining toknowledge discovery in databases. AIMagazine, 17 (3), 37-54.
Fernandez G. (2003). Data mining usingSAS applications. Boca Raton. Chapman &Hall/CRC.
Gao J. , & Hitchcock D. B. (2010).James,Stein shrinkage to improve k,meanscluster analysis. Computational Statistics &Data Analysis, 54, 2113-2127.
García,Escudero L. A. , Gordaliza A. , SanMartín R. , van Aelst S. , & Zamar R.(2009). Robust linear clustering. Journal ofthe Royal Statistical Society, B71 (1), 301-318.
Gruca T. S. , Klemz B. R. , & Petersen E.A. F. (1999). Mining sales data using aneural network model of market reponse.ACM SIGKDD Explorations Newsletter, 1
22 J.Kalina / SJM 8(1) (2013) 9 - 24
(1), 39-43. Gunasekaran A. , & N gai E. W. T. (2012).
Decision support systems for logistic andsupply chain management. DecisionSupport Systems and Electronic Commerce,52 (4), 777-778.
Hakimpoor H. , Arshad K. A. B. , Tat H.H. , Khani N. , & Rahmandoust M. (2011).Artificial neural networks’ applications inmanagement. World Applied SciencesJournal, 14 (7), 1008-1019.
Hand D. J. (2006). Classifier technologyand the illusion of progress. StatisticalScience, 21 (1), 1-15.
Hastie T. , Tibshirani R. , Friedman J.(2001). The elements of statistical learning.Springer, New York.
Hekimoglu S. , Erenoglu R. C. , & KalinaJ. (2009). Outlier detection by means ofrobust regression estimators for use inengineering science. Journal of ZhejiangUniversity, Science A, 10 (6), 909-921.
Jaakkola T. S. (2013). Machine learning.http. //www. ai. mit. edu/courses/6. 867-f04/lectures/lecture-5-ho. pdf (downloadedJanuary 4, 2013).
Jeng J. ,T. , Chuang C. ,T. , & Chuang C.,C. (2011). Least trimmed squares basedCPBUM neural networks. ProceedingsInternational Conference on System Scienceand Engineering ICSSE 2011, IEEEComputer Society Press, Washington, 187-192.
Kalina J. (2012a). Implicitly weightedmethods in robust image analysis. Journal ofMathematical Imaging and Vision, 44 (3),449-462.
Kalina J. (2012b). On multivariatemethods in robust econometrics. PragueEconomic Papers, 21 (1), 69-82.
Kalina J. (2011). Some diagnostic toolsin robust econometrics. Acta UniversitatisPalackianae Olomucensis Facultas Rerum
Naturalium Mathematica ,50 (2), 55-67. Krycha K. A. , & Wagner U. (1999).
Applications of artificial neural networks inmanagement science. A survey. Journal ofRetailing and Consumer Services, 6, 185-203.
Liang K. (2005). Clustering as a basis ofhedge fund manager selection. Technicalreport, University of California, Berkeley,cmfutsarchive/HedgeFunds/hf_managerselection. pdf (downloaded December 20, 2012).
Liano K. (1996). Robust error measurefor supervised neural network learning withoutliers. IEEE Transactions on Neural
Networks, 7 (1), 246-250. Maronna R. A. , Martin R. D. , & Yohai V.
J. (2006). Robust statistics. Theory andmethods. Chichester. Wiley.
Martinez W. L. , Martinez A. R. , & SolkaJ. L. (2011). Exploratory data analysis withMATLAB. Second edition. Chapman &Hall/CRC, London.
Mura L. (2012). Possible applications ofthe cluster analysis in the managerialbusiness analysis. Information Bulletin ofthe Czech Statistical Society, 23 (4), 27-40.(In Slovak. )
Murtaza N. , Sattar A. R. , & Mustafa T.(2010). Enhancing the software effortestimation using outlier elimination methodsfor agriculture in Pakistan. Pakistan Journalof Life and Social Sciences, 8 (1), 54-58.
Nisbet R. , Elder J. , Miner G. (2009).Handbook of statistical analysis and datamining applications. Elsevier, Burlington.
Punj G. , & Stewart D. W. (1983).Cluster analysis in marketing research.Review and suggestions for applications.Journal of Marketing Research, 20 (2), 134-148.
Ritchie M. D. , Hahn L. W. , Roodi N. ,Bailey L. R. , Dupont W. D. , Parl F. F. , &Moore J. H. (2001). Multifactor-
23J.Kalina / SJM 8(1) (2013) 9 - 24
dimensionality reduction reveals high-orderinteractions among estrogen-metabolismgenes in sporadic breast cancer. AmericanJournal of Human Genetics, 69 (1), 138-147.
Rousseeuw P. J. , & van Driessen K.(2006). Computing LTS regression for largedata sets. DataMining and KnowledgeDiscovery,12 (1), 29-45.
Rusiecki A. (2008). Robust MCD,basedbackpropagation learning algorithm. InRutkowski L, Tadeusiewicz R. , Zadeh L. ,Zurada J. (Eds. ). Artificial Intelligence andSoft Computing. Lecture Notes in ComputerScience, 5097, 154-163.
Salibián,Barrera M. (2006). Theasymptotics of MM,estimators for linearregression with fixed designs. Metrika, 63,283-294.
Schäfer J. , & Strimmer K. (2005). Ashrinkage approach to large,scale covariancematrix estimation and implications forfunctional genomics. StatisticalApplications in Genetics and MolecularBiology, 4 (1), Article 32, 1-30.
Seber A. F. , & Wild C. J. (1989).Nonlinear regression. Wiley, New York.
Shertzer K. W. , & Prager M. H. (2002).Least median of squares. A suitableobjective function for stock assessmentmodels? Canadian Journal of Fisheries andAquatic Sciences, 59, 1474-1481.
Shin S. , Yang L. , Park K. , & Choi Y.(2009). Robust data mining. An integratedapproach. In Ponce J. , Karahoca A. (Eds. ).Data mining and knowledge discovery inreal life applications. I-Tech Education andPublishing, New York.
Soda P. , Pechenizkiy M. , Tortorella F. ,& Tsymbal A. (2010). Knowledgediscovery and computer,based decisionsupport in biomedicine. Knowledgediscovery and computer,based decisionsupport in biomedicine. Artificial
Intelligence in Medicine, 50 (1), 1-2. Stigler S. M. (2010). The changing
history of robustness. American Statistician,64 (4), 277-281.
Svozil D. , Kalina J, Omelka M. , &Schneider B. (2008). DNA conformations
and their sequence preferences. NucleicAcids Research, 36 (11), 3690-3706.
Tibshirani R. , Walther G. , & Hastie T.(2001). Estimating the number of clusters ina data set via the gap statistic. Journal of theRoyal Statistical Society Series B, 63 (2),411-423.
Vintr T. , Vintrová V. , & Řezanková H.(2012). Poisson distribution basedinitialization for fuzzy clustering. NeuralN etwork World, 22 (2), 139-159.
Víšek J. Á. (2001). Regression with highbreakdown point. In Antoch J. , Dohnal G.(Eds. ). Proceedings of ROBUST 2000,Summer School of JČMF, JČMF and CzechStatistical Society, Prague, 324-356.
Yeung D. S. , Cloete I. , Shi D. , & N g W.W. Y. (2010). Sensitivity analysis for neuralnetworks. Springer, New York.
Youden W. J. (1950). Index for ratingdiagnostic tests. Cancer 3, 32-35.
Zvárová, J. , Veselý A. , & Vajda I.(2009). Data, information and knowledge.In P. Berka, J. Rauch and D. Zighed (Eds.), Data mining and medical knowledgemanagement. Cases and applicationsstandards. IGI Global, Hershey, 1-36.
24 J.Kalina / SJM 8(1) (2013) 9 - 24