estimation of individual prediction reliability using the

Appl Intell (2008) 29: 187–203DOI 10.1007/s10489-007-0084-9

Estimation of individual prediction reliability using the localsensitivity analysis

Zoran Bosnic · Igor Kononenko

Published online: 3 August 2007© Springer Science+Business Media, LLC 2007

Abstract For a given prediction model, some predictionsmay be reliable while others may be unreliable. The averageaccuracy of the system cannot provide the reliability esti-mate for a single particular prediction. The measure of indi-

vidual prediction reliability can be important information inrisk-sensitive applications of machine learning (e.g. medi-cine, engineering, business). We define empirical measures

for estimation of prediction accuracy in regression. Pre-sented measures are based on sensitivity analysis of regres-sion models. They estimate reliability for each individual

regression prediction in contrast to the average predictionreliability of the given regression model. We study the em-pirical sensitivity properties of five regression models (lin-

ear regression, locally weighted regression, regression trees,neural networks, and support vector machines) and the rela-tion between reliability measures and distribution of learn-

ing examples with prediction errors for all five regressionmodels. We show that the suggested methodology is appro-priate only for the three studied models: regression trees,

neural networks, and support vector machines, and test theproposed estimates with these three models. The results ofour experiments on 48 data sets indicate significant correla-

tions of the proposed measures with the prediction error.

Z. Bosnic (�) · I. KononenkoUniversity of Ljubljana, Faculty of Computer and InformationScience, Tržaška 25, Ljubljana, Sloveniae-mail: [email protected]

I. Kononenkoe-mail: [email protected]

1 Introduction

A key issue in determining the quality of learning algorithmis the measurement of its accuracy. Commonly used accu-racy measures, as the mean squared error (MSE) and the rel-ative mean squared error (RMSE), evaluate the model per-formance by summarizing the error contributions of all testexamples. They nevertheless provide no local informationabout the expected error of individual prediction for a givenunseen example.

Measuring the expected prediction error is very impor-tant in risk-sensitive areas where acting upon predictionsmay have financial or medical consequences (e.g. medicaldiagnosis stock market, navigation, control applications). Insuch areas, appropriate local accuracy measures may pro-vide additional necessary information about the predictionconfidence. For example, in medical diagnosis physiciansare not interested only in the average accuracy of the predic-tor. When a certain patient is analyzed, the physicians expectfrom a system to be able to provide a prediction as well asthe estimate of the reliability of that particular prediction.The average accuracy of the model cannot provide informa-tion whether some particular prediction is reliable or not.

The above described challenge is illustrated in Fig. 1. Itillustrates a contrast between the average reliability estimate(e.g. MSE), and reliability estimates of individual predic-tions. Note that instead of estimating the aggregated accu-racy of the whole prediction model, the individual reliabil-ity estimates enable the user to make a distinction betweenbetter and worse predictions. The idea of the latter approachhas also an additional advantage as well. Namely, the cal-culation of individual predictions’ reliability estimates doesnot require true label values. Unlike MSE estimate, whichrequires a testing data set, the idea of individual predictions’

mailto:[email protected]

mailto:[email protected]

188 Z. Bosnic, I. Kononenko

Fig. 1 Reliability estimate forthe whole regression model(above) in contrast to reliabilityestimates for individualpredictions (below)

estimates is that they can be calculated for arbitrary unseenexamples.

The existing estimates of prediction errors are founded onthe quantitative description of distribution of learning exam-ples in the problem space, in which algorithms usually makethe i.i.d. (independent and identically distributed) assump-tion. Noise in data and nonuniform distribution of examplesrepresent a challenge for learning algorithms, leading to dif-ferent prediction accuracies in various parts of the problemspace. Apart from distribution of learning examples, thereare also other causes that influence the inaccuracy of pre-diction models: their generalization ability, bias, resistanceto noise, avoidance of overfitting, etc. Since these aspectscannot be measured quantitatively, they cannot be composedinto a common formula and used to estimate the predictionerror. Therefore, we focus on an approach which enables usto analyze the local particularities of learning algorithms.

Our method is based on the sensitivity analysis [1]. Sen-sitivity analysis aims at determining how much the variationof input can influence the output of a system. Our approachis to locally modify the learning set in a controlled mannerin order to explore the sensitivity of the regression model ina particular part of the problem space. By doing so, we adaptthe reliability estimate to the local particularities of data dis-tribution and noise. The sensitivity is thus related to changesof the prediction of the regression model when the learningset is slightly changed. Since the true error of an unlabeledexample is not known, we use more appropriate term, say-ing that we estimate prediction reliability rather than predic-tion error. This conforms with the definition of reliabilitywhich can be defined as the ability to perform certain tasksconforming to required quality standards [2]. Namely, theprediction accuracy in regression is considered the requiredquality standard.

The paper is organized as follows. Section 2 introducesprevious work from three related areas which are laterjointly combined by our approach. In Sect. 3 we present themotivation for this research using the minimum descriptionlength (MDL) principle formalization. Section 4 defines our

sensitivity analysis task and illustrates the expected outputof five regression models. In Sect. 5 we define three reli-ability estimates. These are tested and compared to anotherreliability estimation approach in Sect. 6. Section 7 providesconclusions and ideas for further work.

2 Related work

Our paper is related to previous work within three differentresearch areas of the learning algorithms’ properties. Theseareas are: the use of sensitivity analysis, perturbations oflearning data in order to improve accuracy, and the estima-tion of the prediction reliability for single examples.

Sensitivity analysis Sensitivity analysis is an approachwhich has been applied to many areas such as statistics andmathematical programming. In the context of analyzing sta-bility of learning algorithms it has been discussed by Bous-quet and Elisseeff [1]. They defined notions of stability forlearning algorithms and showed how to derive the general-ization error bounds based on the empirical error and theleave-one-out error. They have also introduced the conceptof β-stable learner as one for which the expected loss func-tion of the learned solution does not change more than β

with small changes in the training set. Bousquet and Elis-seeff [3] and Elisseeff and Pontil [4] applied this ideas toseveral learning models and showed how to obtain boundson their generalization performance.

In a similar way Kearns and Ron [5] define the hypothe-sis stability as a quantity that measures how much the func-tion learned by the algorithm will change when one pointin the training set is removed. All mentioned studies focuson dependence of error-bounds either from the VC (Vapnik–Chervonenkis) theory [17] or from the way the learning al-gorithm searches the space. By proving theoretical useful-ness of notion of stability, these approaches motivated us toexplore the possibilities for empirical estimation of the in-dividual prediction reliability based on the local stability ofthe model.

Estimation of individual prediction reliability using the local sensitivity analysis 189

Perturbation of learning data The group of approachesfrom the second related area is more general. These ap-proaches generate perturbations of initial learning set to im-prove accuracy of final aggregated prediction. Bagging [6]and boosting [7–9] are the most popular in this field. Besidesimproving the predictive accuracy, pasting [10] also solvesthe prediction problem in data sets which are too large to fitin memory.

Tibshirani and Knight [11] introduced the covariance in-flation criterion (CIC), which they use to improve the learn-ing error by iteratively generating perturbed versions of thelearning set. In each iteration they measure a covariance be-tween input and predictor response and perform the modelselection accordingly. The studies [12] have shown that CICis a suitable measure for model comparison, even if we donot use cross-validation to estimate the model accuracy.

Elidan, et al. [13] introduced a strategy for escaping lo-cal maxima that also perturbs the training data instead ofperturbing the hypotheses directly. They use reweighting ofthe training examples to create useful ascent directions inthe hypothesis space and look for optimal solutions in thesets of perturbed problems. Their results show that such per-turbations allow one to overcome local maxima in severallearning scenarios. On both synthetic and real-life data thisapproach significantly improved models in learning struc-ture from complete data, and learning both parameters andstructure from incomplete data.

All mentioned approaches iteratively modify the learningset of examples in order to generate a series of learning mod-els, on which they perform a model selection and choosethe most accurate one. Though results in this field show thatthe perturbation approaches are justified in improving thegeneral hypothesis accuracy score, the application to esti-mation/improvement of the single prediction accuracy wasnot yet studied. In our research we thus focus on this chal-lenge.

Estimation of reliability for single examples Previous stud-ies have referred to reliability of single predictions withdifferent terms. Gammerman, Vovk, and Vapnik [14] andSaunders, Gammerman and Vovk [15] introduce the notionsof confidence and credibility. Confidence indicates a confi-dence value for predicted classifications and credibility is anindicator of the reliability of the data upon which the predic-tion is made. Experiments with their modified Support Vec-tor Machine algorithm showed that they successfully pro-duced the confidence and credibility measures and outper-formed other predictive algorithms.

Later, Nouretdinov, et al. [16] demonstrate the use of con-fidence value in the context of ridge regression. Using resid-uals of learning examples and a p-value function they im-prove an ordinary ridge regression with confidence regions.

The drawback of these approaches is that confidence esti-mations need to be specifically designed for each particularmodel and cannot be applied to other methods.

The notion of reliability estimation has most frequentlyappeared together with the notion of transduction, as for ex-ample in [14, 15]. Transduction is an inference principle thatreasons from particular to particular [17] in contrast to in-ductive learning, which aims at inferring a general rule froma finite set of data. Transductive methods may therefore useonly selected examples of interest and not necessarily thewhole input space. This locality enables the use of transduc-tive algorithms for making other inferences besides predic-tions. We find inferences of reliability measures of specialinterest.

As an application of the above principle, Kukar andKononenko [18] proposed a transductive method for esti-mation of classification reliability. Their work introduceda set of reliability measures which successfully separatecorrect and incorrect classifications and are independent ofthe learning algorithm. Bosnic and Kononenko [19] lateradapted the approach to regression. Transductive predic-tions, introduced by this technique, were used to model pre-diction error for each individual example. Initial results werepromising and showed the potential for estimating and pos-sibly correcting the prediction error.

Some other related theoretical work should also be men-tioned. In the context of co-training, it has been shown thatthe unlabeled data can be used to improve the performanceof a predictor [20]. It has also been shown that for everyreasonable classifier (i.e., better than random) performancecan be significantly boosted by utilizing only additional un-labeled data [21]. Both of these studies additionally encour-age the use of utilizing additional learning examples, whichwe used in this research.

The work presented here extends the work of Kukar andKononenko [18] to regression and places it in the context ofthe sensitivity analysis.

3 MDL based motivation

The dependence between input data and the prediction can-not be analytically expressed for most of the prediction mod-els. This is true especially for models, which partition inputspace and establish local models for each separate partition.In contrast, the Minimum Description Length principle [22]offers a formalism based on the probabilistic and informa-tion theory. In this section we use MDL to introduce themotivation for our research. We show that it is possible toobtain additional information if we expand the learning dataset with an additional example. This finding motivates us touse this information for estimation of the prediction reliabil-ity for individual examples.


In the following analysis we start by setting the basicMDL framework. We introduce the notions of absolute andrelative reliability, which we define as the hypothesis’ capa-bility to achieve the largest compression of the data. Basedon this general definition we express the reliability of a sin-gle example. Considering two extreme scenarios we demon-strate the behavior of the relative reliability. We conclude byshowing that the expanding of learning set in the appropriateway may result in obtaining the extra information.

3.1 MDL preliminaries

Let us denote learning examples by E, background knowl-edge by B and a hypothesis by H . By definition of MDL, theoptimal hypothesis maximizes the conditional probability,which is equal to minimizing the information (minus loga-rithm of probability) [22]:

Hopt = arg maxH

P (H |E,B) = arg minH

− log2 P(H |E,B).(1)

Using the Bayes theorem, the information − log2 P(H |E,

B) can be written as

− log2 P(H |E,B) = − log2 P(E|H,B) − log2 P(H |B)

+ log2 P(E|B) (2)

and since − log2 P(E|B) is independent of H , the optimalhypothesis Hopt can be obtained with:

Hopt = arg minH

[− log2 P(E|H,B) − log2 P(H |B)]. (3)

The MDL criterion therefore tries to minimize the com-plexity of the hypothesis (− log2 P(H |B)) and the error(− log2 P(E|H,B)).

The problem with the above derivation is that in order toimplement it one needs the optimal coding which, however,is not computable [22]. Therefore, in practice one introducesa practical coding and the MDL criterion (3) is transformedinto

Hopt ∼ arg minH

[I (E|H,B) + I (H |B)] .

The term I (H |B) represents the number of bits neededto encode the hypothesis (given the background knowl-edge) and the term I (E|H,B) represents the number ofbits needed to describe data E, given the hypothesis (andthe background knowledge). However, note, that in this for-mula the practical (nonoptimal) coding is used. Therefore,the (nearly) optimal hypothesis is described with short hy-pothesis that explains a great portion of the data. The ex-ample of relationship between hypothesis complexity andprediction error is illustrated in Fig. 2.

Another point of view on the same situation is providedwith the notion of compressivity. The hypothesis H is said

Fig. 2 Illustration of the MDL principle: the relation between themodel complexity (I (H |B)) and the prediction error (I (E|H,B)). Themodel at the top has overly large error, while the model at the bottomhas overly large complexity

to be compressive, if it allows for shorter description of data(in [22] this principle is formally described with the Oc-cam’s Razor Theorem):

I (E|H,B) + I (H |B) < I (E|B). (4)

The term I (E|B) represents the number of bits needed toencode the data (given the background knowledge) withoutthe use of the hypothesis. The lesser the left-hand side term,the more compressive is the hypothesis H .


3.2 Definition of reliability

For a given set of learning data E and background knowl-edge B let the most reliable hypothesis be the one that leadsto the largest compression of the data. Put differently, themost reliable hypothesis is the one that minimizes the left-hand side of the inequality (4).

Based on this definition we now introduce the quantita-tive measures of absolute and relative reliability.

Definition 1 Let H , E and B be defined as above. Hypoth-esis H has reliability R, defined as

R(H |E,B) = I (E|B) − I (E|H,B) − I (H |B). (5)

Definition 2 Let H , E and B be defined as above. Hypoth-esis H has relative reliability Rrel, defined as

Rrel(H |E,B) = 1 − I (E|H,B) + I (H |B)

I (E|B). (6)

3.3 Reliability of a single prediction

Using the above definitions we now consider the reliabil-ity of a single example. We start by expanding the learningset E = {e1, e2, . . . , en}, |E| = n with an additional learningexample en+1. Inequality in (4) thus becomes

I (E ∪ {en+1}|H ,B) + I (H |B) < I (E ∪ {en+1}|B) (7)

where H stands for new, modified hypothesis, which alsocovers (predicts) example en+1. Since learning sets of H andH differ only in one learning example, we may assume thatthe both hypotheses are very similar. For sufficiently largedata sets we therefore assume that

I (H |B) ≈ I (H |B), (8)

I (ei |H,B) ≈ I (ei |H ,B), i = 1, . . . , n (9)

and use these simplifications in the following definition.

Definition 3 Let R1(e,H) denote the reliability of predic-tion of example e by hypothesis H . We define the reliabilityof a single prediction as the difference between reliabilitiesof hypotheses H and H :

R1(en+1, H ) = R(H |E ∪ {en+1},B) − R(H |E,B). (10)

The first term at the right hand side can be expanded asin (5). Therefore, by substituting

R(H |E ∪ {en+1},B)

= I (E ∪ {en+1}|B) − I (E ∪ {en+1}|H ,B) − I (H |B)(11)

and by using the following simplifications

I (E|B) = I (e1|B) + · · · + I (en|B), (12)

I (E ∪ {en+1}|B) − I (E|B) = I (en+1|B), (13)

I (E|H,B) = I (e1|H,B) + · · · + I (en|H,B) (14)

and the simplification from (8) we get the intermediate result

R1(en+1, H ) = I (en+1|B) − I (E ∪ {en+1}|H ,B)

+ I (E|H,B). (15)

Plugging (9) and (14) into the latter result we rewrite thedefinition of reliability in (10) as

R1(en+1, H ) ≈ I (en+1|B) − I (en+1|B, H ). (16)

We see that this form expresses the reliability of a singleprediction as the difference of information terms, which de-pends on the expanded hypothesis H only and does not de-pend on the original hypothesis H . Furthermore, note thatI (en+1|B) is independent of H and plays a role of a con-stant term. This form of definition is suitable for analysis forthe two scenarios, which we analyze in the following sub-section.

3.4 Analysis of reliability in extreme cases

Let us now consider two cases, which can occur when ex-panding the learning set with an additional example, fol-lowed by constructing the new hypothesis H . In the two ex-treme situations, hypothesis H can either contain the maxi-mum information for prediction of the new example, or noneat all. In each of extreme cases, the relative reliability Rrel

changes its value as follows:

1. If hypothesis H contains the maximum informationfor prediction of the new example, then the term at theright hand side of (16) achieves the maximum value, thusI (en+1|B, H ) = 0 and

R1(en+1, H ) ≈ I (en+1|B). (17)

As it holds Rrel(H |E,B) < 1 (see (6)) and:

Rrel(H |E ∪ {en+1},B) = R(H |E ∪ {en+1},B)

I (E ∪ {en+1}|B)

= R(H |E,B) + R1(en+1, H )

I (E|B) + I (en+1|B)

the comparison of the relative reliabilities of hypotheses H

(before adding a new example en+1) and H (after addingen+1) gives:

Rrel(H |E ∪ {en+1},B) > Rrel(H |E,B). (18)


since

R(H |E,B) + I (en+1|B)

I (E|B) + I (en+1|B)>

R(H |E,B)

I (E|B).

Equation (18) shows that the relative reliability of the hy-pothesis has increased. In other words, we state that an opti-mal learning algorithm modifies hypothesis H into H by in-cluding the accurate knowledge for prediction of the new ex-ample. This leads to the conclusion that the former hypoth-esis was unreliable (i.e. it achieved insufficient compressionof the data) in that particular part of the problem space.

2. If hypothesis H includes no information for predic-tion of the new example, then term (16) achieves its mini-mum, thus:

R1(en+1, H ) ≈ I (en+1|B) − I (en+1|B, H ) ≈ 0. (19)

Comparing the relative reliabilities of hypotheses H and H

in this case gives

Rrel(H |E ∪ {en+1},B) < Rrel(H |E,B) (20)

since R(H |E ∪{en+1},B) = R(H |E,B) and I (E ∪{en+1}|B) > I (E|B), meaning that the relative reliability of thehypothesis has decreased. In contrast to conclusions in theformer extreme case, this means that the hypothesis H wasmore reliable (i.e. achieved better compression of the data)in that part of the problem space. However, neither H nor H

cover en+1 (for H it was not available for learning and H

considers it a noisy example).

Based on the above findings we now state a result abouthypotheses H and H for some optimal learning algorithm.

Theorem 1 Let en+1, H , H and B be as defined above.If I (en+1|H,B) = 0 for some optimal learning algorithm,then the following is true:

I (en+1|H ,B) = 0. (21)

Proof Recall the property of the optimal learning algorithm,

H = arg minT

[I (E|T ,B) + I (T |B)], (22)

H = arg minT

[I (E ∪ {en+1}|T ,B) + I (T |B)]. (23)

If we assume that I (en+1|H ,B) > 0, then

I (E|H ,B) + I (H |B) < I (E|H,B) + I (H |B)

meaning that H is not the optimal hypothesis for E, leadingto contradiction. �

The negation of the above theorem states that ifI (en+1|H ,B) > 0, then either the algorithm is suboptimal

or the hypothesis H does not contain the maximum knowl-edge for prediction of en+1, i.e. I (en+1|H,B) > 0. Takinginto account that the choice of the algorithm is fixed forboth H and H , we leave aside the discussion about the algo-rithm’s optimality and focus rather on emphasizing the useof this in the sensitivity analysis. The latter result namelystates that the most of information is achieved if we expandthe learning set with an example that is not well covered bythe initial hypothesis H . I.e., by expanding the learning setwith such example, we test the two extreme cases, describedabove. Either the optimal learning algorithm will modify thehypothesis to obtain good coverage (case 1) or it will leavethe hypothesis unchanged therefore indicating that the newexample is considered noisy (case 2). This finding motivatesfor putting this extra information in use for estimating thereliability of individual predictions.

4 Sensitivity of regression models

The aim of regression predictors is to model learning exam-ples by minimizing error on the learning and test data. Byadding or removing an example from the learning set, thusmaking a minimal change to the input of the learning algo-rithm, one can expect that the change in output prediction forthe modified example will also be small. Big changes in out-put prediction that result by making small changes in learn-ing data may be a sign of instability in the generated model.The magnitude of output change may therefore be used as ameasure of model instability for a modified example. In ourresearch we focus precisely on using these instability mea-sures and combining them into reliability estimates.

4.1 Modification of input

There are several possible ways to modify the learning dataset. The theoretical result from Sect. 3 indicates that for theanalysis we should use examples that are not well coveredby the hypothesis. We decided to expand it with the addi-tional learning example as follows. Let x represent the ex-ample and let y be its label. Therefore, with (x, y) we denotea learning example with known/assigned label y and with(x, ) we denote an unseen example with unknown label.Let (x, ) be the unseen and unlabeled example, for whichwe wish to estimate the reliability of prediction. The pre-diction K , which we are estimating, is made by regressionmodel M , therefore fM(x) = K . Since K was predicted atthe beginning of the sensitivity analysis using an unmodifiedlearning set of size n, we will refer to it as initial prediction.To expand the learning set with example (x, ) we first labelit with

y = K + δ (24)


Fig. 3 The sensitivity analysisprocess. The figure illustratesthe obtaining of initialprediction (phase 1) and thesensitivity model withsensitivity prediction Kε

(phase 2)

where δ denotes some small change. Note, that we use theinitial prediction K as a central value for y, which is after-wards incremented by term δ (which may be either positiveor negative). We define such δ, which is proportionate toknown bounds of label values. In particular, if the intervalof learning examples’ labels is denoted by [a, b] and if ε

denotes a value that expresses the relative portion of this in-terval, then δ = ε(b − a).

After selecting ε and labeling the new example, we ex-pand the learning data set with example (x, y). Based onthis modified learning set with n + 1 examples we builda model, to which we refer to as the sensitivity regres-sion model M ′. We use M ′ to predict (x, ), thus havingfM ′(x) = Kε . Let us refer to Kε as the sensitivity predic-tion. The described procedure is illustrated in Fig. 3. By se-lecting different εk ∈ {ε1, ε2, . . . , εm} to obtain y, we in factiteratively obtain a set of sensitivity predictions

Kε1 ,K−ε1 ,Kε2 ,K−ε2 , . . . ,Kεm,K−εm. (25)

The obtained sensitivity predictions serve as output values inthe sensitivity analysis process. As mentioned above, we useall of above differences Kε − K to observe the model sta-bility and combine them into different reliability measures.Having introduced the needed terminology, we now illus-trate in Fig. 4 our expected assumptions of how do differ-ences Kε − K characterize reliable and unreliable predic-tions of K .

At this point let us make a remark that in the processof assigning a label one cannot take (local) bias and vari-ance into account as one does not have their local estimatesfor a single new data point. In fact, local bias and varianceare precisely what we expect to capture by locally testingthe model sensitivity. In the remainder, we aim to estimatethe local variance (reliability measures RE1 and RE2, de-scribed later) and bias (measure RE3). Before introducingthese measures, let us take a brief look at output character-istics of some common regression models.

Fig. 4 Reliability defined as prediction sensitivity (three examples)

4.2 Sensitivity of the output

Based on the procedure for acquiring a sensitivity prediction,one could intuitively expect the following to hold:

ε1 < ε2 ⇒ Kε1 < Kε2 (26)

and therefore

K−εm < · · · < K−ε1 < K < Kε1 < Kε2 < · · · < Kεm

for εk ∈ {ε1, ε2, . . . , εm}, ε1 < ε2,< · · · < εm, k = 1, . . . ,m.Although this empirically holds for the majority of cases,the relative ordering of initial and sensitivity predictions de-pends on the regression model itself.

We tested the proposed technique with five regressionpredictors: regression trees, locally weighted regression,neural networks, support vector machines and linear regres-sion. These learning algorithms might be divided to simpleand complex, regarding whether they divide the input spaceprior to modeling data or not. We consider linear regressionand locally weighted regression to be simple models sincethey model the whole input data at once. In contrast, prior to


Fig. 5 Changes of regressiontree after modifying the learningdata set

modeling, other three models (regression trees, neural net-works and support vector machines) perform either parti-tioning of space or other example selection: regression treespartition according to attribute values, series of neurons actlike a complex discriminant function which partitions ex-amples and SVM model depends only on examples at themargin.

For sensitivity analysis approach, the complex algorithmsare more interesting than simple algorithms. Instead of aslight change in the output of complex models, the addi-tional example may cause different partitioning of the inputspace, thus leading to a different hypothesis. This may alsoresult in a big difference between initial and sensitivity pre-dictions, which indicates that the initial hypothesis for testedexample is unstable or unreliable. We illustrate this phenom-enon in Fig. 5 for regression trees on a benchmark data setpollution [24]. Figure 5(a) displays the initial regression treebuilt on the original learning set. The first possible scenariois that by utilizing an additional example the model changesslightly, as shown in Fig. 5(b). The difference between ini-tial and second model is a changed prediction value in thethird leaf of the tree (prediction for utilized example changes

from 922.8 to 924.1). The other alternative is that the utiliza-tion of additional example causes a change in the structureof the hypothesis, as shown in Fig. 5(c). We can observethat the splitting criteria and predictive values in the subtreeat the right-hand side change substantially in this case. Al-though the positive δ was used, the prediction for utilizedexample decreases from 922.8 to 906.4. This example illus-trates that the rule (26) does not hold for cases, when thestructure of the hypothesis changes. We do not consider thisas a drawback for our approach, but rather as an indicator ofthe hypothesis unreliability, which we try to estimate.

Simple and complex models also differ in a way that insimple models, sensitivity prediction can be expressed as afunctional dependency of initial prediction and the new ex-ample. Let us now take a look at two examples, one for eachof the simple models.

Example 1 The prediction of example x using the locallyweighted regression is:

K =∑N

j κ(D(x,xj)) · Cj∑N

j κ(D(x,xj)),


where D represents a distance function, κ a kernel function,Cj the true label of example j and x the attribute vectorof our example. The sensitivity prediction Kε can then beexpressed as:

Kε =∑N

j [κ(D(x,xj)) · Cj ] + [κ(D(x,x)) · (K + δ)]∑N

j [κ(D(x,xj))] + [κ(D(x,x))]

=∑N

j [κ(D(x,xj)) · Cj ] + [κ(0) · (K + δ)]∑N

j [κ(D(x,xj))] + [κ(0)] .

Example 2 Let us assume the case of linear regression intwo-dimensional space. We are thus trying to model depen-dency y = kx + n, where k is a regression line slope and n

its intercept. Then,

K = kx + n,

k =∑

k(xk − x)(yk − y)∑

k(xk − x)2= n

∑k xkyk − ∑

k xk

∑k yk

n∑

k x2k − (

∑k xk)2

,

n = y − kx

where n represents the number of examples and x denotesthe average value. Sensitivity prediction Kε can then be ex-pressed as:

Ki,ε = kεxi + nε,

kε = (n+1)[∑nk=1 xkyk+x(K+δ)]−[∑n

k=1 xk+x][∑nk=1 yk+K+δ]

(n+1)[∑nk=1 x2

k +x2]−(∑n

k=1 xk+x)2 ,

nε = yn + K + δ

n + 1− kε

xn + x

n + 1.

Here xk, k = 1, . . . , n represent learning examples and x

represent a new example.

For complex models it is not possible to express suchfunctional dependency.

The magnitude of change in model depends also on thesize of δ (or ε) used in (24). Using different values of δ wecan depict model sensitivity in plots, shown in Fig. 6. Theplots show empirical dependency of Kε − K (vertical axis)on ε ∈ [−5,5] (horizontal axis) for a typical example, usingthe benchmark data set strikes. It is obvious that the plots forlocally weighted regression (Fig. 6(d)) and linear regression(Fig. 6(e)) do not show any local anomalies which could becaptured by the reliability measures.

The critical local regions, which emphasize the sensi-tivity, can be identified for complex models. For regressiontrees (Fig. 6(a)), the local regions are limited to single leafof the tree. For support vector machines (Fig. 6(c)), the crit-ical local region is defined with the hyperplane margin. Andfor neural networks (Fig. 6(b)), the critical local regions arenot so obvious. When the label is approaching the extreme

values, the added example is more and more considered asan outlier, and the prediction therefore becomes more de-pendent on the other learning examples.

Based on the above examples and plots in Fig. 6 we canconclude that the determinism in simple models does not al-low us to capture what we defined as the unreliable modelbehavior. The tests, performed using linear regression andlocally weighted regression, also confirmed this conclusion.In Sect. 6 we therefore focus on testing our technique withcomplex predictors, i.e. regression trees, neural networksand SVMs.

5 Reliability estimates

We use the differences between predictions of initial andsensitivity models as an indicator of the prediction reliabil-ity. At this point we combine these differences into threereliability estimates.

The calculation of the sensitivity prediction requires theselection of particular ε, as defined in (24). To avoid select-ing a particular ε, we define measures to use an arbitrarynumber of sensitivity predictions (defined by using differentε parameters). In this way we widen the observation windowfor observing model instabilities and make the measures ro-bust to local anomalies in the problem space. The number ofused ε parameters therefore represents a trade-off betweengaining more stable reliability estimates and the total com-putational time. Since we assumed that the zero-differencebetween predictions represents the maximum reliability, wealso define the reliability measures so that value 0 indicatesthe most reliable prediction.

Let us assume we have a set of non-negative ε valuesE = {ε1, ε2, . . . , ε|E|}. We define the estimates as follows:

1. Estimate RE1 (local variance):

RE1 =∑

ε∈E(Kε − K−ε)

|E| . (27)

In the case of reliable predictions we expect that the changein sensitivity model for Kε and K−ε would be minimal (0 forthe most reliable predictions). We define this reliability mea-sure using both, Kε and K−ε , to capture the model instabili-ties not regarding the sign of δ in (24). The measure takes theaverage of differences across all values of ε. In Fig. 4, theRE1 estimate represents the wideness of the whole interval,therefore it corresponds to local variance.


Fig. 6 Sensitivity of regressionprediction to changes in a singlelearning example. The plotsshow empirical dependency ofKε − K (vertical axis) onε ∈ [−5,5] (horizontal axis) fora typical example, using thebenchmark data set strikes

(a) Regression tree (b) Neural network

(c) SVM (d) Locally weighted regression

(e) Linear regression

2. Estimate RE2 (local absolute variance):

RE2 =∑

ε∈E |Kε − K| + |K−ε − K|2|E| . (28)

In contrast to RE1, RE2 measures the difference betweenpredictions of initial and sensitivity models. The estimatetakes the average of |Kε − K| and |K−ε − K| and thereforemeasures the average change of the prediction using positive

and negative δ. This measure also takes the average acrossall ε parameters and is defined as non-negative.

3. Estimate RE3 (local bias):

RE3 =∑

ε∈E(Kε − K) + (K−ε − K)

2|E| . (29)

We define RE3 in a similar way as RE2. In contrast to RE2,RE3 can be either positive or negative. Its sign carries in-


formation about the direction in which the predictor is moreunstable. The measure is also averaged across all ε. In Fig. 4,the RE3 estimates the skewness (the asymmetry between leftand right subintervals) and therefore corresponds to localbias.

Note that all three reliability estimates are very similar tothe symmetrized form of formula for calculation of the nu-merical derivative K ′(ε) [23]. The function derivative, i.e.the slope of a function, is an indicator of the function sen-sitivity at the given point, which is in accordance with ourdefinition of reliability estimates.

In the experiments we correlate reliability measures RE1

and RE2 with the absolute value of the prediction error(residual). Since the value of RE3 contains significant infor-mation about the direction of the model sensitivity, its valueis correlated to the signed value of the prediction error.

6 Experimental results

6.1 Sensitivity-based estimation of prediction error

The reliability estimates were tested on 48 standard bench-mark data sets, which are used across the whole machinelearning community. Each data set is a regression problem.The application domains vary from medical, ecological andtechnical to mathematical and physical domains. The num-ber of examples in this domains varies from 20 to over 6500.Most of the data sets are available from UCI Machine Learn-ing Repository [24] and from StatLib DataSets Archive [25].All data sets are available from authors upon request. Thebrief description of data sets is given in Table 1.

As explained in Sect. 4.2 we experimented with fiveregression models. We present results only for regressiontrees, neural networks and SVMs, as our technique is inade-quate for linear regression and locally weighted regression.Some key properties of used models are:

• Regression trees: the mean squared error is used as thesplitting criterion, the value in leaves represents the av-erage label of examples, trees are pruned using the m-estimate [26].

• Neural networks: one hidden layer of neurons, the learn-ing rate was 0.5, the stopping criterion for learning isbased on the change of MSE between two backprop it-erations.

• Support vector machines: The ε-support vector regres-sion algorithm from the LIBSVM library is used [27], weuse the third-degree RBF kernel, the precision parameterwas ε = 0.001.

The testing of model sensitivity was performed simi-larly to cross-validation. The data set was divided into tensubsets and the sensitivity predictions were calculated for

each of the examples in the excluded subset. This was re-peated for all ten data folds, thus obtaining initial and sen-sitivity predictions for all examples in the testing data set.When calculating a sensitivity prediction for each exam-ple, the learning set was expanded with the additional learn-ing example. This change in the learning set was not per-manent, but the changes were discarded before the calcu-lation of the new sensitivity prediction (original data setwith excluded subset was used). For calculation of relia-bility estimates, five different values of ε parameter wereused: E = {0.01, 0.1, 0.5, 1.0, 2.0}. For each of the exam-ples in the data set, the reliability estimates were correlatedwith their prediction error using the Pearson’s correlationcoefficient. The significance of correlation coefficient wasevaluated using the t-test for correlation coefficients. Thedescribed testing procedure is illustrated in Fig. 7.

The summarized results are shown in Table 2. The tabledisplays percentages of experiments in which we achievedsignificant results. Detailed results for individual domainsare shown in Table 3. Results confirm our expectations thatthe reliability estimates should positively correlate with theprediction error. We can see that the positive correlationshighly outnumber the negative correlations with all regres-sion models and reliability estimates.

The best summarized results were achieved using esti-mate RE3 (local bias), and the worst, although still goodwith RE1 (local variance). The best estimate, RE3, signifi-cantly positively correlated with prediction error in 48% oftests (negatively in 3%). The result, which stands out themost, is the performance of RE3 with regression trees. Inthis case, RE3 significantly positively correlated in 75% oftests and negatively in none. Analyzing from different per-spective, we can also see that the estimates perform best withregression trees and the worst, although still quite well withSVMs.

Summarized results show the potential of using the pro-posed reliability estimates for estimation of the predictionerror. We proceed by comparing our results to another ap-proach.

6.2 Density-based estimation of prediction error

A traditional approach to estimation of the prediction con-fidence/reliability is based on distribution of learning exam-ples. In this section we use term subspace which refers tothe subset of learning examples, which are related by local-ity regarding their attributes’ values. If two subspaces of thesame size are compared to one another, the first one havingthe greater number of learning examples than the other, werefer to the first one as the denser subspace.

The density-based estimation of prediction error assumesthat error is lower for predictions which are made in denserproblem subspaces, and higher for predictions which are


Table 1 Basic characteristics of testing data sets

Data set # examples # disc.attr. # cont.attr.

abalone 4177 1 7

audio 200 69 0

auto_price 159 1 14

auto93 93 6 16

autohorse 203 8 17

autompg 398 1 6

balance 625 0 4

baskball 96 0 4

bhouse 506 1 12

bodyfat 252 0 14

brainsize 20 0 8

breasttumor 286 1 8

cholesterol 303 7 6

cleveland 303 7 6

cloud 108 2 4

cos4 1000 0 10

cpu 209 0 6

diabetes 43 0 2

echomonths 130 3 6

elusage 55 1 1

fishcatch 158 2 5

fruitfly 125 2 2

grv 123 0 3

hungarian 294 7 6

lowbwt 189 7 2

mbagrade 61 1 1

meta 528 2 19

pbc 418 8 10

pharynx 195 4 7

photo 858 2 3

places 329 0 8

plasma_carotene 315 3 10

plasma_retinol 315 3 10

pollution 60 0 15

pwlin2 200 0 10

pwlinear 200 0 10

pyrim 74 0 27

quake 2178 0 3

sensory 576 0 11

servo 167 2 2

sleep 58 0 7

stock 950 0 9

strikes 625 0 5

transplant 131 0 2

triazines 186 0 60

tumor 86 0 4

wind 6574 0 11

wpbc 198 0 32


Fig. 7 Illustration of the testing procedure

Table 2 Percentage of significant positive and negative correlations between reliability estimates and prediction error

Correlation RE1 RE2 RE3 Average

+ − + − + − + −

Regression tree 31% 4% 52% 2% 75% 0% 53% 2%

Neural network 42% 2% 46% 0% 44% 4% 44% 2%

SVM 35% 4% 35% 4% 25% 4% 32% 4%

Average 36% 3% 44% 2% 48% 3% 43% 3%

made in sparser subspaces. This means that we trust the pre-diction with regard to the quantity of information at disposalfor calculation of the prediction. A typical use of this ap-proach is, for example, with decision and regression trees,where we trust each prediction according to proportion oflearning examples that fall in the same leaf of a tree as thepredicted example.

We estimated density using a nonparametric estimationof probability distribution [28]. This approach is called thekernel estimator or Parzen windows. Similar to experimentsin Sect. 6.1, we correlated the density estimates with ab-solute prediction errors (absolute residuals). With regard todefinition of the density-based error estimation we expectederror to negatively correlate with density estimate. For cal-culating densities we used ten-fold cross-validation proce-dure in a similar manner as in Sect. 6.1. For each testingexample the local density was estimated and correlated withits prediction error.

The summarized results are shown in Table 4. The tabledisplays percentages of experiments in which we achievedsignificant results. Detailed results for individual domainsare shown in Table 5. Results confirm that the negative cor-relations outnumber positive correlations. Nevertheless, re-sults of density estimates are worse than those for estimatesRE1, RE2 and RE3. Note, that for our measures, the desiredcorrelations are positive, while for the density estimates,the desired correlations are negative. The number of desiredcorrelations obtained by our approach is greater (43% ver-

sus 33%) and the number of undesired correlations muchsmaller (3% versus 8%).

Better correlations of the proposed estimates with pre-diction error show that they provide more information thanthe probabilistic distribution of examples. They measure theprediction sensitivity, which also depends on learning andon predictive algorithms themselves. We can conclude thatestimates RE1, RE2 and RE3 are more suitable for estima-tion of the prediction error than density estimates. However,it remains an open question how to adapt the three proposedmeasures in order to achieve the probabilistic interpretabilityas offered by the density function. We leave this for furtherwork.

7 Conclusion

The paper presents a new method for reliability estimationof individual predictions. Our method is based on the sensi-tivity analysis, which is an approach that observes the outputresponse with respect to small changes in the input data set.

Previous work in this field inspired us to modify thelearning set by expanding it with an additional example.This is similar to ideas of Bousquet and Eliseeff [1] andKearns and Ron [5]. Using the difference in predictions ofinitial and sensitivity models, we compose the reliability es-timates RE1 (local variance), RE2 (local absolute variance)and RE3 (local bias). We base these estimates with an effort


Table 3 Correlation coefficients between reliability estimates and prediction error. Cell shading represents thep-values. The data with significance level α ≤ 0.05 is marked by light grey (significant positive correlation) anddark grey (significant negative correlation) background


Table 4 Percentage of positiveand negative correlationsbetween density estimates andprediction error

Correlation Density

+ −

Regression tree 10% 35%

Neural network 8% 29%

SVM 4% 35%

Average 8% 33%

to measure instabilities of regression models that arise fromthe learning algorithm itself. We explain why this method-ology is not appropriate for simple algorithms (linear re-gression, locally weighted regression) and focus on testingit with three complex regression models: regression trees,neural networks and support vector machines. We assumethat the proposed approach shall perform well with othercomplex regression models, which have to be empiricallyverified in future.

Experiments show that the proposed estimates better cor-relate with prediction error than common density estimates.The most promising results were achieved using RE3 (localbias) which seems to be a good candidate for a general relia-bility measure. This estimate has an additional advantage, asit was correlated to the non-absolute value of the predictionerror. It therefore holds a potential for correction of predic-tion errors.

Compared to a traditional approach, which estimates re-liability based only on the distribution of examples in theproblem space, the proposed approach implicitly considerslocal particularities of the learning problem, including thelearning algorithm’s generalization ability, bias, resistanceto noise, amount of noise in data, avoidance of overfitting,etc. These aspects, most of which cannot be measured quan-titatively, are analyzed implicitly by applying the sensitivityanalysis approach and thus considering the learning problemas a black box. We can conclude that this is the reason whythe proposed sensitivity estimates compare favorably to den-sity estimates and also achieve better experimental results.

Better correlations of the proposed estimates with pre-diction error show that they provide more information thanthe probabilistic distribution of examples. They measure theprediction sensitivity, which also depends on learning andon predictive algorithms themselves. We can conclude thatestimates RE1, RE2 and RE3 are more suitable for estima-tion of the prediction error than density estimates. However,it remains an open question how to adapt the three proposedmeasures in order to achieve the probabilistic interpretabilityas offered by the density function. We leave this for furtherwork.

Related work [18] proposed estimates for classificationreliability, which are based on the change of the posteriorclass distribution. In contrast to this approach, our estimates

are based solely on the outputs, given by the prediction sys-tem, and therefore do not require any estimations of distri-bution functions. The use of this approach is possible dueto the continuous nature of predicted values in regression.Namely, this enables us to numerically express the differ-ence between two regression predictions, in contrast to clas-sification, where one can only observe, whether the pre-dicted class was the same or different.

Besides additional comparisons with other techniques,the ideas for further work in this field include the improve-ment of interpretability of the proposed estimates. Namely,the values of proposed estimates are not bounded to any par-ticular interval, and since they are based on prediction ofdependent variable, their numerical values depend on thatvariable domain. This consequently means that the reliabil-ity values for predictions of two different data sets cannotbe compared to each other. It shall therefore be appropriateto transform the estimates to a unique interval with (hope-fully) a probabilistic interpretation. The estimates shall beappropriately mapped into interval [0,1], with value 0 repre-senting an unreliable prediction and 1 representing the mostreliable one. The notion of prediction reliability shall beexpanded to confidence interval and it shall also be testedwhether the reliability estimates can be used to correct theinitial predictions and thus improve their accuracy.

The proposed method was preliminarily tested in a realdomain. The data consisted of 1035 breast cancer patients,who had surgical treatment for cancer between 1983 and1987 in the Clinical Center in Ljubljana, Slovenia. The pa-tients were described using standard prognostic factors forbreast cancer recurrence. The goal of the research was topredict the time of possible cancer recurrence after the sur-gical treatment. The research showed that this is a difficultprediction problem, because the possibility for recurrenceis continuously present for almost 20 years after the treat-ment. The bare recurrence predictions were therefore com-plemented with our reliability estimates, helping the doctorswith the additional validation of predictions’ accuracy. Thepromising preliminary results confirm the usability of ourapproach.

Acknowledgements We thank Matjaž Kukar and Marko Robnik-Šikonja for their contribution to this study.


Table 5 Correlationcoefficients between densityestimates and prediction error.Cell shading represents thep-values. The data withsignificance level α ≤ 0.05 ismarked by light grey (significantpositive correlation) and darkgrey (significant negativecorrelation) background

References

1. Bousquet O, Elisseeff A (2002) Stability and generalization.J Mach Learn Res 2:499–526

2. Crowder MJ, Kimber AC, Smith RL, Sweeting TJ (1991) Statis-tical concepts in reliability. Statistical analysis of reliability data.Chapman & Hall, London, pp 1–11

3. Bousquet O, Elisseeff A (2000) Algorithmic stability and general-ization performance. In: NIPS, pp 196–202

4. Bousquet O, Pontil M (2003) Leave-one-out error and stabilityof learning algorithms with applications. In: Suykens JAK et al,Advances in learning theory: methods, models and applications.IOS Press, Amsterdam

5. Kearns MJ, Ron D (1997) Algorithmic stability and sanity-checkbounds for leave-one-out cross-validation. In: Computational lear-ing theory, pp 152–162

6. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140


7. Schapire RE (1999) A brief introduction to boosting. In: Proceed-ings of IJCAI, pp 1401–1406

8. Drucker H (1997) Improving regressors using boosting tech-niques. In: Machine learning: proceedings of the fourteenth in-ternational conference, pp 107–115

9. Ridgeway G, Madigan D, Richardson T (1999) Boosting method-ology for regression problems. In: Proceedings of the artificial in-telligence and statistics, pp 152–161

10. Breiman L (1997) Pasting bites together for prediction in largedata sets and on-line. Department of Statistics technical report,University of California, Berkeley

11. Tibshirani R, Knight K (1999) The covariance inflation criterionfor adaptive model selection. J Roy Stat Soc Ser B 61:529–546

12. Rosipal R, Girolami M, Trejo L (2000) On kernel principal com-ponent regression with covariance in action criterion for modelselection. Technical report, University of Paisley

13. Elidan G, Ninio M, Friedman N, Shuurmans D (2002) Data per-turbation for escaping local maxima in learning. In: Proceedingsof AAAI/IAAI, pp 132–139

14. Gammerman A, Vovk V, Vapnik V (1998) Learning by transduc-tion. In: Proceedings of the 14th conference on uncertainty in ar-tificial intelligence, Madison, WI, pp 148–155

15. Saunders C, Gammerman A, Vovk V (1999) Transduction withconfidence and credibility. In: Proceedings of IJCAI, vol 2,pp 722–726

16. Nouretdinov I, Melluish T, Vovk V (2001) Ridge regressionconfidence machine. In: Proceedings of the 18th internationalconference on machine learning. Kaufmann, San Francisco,pp 385–392

17. Vapnik V (1995) The nature of statistical learning theory. Springer,Berlin

18. Kukar M, Kononenko I (2002) Reliable classifications with ma-chine learning. In: Proceedings of the machine learning: ECML-2002. Springer, Helsinki, pp 219–231

19. Bosnic Z, Kononenko I, Robnik-Šikonja M, Kukar M (2003)Evaluation of prediction reliability in regression using the trans-duction principle. In Proceedings of Eurocon 2003, Ljubljana,pp 99–103

20. Mitchell T (1999) The role of unlabelled data in supervised learn-ing. In: Proceedings of the 6th international colloquium of cogni-tive science, San Sebastian, Spain

21. Blum A, Mitchell T (1998) Combining labeled and unlabeled datawith co-training. In: Proceedings of the 11th annual conference oncomputational learning theory, pp 92–100

22. Li M, Vitányi P (1993) An introduction to Kolmogorov complex-ity and its applications. Springer, New York

23. Press WH et al (2002) Numerical recipes in C: the art of scientificcomputing. Cambridge University Press, Cambridge

24. Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repos-itory of machine learning databases. Department of Informationand Computer Sciences, University of California, Irvine

25. Department of Statistics at Carnegie Mellon University (2005)StatLib—data, software and news from the statistics community

26. Cestnik B, Bratko I (1991) On estimating probabilities in treepruning. In: Proceedings of European working session on learn-ing (EWSL-91), Porto, Portugal, pp 138–150

27. Chang C, Lin C (2001) LIBSVM: a library for support vector ma-chines. http://www.csie.ntu.edu.tw/~cjlin/libsvm

28. Alpaydin E (2004) Introduction to machine learning. MIT Press,Cambridge

http://www.csie.ntu.edu.tw/~cjlin/libsvm

estimation of individual prediction reliability using the

Documents