spotting differences among observations - isgisg · spotting differences among observations marko...

Spotting Differences Among Observations

Marko Rak1, Tim Konig1 and Klaus-Dietz Tonnies1

1Department of Simulation and Graphics, Otto-von-Guericke University, Magdeburg, Germany{rak, tim.koenig, toennies}@isg.cs.ovgu.de

Keywords: Density Difference, Kernel Density Estimation, Scale Space, Blob Detection, Affine Shape Adaption

Abstract: Identifying differences among the sample distributions of different observations is an important issue in manyfields ranging from medicine over biology and chemistry to physics. We address this issue, providing a gen-eral framework to detect difference spots of interest in feature space. Such spots occur not only at variouslocations, they may also come in various shapes and multiple sizes, even at the same location. We deal withthese challenges in a scale-space detection framework based on the density function difference of the obser-vations. Our framework is intended for semi-automatic processing, providing human-interpretable interestspots for further investigation of some kind, e.g., for generating hypotheses about the observations. Such in-terest spots carry valuable information, which we outline at a number of classification scenarios from UCIMachine Learning Repository; namely, classification of benign/malign breast cancer, genuine/forged moneyand normal/spondylolisthetic/disc-herniated vertebral columns. To this end, we establish a simple decisionrule on top of our framework, which bases on the detected spots. Results indicate state-of-the-art classificationperformance, which underpins the importance of the information that is carried by these interest spots.

1 INTRODUCTION

Sooner or later a large portion of pattern recognitiontasks come down to the question What makes X dif-ferent from Y ? Some scenarios of that kind are:

Detection of forged money based on image-derived features: What makes some sort offorgery different from genuine money?Comparison of medical data of healthy andnon-healthy subjects for disease detection:What makes data of the healthy different fromthat of the non-healthy?Comparison of document data sets for text re-trieval purposes: What makes this set of docu-ments different from another set?

Apart from this, spotting differences in two or moreobservations is of interest in fields of computationalbiology, chemistry or physics. Looking at it from ageneral perspective, such questions generalize to

What makes the samples of group X differentfrom the samples of group Y ?

This question usually arises when we deal withgrouped samples in some feature space. For humans,answering such questions tends to become more chal-lenging with increasing number of groups, samplesand feature space dimensions, up to the point where

we miss the forest for the trees. This complexity isnot an issue to automatic approaches, which, on theother hand, tend to either overfit or underfit patternsin the data. Therefore, semi-automatic approaches areneeded to generate a number of interest spots whichare to be looked at in more detail.

We address this issue by a scale-space differencedetection framework. Our approach relies on the den-sity difference of group samples in feature space. Thisenables us to identify spots where one group domi-nates the other. We draw on kernel density estimatorsto represent arbitrary density functions. Embeddingthis into a scale-space representation, we are able todetect spots of different sizes and shapes in featurespace in an efficient manner. Our framework:• applies to d-dimensional feature spaces¿• is able to reflect arbitrary density functions• selects optimal spot locations, sizes and shapes• is robust to outliers and measurement errors• produces human-interpretable results

Our presentation is structured as follows. We out-line the key idea of our framework in Section 2. Thespecific parts of our framework are detailed in Sec-tion 3, while Section 4 comprises our results on sev-eral data sets. In Section 5, we close with a summaryof our work, our most important results and an outlineof future work.

2 FOUNDATIONS

Searching for differences between the sample distri-bution of two groups of observations g and h, we,quite naturally, seek for spots where the density func-tion f g(x) of group g dominates the density functionf h(x) of group h, or vice versa. Hence, we try to findpositive-/negative-valued spots of the density differ-ence

f g−h(x) = f g(x)− f h(x) (1)

w.r.t. the underlying feature space Rd with x ∈ Rd .Such spots may come in various shapes and sizes. Adifference detection framework should be able to dealwith these degrees of freedom. Additionally, it mustbe robust to various sources of error, e.g., from mea-surement, quantization and outliers.

We propose to superimpose a scale-space repre-sentation to the density difference f g−h(x) to achievethe above-mentioned properties. Scale-space frame-works have been shown to robustly handle a widerange of detection tasks for various types of struc-tures, e.g., text strings (Yi and Tian, 2011), per-sons and animals (Felzenszwalb et al., 2010) innatural scenes, neuron membranes in electron mi-croscopy imaging (Seyedhosseini et al., 2011) or mi-croaneurysms in digital fundus images (Adal et al.,2014). In each of these tasks the function of interestis represented through a grid of values, allowing foran explicit evaluation of the scale-space. However, anexplicit grid-based approach becomes intractable forhigher-dimensional feature spaces.

In what follows, we show how a scale-space rep-resenation of f g−h(x) can be obtained from kerneldensity estimates of f g(x) and f h(x) in an implicitfashion, expressing the problem by scale-space kerneldensity estimators. Note that by the usage of kerneldensity estimates our work is limited to feature spaceswith dense filling. We close with a brief discussion onhow this can be used to compare observations amongmore than two groups.

2.1 Scale Space Representation

First, we establish a family lg−h(x; t) of smoothed ver-sions of the densitiy difference lg−h(x). Scale param-eter t ≥ 0 defines the amount of smoothing that is ap-plied to lg−h(x) via convolution with kernel kt(x) ofbandwidth t as stated in

lg−h(x; t) = kt(x)∗ f g−h(x). (2)

For a given scale t, spots having a size of about2√

t will be highlighted, while smaller ones will besmoothed out. This leads to an efficient spot detection

scheme, which will be discussed in Section 3. Let

lg(x; t) = kt(x)∗ f g(x) (3)

lh(x; t) = kt(x)∗ f h(x) (4)

be the scale-space representations of the group den-sities f g(x) and f h(x). Looking at Equation 2 moreclosely, we can rewrite lg−h(x; t) equivalently in termsof lg(x; t) and lh(x; t) via Equation 3 and 4. This reads

lg−h(x; t) = kt(x)∗ f g−h(x) (5)

= kt(x)∗[

f g(x)− f h(x)]

(6)

= kt(x)∗ f g(x)− kt(x)∗ f h(x) (7)

= lg(x; t)− lh(x; t). (8)

The simple yet powerful relation between the leftand the right-hand side of Equation 8 will allow usto evaluate the scale-space representation lg−h(x) im-plicitly, i.e., using only kernel functions. Of major im-portance is the choice of the smoothing kernel kt(x).According to scale-space axioms, kt(x) should suf-fice a number of properties, resulting in the uniformGaussian kernel of Equation 9 as the unique choice,cf. (Babaud et al., 1986; Yuille and Poggio, 1986).

φt(x) =1√

(2πt)dexp(− 1

2txTx)

(9)

2.2 Kernel Density Estimation

In kernel density estimation, the group density f g(x)is estimated from its ng samples by means of a ker-nel function KBg(x). Let xg

i ∈ Rd×1 with i = 1, . . . ,ng

being the group samples. Then, the group density es-timate is given by

f g(x) =1ng

ng

∑i=1

KBg(x−xg

i). (10)

Parameter Bg ∈ Rd×d is a symmetric positive definitematrix, which controls the influences of samples tothe density estimate. Informally speaking, KBg(x) ap-plies a smoothing with bandwidth Bg to the “spikysample relief” in feature space.

Plugging kernel density estimator f g(x) into thescale-space representation lg(x; t) defines the scale-space kernel density estimator lg(x; t) to be

lg(x; t) = kt(x)∗ f g(x). (11)

Inserting Equation 10 into the above, we can tracedown the definition of the scale-space density estima-

tor lg(x; t) to the sample level via transformation

lg(x; t) = kt(x)∗ f g(x) (12)

= kt(x)∗

[1ng

ng

∑i=1

KBg(x−xg

i)]

(13)

=1ng

ng

∑i=1

(kt ∗KBg)(x−xg

i). (14)

Though arbitrary kernels can be used, we chooseKB(x) to be a Gaussian kernel ΦB(x) due to its con-venient algebraic properties. This (potentially non-uniform) kernel is defined as

ΦB(x) =1√

det(2πB)exp(−1

2xTB−1x

). (15)

Using the above, the right-hand side of Equa-tion 14 simplifies further because of the Gaussian’scascade convolution property. Eventually, the scale-space kernel density estimator lg(x; t) is given byEquation 16, where I ∈ Rd×d is the identity.

lg(x; t) =1ng

ng

∑i=1

ΦtI+Bg(x−xg

i)

(16)

Using this estimator, the scale-space representa-tion lg(x; t) of group density f g(x) and analogouslythat of group h can be estimated for any (x; t) in animplicit fashion. Consequently, this allows us to es-timate the scale-space representation lg−h(x; t) of thedensity difference f g−h(x) via Equation 7 by meansof kernel functions only.

2.3 Bandwidth Selection

When regarding bandwidth selection in such a scale-space representation, we see that the impact of differ-ent choices for bandwidth matrix B vanishes as scale tincreases. This can be seen when comparing matricestI+ 0 and tI+B where 0 represents the zero matrix,i.e., no bandwidth selection at all. We observe thatrelative differences between them become neglectableonce ‖tI‖ � ‖B‖. This is especially true for largesample sizes, because the bandwidth will then tendtowards zero for any reasonable bandwidth selectoranyway. Hence, we may actually consider setting Bto 0 for certain problems, as we typically search fordifferences that fall above some lower bound for t.

Literature bares extensive work on bandwidth ma-trix selection, for example, based on plug-in es-timators (Duong and Hazelton, 2003; Wand andJones, 1994) or biased, unbiased and smoothed cross-validation estimators (Duong and Hazelton, 2005;Sain et al., 1992). All of these integrate well with ourscale-space density difference framework. However,

in view of the argument above, we propose to compro-mise between a full bandwidth optimization and hav-ing no bandwidth at all. We define Bg = bgI and usean unbiased least-squares cross-validation to set upthe bandwidth estimate for group g. For Gaussian ker-nels, this leads to the optimization of 17, cf. (Duongand Hazelton, 2005), which we achieved by goldensection search over bg.

argminBg

1ng√

det(4πBg)

+1

ng(ng−1)

ng

∑i=1

ng

∑j=1j 6=i

(Φ2Bg −2ΦBg)(xgi −xg

j) (17)

2.4 Multiple Groups

If differences among more than two groups shall bedetected, we can reduce the comparison to a numberof two-group problems. We can consider two typi-cal use cases, namely one group vs. another and onegroup vs. rest. Which of the two is more suitable de-pends on the specific task at hand. Let us illustratethis using two medical scenarios. Assume we havea number of groups which represent patients havingdifferent diseases that are hard to discriminate in dif-ferential diagnosis. Then we may consider the seconduse case, to generate clues on markers that make onedisease different from the others. In contrast, if thesegroups represent stages of a disease, potentially in-cluding a healthy control group, then we may considerthe first use case, comparing only subsequent stagesto give clues on markers of the disease’s progress.

3 METHOD

To identify the positve-/negative-valued spots of adensity difference, we apply the concept of blob de-tection, which is well-known in computer vision, tothe scale-space representation derived in Section 2.In scale-space blob detection, some blobness criterionis applied to the scale-space representation, seekingfor local optima of the function of interest w.r.t. spaceand scale. This directly leads to an efficient detectionscheme that identifies a spot’s location and size. Thelatter corresponds to the detection scale.

In a grid-representable problem we can evaluateblobness densely over the scale-space grid and iden-tify interesting spots directly using the grid neigh-borhood. This is intractable here, which is why werely on a more refined three-stage approach. First, wetrace the local spatial optima of the density difference

(a) “Isometric” View (b) Top ViewFigure 1: Detection results for a two-group (red/blue) problem in two-dimensional feature space (xy-plane) with augmentedscale dimension s; Red squares and blue circles visualize the samples of each group; Red/blue paths outline the dendrogram ofscale-space density difference optima for the red/blue group dominating the other group; Interesting spots of each dendrogramare printed thick; Red/blue ellipses characterize the shape for each of the interest spots

through scales of the scale-space representation. Sec-ond, we identify the interesting spots by evaluatingtheir blobness along the dendrogram of optima thatwas obtained during the first stage. Having selectedspots and therefore knowing their locations and sizes,we finally calculate an elliptical shape estimate foreach spot in a third stage.

Spots obtained in this fashion characterize ellipti-cal regions in feature space as outlined in Figure 1.The representation of such regions, i.e., location, sizeand shape, as well as its strength, i.e., its scale-spacedensity difference value, are easily interpretable byhumans, which allows to look at them in more detailusing some other method. This also renders a limita-tion of our work, because non-elliptical regions mayonly be approximated by elliptical ones. We now givea detailed description of the three stages.

3.1 Scale Tracing

Assume we are given an equidistant scale sampling,containing non-negative scales t1, . . . , tn in increasingorder and we search for spots where group g dom-inates h. More precisely, we search for the non-negatively valued maxima of lg−h(x; ti−1). The op-posite case, i.e., group h dominates g, is equivalent.

Let us further assume that we know the spatial lo-cal maxima of the density difference lg−h(x; ti−1) fora certain scale ti−1 and we want to estimate those ofthe current scale ti. This can be done taking the previ-ous local maxima as initial points and optimizing eachw.r.t. lg−h(x; ti). In the first scale, we take the sam-ples of group g themselves. As some maxima maybe converged to the same location, we merge themtogether, feeding unique locations as initials into thenext scale ti+1 only. We also drop any negatively-

valued locations as these are not of interest to ourtask. They will not become of interest for any higherscale either, because local extrema will not enhance asscale increases, cf. (Lindeberg, 1998). Since deriva-tives are simple to evaluate for Gaussian kernels, wecan use Newton’s method for spatial optimization.We can assemble gradient ∂

∂x lg−h(x; t) and Hessian∂2

∂x∂xT lg−h(x; t) sample-wise using

∂

∂xΦB(x) =−ΦB(x)B−1x and (18)

∂2

∂x∂xT ΦB(x) = ΦB(x)(B−1xxTB−1−B−1) . (19)

Iterating this process through all scales, we forma discret dendrogram of the maxima over scales. Adendrogram branching means that a maxima formedfrom two (or more) maxima from the preceding scale.

3.2 Spot Detection

The maxima of interest are derived from a scale-normalized blobness criterion cγ(x; t). Two main cri-teria, namely the determinant of the Hessian (Bret-zner and Lindeberg, 1998) and the trace of the Hes-sian (Lindeberg, 1998) have been discussed in litera-ture. We focus on the former, which is given in Equa-tion 201, as it has been shown to provide better scaleselection properties under affine transformation of thefeature space, cf. (Lindeberg, 1998).

cγ(x; t) = tγd (−1)ddet(

∂2

∂x∂xT lg−h(x; t))

︸︷︷︸ (20)

= tγd c(x; t) (21)

1(−1)d handles even and odd dimensions consistently.

Because the maxima are already spatially optimal, wecan search for spots that maximize cγ(x; t) w.r.t. thedendrogram neighborhood only. Parameter γ≥ 0 canbe used to introduce a size bias, shifting the detectedspot towards smaller or larger scales. The definitionof γ highly depends on the type of spot that we arelooking for, cf. (Lindeberg, 1996). This is impracticalwhen we seek for spots of, for example, small andlarge skewness or extreme kurtosis at the same time.

Addressing the parameter issue, we search forall spots that maximize cγ(x; t) locally w.r.t. someγ ∈ [0,∞). Some dendrogram spot s with scale-spacecoordinates (xs; ts) is locally maximal if there existsa γ-interval such that its blobness cγ(xs; ts) is largerthan that of every spot in its dendrogram neighbor-hood N (s). This leads to a number of inequalities,which can be written as

tγds c(xs; ts) >

∀n∈N (s)tγdn c(xn; tn) or (22)

γd logtstn

>∀n∈N (s)

logc(xn; tn)c(xs; ts)

. (23)

The latter can be solved easily for the γ-interval, ifany. We can now identify our interest spots by look-ing for the maxima along the dendrogram that locallymaximize the width of the γ-interval. More precisely,let wγ(xs; ts) be the width of the γ-interval for dendro-gram spot s, then s is of interest if the dendrogramLaplacian of wγ(x; t) is negative at (xs; ts), or equiva-lently, if

wγ(xs; ts)>1∣∣N (s)∣∣ ∑

n∈N (s)

wγ(xn; tn). (24)

Intuitively, a spot is of interest if its γ-interval widthis above neighborhood average. This is the only as-sumption we can make without imposing limitationson the results. Interest spots indentified in this waywill be dendrogram segments, each ranging over anumber of consecutive scales.

3.3 Shape Adaption

Shape estimation can be done in an iterative mannerfor each interest spot. The iteration alternatingly up-dates the current shape estimate based on a measureof anisotropy around the spot and then corrects thebandwidth of the scale-space smoothing kernel ac-cording to this estimate, eventually reaching a fixedpoint. The second moment matrix of the function ofinterest is typically used as an anisotropy measure,e.g., in (Lindeberg and Garding, 1994) and (Mikola-jczyk and Schmid, 2004). Since it requires spatial in-tegration of the scale-space representation around theinterest spot, this measure is not feasible here.

We adapted the Hessian-based approach of (Lake-mond et al., 2012) to d-dimensional problems.The aim is to make the scale-space representationisotropic around the interest spot, iteratively mov-ing any anisotropy into the symmetric positive defi-nite shape matrix S ∈Rd×d of the smoothing kernel’sbandwidth tS. Thus, we lift the problem into a gen-eralized scale-space representation lg−h(x; tS) of non-uniform scale-space kernels, which requires us to re-place the definition of φt(x) by that of ΦB(x).

Starting with the isotropic S1 = I, we decomposethe current Hessian via

∂2

∂x∂xT lg−h( · ; tSi) = VD2VT (25)

into its eigenvectors in columns of V and eigenvalueson the diagonal of D2. We then normalize the latter tounit determinant via

D = d√

det(D)D (26)

to get a relative measure of anisotropy for each ofthe eigenvector directions. Finally, we move theanisotropy into the shape estimate via

Si+1 =(

VTD−12 V)

Si

(VD−

12 VT

)(27)

and start all over again. Iteration terminates whenisotropy is reached. More precisely: when the ratioof minimal and maximal eigenvalue of the Hessianapproaches one, which usually happens within a fewiterations.

4 EXPERIMENTS

We next demonstrate that interest spots carry valuableinformation about a data set. Due to the lack of datasets that match our particular detection task a groundtruth comparison is impossible. Certainly, artificiallyconstructed problems are an exception. However, thegeneralizability of results is at least questionable forsuch problems. Therefore, we chose to benchmarkour approach indirectly via a number of classificationtasks. The rational is that results that are comparableto those of well-established classifiers should under-pin the importance of the identified interest spots.

We next show how to use these interest spots forclassification using a simple decision rule and detailthe data sets that were used. We then investigate pa-rameters of our approach and discuss the results of theclassification tasks in comparison to decision trees,Fisher’s linear discriminant analysis, k-nearest neigh-bors with optimized k and support vector machineswith linear and cubic kernels. All experiments wereperformed via leave-one-out cross-validation.

(a) “Isometric” View (b) Top ViewFigure 2: Feature space decision boundaries (black plane curves) obtained from group likelihood criterion for the two-dimensional two-group problem of Figure 1; Red squares and blue circles visualize the samples of each group; Red/bluepaths outline the dendrogram of scale-space density difference optima for the red/blue group dominating the other group;Interesting spots of each dendrogram are printed thick; Red/blue ellipses characterize the shape for each of the interest spots

10−6

10−4

10−2

100

10−6

10−4

10−2

100

pred

pblue

Figure 3: Sample group likelihoods and decision boundary(black diagonal line) for the two-group problem of Figure 1

4.1 Decision Rule

To perform classification we establish a simple deci-sion rule based on interest spots that were detected us-ing the one group vs. rest use case. Therefore, we de-fine a group likelihood criterion as follows. For eachgroup g, having the set of interest spots I g, we define

pg(x) = maxs∈I g

lg−h (xs; tsSs)

·exp(−1

2(x−xs)

T (tsSs)−1 (x−xs)

). (28)

This is a quite natural trade-off, where the first factorfavors spots s with high density difference, while thelatter factor favors spots with small Mahalanobis dis-tance to the location x that is investigated. We mayalso think of pg(x) as an exponential approximationof the scale-space density difference using interestingspots only. Given this, our decision rule simply takesthe group that maximizes the group likelihood for thelocation of interest x. Figure 2 and Figure 3 illustratethe decision boundary obtained from this rule.

4.2 Data Sets

We carried out our experiments on three classifi-cation data sets taken from UCI Machine Learn-ing Repository. A brief summary of them is givenin Table 1. In the first task, we distinguish be-tween benign and malign breast cancer based onmanually graded cytological charateristics, cf. (Wol-berg and Mangasarian, 1990). In the second task,we distinguish between genuine and forged moneybased on wavelet-transform-derived features fromphotographs of banknote-like specimen, cf. (Glocket al., 2009). In the third task, we differentiate amongnormal, spondylolisthetic and disc-herniated vertebralcolumns based on biomechanical attributes derivedfrom shape and orientation of the pelvis and the lum-bar vertebral column, cf. (Berthonnaud et al., 2005).

4.3 Parameter Investigation

Before detailing classification results, we investigatetwo aspects of our approach. Firstly, we inspectthe importance of bandwidth selection, benchmarkingno kernel density bandwidth against the least-squarescross-validation technique that we use. Secondly, wedetermine the influence of the scale sampling rate.For the latter we space n + 1 scales for various nequidistantly from zero to

tn = F−1χ2 (1− ε|d) max

g

(d√

det(Σg)

), (29)

where F−1χ2 ( · |d) is the cumulative inverse-χ2 distribu-

tion with d degrees of freedom and Σg is the covari-ance matrix of group g. Intuitively, tn captures theextent of the group with largest variance up to a smallε, i.e., here 1.5 ·10−8.

Table 1: Data sets from UCI Machine Learning Repository

Breast Cancer (BC) Banknote Authentication Vertebral ColumnGroups benign / malign genuine / forged normal / spondylolisthetic / herniated discsSamples 444 / 239 762 / 610 100 / 150 / 60

Dimensions 10 4 6

Table 2: Classification accuracy of our decision rule in b%c for data sets of Table 1 with/without bandwidth selection

Scale sampling rate n100 125 150 175 200 225 250 270 300

Breast Cancer 65 / 65 97 / 97 97 / 97 95 / 95 97 / 97 95 / 95 97 / 97 96 / 96 97 / 97Banknote Authen. 96 / 94 96 / 96 96 / 96 98 / 98 98 / 98 98 / 98 98 / 98 98 / 98 99 / 99Vertebral Column 87 / 82 88 / 83 88 / 84 88 / 83 88 / 85 88 / 85 88 / 86 88 / 86 88 / 87

To investigate the two aspects, we compare classi-fication accuracies with and without bandwidth selec-tion as well as sampling rates ranging from n = 100to n = 300 in steps of 25. From the results, which aregiven in Table 2, we observe that bandwidth selectionis almost neglectable for the Breast Cancer (BC) andthe Banknote Authentication (BA) data set. However,the impact is substantial throughout all scale samplingrates for the Vertebral Column (VC) data set. Thismay be due to the comparably small number of sam-ples per group for this data set. Regarding the sec-ond aspect, we observe that for the BA and VC dataset the classification accuracy slightly increases whenthe scale sampling rate rises. Regarding the BC dataset, accuracy remains relatively stable, except for thelower rates. From the results we conclude that band-width selection is a necessary part for interest spotdetection. We further recommend n ≥ 200, becauseaccuracy starts to saturate at this point for all data sets.For the remaining experiments we used bandwidth se-lection and a sampling rate of n = 200.

4.4 Classification Results

A comparison of classification accuracies of our de-cision rule against the aforementioned classifiers isgiven in Table 3. For the BC data set we observe thatexcept for the support vector machine (SVM) with cu-bic kernel all approaches were highly accurate, scor-ing between 94% and 97% with our decision rule be-ing topmost. Even more similar to each other are re-sults for the BA data set, where all approaches scorebetween 97% and 99%, with ours lying in the middleof this range. Results are most diverse for the VC datasets. Here, the SVM with cubic kernel again performssignificantly worse than the rest, which all score be-tween 80% and 85%, while our decision rule peaksat 88%. Other research showed similar scores on thegiven data sets. For example the artificial neural net-works based on pareto-differential evolution in (Ab-bass, 2002) obtained 98% accuracy for the BC data

Table 3: Classification accuracies of different classifiers inb%c for data sets of Table 1

BC BA VCdecision tree 94 98 82

k-nearest neighbors 97 99 80Fisher’s discriminant 96 97 80linear kernel SVM 96 99 85cubic kernel SVM 90 98 74our decision rule 97 98 88

set, while (Rocha Neto et al., 2011) achieved 83% to85% accuracy on the VC data set with SVMs with dif-ferent kernels. These results suggest that our interestpoints carry information about a data set that are sim-ilarly important than the information carried by thewell-established classifiers.

Confusion tables for our decision rule are givenin Table 4 for all data sets. As can be seen, our ap-proach gave balanced inter-group results for the BCand the BA data set. We obtained only small inac-curacies for the recall of the benign (96%) and gen-uine (97%) groups as well as for the precision of themalign (94%) and forged (96%) groups. Results forthe VC data set were more diverse. Here, a num-ber of samples with disc herniation were mistakenfor being normal, lowering the recall of the herniatedgroup (86%) noticeably. However, more severe inter-group imbalances were caused by the normal sam-ples, which were relatively often mistaken for beingspondylolisthetic or herniated discs. Thus, recall forthe normal group (76%) and precision for the herni-ated group (74%) decreased significantly. The latter isto some degree caused by a handful of strong outliersfrom the normal group that fall into either of the othergroups, which can already be seen from the grouplikelihood plot in Figure 4. This finding was made byothers as well, cf. (Rocha Neto and Barreto, 2009).

The other classifiers performed similarly balancedon the BA and BC data set. Major differences occuredon the VC data set only. A precision/recall compar-

Table 4: Confusion table for predicted/actual groups of our decision rule for data sets of Table 1(a) Breast Cancer

HHHHP

A ben. mal.

ben. 429 4 99mal. 15 235 94

96 98 b%c

(b) Banknote AuthenticationHH

HHPA gen. for.

gen. 742 0 100for. 20 610 96

97 100 b%c

(c) Vertebral ColumnHH

HHPA norm. spon. hern.

norm. 76 1 6 91spon. 10 145 2 92hern. 14 4 52 74

76 96 86 b%c

10−30

10−20

10−10

100

10−30

10−20

10−10

100

max

(

pspond.

, phern.

)

pnorm

al

Figure 4: Sample group likelihoods and decision boundary(black diagonal line) for the Vertebral Column data set ofTable 1; Normal, spondylolisthetic and herniated discs inblue, magenta and red, respectively

ison of all classifiers on the VC data set is given inTable 5. We observe that the precision of the normaland the herniated group are significantly lower (gap >12%) than that of the spondylolisthetic group for allclassifiers except for our decision rule, for which atleast the normal group is predicted with a similar pre-cision. Regarding the recall we note an even moreunbalanced behavior. Here, a strict ordering fromspondylolisthetic over normal to herniated disks oc-curs. The differences of the recall of spondylolistheticand normal are significant (gap > 16 %) and thosebetween normal and herniated are even larger (gap >18 %) among all classifiers that we compared against.The recalls for our decision rule are distributed differ-ently, ordering the herniated before the normal group.Also the magnitude of differences is less significant(gaps ≈ 10%) for our decision rule. Results of thiscomparison indicate that the information that is car-ried by our interest points tends to be more balancedamong groups than the information carried by thewell-established classifiers that we compared against.

5 CONCLUSION

We proposed a detection framework that is able toidentify differences among the sample distributionsof different observations. Potential applications aremanifold, touching fields such as medicine, biology,

Table 5: Classification precision/recall of different classi-fiers in b%c for the Vertebral Column data set of Table 1

norm. spon. hern.decision tree 69 / 83 97 / 95 68 / 50

k-nearest neighbors 70 / 74 96 / 96 58 / 55Fisher’s discriminant 70 / 80 87 / 92 74 / 48linear kernel SVM 76 / 85 97 / 96 72 / 61cubic kernel SVM 59 / 82 90 / 91 52 / 18our decision rule 91 / 76 92 / 96 74 / 86

chemistry and physics. Our approach bases on thedensity function difference of the observations in fea-ture space, seeking to identify spots where one obser-vation dominates the other. Superimposing a scale-space framework to the density difference, we are ableto detect interest spots of various locations, size andshapes in an efficient manner.

Our framework is intended for semi-automaticprocessing, providing human-interpretable interestspots for potential further investigation of some kind.We outlined that these interest spots carry valuableinformation about a data set at a number of classifi-cation tasks from the UCI Machine Learning Repos-itory. To this end, we established a simple decisionrule on top of our framework. Results indicate state-of-the-art performance of our approach, which under-pins the importance of the information that is carriedby these interest spots.

In the future, we plan to extend our work to sup-port repetitive features such as angles, which cur-rently is a limitation of our approach. Modifying ournotion of distance, we would then be able to cope withproblems defined on, e.g., a sphere or torus. Futurework may also include the migration of other types ofscale-space detectors to density difference problems.This includes the notion of ridges, valleys and zero-crossings, leading to richer sources of information.

ACKNOWLEDGEMENTS

This research was partially funded by the project “Vi-sual Analytics in Public Health” (TO 166/13-2) of theGerman Research Foundation.

REFERENCES

Abbass, H. A. (2002). An evolutionary artificial neural net-works approach for breast cancer diagnosis. ArtificialIntelligence in Medicine, 25:265–281.

Adal, K. M., Sidibe, D., Ali, S., Chaum, E., Karnowski,T. P., and Meriaudeau, F. (2014). Automated detectionof microaneurysms using scale-adapted blob analysisand semi-supervised learning. Computer Methods andPrograms in Biomedicine, 114:1–10.

Babaud, J., Witkin, A. P., Baudin, M., and Duda, R. O.(1986). Uniqueness of the Gaussian kernel for scale-space filtering. IEEE Transactions on Pattern Analysisand Machine Intelligence, 8:26–33.

Berthonnaud, E., Dimnet, J., Roussouly, P., and Labelle,H. (2005). Analysis of the sagittal balance of thespine and pelvis using shape and orientation param-eters. Journal of Spinal Disorders and Techniques,18:40–47.

Bretzner, L. and Lindeberg, T. (1998). Feature tracking withautomatic selection of spatial scales. Computer Visionand Image Understanding, 71:385–392.

Duong, T. and Hazelton, M. L. (2003). Plug-in bandwidthmatrices for bivariate kernel density estimation. Jour-nal of Nonparametric Statistics, 15:17–30.

Duong, T. and Hazelton, M. L. (2005). Cross-validationbandwidth matrices for multivariate kernel density es-timation. Scandinavian Journal of Statistics, 32:485–506.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., andRamanan, D. (2010). Object detection with discrim-inatively trained part-based models. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,32:1627–1645.

Glock, S., Gillich, E., Schaede, J., and Lohweg, V. (2009).Feature extraction algorithm for banknote texturesbased on incomplete shift invariant wavelet packettransform. In Proceedings of the Annual PatternRecognition Symposium, volume 5748, pages 422–431.

Lakemond, R., Sridharan, S., and Fookes, C. (2012).Hessian-based affine adaptation of salient local imagefeatures. Journal of Mathematical Imaging and Vi-sion, 44:150–167.

Lindeberg, T. (1996). Edge detection and ridge detectionwith automatic scale selection. In Proceedings of theIEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, pages 465–470.

Lindeberg, T. (1998). Feature detection with automaticscale selection. International Journal of Computer Vi-sion, 30:79–116.

Lindeberg, T. and Garding, J. (1994). Shape-adaptedsmoothing in estimation of 3-d depth cues from affinedistortions of local 2-d brightness structure. In Pro-ceedings of the European Conference on ComputerVision, pages 389–400.

Mikolajczyk, K. and Schmid, C. (2004). Scale & affine in-variant interest point detectors. International Journalof Computer Vision, 60:63–86.

Rocha Neto, A. R. and Barreto, G. A. (2009). On the appli-cation of ensembles of classifiers to the diagnosis ofpathologies of the vertebral column: A comparativeanalysis. IEEE Latin America Transactions, 7:487–496.

Rocha Neto, A. R., Sousa, R., Barreto, G. A., and Cardoso,J. S. (2011). Diagnostic of pathology on the verte-bral column with embedded reject option. In PatternRecognition and Image Analysis, volume 6669, pages588–595. Springer Berlin Heidelberg.

Sain, S. R., Baggerly, K. A., and Scott, D. W. (1992). Cross-validation of multivariate densities. Journal of theAmerican Statistical Association, 89:807–817.

Seyedhosseini, M., Kumar, R., Jurrus, E., Giuly, R., Ellis-man, M., Pfister, H., and Tasdizen, T. (2011). De-tection of neuron membranes in electron microscopyimages using multi-scale context and Radon-like fea-tures. In Proceedings of the International Conferenceon Medical Image Computing and Computer-assistedIntervention, pages 670–677.

Wand, M. P. and Jones, M. C. (1994). Multivariate plug-inbandwidth selection. Computational Statistics, 9:97–116.

Wolberg, W. and Mangasarian, O. (1990). Multisurfacemethod of pattern separation for medical diagnosis ap-plied to breast cytology. In Proceedings of the Na-tional Academy of Sciences, pages 9193–9196.

Yi, C. and Tian, Y. (2011). Text string detection from natu-ral scenes by structure-based partition and grouping.IEEE Transactions on Image Processing, 20:2594–2605.

Yuille, A. L. and Poggio, T. A. (1986). Scaling theorems forzero crossings. IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 8:15–25.

spotting differences among observations - isgisg · spotting differences among observations marko...

Documents