cvpr2010: advanced itincvpr in a nutshell: part 3: feature selection
TRANSCRIPT
Tutorial
Advanced Information Theory in CVPR “in a Nutshell” CVPR
June 13-18 2010 San Francisco,CA
High-Dimensional Feature Selection:Images, Genes & Graphs
Francisco Escolano
High-Dimensional Feature Selection
Background. Feature selection in CVPR is mainly motivated by: (i)dimensionality reduction for improving the performance of classifiers,and (ii) feature-subset pursuit for a better understanding/descriptionof the patterns at hand (either using generative or discriminativemethods).Exploring the domain of images [Zhu et al.,97] and other patterns(e.g. micro-array/gene expression data) [Wang and Gotoh,09] impliesdealing with thousands of features [Guyon and Elisseeff, 03]. In termsof Information Theory (IT) this task demands pursuing the mostinformative features, not only individually but also capturing theirstatistical interactions, beyond taking into account pairs of features.This is a big challenge since the early 70’s due to its intrinsiccombinatorial nature.
2/56
Features Classification
Strong and Weak Relevance
1. Strong relevance: A feature fi is strongly relevant if its removaldegrades the performance of the Bayes optimum classifier.
2. Weak relevance: fi is weakly relevant if it is not stronglyrelevant and there exists a subset of features ~F ′ such that theperformance on ~F ′ ∪ {fi} is better than the one on just ~F ′.
Redundancy and Noisy
1. Redundant features: those which can be removed because thereis another feature or subset of features which already supply thesame information about the class.
2. Noisy features: not redundant and not informative.
3/56
Features Classification (2)
Irrelevant features
Weakly relevant features
Strongly relevant
features
Figure: Feature classification
4/56
Wrappers and Filters
Wrappers vs Filters
1. Wrapper feature selection (WFS) consists in selecting featuresaccording to the classification results that these features yield(e.g. using Cross Validation (CV)). Therefore wrapper featureselection is a classifier-dependent approach.
2. Filter feature selection (FFS)is classifier independent, as it isbased on statistical analysis on the input variables (features),given the classification labels of the samples.
3. Wrappers build classifiers each time a feature set has to beevaluated. This makes them more prone to overfitting thanfilters.
4. In FFS the classifier itself is built and tested after FS.
5/56
FS is a Complex Task
The Complexity of FS
The only way to assure that a feature set is optimum (following whatcriterion?) is the exhaustive search among feature combinations.The curse of dimensionality limits this search, as the complexity is
O(z), z =n∑
i=1
(ni
)Filters design is desirable in order to avoid overfitting, but they mustbe as multivariate as Wrappers. However, this implies to design andestimate a cost function capturing the high-order interaction ofmany variables. In the following we will focus on FFS.
6/56
Mutual Information Criterion
A Criterion is Needed
The primary problem of feature selection is the criterion whichevaluates a feature set. It must decide whether a feature subset issuitable for the classification problem, or not. The optimal criterionfor such purpose would be the Bayesian error rate for the subset ofselected features:
E (~S) =
∫~S
p(~S)
(1−max
i(p(ci |~S))
)d~S ,
where ~S is the vector of selected features and ci ∈ C is a class fromall the possible classes C existing in the data.
7/56
Mutual Information Criterion (2)
Bounding Bayes Error
An upper bound of the Bayesian error, which was obtained byHellman and Raviv (1970) is:
E (~S) ≤ H(C |~S)
2.
This bound is related to mutual information (MI), because mutualinformation can be expressed as
I (~S ; C ) = H(C )− H(C |~S)
and H(~C ) is the entropy of the class labels which do not depend onthe feature subspace ~S .
8/56
Mutual Information Criterion (3)
I(X;C)
Cx1
x2 x3x4
Figure: Redundancy and MI
9/56
Mutual Information Criterion (4)
MI and KL Divergence
From the definition of MI we have that:
I (X ; Y ) =∑y∈Y
∑x∈X
p(x , y) logp(x , y)
p(x)p(y)
=∑y∈Y
p(y)∑x∈X
p(x |y) logp(x |y)
p(x)
= EY (KL (p(x |y)||p(x))).
Then, maximizing MI is equivalent to maximizing the expectation ofthe KL divergence between the class-conditional densities P(~S |~C )and the density of the feature subset P(~S).
10/56
Mutual Information Criterion (5)
The BottleneckI Estimating mutual information between a high-dimensional
continuous set of features, and the class labels, is notstraightforward, due to the entropy estimation.
I Simplifying assumptions: (i) Dependency is not-informativeabout the class, (ii) redundancies between features areindependent of the class [Vasconcelos & Vasconcelos, 04].The basic idea is to simplify the computation of the MI. Forinstance:
I ( ~S∗; C ) ≈∑xi∈~S
I (xi ; C )
I More realistic assumptions involving interactions are needed!
11/56
The mRMR Criterion
Definition
Maximize the relevance I (xj ; C ) of each individual feature xj ∈ ~Fand simultaneously minimize the redundancy between xj and the
rest of selected features xi ∈ ~S , i 6= j . This criterion is known as themin-Redundancy Max-Relevance (mRMR) criterion and itsformulation for the selection of the m-th feature is [Peng et al., 2005]:
maxxj∈~F−~Sm−1
I (xj ; C )− 1
m − 1
∑xi∈~Sm−1
I (xj ; xi )
Thus, second-order interactions are considered!
12/56
The mRMR Criterion (2)
Properties
I Can be embedded in a forward greedy search, that is, at eachiteration of the algorithm the next feature optimizing thecriterion is selected.
I Doing so, this criterion is equivalent to a first-order using theMaximum Depedency criterion:
max~S⊆~F
I (~S ; C )
Then, the m-th feature is selected according to:
maxxj∈~F−~Sm−1
I (~Sm−1, xj ; C )
13/56
Efficient MD
Need of an Entropy Estimator
I Whilst in mRMR the mutual information is incrementallyestimated by estimating it between two variables of onedimension, in MD the estimation of I (~S ; C ) is not trivialbecause ~S could consist of a large number of features.
I If we base MI calculation in terms of the conditional entropyH(~S |~C ) we have:
I (~S ; ~C ) = H(~S)− H(~S |~C ).
To do this,∑
H(~S |C = c)p(C = c) entropies have to becalculated, Anyway, we must calculate multi-dimensionalentropies!
14/56
Leonenko’s Estimator
Theoretical Bases
Recently Leonenko et al. published an extensive study [Leonenko et
al, 08] about Renyi and Tsallis entropy estimation, also consideringthe case of the limit of α→ 1 for obtaining the Shannon entropy.Their construction relies on the integral
Iα = E{f α−1(~X )} =
∫Rd
f α(x)dx ,
where f (.) refers to the density of a set of n i.i.d. samples~X = {X1,X2, . . . ,XN}. The latter integral is valid for α 6= 1,however, the limits for α→ 1 are also calculated in order to considerthe Shannon entropy estimation.
15/56
Leonenko’s Estimator (2)
Entropy Maximization Distributions
I The α-entropy maximizing distributions are only defined for0 < α < 1, where the entropy Hα is a concave function. Themaximizing distributions are defined under some constraints.The uniform distribution maximizes α-entropy under theconstraint that the distribution has a finite support. Fordistributions with a given covariance matrix the maximizingdistribution is Student-t, if d/(d + 2) < α < 1, for any numberof dimensions d ≥ 1.
I This is a generalization of the property that the Gaussiandistribution maximizes the Shannon entropy H.
16/56
Leonenko’s Estimator (3)
Renyi, Tsallis and Shannon Entropy Estimates
For α 6= 1, the estimated I is:
IN,k,α = 1N
∑ni=1(ζN,i ,k)1−α,
with
ζN,i ,k = (N − 1)CkVd(ρ(i)k,N−1)d ,
where ρ(i)k,N−1 is the Euclidean distance from Xi to its k-th nearest
neighbour from among the resting N − 1 samples.
Vd = πd2 /Γ(d2 + 1) is the volume of the unit ball B(0, 1) in Rd and
Ck is Ck = [Γ(k)/Γ(k + 1− α)]1
1−α .
17/56
Leonenko’s Estimator (4)
Renyi, Tsallis and Shannon Entropy Estimates (cont.)
I The estimator IN,k,α is asymptotically unbiased, which is to say
that E IN,k,α → Iq as N →∞. It is also consistent under mildconditions.
I Given these conditions, the estimated Renyi entropy Hα of f is
HN,k,α =log(IN,k,α)
1−α , α 6= 1,
and for the Tsallis entropy Sα = 1q−1 (1−
∫x f α(x)dx) is
SN,k,α =1−IN,k,αα−1 , α 6= 1.
and as N →∞, both estimators are asymptotically unbiasedand consistent.
18/56
Leonenko’s Estimator (5)
Shannon Entropy Estimate
The limit of the Tsallis entropy estimator as α→ 1 gives theShannon entropy H estimator:
HN,k,1 =1
N
N∑i=1
log ξN,i ,k ,
ξN,i ,k = (N − 1)e−Ψ(k)Vd(ρ(i)k,N−1)d ,
where Ψ(k) is the digamma function:Ψ(1) = −γ ' 0.5772,Ψ(k) = −γ + Ak−1, A0 = 0,Aj =
∑ji=1
1i .
19/56
Leonenko’s Estimator (6)
0 500 1000 1500 20001.2
1.25
1.3
1.35
1.4
1.45
samples
entr
opy
1−D Gaussian distributions
5−NN
KD−part.
Gauss. Σ=I
Gauss. real Σ
0 500 1000 1500 20002.5
2.6
2.7
2.8
2.9
3
samples
entr
opy
2−D Gaussian distributions
5−NN
KD−part.
Gauss. Σ=I
Gauss. real Σ
0 500 1000 1500 20004
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
samples
entr
opy
3−D Gaussian distributions
5−NN
KD−part.
Gauss. Σ=I
Gauss. real Σ
0 500 1000 1500 20005.4
5.6
5.8
6
6.2
6.4
6.6
6.8
samples
entr
opy
4−D Gaussian distributions
5−NN
KD−part.
Gauss. Σ=I
Gauss. real Σ
Figure: Comparison: k-NN vs k-d partitioning estimator for Gaussians.
20/56
Reminder: Visual Localization
Figure: 3D+2D map for coarse-to-fine localization
21/56
Image Categorization
Problem StatementI Design a mininal complexity and maximal accuracy classifier
based on forward MI-FS for coarse-localization [Lozano et al, 09].
I We have a bag of filters (say we have K different filters):Harris detector, Canny, horizontal and vertical gradient, gradient
magnitude, 12 color filter based on Gaussians over hue range, sparse
depth.
I We take the statistics of each filter and compose a histogramwith say B bins.
I Therefore, the total number of features NF per image isNF = K × (B − 1).
22/56
Image Categorization (2)
Figure: Examples of filters responses.From top-bottom and left-right: inputimage, depth, vertical gradient, gradient magnitude, and five colour filters.
23/56
Image Categorization (3)
0 20 40 60 800
10
20
30
40
50
60Greedy Feature Selection on different feature sets
# Selected Features
10−
fold
CV
err
or (
%)
16 filters x 6 bins16 filters x 4 bins16 filters x 2 bins
0 10 20 30 40 500
5
10
15
20
# Selected Features10
−fo
ld C
V e
rror
(%
)
Greedy Feature Selection on 4−bin histograms
8 Classes6 Classes4 Classes2 Classes
Figure: Left: Finding the optima number of bins B for K = 16 filters.Right: Testing whether |C | = 6 is a correct choice.
24/56
Image Categorization (4)
0 2 4 6 8 10 12 14 16 18 200
10
20
30
40
50
60
MD, MmD and mRMR Feature Selection
# features
% e
rror
MD (10−fold CV error)MD (test error)MmD (10−fold CV error)MmD (test error)mRMR (10−fold CV error)mRMR (test error),
Figure: FS performance on image histograms data with 48 features.
25/56
Image Categorization (5)
C#1 C#2 C#3 C#4 C#5 C#6
C#1 26 0 0 0 0 0
C#2 2/3 63/56 1/4 0/3 0 0
C#3 0 1/0 74/67 1/9 0 0
C#4 4/12 5/6 10/0 96/95 0/2 0
C#5 0 0 0 0 81 0
C#6 0 0 0 0 30/23 78/85
Table: Key question: What is the best classifier after Feature Selection?K-NN / SVM Confusion Matrix
26/56
Image Categorization (6)
Test Image 1st NN 2nd NN 3rd NN 4th NN 5th NN
27/56
Image Categorization (7)
Figure: Confusion trajectories at coarse (left) and fine (right) localization
28/56
Image Categorization (8)
Figure: Confusion trajectory using only coarse localization
29/56
Image Categorization (9)
Figure: Confusion trajectory after coarse-to-fine localization
30/56
Image Categorization (10)
ResultsI With mRMR, the lowest CV error (8.05%) is achieved with a
set of 17 features, while with MD a CV error of 4.16% isachieved with a set of 13 features.
I The CV errors yielded by MmD are very similar to those of MD.
I On the other hand, test errors present a slightly differentevolution.
I The best error is given by MD, and the worst one is given bymRMR.
I This may be explained by the fact that the entropy estimatorwe use does not degrade its accuracy as the number ofdimensions increases.
31/56
Gene Selection
Microarray Analysis
I The basic setup of a microarray platform is a glass slide whichis spotted with DNA fragments or oligonucleotides representingsome specific gene coding regions. Then, purified RNA islabelled fluorescently or radioactively and it is hybridized to theslide.
I The hybridization can be done simultaneously with referenceRNA to enable comparison of data with multiple experiments.Then washing is performed and the raw data is obtained byscanning with a laser or autoradiographing imaging. Theobtained expression levels of tens of thousands of genes arenumerically quantified and used for statistical studies.
32/56
Gene Selection (2)
Fluorescent
labeling
Sample RNA
Referece ADNc
Combination
Hybridization
Fluorescent
labeling
Figure: An illustration of the basic idea of a microarray platform.
33/56
Gene Selection (3)
0 20 40 60 80 100 120 140 16010
15
20
25
30
35
40
45
# features
% e
rror
and
MI
MD Feature Selection on Microarray
LOOCV error (%)
I(S;C)
Figure: MD-FS performance on the NCI microarray data set with 6,380features. Lowest LOOCV error with a set of 39 features [Bonev et al.,08]
34/56
Gene Selection (4)
5 10 15 20 25 30 350
10
20
30
40
50
60
70
80
90
100Comparison of FS criteria
# features
% L
OO
CV
err
or
MD criterionMmD criterionmRMR criterionLOOCV wrapper
Figure: Performance of FS criteria on the NCI microarray. Only theselection of the first 36 features (out of 6,380) is represented.
35/56
Gene Selection (5)
MD Feature Selection
Number of Selected Gene
Cla
ss (
dise
ase)
MELANOMAMELANOMAMELANOMAMELANOMAMELANOMAMELANOMA
BREASTBREAST
MELANOMANSCLCNSCLCNSCLC
BREASTMCF7D−repro
BREASTMCF7A−repro
COLONCOLONCOLONCOLONCOLONCOLONCOLON
LEUKEMIALEUKEMIALEUKEMIALEUKEMIALEUKEMIA
K562A−reproK562B−reproLEUKEMIA
NSCLCNSCLCNSCLC
PROSTATEOVARIANOVARIANOVARIANOVARIANOVARIAN
PROSTATEMELANOMA
OVARIANUNKNOWN
RENALNSCLC
BREASTRENALRENALRENALRENALRENALRENALRENALNSCLCNSCLC
BREASTCNSCNS
BREASTRENAL
CNSCNSCNS
19→
135
246
663
766
982
1177
1470
1671
→ 2
080
3227
3400
3964
4057
4063
4110
4289
4357
4441
4663
4813
5226
5481
5494
5495
5508
5790
5892
6013
6019
6032
6045
6087
→ 6
145
6184
6643
mRMR Feature Selection
Number of Selected Gene
MELANOMAMELANOMAMELANOMAMELANOMAMELANOMAMELANOMA
BREASTBREAST
MELANOMANSCLCNSCLCNSCLC
BREASTMCF7D−repro
BREASTMCF7A−repro
COLONCOLONCOLONCOLONCOLONCOLONCOLON
LEUKEMIALEUKEMIALEUKEMIALEUKEMIALEUKEMIA
K562A−reproK562B−reproLEUKEMIA
NSCLCNSCLCNSCLC
PROSTATEOVARIANOVARIANOVARIANOVARIANOVARIAN
PROSTATEMELANOMA
OVARIANUNKNOWN
RENALNSCLC
BREASTRENALRENALRENALRENALRENALRENALRENALNSCLCNSCLC
BREASTCNSCNS
BREASTRENAL
CNSCNSCNS
133
134
→ 1
3523
325
938
156
113
7813
8214
0918
41→
208
020
8120
8320
8632
5333
7133
7243
8344
5945
2754
3555
0455
3856
9658
1258
8759
3460
7261
15→
614
563
0563
9964
2964
3065
66
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure: Feature Selection on the NCI DNA microarray data. Features(genes) selected by MD and mRMR are marked with an arrow.
36/56
Gene Selection (6)
ResultsI Leukemia: 38 bone marrow samples, 27 Acute Lymphoblastic
Leukemia (ALL) and 11 Acute Myeloid Leukemia (AML), over 7,129
probes (features) from 6,817 human genes. Also 34 samples testing
data is provided, with 20 ALL and 14 AML. We obtained a 2.94%
test error with 7 features, while other authors report the same error
selecting 49 features via individualized markers. Other recent works
report test errors higher than 5% for the Leukemia data set.
I Central Nervous System Embryonal Tumors: 60 patient samples,
21 are survivors and 39 are failures. There are 7,129 genes in the data
set. We selected 9 genes achieving a 1.67% LOOCV error. Best result
in the literature is a 8.33% CV error with 100 features is reported.
37/56
Gene Selection (7)
Results (cont)
I Colon: 62 samples collected from colon-cancer patients. Among
them, 40 tumor biopsies are from tumors and 22 normal biopsies are
from healthy parts of the colons of the same patients. From 6,500
genes, 2,000 were selected for this data set, based on the confidence
in the measured expression levels. In the feature selection experiments
we performed, 15 of them were selected, resulting in a 0% LOOCV
error, while recent works report errors higher than 12%.
I Prostate: 25 tumor and 9 normal samples. with 12,600 genes.Our experiments achieved the best error with only 5 features,with a test error of 5.88%. Other works report test errorshigher than 6%.
38/56
Structural Recognition
Reeb Graphs
I Given a surface S and a real function f : S → R, the Reebgraph (RG), represents the topology of S through a graphstructure whose nodes correspond to the critical points of f .
I The Extended Reeb Graph (ERG) [Biasotti, 05] is anapproximation of the RG by using of a fixed number of levelsets (63 in this work) that divide the surface into a set ofregions; critical regions, rather than critical points, areidentified according to the behaviour of f along level sets; ERGnodes correspond to critical regions, while the arcs are detectedby tracking the evolution of level sets.
39/56
Structural Recognition (2)
Figure: Extended Reeb Graphs. Image by courtesy of Daniela Giorgi andSilvia Biasotti. a) Geodesic, b) Distance to center, b) bSphere
40/56
Structural Recognition (3)
Catalog of Features
I Subgraph node centrality: quantifies the degree of participationof a node i in a subgraph.
CS(i) =∑n
k=1 φk(i)2eλk ,
where where n = |V |, λk the k-th eigenvalue of A and φk itscorresponding eigenvector.
I Perron-Frobenius eigenvector: φn (the eigenvectorcorresponding to the largest eigenvalue of A). The componentsof this vector denote the importance of each node.
I Adjacency Spectra: the magnitudes |φk | of the (leading)eigenvalues of A have been been experimentally validated forgraph embedding [Luo et al.,03].
41/56
Structural Recognition (4)
Catalog of Features (cont.)
I Spectra from Laplacians: both from the un-normalized andnormalized ones. The Laplacian spectrum plays a fundamentalrole in the development of regularization kernels for graphs.
I Friedler vector: eigenvector corresponding to the firstnon-trivial eigenvalue of the Laplacian (φ2 in connectedgraphs). It encodes the connectivity structure of the graph(actually is the core of graph-cut methods).
I Commute Times either coming from the un-normalized andnormalized Laplacian. They encode the path-length distributionof the graph.
I The Heat-Flow Complexity trace as seen in the section above.
42/56
Structural Recognition (5)
Backwards Filtering
I The entropies H(·) of a set with a large n number of featurescan be efficiently estimated using the k-NN-based methoddeveloped by Leonenko.
I Thus, we take the data set with all its features and determinewhich feature to discard in order to produce the smallestdecrease of I ( ~Sn−1; ~C ).
I We then repeat the process for the features of the remainingfeature set, until only one feature is left [Bonev et al.,10].
I Then, the subset yielding the minimum 10-CV error is selected.
I Most of the spectral features are histogrammed into a variablenumber of bins: 9 · 3 · (2 + 4 + 6 + 8) = 540 features.
43/56
Experimental Results
ElementsI SHREC database (15 classes × 20 objects). Each of the 300
samples is characterized by 540 features, and has a class labell ∈ {human, cup, glasses, airplane, chair, octopus, table, hand,fish, bird, spring, armadillo, buste, mechanic, four-leg}
I The errors are measured by 10-fold cross validation (10-foldCV). MI is maximized as the number of selected features grows.A high number of features degrades the classificationperformance.
I However the MI curve, as well as the selected features, do notdepend on the classifier, as it is a purely information-theoreticmeasure.
44/56
Structural Recognition (7)
Figure: Classification error vs Mutual Information.
45/56
Structural Recognition (8)
Figure: Feature selection on the 15-class experiment (left) and the featurestatistics for the best-error feature set (right).
46/56
Structural Recognition (9)
Analysis
I For the 15-class problem, the minimal error (23, 3%) is achievedwith a set of 222 features.
I The coloured areas in the plot represent how much a feature isused with respect to the remaining ones (the height on the Yaxis is arbitrary).
I For the 15-class experiment, in the feature sets smaller than100 features, the most important is the Friedler vector, incombination with the remaining features. Commute time is alsoan important feature.Some features that are not relevant arethe node centrality and the complexity flow. Turning ourattention to the graphs type, all three appear relevant.
47/56
Structural Recognition (10)
Analysis (cont)
I We can see that the four different binnings of the features dohave importance for graph characterization.
I These conclusions concerning the relevance of each featurecannot be drawn without performing some additionalexperiments with different groups of graph classes.
I We perform our different 3-class experiments. The classes sharesome structural similarities, for example the 3 classes of thefirst experiment have a head and limbs.
I Although in each experiment the minimum error is achievedwith very different numbers of features, the participation ofeach feature is highly consistent with the 15-class experiment.
48/56
Structural Recognition (11)
Figure: Feature Selection on 3-class experiments:Human/Armadillo/Four-legged, Aircraft/Fish/Bird
49/56
Structural Recognition (12)
Figure: Feature Selection on 3-class experiments: Cup/Bust/Mechanic,Chair/Octopus/Table.
50/56
Structural Recognition (14)
0 100 200 300 400 500 60020
30
40
50
60
70
80
# features
CV
err
or
(%)
Comparison of forward and backward F.S.
Forward selection
Backward elimination
Figure: Forward vs Backwards FS
51/56
Structural Recognition (15)
0 100 200 300 400 500
30
40
50
60
70
80
90
# features
% test err
or
Bootstrapping on forward FS
Standard deviation
Test error
0 100 200 300 400 500
30
40
50
60
70
80
90
# features
% test err
or
Bootstrapping on backward FS
Standard deviation
Test error
0 100 200 300 400 5000
5
10
15
20
25
# selected features
# c
oin
cid
ences
Forward FS: feature coincidences in bootstraps
0 100 200 300 400 5000
5
10
15
20
25
# selected features
# c
oin
cid
ences
Backward FS: feature coincidences in bootstraps
Figure: Bootstrapping experiments on forward (on the left) and backward(on the right) feature selection, with 25 bootstraps.
52/56
Structural Recognition (16)
ConclusionsI The main difference among experiments is that node centrality
seems to be more important for discerning among elongatedsharp objects with long sharp characteristics. Although all threegraph types are relevant, the sphere graph performs best forblob-shaped objects.
I Bootstrapping shows how better is the Backwards method.
I The IT approach not only performs good classification but alsoyield the role of each spectral feature in it!
I Given other points of view (size functions, eigenfunctions,...) itis desirable to exploit them!
53/56
References
I [Zhu et al.,97] Zhu, S.C., Wu, Y.N. and Mumford, D (1997). Filters,
Random Fields And Maximum Entropy (FRAME): towads an unified
theory for texture modeling. IJCV 27 (2) 107–126
I [Wang and Gotoh, 2009] Wang, X. and Gotoh, O. (2009). Accurate
molecular classication of cancer using simple rules. BMC Medical
Genomics, 64(2)
I [Guyon and Elisseeff, 03] Guyon, I. and Elissee?, A. (2003). An intro-
duction to variable and feature selection. Journal of Machine
Learning Research, volume 3, pages 1157–1182.
I [Vasconcelos and Vasconcelos,04] Vasconcelos, N. and Vasconcelos,
M. (2004). Scalable discriminant feature selection for image retrieval
and recognition. In CVPR04, pages 770–775.
54/56
References (2)
I [Peng et al., 2005]Peng, H., Long, F., and Ding, C. (2005). Feature
selection based on mutual information: Criteria of max-dependency,
max-relevance, and min-redundancy. In IEEE Trans. on PAMI, 27(8),
pp.1226-1238.
I [Leonenko et al, 08] Leonenko, N., Pronzato, L., and Savani, V.
(2008). A class of Renyi information estimators for multidimensional
densities. The Annals of Statistics, 36(5):21532182.
I [Lozano et al, 09] Lozano M,A., Escolano, F., Bonev, B., Suau, P.,
Aguilar, W., Saez, J.M., Cazorla, M. (2009). Region and constell.
based categorization of images with unsupervised graph learning.
Image Vision Comput. 27(7): 960–978
55/56
References (3)
I [Bonev et al.,08] Bonev, B., Escolano, F., and Cazorla, M. (2008).
Feature selection, mutual information, and the classication of high-
dimensional patterns. Pattern Analysis and Applications 11(3-4):
304–319.
I [Biasotti, 05] Biasotti, S. (2005). Topological coding of surfaces with
boundary using Reeb graphs. Computer Graphics and Geometry,
7(1):3145.
I [Luo et al.,03 Luo, B., Wilson, R., and Hancock, E. (2003). Spectral
embedding of graphs. Pattern Recognition, 36(10):22132223
I [Bonev et al.,10] Bonev, B., Escolano, F., Giorgi, D., Biasotti, S,
(2010). High-dimensional Spectral Feature Selection for 3D Object
Recognition based on Reeb Graphs, SSPR’2010 (accepted)
56/56