variable selection and weighting by nearest neighbor...
TRANSCRIPT
Variable Selection and Weighting by NearestNeighbor Ensembles
Jan Gertheiss
(joint work with Gerhard Tutz)
Department of StatisticsUniversity of Munich
WNI 2008
Nearest Neighbor MethodsIntroduction
I One of the simplest and most intuitive techniques for statisticaldiscrimination (Fix & Hodges, 1951).
I Nonparametric and memory based.
Given data (yi , xi ) with categorial response yi ∈ {1, . . . ,G} and (metric)p-dim. predictor xi :
I Place a new observation with unknown class label into the class ofthe observation from the training set that is closest to the newobservation - with respect to covariates.
I Closeness, resp. distance d of two observations is derived from aspecific metric in the predictor space.
I Given the Euclidian metric ⇒ d(xi , xr ) =√∑p
j=1 |xij − xrj |2.
Nearest Neighbor MethodsClass probability estimates
I Use not only the first but the k nearest neighbors.I The relative frequency of predictions in favor of category g among
these neighbors can be seen as an estimate of the probability ofcategory g .
I Estimates π̂ig can take values h/k, h ∈ {0, ..., k}.I If neighbors are weighted with respect to their distance to the
observation of interest, π̂ can in principle take all values in [0, 1].
Nearest Neighbor EnsemblesBasic Concept
Final estimation by an ensemble of single predictors:
I Use the ensemble formula for computing the probability thatobservation i falls in category g :
π̂ig =
p∑j=1
cj π̂ig(j), with cj ≥ 0 ∀j and∑
j
cj = 1.
I With k nearest neighbor estimates π̂ig(j) based on predictor j only.I Weights - or coefficients - c1, . . . , cp need to be determined.
Nearest Neighbor EnsemblesMore Flexibility?
I Why not
π̂ig =∑
j
cgj π̂ig(j), with cgj ≥ 0 ∀g , j and∑
j
cgj = 1 ∀g ?
I It can be shown: Restriction c1j = . . . = cGj = cj is the onlypossibility to ensure that
1. π̂ig ≥ 0 ∀g and2.P
g π̂ig = 1
for all possible future estimations {π̂ig(j)} with1. π̂ig(j) ≥ 0 ∀g , j and2.P
g π̂ig(j) = 1 ∀j .
Determination of WeightsPrinciples
I Given all {π̂ig(j)}, matrix Π̂ with (Π̂)ig = π̂ig depends onc = (c1, . . . , cp)
T .I Given the training data with predictors x1, . . . , xn and true class
labels y = (y1, . . . , yn)T , a previously chosen loss function - or score
- L(y , Π̂) is minimized over all possible c .
Note: The categorial response yi is alternatively represented by a vectorzi = (zi1, . . . , ziG )T of dummy variables
zig =
{1, if yi = g0, otherwise
Determination of WeightsPossible loss functions
Log ScoreL(y , Π̂) =
∑i
∑g
zig log(1/π̂ig )
+ likelihood based− Hypersensitive ⇒ Inapplicable for nearest neighbor estimates.
Approximate Log Score
L(y , Π̂) =∑
i
∑g
zig
((1− π̂ig ) +
12(1− π̂ig )2
)
+ Hypersensitivity removed− Not "incentive compatible" (Selten, 1998), i.e. expected loss
E (L) =∑G
y=1 πyL(y , π̂y ) not minimized by π̂g = πg .
Determination of WeightsPossible loss functions
Quadratic Loss / Brier Score
L(y , Π̂) =∑
i
∑g
(zig − π̂ig )2
(introduced by Brier, 1950)+ Not hypersensitive+ Incentive compatible (see e.g. Selten, 1998)+ Also takes into account how the estimated probabilities are
distributed over the false classes.
Determination of WeightsPractical implementation
1. For each observation i create a matrix Pi of predictions:
(Pi )gj = π̂ig(j).
2. Create a vector z = (zT1 , . . . , zT
n )T and a matrixP = (PT
1 | . . . |PTn )T .
3. Now the Brier Score as function of c can be written in matrixnotation:
L(c) = (z − Pc)T (z − Pc).
4. Given restrictions cj ≥ 0 ∀j and∑
j cj = 1, weights cj can bedetermined using quadratic programming methods; e.g. using theR add-on package quadprog.
Given the approximate log score the weights can be determined in asimilar way.
Variable Selection
Variable Selection means setting cj = 0 for some j .
Thresholding:I Hard: cj = 0, if cj < t; cj = cj , otherwise; e.g. t = 0.25maxj{cj}.I Soft: cj = (cj − t)+.
(followed by rescaling)
Lasso based approximate solutions:I If restrictions are replaced by
∑j |cj | ≤ s, a lasso type problem
(Tibshirani, 1996) arises.I Lasso typical selection characteristics cause cj = 0 for some j .
(followed by rescaling and cj = c+j )
Including Interactions
Matrix P may be augmented by including interactions of predictors.I Adding all predictions π̂ig(jl), resp. π̂ig(jlm) based on two or even
three predictors.I Feasible for small scale problems only; P has p +
(p2
)+ . . . columns.
Simulation Studies ITwo classification problems
There are 10 independent features xj , each uniformly distributed on [0, 1].The two class 0/1 coded response y is defined as follows (cf. Hastie etal., 2001):
I as an "easy" problem: y = I (x1 > 0.5), andI as a "difficult" problem: y = I (sign(
∏3j=1(xj − 0.5)) > 0).
Simulation Studies IReference methods
Nearest neighbor methods:I (3) Nearest neighbor based extended forward / backward variable
selection.With tuning parameter S as the number of simple forward /backward selection steps that are checked in each iteration.
I Weighted (5) nearest neighbor prediction; R add-on package kknn.
Some alternative classification tools:I Linear discriminant analysis (LDA); R add-on package MASS.I CART (Breiman et al., 1984) and Random Forests (Breiman,
2001); R add-on packages rpart, randomForest.
Simulation Studies IThe easy problem
Prediction performance on the test set in terms of the Brier Score andNo. of Missclassified Observations:
0
050
100
150
200
250
loss
●●
●
●
●●
●●●●
●●
●●●●
●●
●
0 3NN−FS0 3NN−BS0 3NNE−LS0 3NNE−QS0 w5NN0 LDA0 CART0 RF
Simulation Studies IThe easy problem
Variable selection/weighting by nearest neighbor based (extended)forward/backward selection (left) or nearest neighbor ensembles (right):
●
1 2 3 4 5 6 7 8 9 10
510
1520
2530
Variable No.
Rea
lizat
ion
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
15 35 55 75 95 115 135 155 175
0.0
0.4
0.8
(1) approx. Log Score used
Term No.
Wei
ght
15 35 55 75 95 115 135 155 175
0.0
0.4
0.8
(2) Quadratic Score used
Term No.
Wei
ght
●
●
●●●
●
●●
●
●●●●
●
Simulation Studies IThe difficult problem
Prediction performance on the test set in terms of the Brier Score andNo. of Missclassified Observations:
0
100
200
300
400
500
600
700
loss
●
●●
●●
●
●
●
●
●
●
●
●
●
0 3NN−FS0 3NN−BS0 3NNE−LS0 3NNE−QS0 w5NN0 LDA0 CART0 RF
Simulation Studies IThe difficult problem
Variable selection/weighting by nearest neighbor based (extended)forward/backward selection (left) or nearest neighbor ensembles (right):
● ● ●
1 2 3 4 5 6 7 8 9 10
510
1520
2530
Variable No.
Rea
lizat
ion
● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
15 35 55 75 95 115 135 155 175
0.0
0.4
0.8
(1) approx. Log Score used
Term No.
Wei
ght
15 35 55 75 95 115 135 155 175
0.0
0.4
0.8
(2) Quadratic Score used
Term No.
Wei
ght
Simulation Studies IIcf. Hastie & Tibshirani (1996)
1. 2 Dimensional Gaussian: Two Gaussian classes in two dimensions.2. 2 Dimensional Gaussian with 14 Noise: Additionally 14
independent standard normal noise variables.3. Unstructured: 4 classes, each with 3 spherical bivariate normal
subclasses; means are chosen at random.4. Unstructured with 8 Noise: Augmented with 8 independent
standard normal predictors.5. 4 Dimensional Spheres with 6 Noise: First 4 predictors in class 1
independent standard normal, conditioned on radius > 3; class 2without restrictions.
6. 10 Dimensional Spheres: All 10 predictors in class 1 conditionedon 22.4 < radius2 < 40.
7. Constant Class Probabilities: Class probabilities (0.1,0.2,0.2,0.5)are independent of the predictors.
8. Friedman’s example: Predictors in class 1 independent standardnormal, in class 2 independent normal with mean and varianceproportional to
√j and 1/
√j respectively, j = 1, . . . , 10.
Simulation Studies IIScenario 1 - 4
5010
015
020
025
0
loss
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0
w5NN
0
0
LDA
0
0CART
0
0
RF
(1) 2 normals
100
200
300
400
loss
●
●
●
●
●
●
●●
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0
w5NN
0
0
LDA
0
0
CART
0
0
RF
(2) 2 normals with noise
020
040
060
080
0
loss
● ● ● ● ●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0
w5NN
0
0
LDA
0
0
CART
0
0
RF
(3) unstructured
020
040
060
080
0
loss
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0w5NN
0
0
LDA
0
0
CART
0
0
RF
(4) unstructured with noise
Simulation Studies IIScenario 5 - 8
100
200
300
400
500
loss
●
●
●●
●
●
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0
w5NN
0
0
LDA
0
0CART
0
0
RF
(5) 4D sphere in 10 dimensions
100
200
300
400
500
loss
●
●
●
●●
●
●
●
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0
w5NN
0
0
LDA
0
0
CART
0
0
RF
(6) 10D sphere in 10 dimensions
500
600
700
800
900
loss
●
●
●
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0
w5NN
0
0
LDA
0
0
CART
0
0
RF
(7) constant class probabilities
100
150
200
250
300
350
400
450
loss
●
●
●
●
●
●
●
0 0
0
3NN−FS
0
0
3NN−BS
0
0
3NNE−LS
0
0
3NNE−QS
0
0w5NN
0
0
LDA
0
0
CART
0
0
RF
(8) Friedman's example
Real World DataGlass data, R package mlbench
Forecast the type of glass (6 classes) on the basis of the chemicalanalysis given in form of 9 metric predictors.
I Result 3NNE-QS / all data
10 20 30 40 50 60 70 80 90 100 110 120
0.0
0.1
0.2
0.3
0.4
Term No.
Wei
ght
Real World DataGlass data, R package mlbench
Forecast the type of glass (6 classes) on the basis of the chemicalanalysis given in form of 9 metric predictors.
I Performance / 50 random splits
0
510
1520
2530
35
Loss
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 3NN−FS0 3NN−BS0 3NNE−LS0 3NNE−QS0 w5NN0 LDA 0 CART0 RF
SummaryNearest neighbor ensembles
I Nonparametric probability estimation by an ensemble, i.e. weightedaverage of nearest neighbor estimates.
I Each estimate is based on a single or a very small subset ofpredictors.
I No black box (by contrast to many other ensemble methods).I Good performance for small scale problems, particularly if pure
noise variables can be separated from relevant covariates.I Direct application to high dimensional problems with interactions is
not recommended.I Given microarrays possibly useful as nonparametric gene
preselection tool.I May be employed for automatic choice of the most appropriate
metrics or the right neighborhood.I Application to regression problems is possible as well.
References
BREIMAN, L. (2001): Random Forests, Machine Learning 45, 5–32
BREIMAN, L. et al. (1984): Classification and Regression Trees, Monterey, CA: Wadsworth.
BRIER, G. W. (1950): Verification of forcasts expressed in terms of probability, Monthly Weather Review78, 1–3.
FIX, E. and J. L. HODGES (1951): Discriminatory analysis - nonparametric discrimination: consistency
properties, US air force school of aviation medicine, Randolph Field Texas.
GERTHEISS, J. and TUTZ, G. (2008): Feature Selection and Weighting by Nearest Neighbor Ensembles,
University of Munich, Department of Statistics: Technical Report, No.33.
HASTIE, T. and R. TIBSHIRANI (1996): Discriminant adaptive nearest neighbor classification, IEEE
Transactions on Pattern Analysis and Machine Intelligence 18, 607–616.
HASTIE, T. et al. (2001): The Elements of Statistical Learning, New York: Springer.
R Development Core Team (2007): R: A Language and Environment for Statistical Computing, Vienna: R
Foundation for Statistical Computing.
SELTEN, R. (1998): Axiomatic characterization of the quadratic scoring rule, Experimental Economics 1,43–62.
TIBSHIRANI, R. (1996): Regression Shrinkage and Selection via the Lasso, Journal of the Royal StatisticalSociety B, 58, 267–288.