variable selection and weighting by nearest neighbor...

Variable Selection and Weighting by NearestNeighbor Ensembles

Jan Gertheiss

(joint work with Gerhard Tutz)

Department of StatisticsUniversity of Munich

WNI 2008

Nearest Neighbor MethodsIntroduction

I One of the simplest and most intuitive techniques for statisticaldiscrimination (Fix & Hodges, 1951).

I Nonparametric and memory based.

Given data (yi , xi ) with categorial response yi ∈ {1, . . . ,G} and (metric)p-dim. predictor xi :

I Place a new observation with unknown class label into the class ofthe observation from the training set that is closest to the newobservation - with respect to covariates.

I Closeness, resp. distance d of two observations is derived from aspecific metric in the predictor space.

I Given the Euclidian metric ⇒ d(xi , xr ) =√∑p

j=1 |xij − xrj |2.

Nearest Neighbor MethodsClass probability estimates

I Use not only the first but the k nearest neighbors.I The relative frequency of predictions in favor of category g among

these neighbors can be seen as an estimate of the probability ofcategory g .

I Estimates π̂ig can take values h/k, h ∈ {0, ..., k}.I If neighbors are weighted with respect to their distance to the

observation of interest, π̂ can in principle take all values in [0, 1].

Nearest Neighbor EnsemblesBasic Concept

Final estimation by an ensemble of single predictors:

I Use the ensemble formula for computing the probability thatobservation i falls in category g :

π̂ig =

p∑j=1

cj π̂ig(j), with cj ≥ 0 ∀j and∑

j

cj = 1.

I With k nearest neighbor estimates π̂ig(j) based on predictor j only.I Weights - or coefficients - c1, . . . , cp need to be determined.

Nearest Neighbor EnsemblesMore Flexibility?

I Why not

π̂ig =∑

j

cgj π̂ig(j), with cgj ≥ 0 ∀g , j and∑

j

cgj = 1 ∀g ?

I It can be shown: Restriction c1j = . . . = cGj = cj is the onlypossibility to ensure that

1. π̂ig ≥ 0 ∀g and2.P

g π̂ig = 1

for all possible future estimations {π̂ig(j)} with1. π̂ig(j) ≥ 0 ∀g , j and2.P

g π̂ig(j) = 1 ∀j .

Determination of WeightsPrinciples

I Given all {π̂ig(j)}, matrix Π̂ with (Π̂)ig = π̂ig depends onc = (c1, . . . , cp)

T .I Given the training data with predictors x1, . . . , xn and true class

labels y = (y1, . . . , yn)T , a previously chosen loss function - or score

- L(y , Π̂) is minimized over all possible c .

Note: The categorial response yi is alternatively represented by a vectorzi = (zi1, . . . , ziG )T of dummy variables

zig =

{1, if yi = g0, otherwise

Determination of WeightsPossible loss functions

Log ScoreL(y , Π̂) =

∑i

∑g

zig log(1/π̂ig )

+ likelihood based− Hypersensitive ⇒ Inapplicable for nearest neighbor estimates.

Approximate Log Score

L(y , Π̂) =∑

i

∑g

zig

((1− π̂ig ) +

12(1− π̂ig )2

)

+ Hypersensitivity removed− Not "incentive compatible" (Selten, 1998), i.e. expected loss

E (L) =∑G

y=1 πyL(y , π̂y ) not minimized by π̂g = πg .

Determination of WeightsPossible loss functions

Quadratic Loss / Brier Score

L(y , Π̂) =∑

i

∑g

(zig − π̂ig )2

(introduced by Brier, 1950)+ Not hypersensitive+ Incentive compatible (see e.g. Selten, 1998)+ Also takes into account how the estimated probabilities are

distributed over the false classes.

Determination of WeightsPractical implementation

1. For each observation i create a matrix Pi of predictions:

(Pi )gj = π̂ig(j).

2. Create a vector z = (zT1 , . . . , zT

n )T and a matrixP = (PT

1 | . . . |PTn )T .

3. Now the Brier Score as function of c can be written in matrixnotation:

L(c) = (z − Pc)T (z − Pc).

4. Given restrictions cj ≥ 0 ∀j and∑

j cj = 1, weights cj can bedetermined using quadratic programming methods; e.g. using theR add-on package quadprog.

Given the approximate log score the weights can be determined in asimilar way.

Variable Selection

Variable Selection means setting cj = 0 for some j .

Thresholding:I Hard: cj = 0, if cj < t; cj = cj , otherwise; e.g. t = 0.25maxj{cj}.I Soft: cj = (cj − t)+.

(followed by rescaling)

Lasso based approximate solutions:I If restrictions are replaced by

∑j |cj | ≤ s, a lasso type problem

(Tibshirani, 1996) arises.I Lasso typical selection characteristics cause cj = 0 for some j .

(followed by rescaling and cj = c+j )

Including Interactions

Matrix P may be augmented by including interactions of predictors.I Adding all predictions π̂ig(jl), resp. π̂ig(jlm) based on two or even

three predictors.I Feasible for small scale problems only; P has p +

(p2

)+ . . . columns.

Simulation Studies ITwo classification problems

There are 10 independent features xj , each uniformly distributed on [0, 1].The two class 0/1 coded response y is defined as follows (cf. Hastie etal., 2001):

I as an "easy" problem: y = I (x1 > 0.5), andI as a "difficult" problem: y = I (sign(

∏3j=1(xj − 0.5)) > 0).

Simulation Studies IReference methods

Nearest neighbor methods:I (3) Nearest neighbor based extended forward / backward variable

selection.With tuning parameter S as the number of simple forward /backward selection steps that are checked in each iteration.

I Weighted (5) nearest neighbor prediction; R add-on package kknn.

Some alternative classification tools:I Linear discriminant analysis (LDA); R add-on package MASS.I CART (Breiman et al., 1984) and Random Forests (Breiman,

2001); R add-on packages rpart, randomForest.

Simulation Studies IThe easy problem

Prediction performance on the test set in terms of the Brier Score andNo. of Missclassified Observations:

0

050

100

150

200

250

loss

●●

●

●

●●

●●●●

●●

●●●●

●●

●

0 3NN−FS0 3NN−BS0 3NNE−LS0 3NNE−QS0 w5NN0 LDA0 CART0 RF

Simulation Studies IThe easy problem

Variable selection/weighting by nearest neighbor based (extended)forward/backward selection (left) or nearest neighbor ensembles (right):

●

1 2 3 4 5 6 7 8 9 10

510

1520

2530

Variable No.

Rea

lizat

ion

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

15 35 55 75 95 115 135 155 175

0.0

0.4

0.8

(1) approx. Log Score used

Term No.

Wei

ght

15 35 55 75 95 115 135 155 175

0.0

0.4

0.8

(2) Quadratic Score used

Term No.

Wei

ght

●

●

●●●

●

●●

●

●●●●

●

Simulation Studies IThe difficult problem

Prediction performance on the test set in terms of the Brier Score andNo. of Missclassified Observations:

0

100

200

300

400

500

600

700

loss

●

●●

●●

●

●

●

●

●

●

●

●

●

0 3NN−FS0 3NN−BS0 3NNE−LS0 3NNE−QS0 w5NN0 LDA0 CART0 RF

Simulation Studies IThe difficult problem

Variable selection/weighting by nearest neighbor based (extended)forward/backward selection (left) or nearest neighbor ensembles (right):

● ● ●

1 2 3 4 5 6 7 8 9 10

510

1520

2530

Variable No.

Rea

lizat

ion

● ● ●

● ● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ● ● ●

15 35 55 75 95 115 135 155 175

0.0

0.4

0.8

(1) approx. Log Score used

Term No.

Wei

ght

15 35 55 75 95 115 135 155 175

0.0

0.4

0.8

(2) Quadratic Score used

Term No.

Wei

ght

Simulation Studies IIcf. Hastie & Tibshirani (1996)

1. 2 Dimensional Gaussian: Two Gaussian classes in two dimensions.2. 2 Dimensional Gaussian with 14 Noise: Additionally 14

independent standard normal noise variables.3. Unstructured: 4 classes, each with 3 spherical bivariate normal

subclasses; means are chosen at random.4. Unstructured with 8 Noise: Augmented with 8 independent

standard normal predictors.5. 4 Dimensional Spheres with 6 Noise: First 4 predictors in class 1

independent standard normal, conditioned on radius > 3; class 2without restrictions.

6. 10 Dimensional Spheres: All 10 predictors in class 1 conditionedon 22.4 < radius2 < 40.

7. Constant Class Probabilities: Class probabilities (0.1,0.2,0.2,0.5)are independent of the predictors.

8. Friedman’s example: Predictors in class 1 independent standardnormal, in class 2 independent normal with mean and varianceproportional to

√j and 1/

√j respectively, j = 1, . . . , 10.

Simulation Studies IIScenario 1 - 4

5010

015

020

025

0

loss

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0

w5NN

0

0

LDA

0

0CART

0

0

RF

(1) 2 normals

100

200

300

400

loss

●

●

●

●

●

●

●●

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0

w5NN

0

0

LDA

0

0

CART

0

0

RF

(2) 2 normals with noise

020

040

060

080

0

loss

● ● ● ● ●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0

w5NN

0

0

LDA

0

0

CART

0

0

RF

(3) unstructured

020

040

060

080

0

loss

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0w5NN

0

0

LDA

0

0

CART

0

0

RF

(4) unstructured with noise

Simulation Studies IIScenario 5 - 8

100

200

300

400

500

loss

●

●

●●

●

●

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0

w5NN

0

0

LDA

0

0CART

0

0

RF

(5) 4D sphere in 10 dimensions

100

200

300

400

500

loss

●

●

●

●●

●

●

●

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0

w5NN

0

0

LDA

0

0

CART

0

0

RF

(6) 10D sphere in 10 dimensions

500

600

700

800

900

loss

●

●

●

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0

w5NN

0

0

LDA

0

0

CART

0

0

RF

(7) constant class probabilities

100

150

200

250

300

350

400

450

loss

●

●

●

●

●

●

●

0 0

0

3NN−FS

0

0

3NN−BS

0

0

3NNE−LS

0

0

3NNE−QS

0

0w5NN

0

0

LDA

0

0

CART

0

0

RF

(8) Friedman's example

Real World DataGlass data, R package mlbench

Forecast the type of glass (6 classes) on the basis of the chemicalanalysis given in form of 9 metric predictors.

I Result 3NNE-QS / all data

10 20 30 40 50 60 70 80 90 100 110 120

0.0

0.1

0.2

0.3

0.4

Term No.

Wei

ght

Real World DataGlass data, R package mlbench

Forecast the type of glass (6 classes) on the basis of the chemicalanalysis given in form of 9 metric predictors.

I Performance / 50 random splits

0

510

1520

2530

35

Loss

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 3NN−FS0 3NN−BS0 3NNE−LS0 3NNE−QS0 w5NN0 LDA 0 CART0 RF

SummaryNearest neighbor ensembles

I Nonparametric probability estimation by an ensemble, i.e. weightedaverage of nearest neighbor estimates.

I Each estimate is based on a single or a very small subset ofpredictors.

I No black box (by contrast to many other ensemble methods).I Good performance for small scale problems, particularly if pure

noise variables can be separated from relevant covariates.I Direct application to high dimensional problems with interactions is

not recommended.I Given microarrays possibly useful as nonparametric gene

preselection tool.I May be employed for automatic choice of the most appropriate

metrics or the right neighborhood.I Application to regression problems is possible as well.

References

BREIMAN, L. (2001): Random Forests, Machine Learning 45, 5–32

BREIMAN, L. et al. (1984): Classification and Regression Trees, Monterey, CA: Wadsworth.

BRIER, G. W. (1950): Verification of forcasts expressed in terms of probability, Monthly Weather Review78, 1–3.

FIX, E. and J. L. HODGES (1951): Discriminatory analysis - nonparametric discrimination: consistency

properties, US air force school of aviation medicine, Randolph Field Texas.

GERTHEISS, J. and TUTZ, G. (2008): Feature Selection and Weighting by Nearest Neighbor Ensembles,

University of Munich, Department of Statistics: Technical Report, No.33.

HASTIE, T. and R. TIBSHIRANI (1996): Discriminant adaptive nearest neighbor classification, IEEE

Transactions on Pattern Analysis and Machine Intelligence 18, 607–616.

HASTIE, T. et al. (2001): The Elements of Statistical Learning, New York: Springer.

R Development Core Team (2007): R: A Language and Environment for Statistical Computing, Vienna: R

Foundation for Statistical Computing.

SELTEN, R. (1998): Axiomatic characterization of the quadratic scoring rule, Experimental Economics 1,43–62.

TIBSHIRANI, R. (1996): Regression Shrinkage and Selection via the Lasso, Journal of the Royal StatisticalSociety B, 58, 267–288.

variable selection and weighting by nearest neighbor...

Documents