kernel density estimation theory and application in discriminant analysis thomas ledl universität...

83
Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Post on 22-Dec-2015

227 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Kernel Density EstimationTheory and Application in Discriminant Analysis

Thomas Ledl

Universität Wien

Page 2: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction Theory Aspects of Application Simulation Study Summary

Contents:

Page 3: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Page 4: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

0 1 2 3 4

25 observations:

Which distribution?

Page 5: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

0 1 2 3 4

0 1 2 3 4

0.0

0.1

0.2

0.3

0 1 2 3 4

0.0

0.2

0.4

0 1 2 3 4

0.0

0.2

0.4

0.6

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

? ?

?

? ?

Page 6: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

0 1 2 3 4

Kernel density estimator model:

K(.) and h to choose

Theory

Application Aspects

Simulation Study

Summary

Introduction

Page 7: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

0 1 2 3 4

0 1 2 3 4

0.0

0.2

0.4

0.6

triangular

gaussian

„small“ h „large“ h

0 1 2 3 4

0.0

0.2

0.4

0.6

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

kernel/ bandwidth:

Page 8: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction

Question 1:

Which choice of K(.) and h is the best for a descriptive purpose?

Page 9: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

-3 -2,2 -1,4 -0,6 0,

2 1

1,8 2,6

-3

-1,3

0,4

2,1

0,000,010,020,03

0,040,05

0,06

0,07

0,08

0,09

-3

-1,5

0

1,5

3

-3 -2,3

-1,6

-0,9

-0,2 0,5 1,2 1,9 2,6

0

0,02

0,04

0,06

0,08

0,1

0,12

0,14Classification:

Page 10: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Levelplot – LDA (based on assumption of a multivariate normal distribution):

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Classification:

Page 11: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Classification:

Page 12: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Levelplot – KDE classificator:

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

-1 1 3 5 7

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Classification:

Page 13: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Question 2:

Performance of classification based on KDE in more than 2 dimensions?

Page 14: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Page 15: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Essential issues

Optimization criteria Improvements of the standard

model Resulting optimal choices of the

model parameters K(.) and h

Theory

Application Aspects

Simulation Study

Summary

Introduction

Page 16: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Essential issues

Optimization criteria Improvements of the standard

model Resulting optimal choices of the

model parameters K(.) and h

Theory

Application Aspects

Simulation Study

Summary

Introduction

Page 17: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction

Lp-distances:

Optimization criteria

Page 18: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction

f(.)

g(.)

-2 -1 0 1 2 3 4

0.0

0.2

0.4

Page 19: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction

-2 -1 0 1 2 3 4

0.0

0.0

50

.10

0.1

5

-2 -1 0 1 2 3 4

0.0

0.2

0.4

0.0

0.2

0.4

Page 20: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction

-2 -1 0 1 2 3 4

0.0

0.0

50

.10

0.1

5

„Integrated absolute error“

=IAE

=ISE

„Integrated squared error“

Page 21: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction

-2 -1 0 1 2 3 4

0.0

0.0

50

.10

0.1

5

=IAE

„Integrated absolute error“

=ISE

„Integrated squared error“

Page 22: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Theory

Application Aspects

Simulation Study

Summary

Introduction Consideration of horizontal distances for

a more intuitive fit (Marron and Tsybakov, 1995)

Compare the number and position of modes

Minimization of the maximum vertical distance

Other ideas:

Page 23: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Overview about some minimization criteria

L1-distance=IAE

L-distance=Maximum difference

„Modern“ criteria, which include a kind of measure of the horizontal distances

L2-distance=ISE, MISE,AMISE,...

Difficult mathematical tractability

Does not consider overall fit

Difficult mathematical tractability

Theory

Application Aspects

Simulation Study

Summary

Introduction

Most commonly used

Page 24: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

ISE, MISE, AMISE,...

Theory

Application Aspects

Simulation Study

Summary

Introduction

log10(h)

-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.0

10

.02

0.0

30

.04

0.0

5

MISE,IV,ISBAMISE,AIV,AISB

x

De

nsi

ty-3 -1 1 2 3

0.0

0.2

0.4

MISE=E(ISE), the expectation of ISEAMISE=Taylor approximation of MISE, easier to calculate

ISE is a random variable

Page 25: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Essential issues

Optimization criteria Improvements of the standard

model Resulting optimal choices of the

model parameters K(.) and h

Theory

Application Aspects

Simulation Study

Summary

Introduction

Page 26: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

The AMISE-optimal bandwidth

Theory

Application Aspects

Simulation Study

Summary

Introduction

Page 27: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

The AMISE-optimal bandwidth

Theory

Application Aspects

Simulation Study

Summary

Introductionminimized by

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

„Epanechnikov kernel“

dependent on the kernel function K(.)

Page 28: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

The AMISE-optimal bandwidth

Theory

Application Aspects

Simulation Study

Summary

Introduction

dependent on the unknown density f(.)

How to proceed?

Page 29: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Page 30: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Page 31: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Least squares cross-validation (LSCV)

Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local

minimizer – no agreement about which one to use

Bad convergence rate for the resulting bandwidth hopt

Theory

Application Aspects

Simulation Study

Summary

Introduction

Page 32: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Page 33: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Normal rule („Rule of thumb“)

Assumes f(x) to be N(,2) Easiest selector Often oversmooths the function

Theory

Application Aspects

Simulation Study

Summary

Introduction The resulting bandwidth is given by:

Page 34: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Page 35: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990)

Does not substitute R(f“) in the AMISE-formula, but estimates it via R(f(IV)) and R(f(IV)) via R(f(VI)),etc.

Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient

Better rates of convergence Does not finally circumvent the problem

of the unknown density, eitherTheory

Application Aspects

Simulation Study

Summary

Introduction

Page 36: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

The multivariate case

Theory

Application Aspects

Simulation Study

Summary

Introduction

h H...the bandwidth matrix

Page 37: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Issues of generalization in d dimensions

Theory

Application Aspects

Simulation Study

Summary

Introduction

d2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially

straightforward to generalize For Plug-in methods it is „too difficult“ to

give succint expressions for d>2 dimensions

Page 38: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Aspects of Application

Page 39: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Page 40: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Page 41: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

The „curse of dimensionality“

The data „disappears“ into the distribution tails in high dimensions

: a good fit in the tails is desired!d

Probability mass NOT in the "Tail" of a Multivariate Normal Density

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# of dimensions

Page 42: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

The „curse of dimensionality“

Much data is necessary to obey a constant estimation error in high dimensions

Dimensionality Required sample size1 42 193 674 2235 7686 27907 107008 437009 18700010 842000

Page 43: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Page 44: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Essential issues

Estimation of tails important

Worse fit in the tails

Calculation intensive for large n

Many observations required for a reasonable fit

L2-optimal L1-optimal (Misclassification rate)

AMISE-optimal parameter choice

Optimal classification (in high dimensions)

Page 45: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Page 46: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Method 1:

Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information „trade-off“

Use the multivariate kernel density concept to estimate the class densities

Page 47: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Method 2:

Use the univariate concept to „normalize“ the data nonparametrically

Use the classical methods like LDA and QDA for classification

Drawback: calculation intensive

Page 48: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Application Aspects

Theory

Simulation Study

Summary

Introduction

Method 2:

x

f(x)

0 2 4 6 80.

050.

100.

150.

20

x

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

F(x)G(x)

t(x) t(x+) x x+

a)

b)

Page 49: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Page 50: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Criticism on former simulation studies

Carried out 20-30 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of

high dimensions No dimension reduction

Page 51: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

21 datasets 14 estimators 2 error criteria 21x14x2=588 classification

scores Many results

The present simulation study

Page 52: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

The present simulation study

21 datasets 14 estimators 2 error criteria 21x14x2=588 classification

scores Many results

Page 53: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Each dataset has...

...2 classes for distinction ...600 observations/class ...200 test observations, 100

produced by each class ... therfore dimension 1400x10

Page 54: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Normal

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise small

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise medium

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise large

-4 -2 0 2 4

0.0

0.15

0.30

Exponential (1)

0 1 2 3 4 5 6

0.0

0.4

0.8

Bimodal - close

-2 0 2 4 6 8

0.0

0.10

0.20

Bimodal - far

-2 0 2 4 6 8

0.0

0.10

0.20

Univariate prototype distributions:

Page 55: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

+10 datasets having unequal covariance matrices

21 datasets total

+ 1 insurance dataset

10 datasets having equal covariance matrices

Dataset Nr. Abbrev. contains

1 NN1 10 normal distributions with "small noise"2 NN2 10 normal distributions with "medium noise"3 NN3 10 normal distributions with "small noise"4 SkN1 2 skewed (exp-)distributions and 7 normals5 SkN2 5 skewed (exp-)distributions and 5 normals6 SkN3 7 skewed (exp-)distributions and 3 normals7 Bi1 4 normals, 4 skewed and 2 bimodal (close)-dist.8 Bi2 4 normals, 4 skewed and 2 bimodal (close)-dist.9 Bi3 8 skewed and 2 bimodal (far)-dist.10 Bi4 8 skewed and 2 bimodal (far)-dist.

Page 56: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Simulation Study

21 datasets 14 estimators 2 error criteria 21x14x2=588 classification

scores Many results

Page 57: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion ,resp. (2) 8 estimators

Method 2(„marginal normalizations“):

Method 1(multivariate density estimator):

Classical methods:

14 estimators

2 estimatorsLDA and QDA (2)

Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators

Page 58: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Simulation Study

21 datasets 14 estimators 2 misclassification criteria 21x14x2=588 classification

scores Many results

Page 59: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Misclassification Criteria

The classical Misclassification rate („Error rate“)

The Brier score

Page 60: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Simulation Study

21 datasets 14 estimators 2 error criteria 21x14x2= 588 classification

scores Many results

Page 61: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results

Error rate vs. Brier score

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,0 0,1 0,2 0,3 0,4 0,5 0,6

Error rate

Bri

er s

core

The choice of the misclassification criterion is not essential

Page 62: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases

Error rates for method 1

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

0,400

0,450

0,500

0,000 0,100 0,200 0,300 0,400 0,500 0,600

"Normal rule"

LS

CV

Superiority of LSCV in case of bimodals having unequal covariance matrices

Page 63: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results The choice of the univariate bandwidth parameter (method 2) is not essential

Error rates for method 2

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,000 0,050 0,100 0,150 0,200 0,250 0,300

"Normal rule"

She

athe

r-Jo

nes

sele

ctor

Page 64: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results The best trade-off is a projection onto 2-3 dimensions

Error rate regarding different subspaces

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

2 3 4 5

# dimensions

NN-distributions

SkN-distributions

Bi-distributions

Page 65: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Results

Equal covariance matrices: Method 1 performs inferior against LDA

0,0000,0500,1000,1500,2000,2500,3000,3500,400

Err

or ra

te

LDA (classical)

LSCV(3) - method1

Equal covariance matrices: Method 2 sometimes slightly improves

0,000

0,050

0,100

0,150

0,200

0,250

0,300N

N11

NN

21

NN

31

SkN

11

SkN

21

SkN

31

Bi1

1

Bi2

1

Bi3

1

Bi4

1

Err

or ra

te

LDA (classical)

Normal rule (in method 2)

Page 66: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Results

Unqual covariance matrices: Method 2 performs quite poor, but

not for skewed distributions

0,000

0,050

0,100

0,150

0,200

0,250

Err

or ra

te QDA (classical)

LSCV(3) - method1

Unequal covariance matrices: Method 2 often improves essentially

0,000

0,050

0,100

0,150

0,200

0,250

Err

or ra

te QDA (classical)

Normal rule (inmethod 2)

Page 67: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Is the additional calculation time justified?

Results

Simulation Study

Theory

Application Aspects

Summary

Introduction

Required calculation time

LDA,QDA multivariate "normal rule" Preliminary univariatenormalizations,LSCV,

Sheather-Jones plug-in

Page 68: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary

Page 69: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Page 70: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Page 71: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Page 72: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Page 73: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Page 74: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Page 75: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Page 76: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Page 77: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Page 78: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Page 79: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Page 80: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Page 81: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Page 82: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Page 83: Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

The End