kernel density estimation theory and application in discriminant analysis thomas ledl universität...

Post on 22-Dec-2015

227 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kernel Density EstimationTheory and Application in Discriminant Analysis

Thomas Ledl

Universität Wien

Introduction Theory Aspects of Application Simulation Study Summary

Contents:

Introduction

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

0 1 2 3 4

25 observations:

Which distribution?

0 1 2 3 4

0 1 2 3 4

0.0

0.1

0.2

0.3

0 1 2 3 4

0.0

0.2

0.4

0 1 2 3 4

0.0

0.2

0.4

0.6

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

? ?

?

? ?

0 1 2 3 4

Kernel density estimator model:

K(.) and h to choose

Theory

Application Aspects

Simulation Study

Summary

Introduction

0 1 2 3 4

0 1 2 3 4

0.0

0.2

0.4

0.6

triangular

gaussian

„small“ h „large“ h

0 1 2 3 4

0.0

0.2

0.4

0.6

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

kernel/ bandwidth:

Theory

Application Aspects

Simulation Study

Summary

Introduction

Question 1:

Which choice of K(.) and h is the best for a descriptive purpose?

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

-3 -2,2 -1,4 -0,6 0,

2 1

1,8 2,6

-3

-1,3

0,4

2,1

0,000,010,020,03

0,040,05

0,06

0,07

0,08

0,09

-3

-1,5

0

1,5

3

-3 -2,3

-1,6

-0,9

-0,2 0,5 1,2 1,9 2,6

0

0,02

0,04

0,06

0,08

0,1

0,12

0,14Classification:

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Levelplot – LDA (based on assumption of a multivariate normal distribution):

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Classification:

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Classification:

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Levelplot – KDE classificator:

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

-1 1 3 5 7

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

V1

V2

1

11

1

1

111 1

1

1

1

1

1

1

11

1

1

1

2

2

22

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

333

33

3

3

3333

33

4

4

4

4

4

444

44

44

44

4

444 4

4

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Classification:

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Question 2:

Performance of classification based on KDE in more than 2 dimensions?

Theory

Essential issues

Optimization criteria Improvements of the standard

model Resulting optimal choices of the

model parameters K(.) and h

Theory

Application Aspects

Simulation Study

Summary

Introduction

Essential issues

Optimization criteria Improvements of the standard

model Resulting optimal choices of the

model parameters K(.) and h

Theory

Application Aspects

Simulation Study

Summary

Introduction

Theory

Application Aspects

Simulation Study

Summary

Introduction

Lp-distances:

Optimization criteria

Theory

Application Aspects

Simulation Study

Summary

Introduction

f(.)

g(.)

-2 -1 0 1 2 3 4

0.0

0.2

0.4

Theory

Application Aspects

Simulation Study

Summary

Introduction

-2 -1 0 1 2 3 4

0.0

0.0

50

.10

0.1

5

-2 -1 0 1 2 3 4

0.0

0.2

0.4

0.0

0.2

0.4

Theory

Application Aspects

Simulation Study

Summary

Introduction

-2 -1 0 1 2 3 4

0.0

0.0

50

.10

0.1

5

„Integrated absolute error“

=IAE

=ISE

„Integrated squared error“

Theory

Application Aspects

Simulation Study

Summary

Introduction

-2 -1 0 1 2 3 4

0.0

0.0

50

.10

0.1

5

=IAE

„Integrated absolute error“

=ISE

„Integrated squared error“

Theory

Application Aspects

Simulation Study

Summary

Introduction Consideration of horizontal distances for

a more intuitive fit (Marron and Tsybakov, 1995)

Compare the number and position of modes

Minimization of the maximum vertical distance

Other ideas:

Overview about some minimization criteria

L1-distance=IAE

L-distance=Maximum difference

„Modern“ criteria, which include a kind of measure of the horizontal distances

L2-distance=ISE, MISE,AMISE,...

Difficult mathematical tractability

Does not consider overall fit

Difficult mathematical tractability

Theory

Application Aspects

Simulation Study

Summary

Introduction

Most commonly used

ISE, MISE, AMISE,...

Theory

Application Aspects

Simulation Study

Summary

Introduction

log10(h)

-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.0

10

.02

0.0

30

.04

0.0

5

MISE,IV,ISBAMISE,AIV,AISB

x

De

nsi

ty-3 -1 1 2 3

0.0

0.2

0.4

MISE=E(ISE), the expectation of ISEAMISE=Taylor approximation of MISE, easier to calculate

ISE is a random variable

Essential issues

Optimization criteria Improvements of the standard

model Resulting optimal choices of the

model parameters K(.) and h

Theory

Application Aspects

Simulation Study

Summary

Introduction

The AMISE-optimal bandwidth

Theory

Application Aspects

Simulation Study

Summary

Introduction

The AMISE-optimal bandwidth

Theory

Application Aspects

Simulation Study

Summary

Introductionminimized by

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

„Epanechnikov kernel“

dependent on the kernel function K(.)

The AMISE-optimal bandwidth

Theory

Application Aspects

Simulation Study

Summary

Introduction

dependent on the unknown density f(.)

How to proceed?

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Least squares cross-validation (LSCV)

Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local

minimizer – no agreement about which one to use

Bad convergence rate for the resulting bandwidth hopt

Theory

Application Aspects

Simulation Study

Summary

Introduction

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Normal rule („Rule of thumb“)

Assumes f(x) to be N(,2) Easiest selector Often oversmooths the function

Theory

Application Aspects

Simulation Study

Summary

Introduction The resulting bandwidth is given by:

Data-driven bandwidth selection methods

Theory

Application Aspects

Simulation Study

Summary

Introduction

Maximum Likelihood Cross-Validation

Least-squares cross-validation (Bowman, 1984)

Leave-one-out selectors

Criteria based on substituting R(f“) in the AMISE-formula

„Normal rule“ („Rule of thumb“; Silverman, 1986)

Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990)

Smoothed bootstrap

Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990)

Does not substitute R(f“) in the AMISE-formula, but estimates it via R(f(IV)) and R(f(IV)) via R(f(VI)),etc.

Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient

Better rates of convergence Does not finally circumvent the problem

of the unknown density, eitherTheory

Application Aspects

Simulation Study

Summary

Introduction

The multivariate case

Theory

Application Aspects

Simulation Study

Summary

Introduction

h H...the bandwidth matrix

Issues of generalization in d dimensions

Theory

Application Aspects

Simulation Study

Summary

Introduction

d2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially

straightforward to generalize For Plug-in methods it is „too difficult“ to

give succint expressions for d>2 dimensions

Aspects of Application

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Application Aspects

Theory

Simulation Study

Summary

Introduction

The „curse of dimensionality“

The data „disappears“ into the distribution tails in high dimensions

: a good fit in the tails is desired!d

Probability mass NOT in the "Tail" of a Multivariate Normal Density

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# of dimensions

Application Aspects

Theory

Simulation Study

Summary

Introduction

The „curse of dimensionality“

Much data is necessary to obey a constant estimation error in high dimensions

Dimensionality Required sample size1 42 193 674 2235 7686 27907 107008 437009 18700010 842000

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Essential issues

Estimation of tails important

Worse fit in the tails

Calculation intensive for large n

Many observations required for a reasonable fit

L2-optimal L1-optimal (Misclassification rate)

AMISE-optimal parameter choice

Optimal classification (in high dimensions)

Application Aspects

Theory

Simulation Study

Summary

Introduction

Essential issues

Curse of dimensionality Connection between goodness-of-fit and

optimal classification Two methods for discrimatory purposes

Application Aspects

Theory

Simulation Study

Summary

Introduction

Method 1:

Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information „trade-off“

Use the multivariate kernel density concept to estimate the class densities

Application Aspects

Theory

Simulation Study

Summary

Introduction

Method 2:

Use the univariate concept to „normalize“ the data nonparametrically

Use the classical methods like LDA and QDA for classification

Drawback: calculation intensive

Application Aspects

Theory

Simulation Study

Summary

Introduction

Method 2:

x

f(x)

0 2 4 6 80.

050.

100.

150.

20

x

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

F(x)G(x)

t(x) t(x+) x x+

a)

b)

Simulation Study

Simulation Study

Theory

Application Aspects

Summary

Introduction

Criticism on former simulation studies

Carried out 20-30 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of

high dimensions No dimension reduction

Simulation Study

Theory

Application Aspects

Summary

Introduction

21 datasets 14 estimators 2 error criteria 21x14x2=588 classification

scores Many results

The present simulation study

Simulation Study

Theory

Application Aspects

Summary

Introduction

The present simulation study

21 datasets 14 estimators 2 error criteria 21x14x2=588 classification

scores Many results

Simulation Study

Theory

Application Aspects

Summary

Introduction

Each dataset has...

...2 classes for distinction ...600 observations/class ...200 test observations, 100

produced by each class ... therfore dimension 1400x10

Normal

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise small

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise medium

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise large

-4 -2 0 2 4

0.0

0.15

0.30

Exponential (1)

0 1 2 3 4 5 6

0.0

0.4

0.8

Bimodal - close

-2 0 2 4 6 8

0.0

0.10

0.20

Bimodal - far

-2 0 2 4 6 8

0.0

0.10

0.20

Univariate prototype distributions:

+10 datasets having unequal covariance matrices

21 datasets total

+ 1 insurance dataset

10 datasets having equal covariance matrices

Dataset Nr. Abbrev. contains

1 NN1 10 normal distributions with "small noise"2 NN2 10 normal distributions with "medium noise"3 NN3 10 normal distributions with "small noise"4 SkN1 2 skewed (exp-)distributions and 7 normals5 SkN2 5 skewed (exp-)distributions and 5 normals6 SkN3 7 skewed (exp-)distributions and 3 normals7 Bi1 4 normals, 4 skewed and 2 bimodal (close)-dist.8 Bi2 4 normals, 4 skewed and 2 bimodal (close)-dist.9 Bi3 8 skewed and 2 bimodal (far)-dist.10 Bi4 8 skewed and 2 bimodal (far)-dist.

Simulation Study

Theory

Application Aspects

Summary

Introduction

Simulation Study

21 datasets 14 estimators 2 error criteria 21x14x2=588 classification

scores Many results

Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion ,resp. (2) 8 estimators

Method 2(„marginal normalizations“):

Method 1(multivariate density estimator):

Classical methods:

14 estimators

2 estimatorsLDA and QDA (2)

Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators

Simulation Study

Theory

Application Aspects

Summary

Introduction

Simulation Study

21 datasets 14 estimators 2 misclassification criteria 21x14x2=588 classification

scores Many results

Simulation Study

Theory

Application Aspects

Summary

Introduction

Misclassification Criteria

The classical Misclassification rate („Error rate“)

The Brier score

Simulation Study

Theory

Application Aspects

Summary

Introduction

Simulation Study

21 datasets 14 estimators 2 error criteria 21x14x2= 588 classification

scores Many results

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results

Error rate vs. Brier score

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,0 0,1 0,2 0,3 0,4 0,5 0,6

Error rate

Bri

er s

core

The choice of the misclassification criterion is not essential

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases

Error rates for method 1

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

0,400

0,450

0,500

0,000 0,100 0,200 0,300 0,400 0,500 0,600

"Normal rule"

LS

CV

Superiority of LSCV in case of bimodals having unequal covariance matrices

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results The choice of the univariate bandwidth parameter (method 2) is not essential

Error rates for method 2

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,000 0,050 0,100 0,150 0,200 0,250 0,300

"Normal rule"

She

athe

r-Jo

nes

sele

ctor

Simulation Study

Theory

Application Aspects

Summary

Introduction

Results The best trade-off is a projection onto 2-3 dimensions

Error rate regarding different subspaces

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

2 3 4 5

# dimensions

NN-distributions

SkN-distributions

Bi-distributions

Results

Equal covariance matrices: Method 1 performs inferior against LDA

0,0000,0500,1000,1500,2000,2500,3000,3500,400

Err

or ra

te

LDA (classical)

LSCV(3) - method1

Equal covariance matrices: Method 2 sometimes slightly improves

0,000

0,050

0,100

0,150

0,200

0,250

0,300N

N11

NN

21

NN

31

SkN

11

SkN

21

SkN

31

Bi1

1

Bi2

1

Bi3

1

Bi4

1

Err

or ra

te

LDA (classical)

Normal rule (in method 2)

Results

Unqual covariance matrices: Method 2 performs quite poor, but

not for skewed distributions

0,000

0,050

0,100

0,150

0,200

0,250

Err

or ra

te QDA (classical)

LSCV(3) - method1

Unequal covariance matrices: Method 2 often improves essentially

0,000

0,050

0,100

0,150

0,200

0,250

Err

or ra

te QDA (classical)

Normal rule (inmethod 2)

Is the additional calculation time justified?

Results

Simulation Study

Theory

Application Aspects

Summary

Introduction

Required calculation time

LDA,QDA multivariate "normal rule" Preliminary univariatenormalizations,LSCV,

Sheather-Jones plug-in

Summary

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Summary (1/3) – Classification Performance

Restriction to only a few dimensions Improvements with respect to the classical discrimination

methods by marginal normalizations (especially for unequal covariance matrices)

Poor performance of the multivariate kernel density classificator

LDA is undisputed in the case of equal covariance matrices and equal prior probabilities

Additional computation time seems not to be justified

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Summary (2/3) – KDE for Data Description

Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions)

No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“

in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...)

Different parameter selectors are of varying quality with respect to different underlying densities

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application

Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification

For discrimatory purposes the issue of estimating log-densities is much more important

Some univariate model improvements are not generalizable

The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss

Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

The End

top related