on the kernel density estimationchapter 3 is about the modiﬁcations of the kernel density...

The Islamic University of Gaza

Deanery of Higher Studies

Faculty of Science

Department of Mathematics

ON THE KERNEL DENSITY ESTIMATION

Presentd By

Ghada M. Abu Nada

Supervised By

Associate Professor: Mohamed I. Riffi

SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

AT

ISLAMIC UNIVERSITY OF GAZA

GAZA, PALESTINE

FEBRUARY 2002

c© Copyright by Ghada M. Abu Nada, 2000

To my parents

ii

Table of Contents

Table of Contents iii

Abstract v

Acknowledgements vi

1 Introduction 31.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 How to construct a relative frequency histogram . . . . . . . . 41.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Properties And Examples Of The Kernel Function . . . . . . 9

2 Univariate Kernel Density Estimation 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 The MSE And MISE Criteria . . . . . . . . . . . . . . . . . . . . . . 142.3 Asymptotic MSE And MISE Approximations . . . . . . . . . . . . . 172.4 Exact MISE Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Canonical Kernel And Optimal Kernel Theory . . . . . . . . . . . . 272.6 Measuring How Difficult A Density Is To Estimate . . . . . . . . . . 32

3 Modifications of the kernel density estimator 363.1 Local Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . 363.2 Variable Kernel Density Estimator . . . . . . . . . . . . . . . . . . . 393.3 Transformation Kernel Density Estimators . . . . . . . . . . . . . . . 40

4 Bandwidth Selection 434.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Quick And Simple Bandwidth Selectors . . . . . . . . . . . . . . . . . 44

4.2.1 Normal Scale Rules . . . . . . . . . . . . . . . . . . . . . . . . 444.2.2 Oversmoothed bandwidth selection rules . . . . . . . . . . . . 45

iii

4.3 Least Squares Cross-Validation . . . . . . . . . . . . . . . . . . . . . 474.4 Biased Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 484.5 Estimation Of Density Functionals . . . . . . . . . . . . . . . . . . . 494.6 Plug-In Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . 55

4.6.1 Direct Plug-In Rules . . . . . . . . . . . . . . . . . . . . . . . 554.6.2 Solve-The-Equation Rules . . . . . . . . . . . . . . . . . . . . 58

4.7 Smoothed Cross-Validation Bandwidth Selection . . . . . . . . . . . . 60

Conclusions 61

Bibliography 63

iv

Abstract

This paper considers kernel function based density estimators as a fundamental way

for nonparametric estimation functions. We try analyzing the performance of the

kernel density estimator due to the error criterion mean squared error for measuring

the error during the estimating process. An asymptotic approximation to the mean

integrated squared error is derived. This approximations allow more understanding of

the smoothing parameter, often called the bandwidth for kernel estimators. A closed

form expression for the bandwidth that minimizes the mean integrated squared error

asymptotically. We study the influence of the kernel and the true density on the

performance of the kernel density estimator.

To rise the level of efficiency of the basic kernel density and to overcome its limita-

tions, some modified forms of the kernel estimator are introduced.

One of the important issues in kernel smoothing is the choice of the smoothing pa-

rameters. So some bandwidth selection methods are studied.

v

Acknowledgements

I would like to thank so much the Islamic University of Gaza for offering me to get

the Master Degree, and my thank to my professors, my colleagues at the Department

of Mathematics.

I am grateful to my supervisor Dr. Mohamed Riffi for his guidance and efforts with

me during this research. Also I would like to thank Dr. Jaser H. Sarsour and Dr.

Mohamed Al-Atrash for their useful suggestions and discussions.

My sincere gratitude to my family, especially my parents for their love and support.

I am also thankful to all my friends for their kind advises, and encouragement.

Finally, I pray to god to accept this work.

vi

1

Preface

Kernel smoothing refers to a general methodology for non-parametric estimation of

functions. It provides us with a class of techniques for performing this estimation.

The basic principle is that local averaging or smoothing is performed with respect to

a kernel function.

In our theses we choose dealing with one of those techniques, known as kernel density

estimation. Suppose that you have a univariate set of data which you want to display

graphically. Then this interesting procedure can be used effectively.

There are different reasons attracted our attention to this subject. From of them,

the importance of the techniques of the kernel density estimators for statisticians. It

is a simple practical tool for estimating unknown densities in statistical experiments.

Also, The wide using of the kernel smoothers means is not restricted to those familiar

with the topic. It also has a fundamental importance for students and researchers

from other disciplines. The same principles can be extended to more complicated

problems, leading to many applications in fields like medicine, engineering and eco-

nomics.

The researches concerned with kernel estimators are so much, and started at early

years. The basic principles were introduced by Fix and Hodges (1951) and Akaike

(1954).

The main goals of theses are to introduce an investigation about the kernel density

estimation in a simple way to enable us to use it in smoothing and knowing the un-

known densities theoretically and practically.

A brief description now will be given of the Chapters of our theses.

The first Chapter is preparatory. It is an introduction about histograms and how it

is used in estimating a probability density function. Then we discuss an alternative

ways in details.

2

The main object of the univariate kernel density estimation is detailed in Chapter 2.

We begin it with studying some error criteria for measuring the error when estimating

the density. In the second section we introduce the derivation of large sample approx-

imations for leading variance and bias terms, which allow a better understanding of

the role of the bandwidth. In the next section, the exact finite sample performance of

the kernel density estimation for a particular kernel and density is analyzed. The rest

of this chapter is about the effect of the shape of the kernel function and measuring

how difficult a density is to estimate.

Chapter 3 is about the modifications of the kernel density estimator. They are local

kernel density estimator, variable kernel density estimator and transformation kernel

density estimator.

Special attention is also given to the important problem of choosing the smoothing

parameters of a kernel smoother in Chapter 4. It comes in 7 sections and shows two

classes of the bandwidth selectors; the quick and simple selectors and the hi-tech

selectors.

Finally, there are conclusions for the subjects studied in this thesis.

Chapter 1

Introduction

One of the main concerns of statisticians or data analysts is to characterize and know

the distribution of the data in some manner. so they can make their inferences.

The distribution of some function can be modelled as a member of a parametric family

and made inferences about the parameter, or it can be considered as a nonparametric

model. An example of the latter is to use the empirical distribution function of the

data, and make inferences based on whether this distribution has some properties of

interest.

Accordingly, probability density estimators are generally broken down into two basic

classes, parametric or nonparametric estimators.

Parametric estimators assume afunctional form of the density parameterized by a

finite set of parameters, while nonparametric methods consist of all other types of

estimators.

In our present work, we will begin with histograms as an introduction for recogniz-

ing of the kernel estimators, as an important example of nonparametric methods of

estimation unknown densities.

3

4

1.1 Histogram

Histograms are among oldest and the most widely used methods of data represen-

tation. Informally, a histogram is simply a rectangular bins with its basis on the

horizontal axis of the real line, and heights parallel to the vertical axis. It is used to

present graphically the data and helps us to use this data to say something about the

population from which the data were selected.

1.1.1 How to construct a relative frequency histogram

We can, first, summarize the set of data in a table providing the number of occurrences

of each possible outcome which we got from repeating a random experiment a number

of times, say n.

i) In the case of discrete data. The outcomes of the random experiment are values

in a discrete set. That is, the set contains a countable number of points. To

construct a relative frequency, we can make the height of each rectangle equal

to f/n; where f is the number of the outcome appearance in n trials. The base

of length one centered at a data point. f/n is called the relative frequency of

the outcome. Note that the relative frequency histogram gives an estimate of

the probability histogram of the associated random variable.

ii) In the continuous type data. The theoretical set of outcomes forms an interval or

union of intervals. Given a set of continuous data, by grouping the data into

k class intervals with equal length ( they are called also bins or windows), a

frequency table is constructed that lists a tabulation of measurements in the

various classes, and the frequency of each class. A relative frequency histogram

is composed of rectangles, each of them has an area equal to relative frequency

fi/n of the observations for the class.

5

Definition 1.1.1. The function defined by

h(x) =

{fi/(n(ci − ci−1)) ci−1 ≤ x ≤ ci; i = 1, 2, · · · , k.0 otherwise.

(1.1.1)

is called a relative frequency histogram, wherefi is the frequency of the ith bin, ci andci−1 are the boundaries of this bin, and n is the total number of observations.

Eqn.(1.1.1) shows a simple descriptive estimates of the probability density function

of the continuous type. So, a histogram can be considered the simplest nonparametric

form of what so-called the density estimator.

The histogram appearance depends on both the choice of origin of the histogram and

the width of the bins used, provided that, The origin of the histogram is the lower

boundary of the first bin.

Figure (1.1.1)

6

Figure (1.1.1) shows four histograms based on the data of 50 birth weights of children

having severe idiopathic respiratory syndrome (Van Vliek and Gupta , 1973).

Figures (1.1.1) (a) and (b) show the histogram with binwidth b = 0.2 and b = 0.8,

respectively. We note that, the histogram appears in (b) is smoother than that appears

in (a) which has smaller binwidth.

Figures (1.1.1) (c) and (d) show histograms based on the same binwidth b = 0.4

but left bin starts at 0.7 and 0.9, respectively.

Histograms sensitivity to choice of origin can be removed by taking the average

of a number of histograms, which is called the average shifted histograms (ASH). In

contrast to histograms, ASH appearance does not depend on a particular choice of

origin and depends only on the choice of binwidth. The ASH provides one way to

approximate a kernel density estimator (KDE) which we will study in the following

chapter.

However, histograms have some drawbacks other than their sensitivity to choice of

origin, that make dealing with them in advanced estimation problems is useless. Ac-

tually, the histogram is a step function. So, it is unpractically to estimate all densities

with a step function. Another problem is the extension of the histogram to the multi-

variate case. However, the main problem in practice is to obtain a sufficiently smooth

representation of the data while also retaining its main features. This is guaranteed

by the KDE.

Now, we want to return back to the function in Eqn.(1.1.1) and re-express it un-

der an important subject.

7

1.2 Density Estimation

The main object of density estimation is to make a smooth nonparametric estimate of

the unknown density function of a set of observations drawn from it, that is; given a

sample of n independent observations x1, x2, ..., xn from a distribution with unknown

density function, f(x).

A simple development of the estimator in Eqn.(1.1.1) would be the running histogram

estimator

f(x;h) =1

2nhnx; a ≤ x ≤ b, (1.2.1)

where nx is the number of observations falling in the interval [x− h, x+ h], and h is

known as the bandwidth.

Eqn.(1.2.1) can be written as

f(x) =1

2nh(number of observations falling in the interval [x− h, x+ h])

=1

2nh

n∑i=1

I(|xi − x| ≤ h)

=1

nh

n∑i=1

1

2I(|xi − x

h| ≤ 1)

=1

nh

n∑i=1

w(xi − x

h), (1.2.2)

where

I =

1 x− h ≤ xi ≤ x+ h,

0 otherwise.is the indicator function.

and

w(xi − x

h) =

1

2I(|xi − x

h| ≤ 1) =

12

−1 ≤ xi−xh≤ 1,

0 otherwise.

8

Generally, the expression in Eqn. (1.2.2) exhibits a function w centered at the es-

timation point used to weight nearby data points. Now we ready to introduce the

following definition.

Definition 1.2.1. We consider the function that centered at the estimation pointused to weight nearby data points as a weight function and will call it the kernelfunction and denoted by K(.).

Eqn (1.2.2) becomes

f(x) =1

nh

n∑i=1

K(xi − x

h

)(1.2.3)

In fact, the above form of kernel function is called, the uniform kernel and it is one

of several forms of this function.

9

1.2.1 Properties And Examples Of The Kernel Function

1. The kernel is usually a symmetric probability density function (pdf). Thus we

expect it to have the same properties of a pdf . That is .∫RK(u)du = 1 and K(u) ≥ 0

2. The shape of f(x;h) dose not depend upon the choice of origin but is affected

by the bandwidth h. The theoretical background of this insensitivity is that

kernel functions can be re-scaled such that the difference between two kernel

density estimates using two difference kernels is almost negligible ( more details

in section (2.5) )

Some examples of kernel functions

Kernel K(u)

(a) Uniform 12I(|u| ≤ 1)

(b) Triangle (1− |u|)I(|u| ≤ 1)

(c)Gaussian (2π)−1/2 exp(−u2/2)

(d)Triweight (Beta(4,4)) 3532

(1− u2)3I(|u| ≤ 1)

(e) Quartic 1516

(1− u2)2I(|u| ≤ 1)

(f)Epanechnikov 34(1− u2)I(|u| ≤ 1)

(g) Cosinus π4

cos(π2u)I(|u| ≤ 1)

Tabel (1.2.1)

10

Some kernels in table (1.2.1) that needed, are plotted in figure (1.2.1)

Figure (1.2.1)

Chapter 2

Univariate Kernel DensityEstimation

2.1 Introduction

From many tools of data analytic, nonparametric density estimation is specially im-

portant. It is important because it provides an effective way of showing structure

in a set of data, when parametric methods are inappropriate. This is happen, when

supposing a particular parametric model implies missing some main structures in the

data. Moreover, the univariate density estimator is considered the most straightfor-

ward compared with other types of kernel estimators. And the good understanding

and working with it makes it easier to go far more steps toward extensions to the

maltivariate case, to more complicated kernel-based methods or to estimate curves

instead of densities.

Definition 2.1.1. Suppose that X1, ..., Xn is a univariate random sample of continu-ous type which is drown from an unknown distribution with density f(x). The kerneldensity estimator (KDE), f(x;h) for the estimation of the density value f(x) at apoint x is defined as

f(x;h) =1

nh

n∑i=1

K(Xi − x

h

), (2.1.1)

K(.) is a kernel function and h is called the bandwidth or the window-width.

11

12

Remark 2.1.1.

1. The properties of the kernel function in section (1.2) shows that the kernelfunction is a pdf . Eqn. (1.2.1) ensures that f(x;h) is itself a density.

2. The KDE in Eqn. (2.1.1) can be reformulated as

f(x;h) =1

n

n∑i=1

Kh(x−Xi), (2.1.2)

where Kh(u) = K(u/h)/h

According to this definition of KDE, it is possible to think of KDE as the following

way; consider the above n observations x1, x2, ..., xn were plotted on a line, a KDE

can be obtained by replacing a (bump) at each point and then summing the height of

the bumps at each point on the x-axis, the shape of the bump is defined by the kernel

function. The spread of the bump is determined by a bandwidth h, that is analogous

to the bin-width of a histogram. In other words, the value of the kernel estimate at

the point x is the average of the n kernel ordinates at this point.

Example 2.1.1. Choose the kernel function to be theN(0, 1) density; K(x) = φ(x) =(2π)−1/2 exp(−x2/2) and with using five observations (theoretically), figure (2.1.1)shows kernel density estimate.

Figure(2.1.1)

13

Remark number (2) and the previous example give an idea about the role of thebandwidth h as a scaling factor which controls the spread of the kernel.

Definition 2.1.2. The bandwidth h is often called the smoothing parameter, since itcontrols the amount of smoothing being applied to the data.

Example 2.1.2. A random sample of size n = 1000 was drawn from the normalmixture density

f1(x) =3

4φ(x) +

1

4φ1/3

(x− 3

2

)f1 consists of N(0, 1) observations with probability 3

4and N(3

2, (1

3)2) observations

with probability 14.

The kernel density estimates due to this sample is showed by the figure below.

Figure(2.1.2)

14

In figure (2.1.2), the solid line is the estimate and the dashed one is the truedensity. The used kernel for each estimate is illustrated by small kernels at the baseof each figure.Figure (2.1.2)(a) with h = 0.06 shows the estimate of f is very rough. It is usuallycalled undersmoothed estimate.Figure (b) with h = 0.54 shows the estimate of f is too smooth such that we missone mode.Figure (c) with h = 0.18 is a reasonable estimate of f , which retains the main featureof the true f .

2.2 The MSE And MISE Criteria

The important role played by KDE makes us concerned with its performance, its effi-

ciency and accuracy in estimating the true density. So, it is reasonable to investigate

some error criteria for measuring the error when estimating the density at a single

point rather than the whole real line. We have known that, the bandwidth h is the

most important parameter in determining the smoothness of a density. It determines

a relationship between two types of errors (as we will see); bias and variance.

Our goal is to choose a good bandwidth, so we need to understand this relationship.

Actually, these two mentioned types are components of what so-called the mean

squared error.

Definition 2.2.1. The mean squared error (MSE) of an estimator f(x;h) of a densityf(x) is the expected value of the squared difference between the density estimatorand the true density function, it is denoted by E(f(x;h)− f(x))2.

From its definition, the MSE measures the average squared difference between

the density estimator and the true density. In general, any function of the absolute

distance |f(x;h) − f(x)| (often called metric) would serve as a measurement of the

goodness of an estimator. But MSE metric has at least two advantages over other

metrics. First it is tractable analytically. Second it has an interesting decomposition

15

into variance and squared bias provided that f(x) is not random, as follows

MSE{f(x;h)} = E(f(x)− f(x;h))2 = E(f 2(x)− 2f(x)f(x;h) + Ef 2(x;h))

= Ef 2(x)− 2f(x)Ef(x;h) + Ef 2(x;h)

= f 2(x)− 2f(x)Ef(x;h) + Varf(x;h) + (Ef(x;h))2

= Varf(x;h) + (Ef(x;h)− f(x))2. (2.2.1)

where, the bias of an estimator is defined as follows.

Definition 2.2.2. The bias of an estimator f , of a density f , is the difference betweenthe expected value of f and f . That is;

bias(f) = Ef − f

An estimator whose bias is equal zero is called unbiased.

At what follows we well try to compute MSE for an estimator f(x;h) of the density

function f(x), at some point xεR.

Theorem 2.2.1. Let X be a random variable having density f , then

MSE{f(x;h)} = n−1{∫

K2h(x− y)f(y)dy −

(∫Kh(x− y)f(y)dy

)2}+

(∫Kh(x− y)f(y)dy − f(x)

)2

(2.2.2)

Proof. Note that

Ef(x;h) = EKh(x−X) =

∫Kh(x− y)f(y)dy (2.2.3)

this is because that this estimator is treated as the sample mean X ∼ (µ, σ2/n),where X =

∑ni=1Xi that is; EX = µ = EXi and VarX = σ2/n. Therefore, the bias

of f(x;h) is:

Ef(x;h)− f(x) =

∫Kh(x− y)f(y)dy − f(x)

16

The variance is:

Var{f(x;h)} = n−1VarKh(x−X)

= n−1(EK2h(x−X)− (EKh(x−X))2)

= n−1{∫



)2}Combine these to give:

MSE{f(x;h)} = n−1{∫



)2}+

(∫Kh(x− y)f(y)dy − f(x)

)2

.

Now, we are interested in considering an error criterion that globally measures the

distance between the estimation of f over the entire real line and f itself.

Definition 2.2.3. An error criterion that measures the distance between f(.;h) andf is the integrated squared error (ISE) given by

ISE{f(.;h)} =

∫(f(x;h)− f(x))2dx

But the ISE is not appropriate if we deal with all data sets, so we prefer to analyze

the expected value of this random quantity, the mean integrated squared error.

Definition 2.2.4. The expected value of ISE is called the mean integrated squarederror (MISE). It is given by

MISE{f(.;h)} = E(ISE{f(x;h)}) = E

∫ (f(x;h)− f(x)

)2

dx

Theorem 2.2.2. The MISE of an estimator f(.;h) of a density f is given by

MISE{f(.;h)} = n−1

∫ ∫K2

h(x− y)f(y)dydx

+ (1− n−1)

∫ (∫Kh(x− y)f(y)dy

)2

dx

− 2

∫ {∫Kh(x− y)f(y)dy

}f(x)dx+

∫f 2(x)dx. (2.2.6)

17

Proof. The definitions of MISE and MSE and some calculations yield,

MISE{f(.;h)} = E

∫ (f(x;h)− f(x)

)2

dx

=

∫E(f(x;h)− f(x)

)2

dx

=

∫MSE{f(x;h)}dx

= n−1{∫ ∫

K2h(x− y)f(y)dydx−


)2

dx}

+

∫ {∫Kh(x− y)f(y)dydx− f(x)

}2

dx

= n−1

∫ ∫K2

h(x− y)f(y)dydx

+ (1− n−1)


)2

dx

− 2

∫ ∫Kh(x− y)f(y)dyf(x)dx+

∫f 2(x)dx

This theorem introduce the MISE {f(.;h)} depends on h in a relatively compli-

cated way. Practically, we need a clear dependence on h in that expression.

2.3 Asymptotic MSE And MISE Approximations

Here, we will derive an asymptotic approximation for MISE which depends on h in a

simple way. The simple expressions of these approximations will exhibit the influence

of the bandwidth h as a smoothing parameter.

The rate of convergence (it will be defined) of the KDE and the MISE- optimal

bandwidth can be also obtained from the asymptotic approximation of MISE.

Before we start in our investigation we have to introduce some definitions, theorem,

and some assumptions that are needed throughout our work.

Definition 2.3.1. [1]An ultimately monotone function is one that is monotone overboth (−∞,−M) and (M,∞) for some M > 0.

18

Definition 2.3.2. [5] i. A function f is of order than g as x→∞ if limx→∞f(x)g(x)

= 0.

We indicate this by writing f = o(g) (”f is little oh g”).ii. Let f(x) and g(x) be positive for x sufficiently large. Then f is of at most theorder of g as x→∞ of there is a positive integer M for which

f(x)

g(x)≤M,

for x sufficiently large. We indicate this by writing f = O(g)(”f is big oh of g”).

Definition 2.3.3. [3] Given two sequences {an} and {bn} such that bn ≥ 0 for all n.We write

an = O(bn) (read : ”an is big oh of bn”),

if there exists a constant M > 0 such that |an| ≤Mbn for all n. We write

an = o(bn) as n→∞ (read : ”an is little oh of bn”), if limn→∞

an/bn = 0

Definition 2.3.4. [1] We say that an is asymptotically equivalent to bn, or simply an

is asymptotic to bn, and write

an ∼ bn iff limn→∞

(an/bn) = 1.

Definition 2.3.5. [1] If the sequence {an} satisfies an ∼ Crn where rn is a simplefunction of n and C is independent of n then, we call rn the rate of convergence tozero of an. It is also common to say that an is of order rn.

Theorem 2.3.1. (Taylor’s Theorem) Suppose that f is a real-valued function definedon R and let x ∈ R. Assume that f has p continuous derivatives in an interval(x − δ, x + δ) for some δ > 0 and the (p + 1)th derivative of f exists. Then for anysequence αn converging to zero,

f(x+ αn) =

p∑j=0

(αjn/j!)f

(j)(x) + o(αpn)

The assumptions that we need are:

1. The density f is such that its second derivative f ′′ is continuous, squared inte-

grable and ultimately monotone.

19

2. The bandwidth h = hn is a non-random sequence of positive numbers. We also

assume that h satisfies:

limn→∞

h = 0 and limn→∞

nh = ∞

which is equivalent to saying that h approaches zero, but at a rate slower than

n−1.

3. The kernel K is a bounded probability density function having finite fourth

moment and symmetric about the origin.

Lemma 2.3.2. Let X be a random variable having a density f , then the bias off(x;h) can be expressed as

Ef(x;h)− f(x) =1

2h2µ2(K)f ′′(x) + o(h2) (2.3.1)

Proof. As we knew before:

Ef(x;h) =

∫Kh(x− y)f(y)dy =

∫1

hK(x− y

h

)f(y)dy

Set z = x−yh

. Then,

Ef(x;h) =

∫K(z)f(x− hz)dz

Expanding f(x− hz) in a Taylor series about x as follows:f(x− hz) is real-valued function defined on R, and by Assumption (1), f has contin-uous derivatives of order 2. As n→∞ this implies −zh→ 0 by Assumption (2).Thus

f(x− hz) =2∑

j=0

(−zh)j

j!f (j)(x) + o((−zh)2)

= f(x) + (−zh)f ′(x) +z2h2

2f ′′(x) + o(z2h2)

= f(x)− zhf ′(x) +1

2h2z2f ′′(x) + o(h2)

20

This yields:

E{f(x;h)} =

∫K(z)

(f(x)− zhf ′(x) +

1

2h2z2f ′′(x) + o(h2)

)dz

= f(x)− hf ′(x)

∫zK(z)dz +

1

2h2f ′′(x)

∫z2K(z)dz + o(h2)

= f(x) +1

2h2f ′′(x)

∫z2K(z)dz + o(h2)

where ∫K(z)dz = 1,

∫zK(z)dz = 0, and

∫z2K(z)dz <∞.

(from the kernel function properties and Assumption (3)). Let µ2(K) =∫z2K(z)dz,

so the bias expression can be written as:

Ef(x;h)− f(x) =1

2h2µ2(K)f ′′(x) + o(h2)

Here the bias is of order h2, so f(x;h) is asymptotically unbiased. Other important

thing is the bias depends on the true density f . The bias is directly proportional to the

second derivative of f . In other words the above expression gives us a relationship

between the bias and the curvature of the density f . The bias is large when the

curvature of f is high. For many densities this occurs in peaks where the bias is

negative, and valleys where the bias is positive.

Lemma 2.3.3. Let X be a random variable having density f , then

Var{f(x;h)} = (nh)−1R(K)f(x) + o((nh)−1) (2.3.2)

where R(K) =∫K2(x)dx

Proof. We express the Var{f(x;h)} as:

Var{f(x;h)} = (n)−1[ ∫



)2]

21

Using the Taylor series expansion of f(x− hz) about x we obtain:

Var{f(x;h)} = (nh)−1

∫K2(z)f(x− hz)dz − n−1{Ef(x;h)}2

= (nh)−1

∫K2(z){f(x) + o(1)}dz − n−1{f(x) + o(1)}2

= (nh)−1

∫K2(z)dzf(x) + o((nh)−1)

with the notation R(K) =∫K2(x)dx, we can write:

Var{f(x;h)} = (nh)−1R(K)f(x) + o((nh)−1)

from Eqn. (2.3.2) the variance is of order (nh)−1 and hence from Assumption (2),

limn→∞(nh)−1 = 0 then, Varf converges to zero.

Theorem 2.3.4. The MISE of an estimator, f of the unknown density f is given by

MISE{f(.;h)} = AMISE{f(.;h)}+ o{(nh)−1 + h4}

where

AMISE{f(.;h)} = (nh)−1R(K) +1

4h4µ2

2(K)R(f ′′) (2.3.3)

is called the asymptotic MISE of f(.;h)

Proof. The definition of the MSE and the lemmas (2.3.2) and (2.3.3) are combinedto give :

MSE{f(x;h)} = (nh)−1R(K)f(x) + o((nh)−1) +1

4h4µ2

2(K)f ′′2(x)

+ o(h4) + h2µ22(K)f ′′o(h2)

= (nh)−1R(K)f(x) +1

4h4µ2

2(K)f ′′2(x) + o((nh)−1 + h4)

= (nh)−1R(K)f(x) +1

4h4µ2

2(K)f ′′2(x) + o((nh)−1 + h4)

Integrating this expression yields:

MISE{f(.;h)} = (nh)−1R(K) +1

4h4µ2

2(K)R(f ′′) + o((nh)−1 + h4)

hence:

AMISEf(.;h) = (nh)−1R(K) +1

4h4µ2

2(K)R(f ′′)

22

The AMISE (asymptotic MISE) has some useful advantages. Its simplicity as a

mathematical expression to deal with, makes it useful for large sample approximation.

Also, we can see an important alternative relationship between bias and variance, it

is known as the variance-bias trade-off. It gives us an understanding about the role

of the bandwidth h. With more clarification, we can note from MISE f that the ISB

is asymptotically proportional to h4. So in order to decrease it, we need to make h

small. But if we do that, we will increase the integrated variance since, it is inversely

proportional to h. ” Therefore as n increases h should vary in such away that each

of the components of the MISE becomes smaller.

Moreover, The following corollary gives us an especial expression.

Corollary 2.3.5. The AMISE-optimal bandwidth, hAMISE, has a closed form

hAMISE =

(R(K)

µ22(K)R(f ′′)n

)1/5

(2.3.4)

Proof. Now by differentiating with respect to h and setting the derivative equal tozero, Eqn. (2.3.3) becomes

d

dh

(AMISEf

)= (nh2)−1 R(K) + h3µ2

2(K)R(f ′′) = 0

h5µ2(K)2R(f ′′) = n−1 R(K)

hAMISE =

(R(K)

nµ22(K)R(f ′′)

)1/5

When trying to understand what this h guides to, we will find that it depends on

the known K and n, and it is inversely proportional to R(f ′′)1/5. This R(f ′′) measures

the total curvature of f . So if R(f ′′) is small, this implies f has a little curvature and

the bandwidth h will be large. On the other hand h will be small if R(f ′′) is large.

23

However, the previous expression for the optimal h can work to choose a good band-

width, if R(f ′′) is known. But this is the point. So we will investigate some methods

for selecting h based on estimating R(f ′′), in Chapter 4.

Corollary 2.3.6.

infh>0

AMISE{f(.;h)} =5

4

(µ2

2(K)R(K)4R(f ′′))1/5

n−4/5 (2.3.5)

Proof. Substituting Eqn.(2.3.4) in Eqn.(2.3.3) yields :

AMISE{f(.;h)} = (nh)−1(R(K) +

n

4h5µ2

2(K)R(f ′′))

=5

4R(K)

(µ2

2(K)R(f ′′))1/5/

n4/5(R(K))1/5

=5

4n−4/5 (R(K))4/5

(µ2(K)2R(f ′′)

)1/5

take the infimum over h > 0 ( this possible because f and K are non-negative),

infh>0

AMISE{f(.;h)} =5

4

(µ2

2(K)R(K)4R(f ′′))1/5

n−4/5

In terms of the MISE itself and using the asymptotic notation we can rewrite

Eqns. (2.3.4) and (2.3.5)

hMISE ∼( R(K)

µ22(K)R(f ′′)n

)1/5

(2.3.6)

and,

infh>0

MISE{f(.;h)} ∼ 5

4

(µ2

2(K)R(K)4R(f ′′))1/5

n−4/5 (2.3.7)

In fact, Eqn. (2.3.5) gives the smallest possible AMISE for estimation of f using

the kernel K. As the last expression of the infimum of MISE shows, the best rate

of convergence of the MISE of the kernel estimator is of order n−4/5. This rate is

considered slower than n−1 the typical rate of convergence of MSI in parametric

24

estimators. For illustration purpose, we introduce the following example. Otherwise,

some ideas of this example belongs to the class of parametric estimation and need

studying in details.

Example 2.3.1. Consider the random sample X1, X2, ..., Xn from N(µ, σ2) distri-bution and the parameter exp(µ). The maximum likelihood estimator of exp(µ) isexp(X) [2], where X is the mean of the random sample and, X = n−1

∑ni=1Xi. We

interest in finding MSE{exp(X)}. Provided that, the moment-generating function ofX [4] is

M(t) = exp{µt+ (σ2/n)t2/2}

, we have

MSE{exp(X)} = Var{eX}+ (EeX − eX)2

= E[e2X ]− [EeX ] + (EeX − eX)2

= E[e2X ]− 2eXEeX + e2X

= e2µ+2σ2/n − 2eµeµ+σ2/2n + e2µ

= e2µ[e2σ2/n − 2eσ2/2n + 1]

Now, by using the power series expansion of the exponential function, ex =∑∞

k=0(xk/k!)

we get

MSE{exp(X)} = e2µ([1 + 2σ2n−1 + 4σ4n−2/2! + ...]

− 2[1 + σ2(2n−1) + σ4(2n)−2/2! + ..] + 1)

= e2µ(σ2n−1 +7

4σ4n−2 + ..)

= σ2e2µn−1(1 +7

4σ2n−3 + ..)

∼ σ2e2µn−1

So, the rate of convergence of MSE is of order n−1. This is typical for MSE inparametric estimation.

2.4 Exact MISE Calculation

We show at the previous section some advantages of the AMISE{f(.;h)} formula.

25

We note that AMISE{f(.;h)} is just a large sample approximation to MISE{f(.;h)}

given by Eqn. (2.2.6).

In some cases, we need to analyze the exact finite sample performance of the kernel

density estimation for a particular K and f . To do this, MISE{f(.;h)} can be com-

puted using Eqn. (2.2.6). But in order to avoid encountering the several integrals, f

and K can be chosen such that Eqn. (2.2.6) can be computed exactly.

Lemma 2.4.1. For the density φσ(x−µ) which of the N(µ, σ2) distribution, and hasthe notation φσ(x− µ) = (2πσ2)−1/2 exp{−(x− µ)2/(2σ2)}, the algebraic identity

φσ(x− µ)φσ′(x− µ′) = φ(σ2+σ′2)1/2(µ− µ′)φσσ′/(σ2+σ′2)1/2(x− µ∗)

where µ∗ = (σ′2µ+ σ2µ′)/(σ2 + σ′2), is valid.

Proof. We begin with the formula for the N(µ, σ2) distribution,

φσ(x− µ) = (2πσ2)−1/2 exp{−(x− µ)2/2σ2}

then,

φσ(x− µ)φσ′(x− µ′) = (2πσσ′)−1 exp{− 1/2

((x− µ)2/σ2 + (x− µ′)2/σ′2

)}= (2πσσ′(σ2 + σ′2)/(σ2 + σ′2))−1 exp

{− 1/2

(σ′2(x− µ)2

+ σ2(x− µ′)2/σ2σ′2

)}Expansion the squares, cancellation the similar terms, completion the squares in thenumerator of the exponential function, and simplification makes the right-hand side

(2πσσ′(σ2 +σ′2)/(σ2 +σ′2))−1 exp{−((x−µ∗)2 +σ2σ′2(µ−µ′)2

)/2σ2σ′2/(σ2 +σ′2)

}where µ∗ = (σ2µ′ + σ′2µ/σ2 + σ′2)Rewriting the right-hand side yields

(2π(σ2 + σ′2))−1/2 exp{ −(µ− µ′)

2(σ2 + σ′2)

}.( 2πσ2σ′2

σ2 + σ′2

)−1/2

exp{ −(x− µ∗)2

2σ2σ′2/σ2 + σ′2

}Thus,

φσ(x− µ)φσ′(x− µ′) = φ(σ2+σ′2)1/2(µ− µ′)φσσ′/(σ2+σ′2)1/2(x− µ∗)

26

Theorem 2.4.2. ∫φσ(x− µ)φσ′(x− µ′)dx = φ(σ2+σ′2)1/2(µ− µ′) (2.4.1)

whereφσ(x− µ) = (2πσ2)−1/2 exp{−(x− µ)2/2σ2}

Proof. If we integrate the formula in the lemma with respect to x,we will get∫φσ(x− µ)φσ′(x− µ′)dx = φ(σ2+σ′2)1/2(µ− µ′)

∫φσσ′/(σ2+σ′2)1/2(x− µ∗)

The integration in the right-hand side is equal one since, φσσ′/(σ2+σ′2)1/2(x−µ∗) is thedensity of the N(µ∗, σ2σ′2/σ2 + σ′2) distribution.

Example 2.4.1. ChooseK to be the N(0, 1) density and f to be the N(0, σ2) density.Then,

Kh(x) = φh(x) and f(x) = φσ(x).

Now we will compute each term of the right-hand side of the Eqn. (2.2.6)∫K2(x)dx =

∫φ2(x)dx = φ√2(0) = (2π1/2)−1.

Also, ∫Kh(x− y)f(y)dy =

∫φh(y − x)φσ(y)dy = φ(h2+σ2)1/2(x).

Therefore, the integral of the second term becomes∫φ2

(h2+σ2)1/2(x)dx =

∫φ(h2+σ2)1/2(x)φ(h2+σ2)1/2(x)dx

= φ(2h2+2σ2)1/2(0)

= (2π1/2)−1(σ2 + h2)−1/2.

Also, the integral of the third term becomes∫φ(h2+σ2)1/2(x)φσ(x)dx = φ(h2+σ2)1/2(0)

= (2π(h2 + 2σ2))−1/2

The last term is ∫f 2(x)dx =

∫φ2

σ(x)dx = (4πσ2)−1/2.

Finally, substituting in the Eqn. (2.2.6), yields

2π1/2MISE{f(.;h)} = (nh)−1 + (1− n−1)(h2 + σ2)−1/2

+ σ−1 − 23/2(2σ2 + h2)−1/2

27

2.5 Canonical Kernel And Optimal Kernel The-

ory

In the previous sections we worked with the kernel function K under some assump-

tions. It is assumed to be a symmetric and unimodal density.

Beside the simplicity, there are also some considerations, because of them the density

estimators based on kernels that do not satisfy these assumptions are ignored. There

are many kernel functions that satisfy these assumptions. So it is possible that one

may ask:

Are all the kernel functions that satisfy these requirements appropriate?

Do they perform the same role with the same degree of effectiveness?

At what follows; we will try to concentrate our attention on investigating the effect

of the shape of the kernel function K.

Consider the formula for AMISE {f(.;h)} in Eqn. (2.3.3). In this formula the scaling

of K is incorporated with the bandwidth h. This causes difficulty in optimization

with respect to K. If we choose a re-scaling of K of the form

Kδ(.) = (1/δ)K(./δ),

the dependence of K and h can be separated. To know how this can be made, we

will give this lemma.

Lemma 2.5.1. R(Kδ) = µ22(Kδ) is satisfied iff δ = δ0 = {R(K)/µ2

2(K)}1/5

Proof. (i) Assume that R(Kδ) = µ22(Kδ) is true.

From the above re-scaling form of K, we have

(1/δ)R(K) = δ4µ22(K).

This yieldsδ5 = R(K)/µ2

2(K)

28

which impliesδ = {R(K)/µ2

2(K)}1/5

(ii) Now, assume that, δ = δ0 = {R(K)/µ22(K)}1/5.

Begin with R(Kδ).

R(Kδ) = 1/δR(K) = {µ22(K)/R(k)}1/5R(K) = µ

2/52 R4/5(K)

= [µ22(K)]1/5δ4[µ2

2(K)]4/5

= δ4µ22(K) = µ2

2(Kδ)

Theorem 2.5.2. Let R(Kδ) = µ22(Kδ), where δ = δ0 = {R(K)/µ2

2(K)}1/5 then,

AMISE{f(.;h)} = C(Kδ0){(nh)−1 +1

4h4R(f ′′)} (2.5.1)

where C(K) = {R(K)4µ2(K)2}1/5.

Proof.

AMISE{f(.;h)} = (nh)−1R(K) +1

4h4µ2

2(K)R(f ′′)

= δ(nh)−1R(Kδ) +1

4h4δ−4R(Kδ)R(f ′′)

= R(Kδ0){(nh)−1 +1

4h4R(f ′′)}

= δ−10 R(K){(nh)−1 +

1

4h4R(f ′′)}

= {δ−4R4(K)δ4µ22(K)}1/5{(nh)−1 +

1

4h4R(f ′′)}

= {R4(Kδ0)µ22(Kδ0)}1/5

= C(Kδ0)

where C(K) = {R4(K)µ22(K)}1/5

Definition 2.5.1. We say that C(K) is invariant to re-scaling of K if C(Kδ1) =C(Kδ2) for any δ1, δ2 > 0.

Remark 2.5.1. C(K) is invariant to re-scaling of K.

29

Proof. We have to prove that

C(Kδ1) = C(Kδ2) for any δ1, δ2 > 0.

Starting with C(Kδ1), yields

C(Kδ1) = {R4(Kδ1)µ22(Kδ1)}1/5

= {δ−41 R(K).δ4

1µ22(K)}1/5 = {R(K)µ2

2(K)}1/5

= {δ−42 R(K)δ4

2µ22(K)}1/5 = {R(Kδ2)µ

22(Kδ2)}1/5

= C(Kδ2)

Definition 2.5.2. For the class {Kδ : δ > 0} of re-scalings of K, the unique memberof that class that separates the dependence of K and h in Eqn.(2.3.3) is called thecanonical kernel, and denoted by Kc = Kδ0 .

Example 2.5.1. For the KDE

f(x;h) = (n)−1

n∑i=1

Kh(x−Xi),

take K = φ, the standard normal kernel. We have to compute C(Kδ0) in theAMISE{f(.;h)} formula. Using lemma (2.5.1) we obtain δ0 = (4π)−1/10 since,

R(K) =

∫φ2(x)dx = φ√2(0) = (4π)−1/2

and for normal densitiesµ2

2(K) = 1.

The canonical kernel for the class {φδ : δ > 0} is

φc(x) = φ(4π)−1/10(x) this implies C(φ) = (4π)−2/5

Then,

AMISE{f c(.;h)} = (4π)−2/5{(nh)−1 +1

4h4R(f ′′)}

for the estimator f c(x;h) = (n)−1∑n

i=1 φch(x−Xi)

30

Canonical kernels have a very useful advantage. They enable us to make pictorial

comparison of density estimates based on different shaped kernels, with the same

bandwidth h. This is illustrated in figure.

Figure (2.5.1)

In figure (2.5.1)(a), the solid curve refers to kernel density estimates using the stan-

dard normal kernel and the dashed curve refers to the Epanechnicov kernel K∗. The

same bandwidth is used.

In figure (2.5.1)(b), the canonical kernels of the normal and Epanechnicov kernels are

used with the same bandwidth also.

From this illustration we deduce that; if we use the same bandwidth but different

kernels, then very different estimates result. If we use the same bandwidth with dif-

ferent canonical kernels, then the estimates are nearly identical.

31

Canonical kernels also can simplify the optimization procedure of the kernel shape.

That is; by Eqn. (2.5.1), it is enough to choose K that minimizes C(Kδ0), with :∫K(x)dx = 1,

∫xK(x)dx = 0,

∫x2K(x)dx = a2 ≤ ∞

and K(x) ≥ 0 for all x. The solution is

Ka(x) =3

4{1− x2/(5a2)}/(51/2a)1{|x|<51/2a}

where a is an arbitrary scale parameter (Hodges and Lehmann, 1956).

If a2 = 1/5, then we get the simplest form of Ka

K∗(x) =3

4(1− x2)1|x|<1 where 1|x|<1 =

1 |x| < 1,

0 otherwise.(2.5.2)

This kernel is called the Epanechnikov kernel (its graph in section(1.2) ).

Now, we introduce the useful ratio(C(K∗)/C(K)

)5/4

.

Definition 2.5.3. The ratio(C(K∗)/C(K)

)5/4

represents ratio of sample sizes nec-

essary to obtain the same minimum AMISE (for a given f) when using K∗ as whenusing K,and is called the efficiency of K relative to K∗

Example 2.5.2. If the efficiency of K is 0.97, this means that, we have to use 97%of the data as that using K, if we want the density estimate optimal kernel K∗ toreach the same minimum AMISE.

The table below shows values of the previous ratio for various popular kernels K.

Kernel {C(K∗)/C(K)}5/4

Epanechnikov 1.000

Biweight 0.994

Triweight 0.987

Normal 0.951

Triangular 0.986

Uniform 0.930

Table (2.5.1)

32

2.6 Measuring How Difficult A Density Is To Es-

timate

We have seen that, the KDE is one effective way for estimating many density shapes.

In spite of this, one may encounter some problems when trying to estimate some

density shapes using the KDE. This difficulties appear because KDE depends on one

smoothing parameter. More illustration in the example.

Example 2.6.1. A random sample of size n = 1000 is drawn from the lognormaldensity f(x) = φ(lnx)/x. By using the standard normal kernel, the solid curves infigure (2.6.1) illustrates the kernel estimates of the lognormal density which is showedby the dashed curve.

Figure(2.6.1)

33

In figure (2.6.1)(a) a small bandwidth h = 0.05 is used, results in an under-smoothed estimate.In figure (2.6.1)(b) a relatively large bandwidth h = 0.45 is used, results in over-smoothed estimate.In figure (2.6.1)(c), the performance of the kernel estimator becomes better withh = 0.15.It is still required more better result, but it seems indirect to do this with the usualformula of KDE.

Here, we try to answer the following question

How well a particular density can be estimated using the kernel density estimator?

We need to describe the degree of wellness in a certain quantity, to be valid always. To

do this, recall the previous formula for K, the symmetric probability density function

Eqn.(2.3.6):

infh>0

MISE(f(.;h)) ∼ 5/4C(K)R(f ′′)1/5n−4/5 (2.6.1)

We defined R(f ′′) =∫f ′′2(x)dx . So the dependence on f is through the second

derivative which as we said in section (2.3), gives us an idea about the curvature of

f . Here R(f ′′) represents the total curvature features of f , like skewness or modes.

Hence it is expected that we will encounter more complicated estimation problem if

|f ′′(x)| has large value and vice versa. So, this gives us an idea about the degree of

difficulty of estimating the difficult density shapes.

Here, we restrict our attention on densities with a continuous square integrable second

derivative, over the whole real line.

Definition 2.6.1. A measure of the degree of difficulty of kernel estimation of f isD(f). It is given by

D(f) = (σ(f)R(f ′′))1/4

where σ(f) is the population standard deviation of f .

It is found that, D(f) is minimal when (Terrell 1990)

f ∗(x) =35

32(1− x2)31(|x|<1)

34

the beta (4, 4) and that the minimum value is 35/243 (its graph in section (1.2) ).

Thus the density f ∗ is the easiest one to estimate.

As we did in the last section; a useful ratio can be used in this field.

Definition 2.6.2. The efficiency of the kernel estimator for estimating density frelated to estimating the density f ∗ is defined to be D(f ∗)/D(f)

The table below shows the values of D(f ∗)/D(f) for several densities.

Name Density D(f∗)/D(f)

(a) Beta(4,4) 3532

(1− x2)31(|x|<1) 1

(b) Normal (2π)−1/2 exp{−x2/2} 0.908

(c) Normal mixture density (1) 34N(0, 1) + 1

4N(3

2, (1

3)2) 0.568

(d)Normal mixture density(2) 12N(−1, 4

9) + 1

2N(1, 4

9) 0.536

(e)Gamma(3) Γ(3)−1x2e−x1{x>0} 0.327

(f)Normal mixture density(3) 23N(0, 1) + 1

3N(0, 1

100) 0.114

(g) Lognormal x−1(2π)−1/2 exp{−(lnx)2/2} 0.053

Table (2.6.1)

Some of the graphs of the densities in table (2.6.1) are introduced in the following

page.

35

Figure (2.6.2) (c)(d)(e)(f)(g)

Chapter 3

Modifications of the kernel densityestimator

From the ways that can improve the performance of the KDE and can expand its

work field are the adaptation forms of the basic form of the KDE.

At what follows we introduce three types of modifications of kernel density estimators.

We have to say that, many issues concerning this modifications are still unsettled.

Thus our view will be restricted on the definition and some properties.

3.1 Local Kernel Density Estimator

recall that the basic kernel density estimator has a single smoothing parameter over

the whole real line, which makes it inadequate for estimating some kernel shapes.

The natural adaptation is to make h varying with x, at which f(x) is estimated.

Definition 3.1.1. One modified form of the basic KDE is

fL{(x, h(x)} = (nh(x))−1

n∑i=1

K(x−Xi

h(x)

)(3.1.1)

It is called the local kernel density estimator .

The word local comes because fL uses a different basic kernel estimator at each

point of data points.

36

37

Figure (3.1.1) shows values of fL at two different points v and u. At v, fL(v, h(v))

is formed averaging the dotted kernels. At u, fL(u, h(u)) is formed by averaging the

dashed kernels.

Figure (3.1.1)

Remark 3.1.1. Since h is a function of x then, this prevents fL from being a densityfunction.

As we derived the optimal bandwidth for the basic estimator, we will do so for

the local KDE.

Theorem 3.1.1. for asymptotic MSE at x, hAMSE will be the optimal and is givenby

hAMSE(x) =

(R(K)f(x)

µ22(K)f ′′(x)2n

)1/5

provided f ′′(x) 6= 0 (3.1.2)

38

Proof. Substituting the value of hAMSE(x) into AMSE{f(x;h)} we get

AMSE{fL(x;h)} =1

nh

(R(K)f(x) +

n

4h5µ2

2(K)f ′′2(x))

=5

4

(R(K)f(x)

)4/5(µ2

2(K)f ′′2(x))1/5

n−4/5

=5

4

(R(K)4µ2

2(K))1/5(

f 2(x)f ′′(x))2/5

n−4/5

and for all x we get:

AMISE{fL(.;h(.))} =5

4

(R(K)4µ2

2(K))1/5

R((f 2f ′′

)1/5)n−4/5 (3.1.3)

If we look at the rates of convergence of AMISE {fL(.;h(.))} and AMISE {f(.;h)}

in (3.1.3) and (2.3.5) respectively, we find that they have the same rate of convergence,

i.e n−4/5. Generally, this means that there is no improvement in AMISE fL(.;h(.)).

To be accurate, considering the following remark make it sense to say that, in

spite of unchanging of the rate of convergence, there is always some improvement if

h(x) is chosen optimally.

Remark 3.1.2. R((f 2f ′′)1/5) ≤ R1/5(f ′′), for all f

Proof. f is non-negative real-valued integrable function and so f ′′. Applying Holder’sinequality with p = 5 and q = 5/4 [2], yields

R((f 2f ′′)1/5) =

∫ (f 2(x)f ′′(x)

)2/5dx =

∫ (f 2(x)

)2/5.(f ′′(x)

)2/5dx

≤(∫ (

f 4/5(x))5/4

dx)4/5

.(∫ (

f ′′2/5(x))5dx)1/5

=(∫

f ′′2(x)dx)1/5

= R1/5(f ′′)

39

3.2 Variable Kernel Density Estimator

In this kind of modified estimators, the bandwidth h depends on the data points

rather than on the point x at which f(x) is estimated. With more clarification, h

becomes a function α (say), depending on Xi.

Definition 3.2.1. The kernel modified kernel estimator form in which the parameterh is defined by n values α(Xi), i = 1, ..., n is given by

fV (x;α) =1

n

n∑i=1

(α(Xi))−1K

(x−Xi

α(Xi)

)(3.2.1)

It is called the variable KDE.

we understand from the above formula that, each point xi of data has a kernel

centered at it with its own scale parameter α(Xi) which is a function of Xi. This is

where the name variable kernel comes from.

Figure (3.2.1) shows that fV (.;α) is formed by averaging kernels.

Figure (3.2.1)

40

Remark 3.2.1. From its definition, fV (x;α) is also a probability density.

Here, smoothing process depend on the relation between the data points in their

locations. In this situation, we concern with smoothing mass around the data in

regions away from the main body of the data.

3.3 Transformation Kernel Density Estimators

In some cases, we encounter densities that difficult to estimate. We can get out from

this problem by making a transformation and then back transformation to the data

points. This is made up as follows:

Let the random sample X1, X2, ..., Xn has a density,f which is difficult to estimate.

Suppose the transformation is given by, Yi = t(Xi) where: t is an increasing dif-

ferential function defined on the support of f . The result is new random sample

Y1, Y2, ..., Yn with new density g (say), which is more easier to be estimated than f

by the basic kernel density estimator g.

From statistical distribution theory [2], we can write:

f(x) = g(t(x))t′(x)

Replace g by g:

fT (x;h, t) =1

n

n∑i=1

Kh(t(x)− t(Xi))t′(x) (3.3.1)

there exist ξi lies between x and Xi, then by mean value theorem Eqn. (3.3.1)

becomes:

fT (x;h, t) =1

n

n∑i=1

(t′(x)/h)K(t′(ξi)(x−Xi)

h

)

41

Example 3.3.1. Assume we want to estimate the lognormal density.From table (2.6.1), it is the more difficult one to estimate by the usual kernel methods.So applying the transformation Yi = lnXi where ln x is increasing function on itsdomain, gives a random sample Yi are from the N(0, 1) distribution.We can deduce this from simple calculation

f(x) = g(t(x))t′(x)

(2π)−1/2x−1 exp{−(lnx)2/2} = x−1g(lnx)

g(lnx) = (2π)−1/2 exp{−(lnx)2/2}g(y) = (2π)−1/2 exp{−y2/2}.

In figure (3.3.1)(a) the solid curve is a kernel estimate of the new normal density(i.e after transformation), when n = 1000 and the dashed curve is the true normaldensity.

Figure (3.3.1)

The solid curve in figure (3.3.1)(b) is the estimate of the lognormal density using thetransformation KDE. In other words if we make a backtransformation of the kerneldensity estimate in figure (3.3.1)(a) via t−1(x) = ex, we will get the estimate in figure(3.3.1)(b).

42

However, choosing the transformation t in the last example appears simple and

direct. It is not so always, because t depends on the shape of f , like the number of

modes, symmetry, and skewness. Finally, choosing t requires some experience and

good understanding to the properties of the functions that used in the transformation.

Chapter 4

Bandwidth Selection

4.1 Introduction

Since we have seen the influence of h on the performance of the kernel density esti-

mator, it is time to concentrate our attention on specification the bandwidth h.

When trying to choose a suitable h for the estimator to be closest to the true density,

we will find that in some situations, it is enough to choose h by looking at several

density estimates over a range of bandwidths and selecting the density that is the

most appropriate in some sense. We can begin with a large bandwidth and then

decreasing the amount of smoothing

This approach is useful especially when there is a reasonable expectations or back-

ground of the structure of the data.

In many cases there are no prior information about the structure of the data or even

at least intuitions about the optimal-bandwidth h for getting the best result.

Definition 4.1.1. The a method that uses the data X1, X2, ..., Xn to produce abandwidth h is called a bandwidth selector.

In general, bandwidth selectors can be divided into two types.

The first consists of the selectors with simple formulae which makes it is easy to

find a bandwidth for several situations, without giving any mathematical guarantees

43

44

whether our choice for h is the optimal one or no.

These selectors are called quick and simple selectors. The second type of bandwidth

selectors are based on more mathematical arguments and require more computational

efforts, and they are called hi-tech selectors.

Through this chapter we discuss some of these types with the aim of minimizing

MISE{f(.;h)}.

However, there are many unresolved issues in the field of bandwidth selection. It is

still a wide unsettled field for many research.

4.2 Quick And Simple Bandwidth Selectors

Here, we introduce the two common ways of finding the quick and easy bandwidth

selector of the kernel density estimator.

Moreover, the rules which we are going to use will be useful in the hi-tech bandwidth

selection.

4.2.1 Normal Scale Rules

Perhaps the simplest method to estimate h is to assume that f follows a parametric

form and use the corresponding bandwidth.

For example let us begin with the AMISE-optimal bandwidth for the normal den-

sity.

As we show in section (2.5), the bandwidth that minimizes MISE{f(.;h)} asymptot-

ically is

hAMISE =

(R(K)

µ22(K)R(f ′′)n

)1/5

Lemma 4.2.1. If the true density f is normal with variance σ2 then, we obtain:

hAMISE =

(8π1/2R(K)

3µ22(K)n

)1/5

σ (4.2.1)

45

Proof. Using the definition of ψr in section (4.5) gives

R(f ′′) =

∫f ′′2(x)dx =

∫f (4)(x)f(x)dx = ψ4

So in this case r = 4. Theorem (4.6.1) yields

ψ4 =(−1)24!

(2σ)52!π1/2=

3

8σ5π1/2

Now, substituting in the formula of hAMISE gives the result.

To estimate based on a sample, replace σ by σ, then the normal scale bandwidth

selector becomes

hNS =

(8π1/2R(K)

3µ22(K)n

)1/5

σ (4.2.2)

where σ is an estimate of the unknown standard deviation of f , σ. It is common to

be the sample standard deviation s.

We can obtained a result similar to this from normal scale bandwidth selectors even

when the data are close to normal. If the data far away from normality, normal scale

bandwidth selectors fail to work. It tend to be too large.

4.2.2 Oversmoothed bandwidth selection rules

The idea in the oversmoothing principle is that there is a simple upper bound for the

AMISE-optimal bandwidth for estimation of densities. The following inequality

hAMISE ≤

(243R(K)

35µ22(K)n

)1/5

σ (4.2.3)

are valid for all densities having standard deviation σ (Terrell, 1990). By replacing

σ by s the oversmoothed bandwidth selector was introduced depending on the above

bound

hOS =

(243R(K)

35µ2(K)2n

)1/5

s

46

Where s is the sample standard deviation. The oversmoothed bandwidth selection

gives a starting point for subjective choice of the bandwidth. This can work by

plotting an estimate with bandwidth hOS and then make other estimates by taking a

suitable fractions of hOS.

Example 4.2.1. Consider the Old Faithful data set, consisting of 107 eruption timesin minutes for the Old Faithful Geyser in Yellowstone National Park (Silverman,1986).

Figure (4.2.1)(a) shows the density estimate with hOS = 0.467 and the normalkernel is used.

Figure (4.2.1)

47

Figure (4.2.1)(b) shows the density estimate with bandwidth hOS/2. We notethat, the latter estimate keeps the two modes more clear than that appear in the firstone.Figure (4.2.1)(c) uses bandwidth hOS/4 and here, the roughness start appearing inthe estimate and it becomes more in figure (4.2.1)(d) where, the used bandwidth ishOS/8.

We can choose hOS/2 in figure (4.2.1)(b) to be the more appropriate bandwidth

4.3 Least Squares Cross-Validation

With this section, we start dealing with the second class of the bandwidth selectors

which labelled as hi-tech bandwidth selectors.

This is a family of selectors depends on cross-validation, and through this section

we will study one of the members of this family. It is called Least Squares Cross-

Validation (LSCV).

Its motivation comes from expanding the MISE of f(.;h) to get

MISE{f(.;h)} = E

∫f(x;h)2dx− 2E

∫f(x;h)f(x)dx+

∫f(x)2dx.

Here∫f(x)2dx does not depend on h, so we can minimize

MISE{f(.;h)} −∫f(x)2dx = E

[ ∫f(x;h)2dx− 2

∫f(x;h)f(x)dx

]An unbiased estimator for this quantity is [1]:

LSCV(h) =

∫f(x;h)2dx− 2/n

n∑i=1

f−i(Xi;h)

where

f−i(x;h) = (n− 1)−1

n∑i6=j

Kh(x;Xj)

is the density estimate at the data point Xi, where the density estimate based on all

data set except Xi. It is called the ” leave-one-out” density estimator.

48

Now choosing h which minimizes the objective function LSCV(h) will yield an esti-

mator of h denoted by hLSCV.

We can deduce from the formula of LSCV(h) that it requires a relatively large com-

putational efforts. However, ”studies have shown that the theoretical and practical

performance of this bandwidth selector are some what disappointing” (Wand and

Jones). This is because that hLSCV has a high variance.

4.4 Biased Cross-Validation

In the LSCV, we used the exact MISE formula. Here, biased cross-validation ( BCV)

is based on the asymptotic MISE

AMISE{f(.;h)} = (nh)−1R(K) +1

4h4µ2

2(K)R(f ′′) (4.4.1)

in which the unknown value R(f ′′) is replaced by some appropriate estimator

R(f ′′) that is given by [1]

R(f ′′) = R(f ′′(.;h))− (nh5)−1R(K ′′)

= n−2∑∑

i6=j

(K ′′h ∗K ′′

h)(Xi −Xj)

Replacing R(f ′′) by the estimator R(f ′′) gives

BCV(h) = (nh)−1R(K) +1

4h4µ2

2(K)R(f ′′)

which is the BCV objective function.

It therefore makes sense to choose h to minimize BCV(h) as we have done in the

previous section with LSCV(h).

We denote the bandwidth chosen according to this strategy by hBCV.

In fact, hBCV has an advantage over hLSCV. It is more stable than hLSCV, because of

49

its lower asymptotic variance.

But, at the same time, there is an increasing in the bias with hBCV tending to be

larger than the MISE- optimal bandwidth.

At the end of this section, we have to say that, this selector is the most acceptable

of the other selectors used the CV ideas, because it is based on the asymptotic MISE

which makes dealing with it more easier. Also, it use the ideas of the DPI methods

(section(4.6)), since it involves the estimation of the unknown R(f ′′)

4.5 Estimation Of Density Functionals

In the previous expressions for the optimal bandwidths, its clear that the bandwidth

relies on the integrated squared density derivative R(f ′′).

The problem is that R(f ′′) is unknown and this prevents using those expressions. The

hi-tech univariate bandwidth selectors have the estimation of the integrated squared

density derivatives as a component of it.

Definition 4.5.1. Define µj(K) =∫xjK(x)dx to be the jth moment of the kernel

K. Then we will say that K is a kth-order kernel if:

µ0(K) = 1;µj(K) = 0 for j = 1, ..., k − 1; and µk(K) 6= 0

.

The general integrated squared density derivative functional is:

R(f (s)) =

∫f (s)(x)2dx

Under sufficient smoothness assumptions on f , integrating this formula by parts

yields:

R(f (s)) = (−1)s

∫f (2s)(x)f(x)dx

50

Define:

ψr =

∫f (r)f(x)dx r is even,

0 r is odd.

Therefore, it is enough to investigate the estimation of ψr.

Since ψr = E(f (r)(X)), a natural estimator for ψr is:

ψr(g) = 1/nn∑

i=1

f (r)(Xi)

= 1/n2

n∑i=1

n∑j=1

L(r)g (Xi − xj)

where f is a KDE based upon a smoothing parameter g andL is kernel function .

We will need the following assumptions when dealing with ψr in connection with the

AMSE

(i) The kernel L is a symmetric kernel of order k, k = 2, 4, ... possessing r derivatives,

such that

(−1)r+k/2+1L(r)(0)µk(L) > 0

(ii) the density f has p continuous derivatives that are strictly monotone, where p > k.

(iii) g = gn is a positive-valued sequence of bandwidths satisfying

limn→∞

g = 0 and limn→∞

ng2r+1 = ∞.

Our goal is to find a large sample approximation to :

MSE(ψr(g)) = Varψr(g) + (Eψr(g)− ψr(g))2

to do this, we begin with rewriting ψr(g) as follows

ψr(g) = n−1L(r)g (0) + n−2

∑∑i6=j

L(r)g (Xi −Xj)

51

Then the expectation of ψ(g) is

Eψr(g) = n−1L(r)g (0) + (1− 1/n)E(L(r)

g (X1 −X2)

We will introduce the following theorems

Theorem 4.5.1. If the kernel L be as in Assumption (i); then, the bias is

Eψr(g)− ψr = n−1g−r−1L(r)(0) + (k!)−1gkµk(L)ψr+k +O(gk+2)

Proof. At first we will find E(L(r)g (X1 − X2) then the result is straightforward. In-

tegration by parts, under the smoothness assumptions on f , then using the Taylor’stheorem yields

E(L(r)g (X1 −X2)) =

∫ ∫L(r)

g (x− y)f(x)f(y)dxdy

=

∫ ∫Lg(x− y)f(x)f (r)(y)dxdy (integration by parts).

=

∫ ∫L(u)f(y + gu)f (r)(y)dudy

(let u =

x− y

g

).

=

∫ ∫L(u)f (r)(y)

(k∑

l=o

(l!)−1(ug)lf (l)(y) +O(gk+1)

)dudy

=

∫ ∫L(u)f (r)(y)

k∑l=o

(l!)−1(ug)lf (l)(y)dudy

+

∫ ∫L(u)f (r)(y)O(gk+1)dudy (expand f by Taylor’s method).

=

∫ ∫[L(u)f (r)(y)f(y) + ...

+ L(u)f (r)(y)(k!)−1(ug)kf (k)(y)]dudy

+

∫f (r)(y)O(gk+1)

∫L(u)dudy

= ψr + (k!)−1µk(L)gk

∫f (r)(y)f (k)(y)dy (since the order of L is k).

= ψr + (k!)−1µk(L)gk

∫f r+k(y)f(y)dy (integration by parts).

= ψr + (k!)−1µk(L)gkψr+k +O(k+2) (from ψr definition ).

52

So,Eψr(g)− ψr = n−1L(r)

g (0) + (k!)−1µk(L)gkψr + k +O(gk+2)

Lemma 4.5.2. Let X1, X2, ..., Xn a set of identically and independently distributedrandom variables and define

U = 2n−2

n−1∑i=1

n∑j=i+1

S(Xi −Xj)

where the function S is symmetric about zero. Then

Var(U) = 2n−3(n−1)Var{S(Xi−Xj}+4n−3(n−1)(n−2)Cov{S(X1−X2), S(X2−X3)}

Proof. The elementary properties of the variance of any summand random variables,and the identically and independently distributed of them yields

Var(U) = 4n−4

n−1∑i=1

n∑j=i+1

Var{S(Xi −Xj)}

= 4n−4[n(n− 1)

2Var{S(Xi −Xj)}

+ n(n− 1)(n− 2)Cov{S(X1 −X2), S(X2 −X3)}]

= 2n−3(n− 1)Var{S(Xi −Xj}+ 4n−3(n− 1)(n− 2)Cov{S(X1 −X2), S(X2 −X3)}

It follows from the previous lemma and the symmetry of L(r) for r even that

Var{ψr(g)} = 2n−3(n− 1)Var{L(r)g (X1 −X2)}

+ 4n−3(n− 1)(n− 2)Cov{L(r)g (X1 −X2), L

(r)g (X2 −X3)}

This leads to the following theorem

53

Theorem 4.5.3. let the kernel L satisfy the Assumption (i) and with the symmetryof L(r) for r even then,

Var{ψr(g)} = 2n−2g−2r−1ψ0R(L(r))

+ 4n−1{∫

f (r)(x)2f(x)dx− ψ2r

}+ o(n−2g−2r−1 + n−1)

Proof. We will prove this theorem in two steps.Step 1 :

E(L(r)g (X1 −X2))

2 =

∫ ∫(L(r)

g (x− y))2f(x)f(y)dxdy

= 1/g2r+1

∫ ∫L(r)(u)2f(y + gu)f(y)dudy

= 1/g2r+1

∫ ∫L(r)(u)2[f(y) + o(1)]f(y)dudy

= 1/g2r+1ψ0R(L(r)) + o(g−2r−1)

and

E{L(r)g (X1 −X2)} =

∫ ∫L(r)

g (x− y)f(x)f (r)(y)dxdy

=

∫ ∫Lg(x− y)f(x)f (r)(y)dxdy

=

∫ ∫L(u)f(y + gu)f (r)(y)dxdy

=

∫ ∫L(u)f (r)(y)[f(y) + o(1)]dudy

=

∫ ∫ (L(u)f (r)(y)f(y) + L(u)f (r)(y)o(1)

)dudy

= ψr + o(1)

Thus

Var{L(r)g (X1 −X2)} = 1/g2r+1ψ0R(L(r)) + o(g−2r−1) − ψr − o(1)

= g−2r−1ψ0R(L(r))− ψr + (o(g−2r−1 − o(1))

Step 2:

54

E{L(r)g (X1 −X2)L

(r)g (X2 −X3)} =

∫ ∫ ∫Lg(r)(x− y)Lg(r)(y − z)f(x)f(y)f(z)dxdydz

=

∫ ∫ ∫Lg(x− y)Lg(y − z)f (r)(x)f (r)(z)f(y)dxdydz

=

∫ ∫ ∫L(u)L(v)f (r)(y + ug)f(y)f (r)(y − gv)dudvdy

=

∫ ∫ ∫L(u)L(v)[f (r)(y) + o(1)]

x f(y)[f (r)(y) + o(1)]dudvdy

=

∫f (r)(y)2f(y)dy + o(1)

andE{L(r)

g (X1 −X2)}E{L(r)g (X2 −X3)} = ψ2

r + o(1)

Substituting these approximations in lemma(4.5.2) leads to

Var{ψr(g)} = 2n−2g−2r−1ψ0R(L(r))+4n−1{∫

f (r)(x)2f(x)dx−ψ2r

}+o(n−2g−2r−1+n−1)

therefore the asymptotic MSE is

MSE{ψr(g)} = n−1g−r−1L(r)(0) + (k!)−1gkµk(L)ψr+k

+ 2n−2g−2r−1R(L(r))ψ0 + 4n−1{∫f (r)(x)2f(x)dx− ψ2

r}.

Remark 4.5.1. 1. we can choose g to vanish the main bias term to be

gAMSE =

(k!L(r)(0)

−µk(L)ψr+kn

)1/r+k+1

(4.5.1)

2. We have to investigate the influence of this choice for g on the main two compo-nents of MSE. The order of the squared bias of MSE reduces to n−(2k+4)/(r+k+1).Since gAMSE = O(n−1/r+k+1) then the order of the leading variance term aren−(2k+1)/r+k+1 and n−1; respectively.Notice the first of these variance terms dominates the other squared bias term,thus the rate of convergence of the minimum MSE depends only on the leadingvariance terms.

55

3. For k < r, we find that

infg>0MSE{ψr(g)} ∼ 2R(L(r))ψ0

(µk(L)ψr+k

−L(r)(0)k!

)2r+1/r+k+1

n−(2k+1)/r+k+1

for k > r,

infg>0MSE{ψr(g)} ∼ 4(Var{f (r)(X)}

)n−1

and for k = r then the two leading terms in the above expression are of thesame order, and the leading term of the minimum mean squared error is thesum of those terms.

4.6 Plug-In Bandwidth Selection

4.6.1 Direct Plug-In Rules

Recall that, we obtained the AMISE-optimal bandwidth to be

hAMISE =

(R(K)

µ22(K)R(f ′′)n

)1/5

In terms of the ψr functionals, it can be written as

hAMISE =

(R(K)

µ22(K)ψ4n

)1/5

Now, the idea here is to continue estimating the unknown quantities that appear

in formulae for the asymptotically optimal bandwidth. So replacing ψ4 in the last

formula by ψ4(g) gives a formula for what so-called the direct plug-in (DPI) rule

hDPI =

(R(K)

µ22(K)ψ4(g)n

)1/5

In this rule, hDPI depends on the choice of the pilot bandwidth g.

From Eqn. (4.5.1) the AMSE-optimal bandwidth g of the estimator ψ4(g) is

gAMSE =

(2K(4)(0)

−µ2(K)ψ6n

)1/7

56

where K = L the second order kernel which is used in ψ4(g).

The same defect appears here since gAMSE depends also on an unknown density func-

tional, ψ6.

Reapeat the previous process; i.e estimate ψ6 by using an estimator, but its optimal

bandwidth depends on ψ8

This is a problem and will not has limiting because from Eqn.(4.5.1) the optimal

bandwidth for estimating ψr depends on ψr+2.

The way to solve this problem is to estimate an ψr functional with a quick and simple

estimate, the normal scale rule, say.

Until we reach the estimation of ψr by a quick and simple estimate, we have got a

family of direct plug-in bandwidth selectors that depend on the number of stages of

functional estimation.

If the number of stages was ` (say), we call such a rule an `-stage direct plug-in

bandwidth selector and denote it by hDPI,`.

The following theorem is very useful for computing quantities required for bandwidth

selectors, but before we introduce it, we have some facts,[1].

Fact [1] ∫φ(r)

σ (x− µ)φ(r′)σ′ (x− µ′)dx = (−1)rφ

(r+r′)

(σ2+σ′2)1/2(µ− µ′)

Fact [2]

φ(r)σ (0) =

(−1)r/2(2π)−1/2OF(r)σ−r−1 r even,

0 r odd.

where

OF(Odd Factorial)(r) = (r − 1)(r − 3)...1 =r!

2r/2r/2!

57

Theorem 4.6.1. If f is normal density with variance σ2 then, for r even,

ψr =(−1)r/2r!

(2σ)r+1(r/2)!π1/2(4.6.1)

Proof. For r even we have from the definition of ψr that

ψr =

∫f (r)(x)f(x)dx

since f is normal then we can write

ψr =

∫φ(r)

σ (x)φσ(x)dx

From fact (1);ψr = φ(r)√

2σ(0), and from fact (2) we have

φ(r)√

2σ(0) = (−1)r/2(2π)−1/2OF (r)(

√2σ)−r−1

=(−1)r/2r!

(2σ)r+1(r/2)!π1/2

It is useful to illustrate this procedure by the following example

Example 4.6.1. This example illustrates how dealing with the `-stage plug-in band-width selector. Take l = 2 and use L = K, where K is a second-order kernel.Step 1: Estimate ψ8 using the normal scale estimate ψNS

8 = 105/(32π1/2σ9) where σis an estimate of scale. The formula for ψNS

8 is obtained from Eqn. (4.6.1)Step 2: Estimate ψ6 using the kernel estimator ψ6(g1) where

g1 =

[2k(6)(0)

{µ2(k)ψNS8 n}

]1/9

Step 3: Estimate ψ4 using the kernel estimator ψ4(g2)where

g2 =

[− 2K(4)(0)/(µ2(K)ψ6(g1)n)

]1/7

Step 4: The selected bandwidth is

hDPI, 2 =(R(K)/(µ2

2(K)ψ4(g2)n))1/5

58

At the end of this section , there is a reasonable question

How should one choose the value of `, the number of stages of functional

estimation ?

actually, there is no an objective method to determine `. However, because a theoret-

ical considerations ` is preferred to be at least 2, with ` = 2 being a common choice

(Wand and Jones).

4.6.2 Solve-The-Equation Rules

These methods are similar to the DPI approach in depending on the formula for

AMISE-optimal bandwidth and it is another ” stage selection” problem as we will

show.

When h is selected according to these rules, it must satisfy the relationship

h =

(R(K)

µ22(K)ψ4(γ(h))n

)1/5

where the bandwidth function for the estimator ψ4 is a function γ of h.

Consider the following lemma

Lemma 4.6.2. The following relationship is valid

gAMSE =

(2L(4)(0)µ2

2(K)

R(K)µ2(L)

)1/7

(−ψ4/ψ6)1/7h

5/7AMISE

where K and h are, respectively, a kernel and a bandwidth that are used in estimatingthe density f . g and L are, respectively, a bandwidth and a symmetric kernel with rderivatives that are used in ψr

Proof. Beginning with the relationship

gAMSE =

(2L(4)(0)

−µ2(L)ψ6n

)1/7

59

and dividing by h5/7AMISE gives:

gAMSE/h5/7AMISE =

(2L(4)(0)

−µ2(L)ψ6n

)1/7(µ2

2(K)ψ4n

R(K)

)1/7

=

(2L(4)(0)µ2(K)2ψ4n

−µ2(L)ψ6R(K)n

)1/7

=

(2L(4)(0)µ2

2(K)

R(K)µ2(L)

)1/7(−ψ4

ψ6

)1/7

gAMSE =

(2L(4)(0)µ2

2(K)

R(K)µ2(L)

)1/7(−ψ4

ψ6

)1/7

h5/7AMISE

So we take:

γ(h) =

(2L(4)(0)µ2

2(K)

R(K)µ2(L)

)1/7(−ψ4

ψ6

)1/7

h5/7AMSE

where ψ4(g1) and ψ6(g2) are kernel estimates of ψ4 ψ6. g1 and g2 are obtained fromEqn.(4.5.1)

Example 4.6.2. A 2-stage solve-the-equation bandwidth selector that uses L = K( denoted by hSTE,2) is given.

Step 1: Estimate ψ6 and ψ8 using ψNS6 = −15/(16π1/2(σ)7) and ψNS

8 = 105/(32π1/2(σ)9)Step 2: Estimate ψ4 and ψ6 using the kernel estimators ψ4(g1) and ψ6(g2) where

g1 = {−2K(4)(0)/(µ2(K)ψNS6 n)}1/7

andg2 = {−2K(6)(0)/(µ2(K)ψNS

8 n)}1/9

.Step 3: Estimate ψ4 using the kernel estimator ψ4(γ(h)) where

γ(h) =

[2K(4)(0)µ2(K)ψ4(g1)

−ψ6(g2)R(K)

]1/7

h5/7

Step 4: The selected bandwidth is the solution to the equation

h =

[R(K)

µ22(K)ψ4(γ(h))n

]1/5

.

60

4.7 Smoothed Cross-Validation Bandwidth Selec-

tion

Using a kernel estimator with bandwidth g to estimate the integrated squared bias

component of MISE{f(.;h)}is common between the plug-in bandwidth selection and

this kind of cross-validation selection namely; smoothed cross-validation bandwidth

selection (SCV). So DPI and SCV methods have the same theoretical properties.

The formula of MISE{f(.;h)} in Eqn.(2.2.6) can be written asymptotically

MISE{f(.;h)} ≈ (nh)−1R(K) +

∫ (( ∫Kh(x− y)f(y)dy

)− f

)2

(x)dx

The second term is the integrated squared bias of f(.;h).

If f is replaced by a pilot estimator

fL(x; g) = n−1

n∑i=1

Lg(x−Xi)

where Lg(x) = L(x/g)/g is a kernel that may be different from K and bandwidth g.

This gives the smoothed cross-validation objective function SCV such that

SCV(h) = (nh)−1R(K) + ISB(h).

where

ISB(h) =

∫ {∫K(x− y)fL(.; g)(y)dy

}2

(x)dx (4.7.1)

is an estimate of integrated squared bias ISB. From the objective function SCV(h),

we can defined the bandwidth hSCV to be the largest local minimizer of SCV(h).

As we see above, SCV is based on the exact integrated squared bias rather than its

asymptotic approximation. This reason may make it more difficult to analyze than

DPI.

Conclusions

In order to achieve our goal, of reaching the best smoothing curve, as we can, of an

unknown density, nonparametrically, we find that using the kernel density estimator

can be a good alternative way to overcome the deficiency of the oldest and popular

way; histograms.

We find that the main axis in understanding the work way of the kernel density es-

timator is to understand the relationship between the two components of the mean

integrated squared error. This leads to talk about the most important parameter,

the bandwidth h. Its closed expression gives the rate of convergence when it is op-

timal bandwidth to be zero as n → ∞. We find that the best obtainable rate of

convergence of the kernel estimator is of order n−4/5. This rate is slower than the

parametric rate which of order n−1. We have studied a form of the asymptotic mean

integrated squared error that separates the kernel K and the bandwidth h and we

have seen also, that using equal bandwidths and different canonical kernels gives the

same amount of smoothing. We knew how measuring the degree of difficulty of kernel

estimation of some f .

The basic KDE has some modifications. We saw that there always be some improve-

ment if h(x) is chosen optimally in the local KDE. The variable KDE allows different

degrees of smoothing depending on the relation between data points. The transfor-

mation KDE is used when the shape of f is difficult to estimate.

Because the large influence of the bandwidth h on smoothing presses, we devoted

61

62

chapter 4 to study some types of bandwidth selectors.

Although these selectors is divided into simple and quick selectors, and hi-tech selec-

tors, we find that the last type depend on the simple and quick rules.

We have found that, the DPI selector provides an explicit formula for an optimal

bandwidth value. STE selector find h as a solution of an iteration. CV techniques

depends on relatively hard calculations to the objective function. So practically, DPI

and STE methods are more suitable than CV methods except those used the ideas of

the DPI methods.

As we have seen, in this thesis the attention is concentrating on studying the uni-

variate KDE. Completing in investigating the extension of KDE to multivariate case

is required. The need for nonparametric density estimates for knowing structure in

multivariate data is often, greater. This is because the parametric estimation becomes

more complicated than in the univariate case.

We suggest also, an extended studying for the kernel regression, since it is considered

from the fundamental subject where the techniques of kernel smoothing can work

very well.

We hope that, more interesting is given to the modification of the KDE, but it is need

a special studying for them, as we do in the basic one.

Finally, our analyzing of the performance KDE in Chapter 2 assumes that kernel K

is a probability density function, one may relax this restriction and show the effect

on the rates of convergence and other properties of the KDE.

Bibliography

[1] Wand M. P. and Jones M. C., Kernel Smoothing, First ed.,1995 Chapman &

Hall/CRC, London.

[2] Cassella George , Berger Roger L, Statistical Inference,1990 ,Duxbury Press

(An Imprint of Wadsworth Publishing Company ) Belmont, California.

[3] Apostol Tom M. Mathematical Analysis (A moderen Approach To Advanced

Calculus), seconed ed.,1977 , Addison-Wesley Publishing Company.

[4] Hogg Robert V. & Tanis Elliot A Probability and Statistical Inference, fifth

ed.,1997 Macmillan Publishing Co., Inc ,NewYork, London.

[5] Thomas / Finny Calculus, nineth ed., Addison-Wesley Puplishing Company.

[6] Polansky Alan M. ,Bandwidth Selection For Kernel Distuibution Functions,

Northern Illinios University , Division of statistics., AMS 1991subject classifica-

tions , R. H. KDF.

[7] Bashtannyk David M. & Hyndman Rob J. Bandwidth Selection For Kernel

Conditional Density Estimation ,August 24, 2000.

[8] Sain Stephan R. & Scott David W. On Locally Adaptive Dnsity Estimation ,

January 8,1996.

[9] Hengartner Nicolas W. , Asymptotic Unbiased Density Estimator , Yale Uni-

versity , Department of statistics, New Haven, U.S.A.,February 25, 1997.

63

64

[10] Baxter M. J. , & Beardah C. C Beyond the Histogram - Improved Approaches

to Simple Data Display in Archaeology Using Kernel Density Estimates, De-

partment of Mathematics, Statistics and Operational Research, The Nottinham

Trent University, Nottingham , United Kingdom.

[11] Walter Bruce Jonathan. Density Estimation Techniques For Global Illustration

(A Dissertation Presented to the Fuculty of the Graduate School of Cornell

University in Partial Fulfilment of the Requirements for the Degree of Doctor of

Philosophy), August, 1998.

on the kernel density estimationchapter 3 is about the modiﬁcations of the kernel density...

Documents