on the kernel density estimationchapter 3 is about the modifications of the kernel density...
TRANSCRIPT
The Islamic University of Gaza
Deanery of Higher Studies
Faculty of Science
Department of Mathematics
ON THE KERNEL DENSITY ESTIMATION
Presentd By
Ghada M. Abu Nada
Supervised By
Associate Professor: Mohamed I. Riffi
SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
AT
ISLAMIC UNIVERSITY OF GAZA
GAZA, PALESTINE
FEBRUARY 2002
c© Copyright by Ghada M. Abu Nada, 2000
To my parents
ii
Table of Contents
Table of Contents iii
Abstract v
Acknowledgements vi
1 Introduction 31.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 How to construct a relative frequency histogram . . . . . . . . 41.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Properties And Examples Of The Kernel Function . . . . . . 9
2 Univariate Kernel Density Estimation 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 The MSE And MISE Criteria . . . . . . . . . . . . . . . . . . . . . . 142.3 Asymptotic MSE And MISE Approximations . . . . . . . . . . . . . 172.4 Exact MISE Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Canonical Kernel And Optimal Kernel Theory . . . . . . . . . . . . 272.6 Measuring How Difficult A Density Is To Estimate . . . . . . . . . . 32
3 Modifications of the kernel density estimator 363.1 Local Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . 363.2 Variable Kernel Density Estimator . . . . . . . . . . . . . . . . . . . 393.3 Transformation Kernel Density Estimators . . . . . . . . . . . . . . . 40
4 Bandwidth Selection 434.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Quick And Simple Bandwidth Selectors . . . . . . . . . . . . . . . . . 44
4.2.1 Normal Scale Rules . . . . . . . . . . . . . . . . . . . . . . . . 444.2.2 Oversmoothed bandwidth selection rules . . . . . . . . . . . . 45
iii
4.3 Least Squares Cross-Validation . . . . . . . . . . . . . . . . . . . . . 474.4 Biased Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 484.5 Estimation Of Density Functionals . . . . . . . . . . . . . . . . . . . 494.6 Plug-In Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . 55
4.6.1 Direct Plug-In Rules . . . . . . . . . . . . . . . . . . . . . . . 554.6.2 Solve-The-Equation Rules . . . . . . . . . . . . . . . . . . . . 58
4.7 Smoothed Cross-Validation Bandwidth Selection . . . . . . . . . . . . 60
Conclusions 61
Bibliography 63
iv
Abstract
This paper considers kernel function based density estimators as a fundamental way
for nonparametric estimation functions. We try analyzing the performance of the
kernel density estimator due to the error criterion mean squared error for measuring
the error during the estimating process. An asymptotic approximation to the mean
integrated squared error is derived. This approximations allow more understanding of
the smoothing parameter, often called the bandwidth for kernel estimators. A closed
form expression for the bandwidth that minimizes the mean integrated squared error
asymptotically. We study the influence of the kernel and the true density on the
performance of the kernel density estimator.
To rise the level of efficiency of the basic kernel density and to overcome its limita-
tions, some modified forms of the kernel estimator are introduced.
One of the important issues in kernel smoothing is the choice of the smoothing pa-
rameters. So some bandwidth selection methods are studied.
v
Acknowledgements
I would like to thank so much the Islamic University of Gaza for offering me to get
the Master Degree, and my thank to my professors, my colleagues at the Department
of Mathematics.
I am grateful to my supervisor Dr. Mohamed Riffi for his guidance and efforts with
me during this research. Also I would like to thank Dr. Jaser H. Sarsour and Dr.
Mohamed Al-Atrash for their useful suggestions and discussions.
My sincere gratitude to my family, especially my parents for their love and support.
I am also thankful to all my friends for their kind advises, and encouragement.
Finally, I pray to god to accept this work.
vi
1
Preface
Kernel smoothing refers to a general methodology for non-parametric estimation of
functions. It provides us with a class of techniques for performing this estimation.
The basic principle is that local averaging or smoothing is performed with respect to
a kernel function.
In our theses we choose dealing with one of those techniques, known as kernel density
estimation. Suppose that you have a univariate set of data which you want to display
graphically. Then this interesting procedure can be used effectively.
There are different reasons attracted our attention to this subject. From of them,
the importance of the techniques of the kernel density estimators for statisticians. It
is a simple practical tool for estimating unknown densities in statistical experiments.
Also, The wide using of the kernel smoothers means is not restricted to those familiar
with the topic. It also has a fundamental importance for students and researchers
from other disciplines. The same principles can be extended to more complicated
problems, leading to many applications in fields like medicine, engineering and eco-
nomics.
The researches concerned with kernel estimators are so much, and started at early
years. The basic principles were introduced by Fix and Hodges (1951) and Akaike
(1954).
The main goals of theses are to introduce an investigation about the kernel density
estimation in a simple way to enable us to use it in smoothing and knowing the un-
known densities theoretically and practically.
A brief description now will be given of the Chapters of our theses.
The first Chapter is preparatory. It is an introduction about histograms and how it
is used in estimating a probability density function. Then we discuss an alternative
ways in details.
2
The main object of the univariate kernel density estimation is detailed in Chapter 2.
We begin it with studying some error criteria for measuring the error when estimating
the density. In the second section we introduce the derivation of large sample approx-
imations for leading variance and bias terms, which allow a better understanding of
the role of the bandwidth. In the next section, the exact finite sample performance of
the kernel density estimation for a particular kernel and density is analyzed. The rest
of this chapter is about the effect of the shape of the kernel function and measuring
how difficult a density is to estimate.
Chapter 3 is about the modifications of the kernel density estimator. They are local
kernel density estimator, variable kernel density estimator and transformation kernel
density estimator.
Special attention is also given to the important problem of choosing the smoothing
parameters of a kernel smoother in Chapter 4. It comes in 7 sections and shows two
classes of the bandwidth selectors; the quick and simple selectors and the hi-tech
selectors.
Finally, there are conclusions for the subjects studied in this thesis.
Chapter 1
Introduction
One of the main concerns of statisticians or data analysts is to characterize and know
the distribution of the data in some manner. so they can make their inferences.
The distribution of some function can be modelled as a member of a parametric family
and made inferences about the parameter, or it can be considered as a nonparametric
model. An example of the latter is to use the empirical distribution function of the
data, and make inferences based on whether this distribution has some properties of
interest.
Accordingly, probability density estimators are generally broken down into two basic
classes, parametric or nonparametric estimators.
Parametric estimators assume afunctional form of the density parameterized by a
finite set of parameters, while nonparametric methods consist of all other types of
estimators.
In our present work, we will begin with histograms as an introduction for recogniz-
ing of the kernel estimators, as an important example of nonparametric methods of
estimation unknown densities.
3
4
1.1 Histogram
Histograms are among oldest and the most widely used methods of data represen-
tation. Informally, a histogram is simply a rectangular bins with its basis on the
horizontal axis of the real line, and heights parallel to the vertical axis. It is used to
present graphically the data and helps us to use this data to say something about the
population from which the data were selected.
1.1.1 How to construct a relative frequency histogram
We can, first, summarize the set of data in a table providing the number of occurrences
of each possible outcome which we got from repeating a random experiment a number
of times, say n.
i) In the case of discrete data. The outcomes of the random experiment are values
in a discrete set. That is, the set contains a countable number of points. To
construct a relative frequency, we can make the height of each rectangle equal
to f/n; where f is the number of the outcome appearance in n trials. The base
of length one centered at a data point. f/n is called the relative frequency of
the outcome. Note that the relative frequency histogram gives an estimate of
the probability histogram of the associated random variable.
ii) In the continuous type data. The theoretical set of outcomes forms an interval or
union of intervals. Given a set of continuous data, by grouping the data into
k class intervals with equal length ( they are called also bins or windows), a
frequency table is constructed that lists a tabulation of measurements in the
various classes, and the frequency of each class. A relative frequency histogram
is composed of rectangles, each of them has an area equal to relative frequency
fi/n of the observations for the class.
5
Definition 1.1.1. The function defined by
h(x) =
{fi/(n(ci − ci−1)) ci−1 ≤ x ≤ ci; i = 1, 2, · · · , k.0 otherwise.
(1.1.1)
is called a relative frequency histogram, wherefi is the frequency of the ith bin, ci andci−1 are the boundaries of this bin, and n is the total number of observations.
Eqn.(1.1.1) shows a simple descriptive estimates of the probability density function
of the continuous type. So, a histogram can be considered the simplest nonparametric
form of what so-called the density estimator.
The histogram appearance depends on both the choice of origin of the histogram and
the width of the bins used, provided that, The origin of the histogram is the lower
boundary of the first bin.
Figure (1.1.1)
6
Figure (1.1.1) shows four histograms based on the data of 50 birth weights of children
having severe idiopathic respiratory syndrome (Van Vliek and Gupta , 1973).
Figures (1.1.1) (a) and (b) show the histogram with binwidth b = 0.2 and b = 0.8,
respectively. We note that, the histogram appears in (b) is smoother than that appears
in (a) which has smaller binwidth.
Figures (1.1.1) (c) and (d) show histograms based on the same binwidth b = 0.4
but left bin starts at 0.7 and 0.9, respectively.
Histograms sensitivity to choice of origin can be removed by taking the average
of a number of histograms, which is called the average shifted histograms (ASH). In
contrast to histograms, ASH appearance does not depend on a particular choice of
origin and depends only on the choice of binwidth. The ASH provides one way to
approximate a kernel density estimator (KDE) which we will study in the following
chapter.
However, histograms have some drawbacks other than their sensitivity to choice of
origin, that make dealing with them in advanced estimation problems is useless. Ac-
tually, the histogram is a step function. So, it is unpractically to estimate all densities
with a step function. Another problem is the extension of the histogram to the multi-
variate case. However, the main problem in practice is to obtain a sufficiently smooth
representation of the data while also retaining its main features. This is guaranteed
by the KDE.
Now, we want to return back to the function in Eqn.(1.1.1) and re-express it un-
der an important subject.
7
1.2 Density Estimation
The main object of density estimation is to make a smooth nonparametric estimate of
the unknown density function of a set of observations drawn from it, that is; given a
sample of n independent observations x1, x2, ..., xn from a distribution with unknown
density function, f(x).
A simple development of the estimator in Eqn.(1.1.1) would be the running histogram
estimator
f(x;h) =1
2nhnx; a ≤ x ≤ b, (1.2.1)
where nx is the number of observations falling in the interval [x− h, x+ h], and h is
known as the bandwidth.
Eqn.(1.2.1) can be written as
f(x) =1
2nh(number of observations falling in the interval [x− h, x+ h])
=1
2nh
n∑i=1
I(|xi − x| ≤ h)
=1
nh
n∑i=1
1
2I(|xi − x
h| ≤ 1)
=1
nh
n∑i=1
w(xi − x
h), (1.2.2)
where
I =
1 x− h ≤ xi ≤ x+ h,
0 otherwise.is the indicator function.
and
w(xi − x
h) =
1
2I(|xi − x
h| ≤ 1) =
12
−1 ≤ xi−xh≤ 1,
0 otherwise.
8
Generally, the expression in Eqn. (1.2.2) exhibits a function w centered at the es-
timation point used to weight nearby data points. Now we ready to introduce the
following definition.
Definition 1.2.1. We consider the function that centered at the estimation pointused to weight nearby data points as a weight function and will call it the kernelfunction and denoted by K(.).
Eqn (1.2.2) becomes
f(x) =1
nh
n∑i=1
K(xi − x
h
)(1.2.3)
In fact, the above form of kernel function is called, the uniform kernel and it is one
of several forms of this function.
9
1.2.1 Properties And Examples Of The Kernel Function
1. The kernel is usually a symmetric probability density function (pdf). Thus we
expect it to have the same properties of a pdf . That is .∫RK(u)du = 1 and K(u) ≥ 0
2. The shape of f(x;h) dose not depend upon the choice of origin but is affected
by the bandwidth h. The theoretical background of this insensitivity is that
kernel functions can be re-scaled such that the difference between two kernel
density estimates using two difference kernels is almost negligible ( more details
in section (2.5) )
Some examples of kernel functions
Kernel K(u)
(a) Uniform 12I(|u| ≤ 1)
(b) Triangle (1− |u|)I(|u| ≤ 1)
(c)Gaussian (2π)−1/2 exp(−u2/2)
(d)Triweight (Beta(4,4)) 3532
(1− u2)3I(|u| ≤ 1)
(e) Quartic 1516
(1− u2)2I(|u| ≤ 1)
(f)Epanechnikov 34(1− u2)I(|u| ≤ 1)
(g) Cosinus π4
cos(π2u)I(|u| ≤ 1)
Tabel (1.2.1)
10
Some kernels in table (1.2.1) that needed, are plotted in figure (1.2.1)
Figure (1.2.1)
Chapter 2
Univariate Kernel DensityEstimation
2.1 Introduction
From many tools of data analytic, nonparametric density estimation is specially im-
portant. It is important because it provides an effective way of showing structure
in a set of data, when parametric methods are inappropriate. This is happen, when
supposing a particular parametric model implies missing some main structures in the
data. Moreover, the univariate density estimator is considered the most straightfor-
ward compared with other types of kernel estimators. And the good understanding
and working with it makes it easier to go far more steps toward extensions to the
maltivariate case, to more complicated kernel-based methods or to estimate curves
instead of densities.
Definition 2.1.1. Suppose that X1, ..., Xn is a univariate random sample of continu-ous type which is drown from an unknown distribution with density f(x). The kerneldensity estimator (KDE), f(x;h) for the estimation of the density value f(x) at apoint x is defined as
f(x;h) =1
nh
n∑i=1
K(Xi − x
h
), (2.1.1)
K(.) is a kernel function and h is called the bandwidth or the window-width.
11
12
Remark 2.1.1.
1. The properties of the kernel function in section (1.2) shows that the kernelfunction is a pdf . Eqn. (1.2.1) ensures that f(x;h) is itself a density.
2. The KDE in Eqn. (2.1.1) can be reformulated as
f(x;h) =1
n
n∑i=1
Kh(x−Xi), (2.1.2)
where Kh(u) = K(u/h)/h
According to this definition of KDE, it is possible to think of KDE as the following
way; consider the above n observations x1, x2, ..., xn were plotted on a line, a KDE
can be obtained by replacing a (bump) at each point and then summing the height of
the bumps at each point on the x-axis, the shape of the bump is defined by the kernel
function. The spread of the bump is determined by a bandwidth h, that is analogous
to the bin-width of a histogram. In other words, the value of the kernel estimate at
the point x is the average of the n kernel ordinates at this point.
Example 2.1.1. Choose the kernel function to be theN(0, 1) density; K(x) = φ(x) =(2π)−1/2 exp(−x2/2) and with using five observations (theoretically), figure (2.1.1)shows kernel density estimate.
Figure(2.1.1)
13
Remark number (2) and the previous example give an idea about the role of thebandwidth h as a scaling factor which controls the spread of the kernel.
Definition 2.1.2. The bandwidth h is often called the smoothing parameter, since itcontrols the amount of smoothing being applied to the data.
Example 2.1.2. A random sample of size n = 1000 was drawn from the normalmixture density
f1(x) =3
4φ(x) +
1
4φ1/3
(x− 3
2
)f1 consists of N(0, 1) observations with probability 3
4and N(3
2, (1
3)2) observations
with probability 14.
The kernel density estimates due to this sample is showed by the figure below.
Figure(2.1.2)
14
In figure (2.1.2), the solid line is the estimate and the dashed one is the truedensity. The used kernel for each estimate is illustrated by small kernels at the baseof each figure.Figure (2.1.2)(a) with h = 0.06 shows the estimate of f is very rough. It is usuallycalled undersmoothed estimate.Figure (b) with h = 0.54 shows the estimate of f is too smooth such that we missone mode.Figure (c) with h = 0.18 is a reasonable estimate of f , which retains the main featureof the true f .
2.2 The MSE And MISE Criteria
The important role played by KDE makes us concerned with its performance, its effi-
ciency and accuracy in estimating the true density. So, it is reasonable to investigate
some error criteria for measuring the error when estimating the density at a single
point rather than the whole real line. We have known that, the bandwidth h is the
most important parameter in determining the smoothness of a density. It determines
a relationship between two types of errors (as we will see); bias and variance.
Our goal is to choose a good bandwidth, so we need to understand this relationship.
Actually, these two mentioned types are components of what so-called the mean
squared error.
Definition 2.2.1. The mean squared error (MSE) of an estimator f(x;h) of a densityf(x) is the expected value of the squared difference between the density estimatorand the true density function, it is denoted by E(f(x;h)− f(x))2.
From its definition, the MSE measures the average squared difference between
the density estimator and the true density. In general, any function of the absolute
distance |f(x;h) − f(x)| (often called metric) would serve as a measurement of the
goodness of an estimator. But MSE metric has at least two advantages over other
metrics. First it is tractable analytically. Second it has an interesting decomposition
15
into variance and squared bias provided that f(x) is not random, as follows
MSE{f(x;h)} = E(f(x)− f(x;h))2 = E(f 2(x)− 2f(x)f(x;h) + Ef 2(x;h))
= Ef 2(x)− 2f(x)Ef(x;h) + Ef 2(x;h)
= f 2(x)− 2f(x)Ef(x;h) + Varf(x;h) + (Ef(x;h))2
= Varf(x;h) + (Ef(x;h)− f(x))2. (2.2.1)
where, the bias of an estimator is defined as follows.
Definition 2.2.2. The bias of an estimator f , of a density f , is the difference betweenthe expected value of f and f . That is;
bias(f) = Ef − f
An estimator whose bias is equal zero is called unbiased.
At what follows we well try to compute MSE for an estimator f(x;h) of the density
function f(x), at some point xεR.
Theorem 2.2.1. Let X be a random variable having density f , then
MSE{f(x;h)} = n−1{∫
K2h(x− y)f(y)dy −
(∫Kh(x− y)f(y)dy
)2}+
(∫Kh(x− y)f(y)dy − f(x)
)2
(2.2.2)
Proof. Note that
Ef(x;h) = EKh(x−X) =
∫Kh(x− y)f(y)dy (2.2.3)
this is because that this estimator is treated as the sample mean X ∼ (µ, σ2/n),where X =
∑ni=1Xi that is; EX = µ = EXi and VarX = σ2/n. Therefore, the bias
of f(x;h) is:
Ef(x;h)− f(x) =
∫Kh(x− y)f(y)dy − f(x)
16
The variance is:
Var{f(x;h)} = n−1VarKh(x−X)
= n−1(EK2h(x−X)− (EKh(x−X))2)
= n−1{∫
K2h(x− y)f(y)dy −
(∫Kh(x− y)f(y)dy
)2}Combine these to give:
MSE{f(x;h)} = n−1{∫
K2h(x− y)f(y)dy −
(∫Kh(x− y)f(y)dy
)2}+
(∫Kh(x− y)f(y)dy − f(x)
)2
.
Now, we are interested in considering an error criterion that globally measures the
distance between the estimation of f over the entire real line and f itself.
Definition 2.2.3. An error criterion that measures the distance between f(.;h) andf is the integrated squared error (ISE) given by
ISE{f(.;h)} =
∫(f(x;h)− f(x))2dx
But the ISE is not appropriate if we deal with all data sets, so we prefer to analyze
the expected value of this random quantity, the mean integrated squared error.
Definition 2.2.4. The expected value of ISE is called the mean integrated squarederror (MISE). It is given by
MISE{f(.;h)} = E(ISE{f(x;h)}) = E
∫ (f(x;h)− f(x)
)2
dx
Theorem 2.2.2. The MISE of an estimator f(.;h) of a density f is given by
MISE{f(.;h)} = n−1
∫ ∫K2
h(x− y)f(y)dydx
+ (1− n−1)
∫ (∫Kh(x− y)f(y)dy
)2
dx
− 2
∫ {∫Kh(x− y)f(y)dy
}f(x)dx+
∫f 2(x)dx. (2.2.6)
17
Proof. The definitions of MISE and MSE and some calculations yield,
MISE{f(.;h)} = E
∫ (f(x;h)− f(x)
)2
dx
=
∫E(f(x;h)− f(x)
)2
dx
=
∫MSE{f(x;h)}dx
= n−1{∫ ∫
K2h(x− y)f(y)dydx−
∫ (∫Kh(x− y)f(y)dy
)2
dx}
+
∫ {∫Kh(x− y)f(y)dydx− f(x)
}2
dx
= n−1
∫ ∫K2
h(x− y)f(y)dydx
+ (1− n−1)
∫ (∫Kh(x− y)f(y)dy
)2
dx
− 2
∫ ∫Kh(x− y)f(y)dyf(x)dx+
∫f 2(x)dx
This theorem introduce the MISE {f(.;h)} depends on h in a relatively compli-
cated way. Practically, we need a clear dependence on h in that expression.
2.3 Asymptotic MSE And MISE Approximations
Here, we will derive an asymptotic approximation for MISE which depends on h in a
simple way. The simple expressions of these approximations will exhibit the influence
of the bandwidth h as a smoothing parameter.
The rate of convergence (it will be defined) of the KDE and the MISE- optimal
bandwidth can be also obtained from the asymptotic approximation of MISE.
Before we start in our investigation we have to introduce some definitions, theorem,
and some assumptions that are needed throughout our work.
Definition 2.3.1. [1]An ultimately monotone function is one that is monotone overboth (−∞,−M) and (M,∞) for some M > 0.
18
Definition 2.3.2. [5] i. A function f is of order than g as x→∞ if limx→∞f(x)g(x)
= 0.
We indicate this by writing f = o(g) (”f is little oh g”).ii. Let f(x) and g(x) be positive for x sufficiently large. Then f is of at most theorder of g as x→∞ of there is a positive integer M for which
f(x)
g(x)≤M,
for x sufficiently large. We indicate this by writing f = O(g)(”f is big oh of g”).
Definition 2.3.3. [3] Given two sequences {an} and {bn} such that bn ≥ 0 for all n.We write
an = O(bn) (read : ”an is big oh of bn”),
if there exists a constant M > 0 such that |an| ≤Mbn for all n. We write
an = o(bn) as n→∞ (read : ”an is little oh of bn”), if limn→∞
an/bn = 0
Definition 2.3.4. [1] We say that an is asymptotically equivalent to bn, or simply an
is asymptotic to bn, and write
an ∼ bn iff limn→∞
(an/bn) = 1.
Definition 2.3.5. [1] If the sequence {an} satisfies an ∼ Crn where rn is a simplefunction of n and C is independent of n then, we call rn the rate of convergence tozero of an. It is also common to say that an is of order rn.
Theorem 2.3.1. (Taylor’s Theorem) Suppose that f is a real-valued function definedon R and let x ∈ R. Assume that f has p continuous derivatives in an interval(x − δ, x + δ) for some δ > 0 and the (p + 1)th derivative of f exists. Then for anysequence αn converging to zero,
f(x+ αn) =
p∑j=0
(αjn/j!)f
(j)(x) + o(αpn)
The assumptions that we need are:
1. The density f is such that its second derivative f ′′ is continuous, squared inte-
grable and ultimately monotone.
19
2. The bandwidth h = hn is a non-random sequence of positive numbers. We also
assume that h satisfies:
limn→∞
h = 0 and limn→∞
nh = ∞
which is equivalent to saying that h approaches zero, but at a rate slower than
n−1.
3. The kernel K is a bounded probability density function having finite fourth
moment and symmetric about the origin.
Lemma 2.3.2. Let X be a random variable having a density f , then the bias off(x;h) can be expressed as
Ef(x;h)− f(x) =1
2h2µ2(K)f ′′(x) + o(h2) (2.3.1)
Proof. As we knew before:
Ef(x;h) =
∫Kh(x− y)f(y)dy =
∫1
hK(x− y
h
)f(y)dy
Set z = x−yh
. Then,
Ef(x;h) =
∫K(z)f(x− hz)dz
Expanding f(x− hz) in a Taylor series about x as follows:f(x− hz) is real-valued function defined on R, and by Assumption (1), f has contin-uous derivatives of order 2. As n→∞ this implies −zh→ 0 by Assumption (2).Thus
f(x− hz) =2∑
j=0
(−zh)j
j!f (j)(x) + o((−zh)2)
= f(x) + (−zh)f ′(x) +z2h2
2f ′′(x) + o(z2h2)
= f(x)− zhf ′(x) +1
2h2z2f ′′(x) + o(h2)
20
This yields:
E{f(x;h)} =
∫K(z)
(f(x)− zhf ′(x) +
1
2h2z2f ′′(x) + o(h2)
)dz
= f(x)− hf ′(x)
∫zK(z)dz +
1
2h2f ′′(x)
∫z2K(z)dz + o(h2)
= f(x) +1
2h2f ′′(x)
∫z2K(z)dz + o(h2)
where ∫K(z)dz = 1,
∫zK(z)dz = 0, and
∫z2K(z)dz <∞.
(from the kernel function properties and Assumption (3)). Let µ2(K) =∫z2K(z)dz,
so the bias expression can be written as:
Ef(x;h)− f(x) =1
2h2µ2(K)f ′′(x) + o(h2)
Here the bias is of order h2, so f(x;h) is asymptotically unbiased. Other important
thing is the bias depends on the true density f . The bias is directly proportional to the
second derivative of f . In other words the above expression gives us a relationship
between the bias and the curvature of the density f . The bias is large when the
curvature of f is high. For many densities this occurs in peaks where the bias is
negative, and valleys where the bias is positive.
Lemma 2.3.3. Let X be a random variable having density f , then
Var{f(x;h)} = (nh)−1R(K)f(x) + o((nh)−1) (2.3.2)
where R(K) =∫K2(x)dx
Proof. We express the Var{f(x;h)} as:
Var{f(x;h)} = (n)−1[ ∫
K2h(x− y)f(y)dy −
(∫Kh(x− y)f(y)dy
)2]
21
Using the Taylor series expansion of f(x− hz) about x we obtain:
Var{f(x;h)} = (nh)−1
∫K2(z)f(x− hz)dz − n−1{Ef(x;h)}2
= (nh)−1
∫K2(z){f(x) + o(1)}dz − n−1{f(x) + o(1)}2
= (nh)−1
∫K2(z)dzf(x) + o((nh)−1)
with the notation R(K) =∫K2(x)dx, we can write:
Var{f(x;h)} = (nh)−1R(K)f(x) + o((nh)−1)
from Eqn. (2.3.2) the variance is of order (nh)−1 and hence from Assumption (2),
limn→∞(nh)−1 = 0 then, Varf converges to zero.
Theorem 2.3.4. The MISE of an estimator, f of the unknown density f is given by
MISE{f(.;h)} = AMISE{f(.;h)}+ o{(nh)−1 + h4}
where
AMISE{f(.;h)} = (nh)−1R(K) +1
4h4µ2
2(K)R(f ′′) (2.3.3)
is called the asymptotic MISE of f(.;h)
Proof. The definition of the MSE and the lemmas (2.3.2) and (2.3.3) are combinedto give :
MSE{f(x;h)} = (nh)−1R(K)f(x) + o((nh)−1) +1
4h4µ2
2(K)f ′′2(x)
+ o(h4) + h2µ22(K)f ′′o(h2)
= (nh)−1R(K)f(x) +1
4h4µ2
2(K)f ′′2(x) + o((nh)−1 + h4)
= (nh)−1R(K)f(x) +1
4h4µ2
2(K)f ′′2(x) + o((nh)−1 + h4)
Integrating this expression yields:
MISE{f(.;h)} = (nh)−1R(K) +1
4h4µ2
2(K)R(f ′′) + o((nh)−1 + h4)
hence:
AMISEf(.;h) = (nh)−1R(K) +1
4h4µ2
2(K)R(f ′′)
22
The AMISE (asymptotic MISE) has some useful advantages. Its simplicity as a
mathematical expression to deal with, makes it useful for large sample approximation.
Also, we can see an important alternative relationship between bias and variance, it
is known as the variance-bias trade-off. It gives us an understanding about the role
of the bandwidth h. With more clarification, we can note from MISE f that the ISB
is asymptotically proportional to h4. So in order to decrease it, we need to make h
small. But if we do that, we will increase the integrated variance since, it is inversely
proportional to h. ” Therefore as n increases h should vary in such away that each
of the components of the MISE becomes smaller.
Moreover, The following corollary gives us an especial expression.
Corollary 2.3.5. The AMISE-optimal bandwidth, hAMISE, has a closed form
hAMISE =
(R(K)
µ22(K)R(f ′′)n
)1/5
(2.3.4)
Proof. Now by differentiating with respect to h and setting the derivative equal tozero, Eqn. (2.3.3) becomes
d
dh
(AMISEf
)= (nh2)−1 R(K) + h3µ2
2(K)R(f ′′) = 0
h5µ2(K)2R(f ′′) = n−1 R(K)
hAMISE =
(R(K)
nµ22(K)R(f ′′)
)1/5
When trying to understand what this h guides to, we will find that it depends on
the known K and n, and it is inversely proportional to R(f ′′)1/5. This R(f ′′) measures
the total curvature of f . So if R(f ′′) is small, this implies f has a little curvature and
the bandwidth h will be large. On the other hand h will be small if R(f ′′) is large.
23
However, the previous expression for the optimal h can work to choose a good band-
width, if R(f ′′) is known. But this is the point. So we will investigate some methods
for selecting h based on estimating R(f ′′), in Chapter 4.
Corollary 2.3.6.
infh>0
AMISE{f(.;h)} =5
4
(µ2
2(K)R(K)4R(f ′′))1/5
n−4/5 (2.3.5)
Proof. Substituting Eqn.(2.3.4) in Eqn.(2.3.3) yields :
AMISE{f(.;h)} = (nh)−1(R(K) +
n
4h5µ2
2(K)R(f ′′))
=5
4R(K)
(µ2
2(K)R(f ′′))1/5/
n4/5(R(K))1/5
=5
4n−4/5 (R(K))4/5
(µ2(K)2R(f ′′)
)1/5
take the infimum over h > 0 ( this possible because f and K are non-negative),
infh>0
AMISE{f(.;h)} =5
4
(µ2
2(K)R(K)4R(f ′′))1/5
n−4/5
In terms of the MISE itself and using the asymptotic notation we can rewrite
Eqns. (2.3.4) and (2.3.5)
hMISE ∼( R(K)
µ22(K)R(f ′′)n
)1/5
(2.3.6)
and,
infh>0
MISE{f(.;h)} ∼ 5
4
(µ2
2(K)R(K)4R(f ′′))1/5
n−4/5 (2.3.7)
In fact, Eqn. (2.3.5) gives the smallest possible AMISE for estimation of f using
the kernel K. As the last expression of the infimum of MISE shows, the best rate
of convergence of the MISE of the kernel estimator is of order n−4/5. This rate is
considered slower than n−1 the typical rate of convergence of MSI in parametric
24
estimators. For illustration purpose, we introduce the following example. Otherwise,
some ideas of this example belongs to the class of parametric estimation and need
studying in details.
Example 2.3.1. Consider the random sample X1, X2, ..., Xn from N(µ, σ2) distri-bution and the parameter exp(µ). The maximum likelihood estimator of exp(µ) isexp(X) [2], where X is the mean of the random sample and, X = n−1
∑ni=1Xi. We
interest in finding MSE{exp(X)}. Provided that, the moment-generating function ofX [4] is
M(t) = exp{µt+ (σ2/n)t2/2}
, we have
MSE{exp(X)} = Var{eX}+ (EeX − eX)2
= E[e2X ]− [EeX ] + (EeX − eX)2
= E[e2X ]− 2eXEeX + e2X
= e2µ+2σ2/n − 2eµeµ+σ2/2n + e2µ
= e2µ[e2σ2/n − 2eσ2/2n + 1]
Now, by using the power series expansion of the exponential function, ex =∑∞
k=0(xk/k!)
we get
MSE{exp(X)} = e2µ([1 + 2σ2n−1 + 4σ4n−2/2! + ...]
− 2[1 + σ2(2n−1) + σ4(2n)−2/2! + ..] + 1)
= e2µ(σ2n−1 +7
4σ4n−2 + ..)
= σ2e2µn−1(1 +7
4σ2n−3 + ..)
∼ σ2e2µn−1
So, the rate of convergence of MSE is of order n−1. This is typical for MSE inparametric estimation.
2.4 Exact MISE Calculation
We show at the previous section some advantages of the AMISE{f(.;h)} formula.
25
We note that AMISE{f(.;h)} is just a large sample approximation to MISE{f(.;h)}
given by Eqn. (2.2.6).
In some cases, we need to analyze the exact finite sample performance of the kernel
density estimation for a particular K and f . To do this, MISE{f(.;h)} can be com-
puted using Eqn. (2.2.6). But in order to avoid encountering the several integrals, f
and K can be chosen such that Eqn. (2.2.6) can be computed exactly.
Lemma 2.4.1. For the density φσ(x−µ) which of the N(µ, σ2) distribution, and hasthe notation φσ(x− µ) = (2πσ2)−1/2 exp{−(x− µ)2/(2σ2)}, the algebraic identity
φσ(x− µ)φσ′(x− µ′) = φ(σ2+σ′2)1/2(µ− µ′)φσσ′/(σ2+σ′2)1/2(x− µ∗)
where µ∗ = (σ′2µ+ σ2µ′)/(σ2 + σ′2), is valid.
Proof. We begin with the formula for the N(µ, σ2) distribution,
φσ(x− µ) = (2πσ2)−1/2 exp{−(x− µ)2/2σ2}
then,
φσ(x− µ)φσ′(x− µ′) = (2πσσ′)−1 exp{− 1/2
((x− µ)2/σ2 + (x− µ′)2/σ′2
)}= (2πσσ′(σ2 + σ′2)/(σ2 + σ′2))−1 exp
{− 1/2
(σ′2(x− µ)2
+ σ2(x− µ′)2/σ2σ′2
)}Expansion the squares, cancellation the similar terms, completion the squares in thenumerator of the exponential function, and simplification makes the right-hand side
(2πσσ′(σ2 +σ′2)/(σ2 +σ′2))−1 exp{−((x−µ∗)2 +σ2σ′2(µ−µ′)2
)/2σ2σ′2/(σ2 +σ′2)
}where µ∗ = (σ2µ′ + σ′2µ/σ2 + σ′2)Rewriting the right-hand side yields
(2π(σ2 + σ′2))−1/2 exp{ −(µ− µ′)
2(σ2 + σ′2)
}.( 2πσ2σ′2
σ2 + σ′2
)−1/2
exp{ −(x− µ∗)2
2σ2σ′2/σ2 + σ′2
}Thus,
φσ(x− µ)φσ′(x− µ′) = φ(σ2+σ′2)1/2(µ− µ′)φσσ′/(σ2+σ′2)1/2(x− µ∗)
26
Theorem 2.4.2. ∫φσ(x− µ)φσ′(x− µ′)dx = φ(σ2+σ′2)1/2(µ− µ′) (2.4.1)
whereφσ(x− µ) = (2πσ2)−1/2 exp{−(x− µ)2/2σ2}
Proof. If we integrate the formula in the lemma with respect to x,we will get∫φσ(x− µ)φσ′(x− µ′)dx = φ(σ2+σ′2)1/2(µ− µ′)
∫φσσ′/(σ2+σ′2)1/2(x− µ∗)
The integration in the right-hand side is equal one since, φσσ′/(σ2+σ′2)1/2(x−µ∗) is thedensity of the N(µ∗, σ2σ′2/σ2 + σ′2) distribution.
Example 2.4.1. ChooseK to be the N(0, 1) density and f to be the N(0, σ2) density.Then,
Kh(x) = φh(x) and f(x) = φσ(x).
Now we will compute each term of the right-hand side of the Eqn. (2.2.6)∫K2(x)dx =
∫φ2(x)dx = φ√2(0) = (2π1/2)−1.
Also, ∫Kh(x− y)f(y)dy =
∫φh(y − x)φσ(y)dy = φ(h2+σ2)1/2(x).
Therefore, the integral of the second term becomes∫φ2
(h2+σ2)1/2(x)dx =
∫φ(h2+σ2)1/2(x)φ(h2+σ2)1/2(x)dx
= φ(2h2+2σ2)1/2(0)
= (2π1/2)−1(σ2 + h2)−1/2.
Also, the integral of the third term becomes∫φ(h2+σ2)1/2(x)φσ(x)dx = φ(h2+σ2)1/2(0)
= (2π(h2 + 2σ2))−1/2
The last term is ∫f 2(x)dx =
∫φ2
σ(x)dx = (4πσ2)−1/2.
Finally, substituting in the Eqn. (2.2.6), yields
2π1/2MISE{f(.;h)} = (nh)−1 + (1− n−1)(h2 + σ2)−1/2
+ σ−1 − 23/2(2σ2 + h2)−1/2
27
2.5 Canonical Kernel And Optimal Kernel The-
ory
In the previous sections we worked with the kernel function K under some assump-
tions. It is assumed to be a symmetric and unimodal density.
Beside the simplicity, there are also some considerations, because of them the density
estimators based on kernels that do not satisfy these assumptions are ignored. There
are many kernel functions that satisfy these assumptions. So it is possible that one
may ask:
Are all the kernel functions that satisfy these requirements appropriate?
Do they perform the same role with the same degree of effectiveness?
At what follows; we will try to concentrate our attention on investigating the effect
of the shape of the kernel function K.
Consider the formula for AMISE {f(.;h)} in Eqn. (2.3.3). In this formula the scaling
of K is incorporated with the bandwidth h. This causes difficulty in optimization
with respect to K. If we choose a re-scaling of K of the form
Kδ(.) = (1/δ)K(./δ),
the dependence of K and h can be separated. To know how this can be made, we
will give this lemma.
Lemma 2.5.1. R(Kδ) = µ22(Kδ) is satisfied iff δ = δ0 = {R(K)/µ2
2(K)}1/5
Proof. (i) Assume that R(Kδ) = µ22(Kδ) is true.
From the above re-scaling form of K, we have
(1/δ)R(K) = δ4µ22(K).
This yieldsδ5 = R(K)/µ2
2(K)
28
which impliesδ = {R(K)/µ2
2(K)}1/5
(ii) Now, assume that, δ = δ0 = {R(K)/µ22(K)}1/5.
Begin with R(Kδ).
R(Kδ) = 1/δR(K) = {µ22(K)/R(k)}1/5R(K) = µ
2/52 R4/5(K)
= [µ22(K)]1/5δ4[µ2
2(K)]4/5
= δ4µ22(K) = µ2
2(Kδ)
Theorem 2.5.2. Let R(Kδ) = µ22(Kδ), where δ = δ0 = {R(K)/µ2
2(K)}1/5 then,
AMISE{f(.;h)} = C(Kδ0){(nh)−1 +1
4h4R(f ′′)} (2.5.1)
where C(K) = {R(K)4µ2(K)2}1/5.
Proof.
AMISE{f(.;h)} = (nh)−1R(K) +1
4h4µ2
2(K)R(f ′′)
= δ(nh)−1R(Kδ) +1
4h4δ−4R(Kδ)R(f ′′)
= R(Kδ0){(nh)−1 +1
4h4R(f ′′)}
= δ−10 R(K){(nh)−1 +
1
4h4R(f ′′)}
= {δ−4R4(K)δ4µ22(K)}1/5{(nh)−1 +
1
4h4R(f ′′)}
= {R4(Kδ0)µ22(Kδ0)}1/5
= C(Kδ0)
where C(K) = {R4(K)µ22(K)}1/5
Definition 2.5.1. We say that C(K) is invariant to re-scaling of K if C(Kδ1) =C(Kδ2) for any δ1, δ2 > 0.
Remark 2.5.1. C(K) is invariant to re-scaling of K.
29
Proof. We have to prove that
C(Kδ1) = C(Kδ2) for any δ1, δ2 > 0.
Starting with C(Kδ1), yields
C(Kδ1) = {R4(Kδ1)µ22(Kδ1)}1/5
= {δ−41 R(K).δ4
1µ22(K)}1/5 = {R(K)µ2
2(K)}1/5
= {δ−42 R(K)δ4
2µ22(K)}1/5 = {R(Kδ2)µ
22(Kδ2)}1/5
= C(Kδ2)
Definition 2.5.2. For the class {Kδ : δ > 0} of re-scalings of K, the unique memberof that class that separates the dependence of K and h in Eqn.(2.3.3) is called thecanonical kernel, and denoted by Kc = Kδ0 .
Example 2.5.1. For the KDE
f(x;h) = (n)−1
n∑i=1
Kh(x−Xi),
take K = φ, the standard normal kernel. We have to compute C(Kδ0) in theAMISE{f(.;h)} formula. Using lemma (2.5.1) we obtain δ0 = (4π)−1/10 since,
R(K) =
∫φ2(x)dx = φ√2(0) = (4π)−1/2
and for normal densitiesµ2
2(K) = 1.
The canonical kernel for the class {φδ : δ > 0} is
φc(x) = φ(4π)−1/10(x) this implies C(φ) = (4π)−2/5
Then,
AMISE{f c(.;h)} = (4π)−2/5{(nh)−1 +1
4h4R(f ′′)}
for the estimator f c(x;h) = (n)−1∑n
i=1 φch(x−Xi)
30
Canonical kernels have a very useful advantage. They enable us to make pictorial
comparison of density estimates based on different shaped kernels, with the same
bandwidth h. This is illustrated in figure.
Figure (2.5.1)
In figure (2.5.1)(a), the solid curve refers to kernel density estimates using the stan-
dard normal kernel and the dashed curve refers to the Epanechnicov kernel K∗. The
same bandwidth is used.
In figure (2.5.1)(b), the canonical kernels of the normal and Epanechnicov kernels are
used with the same bandwidth also.
From this illustration we deduce that; if we use the same bandwidth but different
kernels, then very different estimates result. If we use the same bandwidth with dif-
ferent canonical kernels, then the estimates are nearly identical.
31
Canonical kernels also can simplify the optimization procedure of the kernel shape.
That is; by Eqn. (2.5.1), it is enough to choose K that minimizes C(Kδ0), with :∫K(x)dx = 1,
∫xK(x)dx = 0,
∫x2K(x)dx = a2 ≤ ∞
and K(x) ≥ 0 for all x. The solution is
Ka(x) =3
4{1− x2/(5a2)}/(51/2a)1{|x|<51/2a}
where a is an arbitrary scale parameter (Hodges and Lehmann, 1956).
If a2 = 1/5, then we get the simplest form of Ka
K∗(x) =3
4(1− x2)1|x|<1 where 1|x|<1 =
1 |x| < 1,
0 otherwise.(2.5.2)
This kernel is called the Epanechnikov kernel (its graph in section(1.2) ).
Now, we introduce the useful ratio(C(K∗)/C(K)
)5/4
.
Definition 2.5.3. The ratio(C(K∗)/C(K)
)5/4
represents ratio of sample sizes nec-
essary to obtain the same minimum AMISE (for a given f) when using K∗ as whenusing K,and is called the efficiency of K relative to K∗
Example 2.5.2. If the efficiency of K is 0.97, this means that, we have to use 97%of the data as that using K, if we want the density estimate optimal kernel K∗ toreach the same minimum AMISE.
The table below shows values of the previous ratio for various popular kernels K.
Kernel {C(K∗)/C(K)}5/4
Epanechnikov 1.000
Biweight 0.994
Triweight 0.987
Normal 0.951
Triangular 0.986
Uniform 0.930
Table (2.5.1)
32
2.6 Measuring How Difficult A Density Is To Es-
timate
We have seen that, the KDE is one effective way for estimating many density shapes.
In spite of this, one may encounter some problems when trying to estimate some
density shapes using the KDE. This difficulties appear because KDE depends on one
smoothing parameter. More illustration in the example.
Example 2.6.1. A random sample of size n = 1000 is drawn from the lognormaldensity f(x) = φ(lnx)/x. By using the standard normal kernel, the solid curves infigure (2.6.1) illustrates the kernel estimates of the lognormal density which is showedby the dashed curve.
Figure(2.6.1)
33
In figure (2.6.1)(a) a small bandwidth h = 0.05 is used, results in an under-smoothed estimate.In figure (2.6.1)(b) a relatively large bandwidth h = 0.45 is used, results in over-smoothed estimate.In figure (2.6.1)(c), the performance of the kernel estimator becomes better withh = 0.15.It is still required more better result, but it seems indirect to do this with the usualformula of KDE.
Here, we try to answer the following question
How well a particular density can be estimated using the kernel density estimator?
We need to describe the degree of wellness in a certain quantity, to be valid always. To
do this, recall the previous formula for K, the symmetric probability density function
Eqn.(2.3.6):
infh>0
MISE(f(.;h)) ∼ 5/4C(K)R(f ′′)1/5n−4/5 (2.6.1)
We defined R(f ′′) =∫f ′′2(x)dx . So the dependence on f is through the second
derivative which as we said in section (2.3), gives us an idea about the curvature of
f . Here R(f ′′) represents the total curvature features of f , like skewness or modes.
Hence it is expected that we will encounter more complicated estimation problem if
|f ′′(x)| has large value and vice versa. So, this gives us an idea about the degree of
difficulty of estimating the difficult density shapes.
Here, we restrict our attention on densities with a continuous square integrable second
derivative, over the whole real line.
Definition 2.6.1. A measure of the degree of difficulty of kernel estimation of f isD(f). It is given by
D(f) = (σ(f)R(f ′′))1/4
where σ(f) is the population standard deviation of f .
It is found that, D(f) is minimal when (Terrell 1990)
f ∗(x) =35
32(1− x2)31(|x|<1)
34
the beta (4, 4) and that the minimum value is 35/243 (its graph in section (1.2) ).
Thus the density f ∗ is the easiest one to estimate.
As we did in the last section; a useful ratio can be used in this field.
Definition 2.6.2. The efficiency of the kernel estimator for estimating density frelated to estimating the density f ∗ is defined to be D(f ∗)/D(f)
The table below shows the values of D(f ∗)/D(f) for several densities.
Name Density D(f∗)/D(f)
(a) Beta(4,4) 3532
(1− x2)31(|x|<1) 1
(b) Normal (2π)−1/2 exp{−x2/2} 0.908
(c) Normal mixture density (1) 34N(0, 1) + 1
4N(3
2, (1
3)2) 0.568
(d)Normal mixture density(2) 12N(−1, 4
9) + 1
2N(1, 4
9) 0.536
(e)Gamma(3) Γ(3)−1x2e−x1{x>0} 0.327
(f)Normal mixture density(3) 23N(0, 1) + 1
3N(0, 1
100) 0.114
(g) Lognormal x−1(2π)−1/2 exp{−(lnx)2/2} 0.053
Table (2.6.1)
Some of the graphs of the densities in table (2.6.1) are introduced in the following
page.
35
Figure (2.6.2) (c)(d)(e)(f)(g)
Chapter 3
Modifications of the kernel densityestimator
From the ways that can improve the performance of the KDE and can expand its
work field are the adaptation forms of the basic form of the KDE.
At what follows we introduce three types of modifications of kernel density estimators.
We have to say that, many issues concerning this modifications are still unsettled.
Thus our view will be restricted on the definition and some properties.
3.1 Local Kernel Density Estimator
recall that the basic kernel density estimator has a single smoothing parameter over
the whole real line, which makes it inadequate for estimating some kernel shapes.
The natural adaptation is to make h varying with x, at which f(x) is estimated.
Definition 3.1.1. One modified form of the basic KDE is
fL{(x, h(x)} = (nh(x))−1
n∑i=1
K(x−Xi
h(x)
)(3.1.1)
It is called the local kernel density estimator .
The word local comes because fL uses a different basic kernel estimator at each
point of data points.
36
37
Figure (3.1.1) shows values of fL at two different points v and u. At v, fL(v, h(v))
is formed averaging the dotted kernels. At u, fL(u, h(u)) is formed by averaging the
dashed kernels.
Figure (3.1.1)
Remark 3.1.1. Since h is a function of x then, this prevents fL from being a densityfunction.
As we derived the optimal bandwidth for the basic estimator, we will do so for
the local KDE.
Theorem 3.1.1. for asymptotic MSE at x, hAMSE will be the optimal and is givenby
hAMSE(x) =
(R(K)f(x)
µ22(K)f ′′(x)2n
)1/5
provided f ′′(x) 6= 0 (3.1.2)
38
Proof. Substituting the value of hAMSE(x) into AMSE{f(x;h)} we get
AMSE{fL(x;h)} =1
nh
(R(K)f(x) +
n
4h5µ2
2(K)f ′′2(x))
=5
4
(R(K)f(x)
)4/5(µ2
2(K)f ′′2(x))1/5
n−4/5
=5
4
(R(K)4µ2
2(K))1/5(
f 2(x)f ′′(x))2/5
n−4/5
and for all x we get:
AMISE{fL(.;h(.))} =5
4
(R(K)4µ2
2(K))1/5
R((f 2f ′′
)1/5)n−4/5 (3.1.3)
If we look at the rates of convergence of AMISE {fL(.;h(.))} and AMISE {f(.;h)}
in (3.1.3) and (2.3.5) respectively, we find that they have the same rate of convergence,
i.e n−4/5. Generally, this means that there is no improvement in AMISE fL(.;h(.)).
To be accurate, considering the following remark make it sense to say that, in
spite of unchanging of the rate of convergence, there is always some improvement if
h(x) is chosen optimally.
Remark 3.1.2. R((f 2f ′′)1/5) ≤ R1/5(f ′′), for all f
Proof. f is non-negative real-valued integrable function and so f ′′. Applying Holder’sinequality with p = 5 and q = 5/4 [2], yields
R((f 2f ′′)1/5) =
∫ (f 2(x)f ′′(x)
)2/5dx =
∫ (f 2(x)
)2/5.(f ′′(x)
)2/5dx
≤(∫ (
f 4/5(x))5/4
dx)4/5
.(∫ (
f ′′2/5(x))5dx)1/5
=(∫
f ′′2(x)dx)1/5
= R1/5(f ′′)
39
3.2 Variable Kernel Density Estimator
In this kind of modified estimators, the bandwidth h depends on the data points
rather than on the point x at which f(x) is estimated. With more clarification, h
becomes a function α (say), depending on Xi.
Definition 3.2.1. The kernel modified kernel estimator form in which the parameterh is defined by n values α(Xi), i = 1, ..., n is given by
fV (x;α) =1
n
n∑i=1
(α(Xi))−1K
(x−Xi
α(Xi)
)(3.2.1)
It is called the variable KDE.
we understand from the above formula that, each point xi of data has a kernel
centered at it with its own scale parameter α(Xi) which is a function of Xi. This is
where the name variable kernel comes from.
Figure (3.2.1) shows that fV (.;α) is formed by averaging kernels.
Figure (3.2.1)
40
Remark 3.2.1. From its definition, fV (x;α) is also a probability density.
Here, smoothing process depend on the relation between the data points in their
locations. In this situation, we concern with smoothing mass around the data in
regions away from the main body of the data.
3.3 Transformation Kernel Density Estimators
In some cases, we encounter densities that difficult to estimate. We can get out from
this problem by making a transformation and then back transformation to the data
points. This is made up as follows:
Let the random sample X1, X2, ..., Xn has a density,f which is difficult to estimate.
Suppose the transformation is given by, Yi = t(Xi) where: t is an increasing dif-
ferential function defined on the support of f . The result is new random sample
Y1, Y2, ..., Yn with new density g (say), which is more easier to be estimated than f
by the basic kernel density estimator g.
From statistical distribution theory [2], we can write:
f(x) = g(t(x))t′(x)
Replace g by g:
fT (x;h, t) =1
n
n∑i=1
Kh(t(x)− t(Xi))t′(x) (3.3.1)
there exist ξi lies between x and Xi, then by mean value theorem Eqn. (3.3.1)
becomes:
fT (x;h, t) =1
n
n∑i=1
(t′(x)/h)K(t′(ξi)(x−Xi)
h
)
41
Example 3.3.1. Assume we want to estimate the lognormal density.From table (2.6.1), it is the more difficult one to estimate by the usual kernel methods.So applying the transformation Yi = lnXi where ln x is increasing function on itsdomain, gives a random sample Yi are from the N(0, 1) distribution.We can deduce this from simple calculation
f(x) = g(t(x))t′(x)
(2π)−1/2x−1 exp{−(lnx)2/2} = x−1g(lnx)
g(lnx) = (2π)−1/2 exp{−(lnx)2/2}g(y) = (2π)−1/2 exp{−y2/2}.
In figure (3.3.1)(a) the solid curve is a kernel estimate of the new normal density(i.e after transformation), when n = 1000 and the dashed curve is the true normaldensity.
Figure (3.3.1)
The solid curve in figure (3.3.1)(b) is the estimate of the lognormal density using thetransformation KDE. In other words if we make a backtransformation of the kerneldensity estimate in figure (3.3.1)(a) via t−1(x) = ex, we will get the estimate in figure(3.3.1)(b).
42
However, choosing the transformation t in the last example appears simple and
direct. It is not so always, because t depends on the shape of f , like the number of
modes, symmetry, and skewness. Finally, choosing t requires some experience and
good understanding to the properties of the functions that used in the transformation.
Chapter 4
Bandwidth Selection
4.1 Introduction
Since we have seen the influence of h on the performance of the kernel density esti-
mator, it is time to concentrate our attention on specification the bandwidth h.
When trying to choose a suitable h for the estimator to be closest to the true density,
we will find that in some situations, it is enough to choose h by looking at several
density estimates over a range of bandwidths and selecting the density that is the
most appropriate in some sense. We can begin with a large bandwidth and then
decreasing the amount of smoothing
This approach is useful especially when there is a reasonable expectations or back-
ground of the structure of the data.
In many cases there are no prior information about the structure of the data or even
at least intuitions about the optimal-bandwidth h for getting the best result.
Definition 4.1.1. The a method that uses the data X1, X2, ..., Xn to produce abandwidth h is called a bandwidth selector.
In general, bandwidth selectors can be divided into two types.
The first consists of the selectors with simple formulae which makes it is easy to
find a bandwidth for several situations, without giving any mathematical guarantees
43
44
whether our choice for h is the optimal one or no.
These selectors are called quick and simple selectors. The second type of bandwidth
selectors are based on more mathematical arguments and require more computational
efforts, and they are called hi-tech selectors.
Through this chapter we discuss some of these types with the aim of minimizing
MISE{f(.;h)}.
However, there are many unresolved issues in the field of bandwidth selection. It is
still a wide unsettled field for many research.
4.2 Quick And Simple Bandwidth Selectors
Here, we introduce the two common ways of finding the quick and easy bandwidth
selector of the kernel density estimator.
Moreover, the rules which we are going to use will be useful in the hi-tech bandwidth
selection.
4.2.1 Normal Scale Rules
Perhaps the simplest method to estimate h is to assume that f follows a parametric
form and use the corresponding bandwidth.
For example let us begin with the AMISE-optimal bandwidth for the normal den-
sity.
As we show in section (2.5), the bandwidth that minimizes MISE{f(.;h)} asymptot-
ically is
hAMISE =
(R(K)
µ22(K)R(f ′′)n
)1/5
Lemma 4.2.1. If the true density f is normal with variance σ2 then, we obtain:
hAMISE =
(8π1/2R(K)
3µ22(K)n
)1/5
σ (4.2.1)
45
Proof. Using the definition of ψr in section (4.5) gives
R(f ′′) =
∫f ′′2(x)dx =
∫f (4)(x)f(x)dx = ψ4
So in this case r = 4. Theorem (4.6.1) yields
ψ4 =(−1)24!
(2σ)52!π1/2=
3
8σ5π1/2
Now, substituting in the formula of hAMISE gives the result.
To estimate based on a sample, replace σ by σ, then the normal scale bandwidth
selector becomes
hNS =
(8π1/2R(K)
3µ22(K)n
)1/5
σ (4.2.2)
where σ is an estimate of the unknown standard deviation of f , σ. It is common to
be the sample standard deviation s.
We can obtained a result similar to this from normal scale bandwidth selectors even
when the data are close to normal. If the data far away from normality, normal scale
bandwidth selectors fail to work. It tend to be too large.
4.2.2 Oversmoothed bandwidth selection rules
The idea in the oversmoothing principle is that there is a simple upper bound for the
AMISE-optimal bandwidth for estimation of densities. The following inequality
hAMISE ≤
(243R(K)
35µ22(K)n
)1/5
σ (4.2.3)
are valid for all densities having standard deviation σ (Terrell, 1990). By replacing
σ by s the oversmoothed bandwidth selector was introduced depending on the above
bound
hOS =
(243R(K)
35µ2(K)2n
)1/5
s
46
Where s is the sample standard deviation. The oversmoothed bandwidth selection
gives a starting point for subjective choice of the bandwidth. This can work by
plotting an estimate with bandwidth hOS and then make other estimates by taking a
suitable fractions of hOS.
Example 4.2.1. Consider the Old Faithful data set, consisting of 107 eruption timesin minutes for the Old Faithful Geyser in Yellowstone National Park (Silverman,1986).
Figure (4.2.1)(a) shows the density estimate with hOS = 0.467 and the normalkernel is used.
Figure (4.2.1)
47
Figure (4.2.1)(b) shows the density estimate with bandwidth hOS/2. We notethat, the latter estimate keeps the two modes more clear than that appear in the firstone.Figure (4.2.1)(c) uses bandwidth hOS/4 and here, the roughness start appearing inthe estimate and it becomes more in figure (4.2.1)(d) where, the used bandwidth ishOS/8.
We can choose hOS/2 in figure (4.2.1)(b) to be the more appropriate bandwidth
4.3 Least Squares Cross-Validation
With this section, we start dealing with the second class of the bandwidth selectors
which labelled as hi-tech bandwidth selectors.
This is a family of selectors depends on cross-validation, and through this section
we will study one of the members of this family. It is called Least Squares Cross-
Validation (LSCV).
Its motivation comes from expanding the MISE of f(.;h) to get
MISE{f(.;h)} = E
∫f(x;h)2dx− 2E
∫f(x;h)f(x)dx+
∫f(x)2dx.
Here∫f(x)2dx does not depend on h, so we can minimize
MISE{f(.;h)} −∫f(x)2dx = E
[ ∫f(x;h)2dx− 2
∫f(x;h)f(x)dx
]An unbiased estimator for this quantity is [1]:
LSCV(h) =
∫f(x;h)2dx− 2/n
n∑i=1
f−i(Xi;h)
where
f−i(x;h) = (n− 1)−1
n∑i6=j
Kh(x;Xj)
is the density estimate at the data point Xi, where the density estimate based on all
data set except Xi. It is called the ” leave-one-out” density estimator.
48
Now choosing h which minimizes the objective function LSCV(h) will yield an esti-
mator of h denoted by hLSCV.
We can deduce from the formula of LSCV(h) that it requires a relatively large com-
putational efforts. However, ”studies have shown that the theoretical and practical
performance of this bandwidth selector are some what disappointing” (Wand and
Jones). This is because that hLSCV has a high variance.
4.4 Biased Cross-Validation
In the LSCV, we used the exact MISE formula. Here, biased cross-validation ( BCV)
is based on the asymptotic MISE
AMISE{f(.;h)} = (nh)−1R(K) +1
4h4µ2
2(K)R(f ′′) (4.4.1)
in which the unknown value R(f ′′) is replaced by some appropriate estimator
R(f ′′) that is given by [1]
R(f ′′) = R(f ′′(.;h))− (nh5)−1R(K ′′)
= n−2∑∑
i6=j
(K ′′h ∗K ′′
h)(Xi −Xj)
Replacing R(f ′′) by the estimator R(f ′′) gives
BCV(h) = (nh)−1R(K) +1
4h4µ2
2(K)R(f ′′)
which is the BCV objective function.
It therefore makes sense to choose h to minimize BCV(h) as we have done in the
previous section with LSCV(h).
We denote the bandwidth chosen according to this strategy by hBCV.
In fact, hBCV has an advantage over hLSCV. It is more stable than hLSCV, because of
49
its lower asymptotic variance.
But, at the same time, there is an increasing in the bias with hBCV tending to be
larger than the MISE- optimal bandwidth.
At the end of this section, we have to say that, this selector is the most acceptable
of the other selectors used the CV ideas, because it is based on the asymptotic MISE
which makes dealing with it more easier. Also, it use the ideas of the DPI methods
(section(4.6)), since it involves the estimation of the unknown R(f ′′)
4.5 Estimation Of Density Functionals
In the previous expressions for the optimal bandwidths, its clear that the bandwidth
relies on the integrated squared density derivative R(f ′′).
The problem is that R(f ′′) is unknown and this prevents using those expressions. The
hi-tech univariate bandwidth selectors have the estimation of the integrated squared
density derivatives as a component of it.
Definition 4.5.1. Define µj(K) =∫xjK(x)dx to be the jth moment of the kernel
K. Then we will say that K is a kth-order kernel if:
µ0(K) = 1;µj(K) = 0 for j = 1, ..., k − 1; and µk(K) 6= 0
.
The general integrated squared density derivative functional is:
R(f (s)) =
∫f (s)(x)2dx
Under sufficient smoothness assumptions on f , integrating this formula by parts
yields:
R(f (s)) = (−1)s
∫f (2s)(x)f(x)dx
50
Define:
ψr =
∫f (r)f(x)dx r is even,
0 r is odd.
Therefore, it is enough to investigate the estimation of ψr.
Since ψr = E(f (r)(X)), a natural estimator for ψr is:
ψr(g) = 1/nn∑
i=1
f (r)(Xi)
= 1/n2
n∑i=1
n∑j=1
L(r)g (Xi − xj)
where f is a KDE based upon a smoothing parameter g andL is kernel function .
We will need the following assumptions when dealing with ψr in connection with the
AMSE
(i) The kernel L is a symmetric kernel of order k, k = 2, 4, ... possessing r derivatives,
such that
(−1)r+k/2+1L(r)(0)µk(L) > 0
(ii) the density f has p continuous derivatives that are strictly monotone, where p > k.
(iii) g = gn is a positive-valued sequence of bandwidths satisfying
limn→∞
g = 0 and limn→∞
ng2r+1 = ∞.
Our goal is to find a large sample approximation to :
MSE(ψr(g)) = Varψr(g) + (Eψr(g)− ψr(g))2
to do this, we begin with rewriting ψr(g) as follows
ψr(g) = n−1L(r)g (0) + n−2
∑∑i6=j
L(r)g (Xi −Xj)
51
Then the expectation of ψ(g) is
Eψr(g) = n−1L(r)g (0) + (1− 1/n)E(L(r)
g (X1 −X2)
We will introduce the following theorems
Theorem 4.5.1. If the kernel L be as in Assumption (i); then, the bias is
Eψr(g)− ψr = n−1g−r−1L(r)(0) + (k!)−1gkµk(L)ψr+k +O(gk+2)
Proof. At first we will find E(L(r)g (X1 − X2) then the result is straightforward. In-
tegration by parts, under the smoothness assumptions on f , then using the Taylor’stheorem yields
E(L(r)g (X1 −X2)) =
∫ ∫L(r)
g (x− y)f(x)f(y)dxdy
=
∫ ∫Lg(x− y)f(x)f (r)(y)dxdy (integration by parts).
=
∫ ∫L(u)f(y + gu)f (r)(y)dudy
(let u =
x− y
g
).
=
∫ ∫L(u)f (r)(y)
(k∑
l=o
(l!)−1(ug)lf (l)(y) +O(gk+1)
)dudy
=
∫ ∫L(u)f (r)(y)
k∑l=o
(l!)−1(ug)lf (l)(y)dudy
+
∫ ∫L(u)f (r)(y)O(gk+1)dudy (expand f by Taylor’s method).
=
∫ ∫[L(u)f (r)(y)f(y) + ...
+ L(u)f (r)(y)(k!)−1(ug)kf (k)(y)]dudy
+
∫f (r)(y)O(gk+1)
∫L(u)dudy
= ψr + (k!)−1µk(L)gk
∫f (r)(y)f (k)(y)dy (since the order of L is k).
= ψr + (k!)−1µk(L)gk
∫f r+k(y)f(y)dy (integration by parts).
= ψr + (k!)−1µk(L)gkψr+k +O(k+2) (from ψr definition ).
52
So,Eψr(g)− ψr = n−1L(r)
g (0) + (k!)−1µk(L)gkψr + k +O(gk+2)
Lemma 4.5.2. Let X1, X2, ..., Xn a set of identically and independently distributedrandom variables and define
U = 2n−2
n−1∑i=1
n∑j=i+1
S(Xi −Xj)
where the function S is symmetric about zero. Then
Var(U) = 2n−3(n−1)Var{S(Xi−Xj}+4n−3(n−1)(n−2)Cov{S(X1−X2), S(X2−X3)}
Proof. The elementary properties of the variance of any summand random variables,and the identically and independently distributed of them yields
Var(U) = 4n−4
n−1∑i=1
n∑j=i+1
Var{S(Xi −Xj)}
= 4n−4[n(n− 1)
2Var{S(Xi −Xj)}
+ n(n− 1)(n− 2)Cov{S(X1 −X2), S(X2 −X3)}]
= 2n−3(n− 1)Var{S(Xi −Xj}+ 4n−3(n− 1)(n− 2)Cov{S(X1 −X2), S(X2 −X3)}
It follows from the previous lemma and the symmetry of L(r) for r even that
Var{ψr(g)} = 2n−3(n− 1)Var{L(r)g (X1 −X2)}
+ 4n−3(n− 1)(n− 2)Cov{L(r)g (X1 −X2), L
(r)g (X2 −X3)}
This leads to the following theorem
53
Theorem 4.5.3. let the kernel L satisfy the Assumption (i) and with the symmetryof L(r) for r even then,
Var{ψr(g)} = 2n−2g−2r−1ψ0R(L(r))
+ 4n−1{∫
f (r)(x)2f(x)dx− ψ2r
}+ o(n−2g−2r−1 + n−1)
Proof. We will prove this theorem in two steps.Step 1 :
E(L(r)g (X1 −X2))
2 =
∫ ∫(L(r)
g (x− y))2f(x)f(y)dxdy
= 1/g2r+1
∫ ∫L(r)(u)2f(y + gu)f(y)dudy
= 1/g2r+1
∫ ∫L(r)(u)2[f(y) + o(1)]f(y)dudy
= 1/g2r+1ψ0R(L(r)) + o(g−2r−1)
and
E{L(r)g (X1 −X2)} =
∫ ∫L(r)
g (x− y)f(x)f (r)(y)dxdy
=
∫ ∫Lg(x− y)f(x)f (r)(y)dxdy
=
∫ ∫L(u)f(y + gu)f (r)(y)dxdy
=
∫ ∫L(u)f (r)(y)[f(y) + o(1)]dudy
=
∫ ∫ (L(u)f (r)(y)f(y) + L(u)f (r)(y)o(1)
)dudy
= ψr + o(1)
Thus
Var{L(r)g (X1 −X2)} = 1/g2r+1ψ0R(L(r)) + o(g−2r−1) − ψr − o(1)
= g−2r−1ψ0R(L(r))− ψr + (o(g−2r−1 − o(1))
Step 2:
54
E{L(r)g (X1 −X2)L
(r)g (X2 −X3)} =
∫ ∫ ∫Lg(r)(x− y)Lg(r)(y − z)f(x)f(y)f(z)dxdydz
=
∫ ∫ ∫Lg(x− y)Lg(y − z)f (r)(x)f (r)(z)f(y)dxdydz
=
∫ ∫ ∫L(u)L(v)f (r)(y + ug)f(y)f (r)(y − gv)dudvdy
=
∫ ∫ ∫L(u)L(v)[f (r)(y) + o(1)]
x f(y)[f (r)(y) + o(1)]dudvdy
=
∫f (r)(y)2f(y)dy + o(1)
andE{L(r)
g (X1 −X2)}E{L(r)g (X2 −X3)} = ψ2
r + o(1)
Substituting these approximations in lemma(4.5.2) leads to
Var{ψr(g)} = 2n−2g−2r−1ψ0R(L(r))+4n−1{∫
f (r)(x)2f(x)dx−ψ2r
}+o(n−2g−2r−1+n−1)
therefore the asymptotic MSE is
MSE{ψr(g)} = n−1g−r−1L(r)(0) + (k!)−1gkµk(L)ψr+k
+ 2n−2g−2r−1R(L(r))ψ0 + 4n−1{∫f (r)(x)2f(x)dx− ψ2
r}.
Remark 4.5.1. 1. we can choose g to vanish the main bias term to be
gAMSE =
(k!L(r)(0)
−µk(L)ψr+kn
)1/r+k+1
(4.5.1)
2. We have to investigate the influence of this choice for g on the main two compo-nents of MSE. The order of the squared bias of MSE reduces to n−(2k+4)/(r+k+1).Since gAMSE = O(n−1/r+k+1) then the order of the leading variance term aren−(2k+1)/r+k+1 and n−1; respectively.Notice the first of these variance terms dominates the other squared bias term,thus the rate of convergence of the minimum MSE depends only on the leadingvariance terms.
55
3. For k < r, we find that
infg>0MSE{ψr(g)} ∼ 2R(L(r))ψ0
(µk(L)ψr+k
−L(r)(0)k!
)2r+1/r+k+1
n−(2k+1)/r+k+1
for k > r,
infg>0MSE{ψr(g)} ∼ 4(Var{f (r)(X)}
)n−1
and for k = r then the two leading terms in the above expression are of thesame order, and the leading term of the minimum mean squared error is thesum of those terms.
4.6 Plug-In Bandwidth Selection
4.6.1 Direct Plug-In Rules
Recall that, we obtained the AMISE-optimal bandwidth to be
hAMISE =
(R(K)
µ22(K)R(f ′′)n
)1/5
In terms of the ψr functionals, it can be written as
hAMISE =
(R(K)
µ22(K)ψ4n
)1/5
Now, the idea here is to continue estimating the unknown quantities that appear
in formulae for the asymptotically optimal bandwidth. So replacing ψ4 in the last
formula by ψ4(g) gives a formula for what so-called the direct plug-in (DPI) rule
hDPI =
(R(K)
µ22(K)ψ4(g)n
)1/5
In this rule, hDPI depends on the choice of the pilot bandwidth g.
From Eqn. (4.5.1) the AMSE-optimal bandwidth g of the estimator ψ4(g) is
gAMSE =
(2K(4)(0)
−µ2(K)ψ6n
)1/7
56
where K = L the second order kernel which is used in ψ4(g).
The same defect appears here since gAMSE depends also on an unknown density func-
tional, ψ6.
Reapeat the previous process; i.e estimate ψ6 by using an estimator, but its optimal
bandwidth depends on ψ8
This is a problem and will not has limiting because from Eqn.(4.5.1) the optimal
bandwidth for estimating ψr depends on ψr+2.
The way to solve this problem is to estimate an ψr functional with a quick and simple
estimate, the normal scale rule, say.
Until we reach the estimation of ψr by a quick and simple estimate, we have got a
family of direct plug-in bandwidth selectors that depend on the number of stages of
functional estimation.
If the number of stages was ` (say), we call such a rule an `-stage direct plug-in
bandwidth selector and denote it by hDPI,`.
The following theorem is very useful for computing quantities required for bandwidth
selectors, but before we introduce it, we have some facts,[1].
Fact [1] ∫φ(r)
σ (x− µ)φ(r′)σ′ (x− µ′)dx = (−1)rφ
(r+r′)
(σ2+σ′2)1/2(µ− µ′)
Fact [2]
φ(r)σ (0) =
(−1)r/2(2π)−1/2OF(r)σ−r−1 r even,
0 r odd.
where
OF(Odd Factorial)(r) = (r − 1)(r − 3)...1 =r!
2r/2r/2!
57
Theorem 4.6.1. If f is normal density with variance σ2 then, for r even,
ψr =(−1)r/2r!
(2σ)r+1(r/2)!π1/2(4.6.1)
Proof. For r even we have from the definition of ψr that
ψr =
∫f (r)(x)f(x)dx
since f is normal then we can write
ψr =
∫φ(r)
σ (x)φσ(x)dx
From fact (1);ψr = φ(r)√
2σ(0), and from fact (2) we have
φ(r)√
2σ(0) = (−1)r/2(2π)−1/2OF (r)(
√2σ)−r−1
=(−1)r/2r!
(2σ)r+1(r/2)!π1/2
It is useful to illustrate this procedure by the following example
Example 4.6.1. This example illustrates how dealing with the `-stage plug-in band-width selector. Take l = 2 and use L = K, where K is a second-order kernel.Step 1: Estimate ψ8 using the normal scale estimate ψNS
8 = 105/(32π1/2σ9) where σis an estimate of scale. The formula for ψNS
8 is obtained from Eqn. (4.6.1)Step 2: Estimate ψ6 using the kernel estimator ψ6(g1) where
g1 =
[2k(6)(0)
{µ2(k)ψNS8 n}
]1/9
Step 3: Estimate ψ4 using the kernel estimator ψ4(g2)where
g2 =
[− 2K(4)(0)/(µ2(K)ψ6(g1)n)
]1/7
Step 4: The selected bandwidth is
hDPI, 2 =(R(K)/(µ2
2(K)ψ4(g2)n))1/5
58
At the end of this section , there is a reasonable question
How should one choose the value of `, the number of stages of functional
estimation ?
actually, there is no an objective method to determine `. However, because a theoret-
ical considerations ` is preferred to be at least 2, with ` = 2 being a common choice
(Wand and Jones).
4.6.2 Solve-The-Equation Rules
These methods are similar to the DPI approach in depending on the formula for
AMISE-optimal bandwidth and it is another ” stage selection” problem as we will
show.
When h is selected according to these rules, it must satisfy the relationship
h =
(R(K)
µ22(K)ψ4(γ(h))n
)1/5
where the bandwidth function for the estimator ψ4 is a function γ of h.
Consider the following lemma
Lemma 4.6.2. The following relationship is valid
gAMSE =
(2L(4)(0)µ2
2(K)
R(K)µ2(L)
)1/7
(−ψ4/ψ6)1/7h
5/7AMISE
where K and h are, respectively, a kernel and a bandwidth that are used in estimatingthe density f . g and L are, respectively, a bandwidth and a symmetric kernel with rderivatives that are used in ψr
Proof. Beginning with the relationship
gAMSE =
(2L(4)(0)
−µ2(L)ψ6n
)1/7
59
and dividing by h5/7AMISE gives:
gAMSE/h5/7AMISE =
(2L(4)(0)
−µ2(L)ψ6n
)1/7(µ2
2(K)ψ4n
R(K)
)1/7
=
(2L(4)(0)µ2(K)2ψ4n
−µ2(L)ψ6R(K)n
)1/7
=
(2L(4)(0)µ2
2(K)
R(K)µ2(L)
)1/7(−ψ4
ψ6
)1/7
gAMSE =
(2L(4)(0)µ2
2(K)
R(K)µ2(L)
)1/7(−ψ4
ψ6
)1/7
h5/7AMISE
So we take:
γ(h) =
(2L(4)(0)µ2
2(K)
R(K)µ2(L)
)1/7(−ψ4
ψ6
)1/7
h5/7AMSE
where ψ4(g1) and ψ6(g2) are kernel estimates of ψ4 ψ6. g1 and g2 are obtained fromEqn.(4.5.1)
Example 4.6.2. A 2-stage solve-the-equation bandwidth selector that uses L = K( denoted by hSTE,2) is given.
Step 1: Estimate ψ6 and ψ8 using ψNS6 = −15/(16π1/2(σ)7) and ψNS
8 = 105/(32π1/2(σ)9)Step 2: Estimate ψ4 and ψ6 using the kernel estimators ψ4(g1) and ψ6(g2) where
g1 = {−2K(4)(0)/(µ2(K)ψNS6 n)}1/7
andg2 = {−2K(6)(0)/(µ2(K)ψNS
8 n)}1/9
.Step 3: Estimate ψ4 using the kernel estimator ψ4(γ(h)) where
γ(h) =
[2K(4)(0)µ2(K)ψ4(g1)
−ψ6(g2)R(K)
]1/7
h5/7
Step 4: The selected bandwidth is the solution to the equation
h =
[R(K)
µ22(K)ψ4(γ(h))n
]1/5
.
60
4.7 Smoothed Cross-Validation Bandwidth Selec-
tion
Using a kernel estimator with bandwidth g to estimate the integrated squared bias
component of MISE{f(.;h)}is common between the plug-in bandwidth selection and
this kind of cross-validation selection namely; smoothed cross-validation bandwidth
selection (SCV). So DPI and SCV methods have the same theoretical properties.
The formula of MISE{f(.;h)} in Eqn.(2.2.6) can be written asymptotically
MISE{f(.;h)} ≈ (nh)−1R(K) +
∫ (( ∫Kh(x− y)f(y)dy
)− f
)2
(x)dx
The second term is the integrated squared bias of f(.;h).
If f is replaced by a pilot estimator
fL(x; g) = n−1
n∑i=1
Lg(x−Xi)
where Lg(x) = L(x/g)/g is a kernel that may be different from K and bandwidth g.
This gives the smoothed cross-validation objective function SCV such that
SCV(h) = (nh)−1R(K) + ISB(h).
where
ISB(h) =
∫ {∫K(x− y)fL(.; g)(y)dy
}2
(x)dx (4.7.1)
is an estimate of integrated squared bias ISB. From the objective function SCV(h),
we can defined the bandwidth hSCV to be the largest local minimizer of SCV(h).
As we see above, SCV is based on the exact integrated squared bias rather than its
asymptotic approximation. This reason may make it more difficult to analyze than
DPI.
Conclusions
In order to achieve our goal, of reaching the best smoothing curve, as we can, of an
unknown density, nonparametrically, we find that using the kernel density estimator
can be a good alternative way to overcome the deficiency of the oldest and popular
way; histograms.
We find that the main axis in understanding the work way of the kernel density es-
timator is to understand the relationship between the two components of the mean
integrated squared error. This leads to talk about the most important parameter,
the bandwidth h. Its closed expression gives the rate of convergence when it is op-
timal bandwidth to be zero as n → ∞. We find that the best obtainable rate of
convergence of the kernel estimator is of order n−4/5. This rate is slower than the
parametric rate which of order n−1. We have studied a form of the asymptotic mean
integrated squared error that separates the kernel K and the bandwidth h and we
have seen also, that using equal bandwidths and different canonical kernels gives the
same amount of smoothing. We knew how measuring the degree of difficulty of kernel
estimation of some f .
The basic KDE has some modifications. We saw that there always be some improve-
ment if h(x) is chosen optimally in the local KDE. The variable KDE allows different
degrees of smoothing depending on the relation between data points. The transfor-
mation KDE is used when the shape of f is difficult to estimate.
Because the large influence of the bandwidth h on smoothing presses, we devoted
61
62
chapter 4 to study some types of bandwidth selectors.
Although these selectors is divided into simple and quick selectors, and hi-tech selec-
tors, we find that the last type depend on the simple and quick rules.
We have found that, the DPI selector provides an explicit formula for an optimal
bandwidth value. STE selector find h as a solution of an iteration. CV techniques
depends on relatively hard calculations to the objective function. So practically, DPI
and STE methods are more suitable than CV methods except those used the ideas of
the DPI methods.
As we have seen, in this thesis the attention is concentrating on studying the uni-
variate KDE. Completing in investigating the extension of KDE to multivariate case
is required. The need for nonparametric density estimates for knowing structure in
multivariate data is often, greater. This is because the parametric estimation becomes
more complicated than in the univariate case.
We suggest also, an extended studying for the kernel regression, since it is considered
from the fundamental subject where the techniques of kernel smoothing can work
very well.
We hope that, more interesting is given to the modification of the KDE, but it is need
a special studying for them, as we do in the basic one.
Finally, our analyzing of the performance KDE in Chapter 2 assumes that kernel K
is a probability density function, one may relax this restriction and show the effect
on the rates of convergence and other properties of the KDE.
Bibliography
[1] Wand M. P. and Jones M. C., Kernel Smoothing, First ed.,1995 Chapman &
Hall/CRC, London.
[2] Cassella George , Berger Roger L, Statistical Inference,1990 ,Duxbury Press
(An Imprint of Wadsworth Publishing Company ) Belmont, California.
[3] Apostol Tom M. Mathematical Analysis (A moderen Approach To Advanced
Calculus), seconed ed.,1977 , Addison-Wesley Publishing Company.
[4] Hogg Robert V. & Tanis Elliot A Probability and Statistical Inference, fifth
ed.,1997 Macmillan Publishing Co., Inc ,NewYork, London.
[5] Thomas / Finny Calculus, nineth ed., Addison-Wesley Puplishing Company.
[6] Polansky Alan M. ,Bandwidth Selection For Kernel Distuibution Functions,
Northern Illinios University , Division of statistics., AMS 1991subject classifica-
tions , R. H. KDF.
[7] Bashtannyk David M. & Hyndman Rob J. Bandwidth Selection For Kernel
Conditional Density Estimation ,August 24, 2000.
[8] Sain Stephan R. & Scott David W. On Locally Adaptive Dnsity Estimation ,
January 8,1996.
[9] Hengartner Nicolas W. , Asymptotic Unbiased Density Estimator , Yale Uni-
versity , Department of statistics, New Haven, U.S.A.,February 25, 1997.
63
64
[10] Baxter M. J. , & Beardah C. C Beyond the Histogram - Improved Approaches
to Simple Data Display in Archaeology Using Kernel Density Estimates, De-
partment of Mathematics, Statistics and Operational Research, The Nottinham
Trent University, Nottingham , United Kingdom.
[11] Walter Bruce Jonathan. Density Estimation Techniques For Global Illustration
(A Dissertation Presented to the Fuculty of the Graduate School of Cornell
University in Partial Fulfilment of the Requirements for the Degree of Doctor of
Philosophy), August, 1998.