penalized exponential series estimation of copula densities
TRANSCRIPT
Penalized Exponential Series Estimation of Copula
Densities
Ximing Wu∗
Abstract
The exponential series density estimator is advantageous to the copula density es-
timation as it is strictly positive, explicitly defined on a bounded support, and largely
mitigates the boundary bias problem. However, the selection of basis functions is chal-
lenging and can cause numerical difficulties, especially for high dimensional density
estimations. To avoid the issues associated with basis function selection, we adopt
the strategy of regularization by employing a relatively large basis and penalizing the
roughness of the resulting model, which leads to a penalized maximum likelihood esti-
mator. To further reduce the computational cost, we propose an approximate likelihood
cross validation method for the selection of smoothing parameter. Our extensive Monte
Carlo simulations demonstrate the effectiveness of the proposed estimator for copula
density estimations.
∗Department of Agricultural Economics, Texas A&M University. Email: [email protected]. I gratefullyacknowledge the Supercomputing Facility of the Texas A&M University where all computations in the studywere performed.
1
1 Introduction
This paper proposes a penalized maximum likelihood estimation for copula densities via the
exponential series estimator for multivariate densities introduced in Wu (2010). Consider a
d-dimensional random variable x with joint distribution function F . In his seminal paper,
Skalar (1959) shows that, via a change of variable argument, the joint distribution can be
written as
F (x1, . . . , xd) = C (F1(x1), . . . , Fd(xd)) , (1)
where Fj(xj), j = 1, . . . , d, is the marginal distribution for the jth element of x. C(·), the
so called copula function, summarizes the dependence structure among x completely. When
the margins are continuous, the copula function is unique. Thus a multivariate distribution
can be completely described by its copula and its univariate marginal distributions. Suppose
F is differentiable with density function f . Taking derivatives of both sides of (1) yields
f(x1, · · · , xd) = f1(x1) · · · fd(xd)c(F1(x1), . . . , Fd(xd)), (2)
where fj(·) is the marginal density of xj, j = 1, . . . , d, and c(·) is the copula density function,
which itself is a density function defined on the unit cube [0, 1]d. For a detailed treatment
of the mathematical properties of copulas, see Nelsen (2010).
Copula is a useful device for two reasons. First, it provides a way of studying scale-free
dependence structure. By writing a joint density function as the product of marginal densities
and the copula density, one can separate the influence of marginal densities from that of the
dependence structure. The dependence structure captured by the copula is scale free and
invariant to monotone transformations. In fact, many well known measures of dependence,
including Kendall’s τ and Spearman’s ρ, can be calculated from the copula function alone.
Second, copula is a starting point for constructing families of multivariate distributions. It
allows us to divide multivariate density estimation into two parts: the univariate density
estimation of the margins and the copula estimation.
Like the usual density functions, the copula densities can be estimated by parametric
2
or non-parametric methods. The commonly used parametric copulas usually contain one
or two parameters and thus may not be adequate to describe complicated relations among
random variables. In addition simple copulas sometimes place restrictions on the dependence
structure among variables. For example, the popular Gaussian copula assumes zero tail
dependence among random variables and is therefore not suitable for the study of financial
assets that tend to move together under extreme market conditions. A second limitation
of the parametric approach is that many parametric copulas are only defined for bivariate
variables, and extensions to higher dimensional cases are not available.
Alternatively one can estimate copula densities using nonparametric methods. The kernel
density estimator (KDE) is a popular smoother for density estimation. It is known that the
KDE suffers from the boundary bias, which is particularly severe when the derivatives of a
density do not vanish at the boundaries. Unfortunately this poses a considerable difficulty for
copula density estimation, because copula density functions often have nonzero derivatives at
boundaries and corners. For example, the return distributions of US and UK stock markets
tend to move together, especially under extreme market conditions, resulting in spikes in
their copula density function at the two ends of the diagonal of the unit square.1
Like the kernel estimation, the series estimation is a commonly used nonparametric ap-
proach. For density estimations, orthogonal series estimation, or generalized Fourier estima-
tion, is often employed. The series estimator has the advantage of automatic adaptiveness
in the sense that the degree of the series, when selected in an optimal manner, can adapt
to the unknown degree of smoothness of the underlying distribution to obtain the optimal
convergence rate. In contrast for the kernel estimations, one may need to use a higher (than-
second) order kernel density estimation to obtain the optimal convergence rate.2 However,
the series density estimators share with higher order kernel estimators the same problem
that they may produce negative density estimates.
Wu (2010) proposes an exponential series estimator (ESE) for multivariate density esti-
1Charpentier et al. (2007) discuss several remedies to mitigate boundary bias of the KDE along the lineof boundary kernel estimators.
2For instance higher-order kernels are required to obtain faster-than n−2/5 convergence rate for univariatekernel density estimations.
3
mations. This method is particularly advantageous to the copula density estimation as it is
strictly positive, explicitly defined on a bounded support, and largely mitigates the boundary
bias problem. Numerical evidence in Wu (2010) and Chui and Wu (2009) demonstrates the
effectiveness of this method for copula densities. However, the selection of basis functions
for the multivariate ESEs is challenging and can cause severe numerical difficulties. In this
study, we adopt a regularization approach by employing a relatively large basis functions
and penalizing the roughness of the resulting model to balance between the goodness-of-fit
and the simplicity of the model. This approach leads to a penalized maximum likelihood
estimator for copula densities. To further reduce the computational cost, we suggest an ap-
proximate likelihood cross validation method for smoothing parameter selection. Our Monte
Carlo simulations show that the proposed estimator outperforms the conventional kernel
density estimator, sometimes by substantial margins.
The rest of the paper is organized as follows. Section 2 provides brief backgrounds on
the exponential series estimation, discussing its information theoretic origin, large sample
properties, extensions to multivariate variables, and smoothing parameter selection. Section
3 proposes the penalized exponential series estimator and presents an approximate likelihood
cross validation method for smoothing parameter selection. Section 4 reports our Monte
Carlo simulations. Some concluding remarks are offered in the last section.
2 Exponential Series Estimator of Copula Density Func-
tions
Wu (2010) proposes a multivariate exponential series estimator, and shows that it is particu-
larly useful for copula density estimations. In this section, we briefly discuss the exponential
series density estimator. We first introduce the idea of maximum entropy density, upon
which the exponential series estimator is based. We then present the exponential series
estimator and discuss its smoothing parameter selection and some practical difficulties in
multivariate cases.
4
2.1 Maximum Entropy Density
One strategy to obtain strictly positive density estimates using the series method is to
model the log density via a series estimator. This idea is not new; earlier studies on the
approximation of log- densities using polynomials include Neyman (1937) and Good (1963).
Transforming the polynomial estimate of log-density back to its original scale results in a
density estimator in the exponential family. Thus approximating log densities by the series
estimators amounts to estimating densities by sequences of canonical exponential families.
The maximum likelihood estimation (MLE) provides efficient estimates of these exponential
families. Crain (1974) establishes the existence and consistency of the MLE in this case.
This method of density estimation arises naturally according to the principle of maximum
entropy. The information entropy, the central concept of information theory, of a univariate
continuous random variable with density f is defined as
W (f) =
∫f(x) log f(x)dx.
Suppose regarding a random variable x with an unknown density function f0, one knows
only some of its moments. There may exist an infinite number of distributions satisfying
these given moment conditions. Jaynes (1957) proposes a method of constructing a unique
density estimate based on the moment conditions as the following:
maxf−W (f)
subject to the integration to unity and side moment conditions:
∫f(x)dx = 1,∫φk(x)f(x)dx = µk, k = 1, . . . , K,
where φks are real-valued, linearly independent functions defined on the support of x.
5
The solution, obtained by an application of calculus of variations, takes the form
f(x; c) = exp(K∑k=1
ckφk(x)− c0) = exp(c′φ(x)− c0), (3)
where φ = (φ1, . . . , φK)′ and c = (c1, . . . , cK)′ are the Lagrangian multipliers for the moment
conditions. The normalization factor c0 = log{∫
exp(c′φ(x))dx}< ∞ ensures the integra-
tion to unity condition. Among all distributions satisfying the given moment conditions, the
maximum entropy density is the closest to the uniform distribution defined on the support
of x. Many distributions can be characterized as maximum entropy densities. For example,
the normal distribution is obtained by setting φ1(x) = x and φ2(x) = x2 for x ∈ R, and the
Beta distribution by φ1(x) = ln(x) and φ2(x) = ln(1− x) for x ∈ (0, 1).
In practice, the population moments are often unknown and therefore replaced by their
sample counterparts. Given an iid random sample X1, . . . , Xn, the maximum entropy density
is then estimated by the MLE based on the sample moments φ = (φ1, . . . , φK)′, where
φk = 1/n∑n
i=1 φk(Xi), k = 1, . . . , K. The log-likelihood function is given by
L =1
n
n∑i=1
[c′φ(Xi)− log
{∫exp(c′φ(x))dx
}]= c′φ− log
{∫exp(c′φ(x))dx
}.
Denote the MLE solution by f(·; c). Thanks to the canonical exponential form of the max-
imum entropy density, φ are the sufficient statistics of f(·; c). Therefore, we call φ the
characterizing moments of the maximum entropy density.
The coefficients for the maximum entropy densities generally cannot be obtained analyt-
ically and thus are to be solved numerically. Zellner and Highfield (1988) and Wu (2003)
discuss the numerical calculations of the maximum entropy density. Define
g(x) = exp(c′φ(x)), µg(h) =
∫h(x) exp(g(x))dx/
∫exp(g(x))dx. (4)
6
The score function and the Hessian matrix of the MLE are then given by
S = φ− µg(φ),
H = −{µg(φφ′)− µg(φ)µg(φ
′)} ,
where g(x) = exp(c′φ(x)). One can then use the Newton’s method to solve for c iteratively.
The uniqueness of the solution is ensured by the positive-definiteness of the Hessian matrix.
Therefore for a maximum entropy density, there exists a unique correspondence between its
characterizing moments φ and its coefficients c.
2.2 Exponential Series Density Estimator
The maximum entropy density is a useful approach for constructing a density estimate given
a set of moment conditions, which enjoys an appealing information-theoretic interpretation.
On the other hand, like the usual parametric models, this density estimator generally is not
consistent unless the underlying distribution happens to belong to the canonical exponential
family with characterizing moments given by φ. To obtain consistent density estimates,
in principle one can let the number of characterizing moments increase with the sample
size at a proper rate, which effectively transforms the maximum entropy method into a
nonparametric estimator. To stress the nonparametric nature of the said estimator, we call
a maximum entropy density whose number of characterizing moments increases with sample
size an Exponential Series Estimator (ESE).
Moving into the realm of nonparametric estimations inevitably brings new problems.
The paramount issue is the determination of degree of smoothing, which will be discussed
in length below. Another issue that warrants caution is identification. To ensure a one-
to-one correspondence between f(x) and exp(g(x))/∫
exp(g(x))dx, we need impose certain
restrictions. Two commonly used identification conditions are g(x0) = 0 for certain x0 and∫g(x)dx = 0. When we use orthogonal series as the basis functions, the second condition is
satisfied automatically. Furthermore, since a constant in g(x) is not identified, it is excluded.
7
Thus throughout the text, we maintain that the usual zero order term φ0(x) = 1 is excluded
from the basis functions for g.
Let x be a random variable defined on [0, 1] with density f0 and f is an ESE approximation
to f0. Without loss of generality, let φ = (φ1, . . . , φK)′ be a series of orthonormal basis
functions with respect to the Lebesgue measure on [0, 1]. One can measure the discrepancy
between f0 and f by the Kullback-Leibler Information Criterion (KLIC, also known as the
relative entropy or cross entropy), which is defined as D(f0‖f) =∫f0(x) ln(f0(x)/f(x))dx.3
In an important development, Barron and Sheu (1991) establish that the sequence of fs
converge to f0 in terms of the KLIC. In particular, suppose∫{∂/∂xr (log f0(x))}2 dx <∞,
then D(f0‖f) = Op(1/K2r + K/n), with K → ∞ and K3/n → 0 for the power series and
K2/n→ 0 for the trigonometric series and the splines.
Wu (2010) extends the ESE to multivariate densities. He uses the tensor product of the
univariate orthogonal basis functions to construct multivariate orthogonal basis functions.
Let x be a d-dimensional random variable defined on [0, 1]d with density f0. A multivariate
ESE for f0 is then constructed as
f(x) =exp
(∑K1
k1=1 · · ·∑Kd
kd=1 φk1(x1) · · ·φkd(xd))
∫exp
(∑K1
k1=1 · · ·∑Kd
kd=1 φk1(x1) · · ·φkd(xd))dx1 · · · dxd
.
Under the assumption that∫ {
∂r
∂r1 ···∂rdln f0(x)
}2dx <∞, where r =
∑dj=1 rj, he shows that
the ESE estimates converge to f0 at rate Op(∏d
j=1K−2rjj + 1/n
∏dj=1Kj) in terms of the
KLIC. Convergence rates in other metrics are also established, and extensive Monte Carlo
simulations demonstrate the effectiveness of the ESE for multivariate density estimations.
Like the orthogonal series density estimator, the ESE enjoys the automatic adaptiveness
to the unknown smoothness of the underlying distribution. On the other hand, it is strictly
positive, and therefore avoids the negative density estimates that might occur with the
orthogonal series estimators and the higher order kernel estimators. In addition, Wu (2010)
3The KLIC is a pseudo-metric in the sense that D(f ||g) = 0 if and only if f = g almost everywhere,whereas it is asymmetric and does not satisfy the triangle inequality.
8
suggests that the ESE is an appealing estimator for the copula density because it is explicitly
defined on a bounded support and therefore less sensitive to the boundary bias problem. Chui
and Wu (2009) provide further Monte Carlo evidence on this.
2.3 Selection of basis functions
In this subsection we discuss the selection of the degree of basis functions for the ESE, with
a focus on the multivariate case. It is well known that the choice of smoothing parameter
is often the most crucial ingredient of nonparametric estimations. The kernel density esti-
mates can vary substantially with the bandwidth. Similarly, the numerical performance of
orthogonal series density estimations hinges on the degree of basis functions; for example,
a high order power series may oscillate wildly and produce negative density estimates. The
ESE, which can be viewed as a series estimator raised to the exponent, is no exception.
When a higher-than desirable number of characterizing moments is used in the estimation,
the density estimates may exhibit spurious bumps and spikes. In addition, a large number of
characterizing moments increases not only the computational cost, but also the probability
that the Hessian matrix used in Newton updating approaches (near) singularity. Therefore,
judicious choice of the degree of basis functions is called for.4
The natural connection between the maximum entropy density and the MLE facilitates
adopting some information criterion for model specification. The Akaike Information Crite-
rion (AIC) and Bayesian Information Criterion (BIC) are two commonly used information
criteria that strive for a balance between the goodness-of-fit and simplicity of statistical
modeling. In the paradigm of nonparametric estimation where an estimator approximates
the unknown underlying process, the AIC is considered optimal in the minimax sense.5 On
4The selection of basis functions is particularly important for the ESEs, compared with the generalizedFourier series density estimation. For the latter, the coefficient for each basis function φk is given by∫φk(x)f0(x)dx, which can be conveniently estimated by its sample counterpart. Therefore although the
selection of basis functions affects its performance, no numerical difficulties are involved for the generalizedFourier series estimations. In contrast, the coefficients for the ESEs are obtained through an inverse problemwhich involves all basis functions through Newton’s updating; an overlarge basis may render the Hessianmatrix near singular and consequently cause numerical difficulties.
5The BIC is consistent if the set of candidates contains the true model. However, in nonparametricestimations generally the true model is assumed unknown and the goal is to arrive at an increasingly better
9
the other hand, from a penalized MLE point of view, the difference between the AIC and
the BIC resides in their penalties for roughness or number of parameters. Let L be the log
likelihood and K the number of parameters in a model, which reflects the complexity of the
model and is to be penalized. Both criteria can be written in the form of L−λK, where the
second term is the roughness penalty and λ determines the strength of roughness penalty.
For the AIC and the BIC, λ takes the value of 1 and 1/2 lnK respectively.
The cross validation (CV) provides an alternative method to select smoothing parameters.
Let L−i denote the log likelihood for the ith observation evaluated at a model estimated with
the entire sample but the ith observation. The cross validated log likelihood is calculated
as L− ≡∑n
i=1 L−i. The likelihood cross validation method minimizes the Kullback-Leibler
loss and therefore is asymptotically equivalent to the AIC approach. (See Hall (1987) for an
in-depth analysis of the likelihood cross validation method.)
Using the information criteria or the CV method to select the smoothing parameter
for the univariate ESE is relatively straightforward. Recall that the selection of smooth
parameter is equivalent to the selection of basis functions or characterizing moments for the
ESE. Given a reasonably large candidate set of basis functions, one can evaluate all subsets of
the candidate set and select the optimal set of basis functions according to a given selection
criterion. However, the process can be time consuming: if the candidate set contains K basis
functions, then the number of subsets is 2K . This process can be greatly simplified when
the basis functions have certain natural ordering or hierarchical structure. For example, the
polynomial series and the trigonometric series have an intuitive frequentist interpretation
such that the low/high order basis functions capture low/high frequency features of the
underlying process. When this type of series are used, it is a common, and sometimes
preferred, practice to use a hierarchical selection approach in the sense that if the kth basis
function is selection, all lower order basis are included automatically. Clearly, the hierarchical
selection method is a truncation approach. The number of models that need to be estimated
is K, considerably smaller than 2K as is required in the complete subset selection.
In principle either the subset selection or truncation method can be used to select the
approximation to the underlying model rather than identifying the true model.
10
smoothing parameter for estimating multivariate densities using the ESE. However, the
practical difficulty is that the number of required evaluations increases exponentially with
the dimension of x. For density estimation of a d-dimensional x, we consider the tensor
products of univariate basis functions given by φk(x) =∏d
j=1 φkjj (xj), where the multi-index
k = (k1, . . . , kd). Denote the size of a multi-index by |k| =∑d
j=1 kj. Suppose the candidate
set MK consists of basis functions whose sizes are no great than K, i.e.,
MK = {φk : 1 ≤ |k| ≤ K} .
With a slight abuse of notation, let |MK | denote the number of elements in MK . One can
show that |MK | =(K+dd
)− 1. Therefore if the subset selection method is used, it would
require estimating 2|MK | ESE densities, which can be prohibitively expensive. For instance,
if d = 2 and K = 4, we need estimate 214 ESE densities; the number explodes to 234 if d = 3.
Thus, the subset selection approach is practically infeasible except for the simplest cases.
Now let us consider the truncation method, which is more economical in terms of basis
functions. It seems we can proceed as in the univariate case, estimating the ESE densities
with basis functions Mk, k = 1, . . . , K, and then selecting the optimal set according to
some criterion. There is, however, a key difference between the univariate case and the
multivariate case. For the former, as we raise k from 1 to K, each time we increase the
number of basis functions by one. On the other hand, for the general d-dimensional case,
the number of basis functions added to the candidate set increases with k. To be precise, let
mK = {φk : |k| = K}, thenMK = (m1, . . . ,mK). The number of additional basis functions
incorporated along the stepwise process at each stage is |mk| = |Mk| − |Mk−1| =(k+d−1d−1
)for d ≥ 2 and k ≥ 1. For instance, when d = 2, we have |mk| = k + 1; when d = 3,
the corresponding step sizes increase to 3, 6, 10, and 15 for k = 1, . . . , 4. Therefore in
the multivariate case, the number of basis functions added along the process of truncation
method rises rapidly with the dimension d, leading to increasingly abrupt expansions in
model complexity. Consequently, the suitability of the simple truncation method to high
dimensional case is questionable.
11
Lastly in addition to the practical difficulties associated with model specification in the
high dimension case, there exists another potential problem. Recall that the maximum en-
tropy density can produce spurious spikes when it is based on a large number of moments.
Not surprisingly, this problem can be aggravated when using the ESE to estimate multi-
variate densities, whose number of characterizing moments increase rapidly with dimension
and sample size. In addition, the larger is the number of basis functions, the higher is the
probability that the Hessian matrix used in the Newton updating becomes (near) singu-
lar, introducing further complications. To mitigate the aforementioned problems, below we
propose an alternative penalized MLE estimation approach for copula densities using the
ESE.
3 Penalized exponential series estimation
As is discussed above, the ESE for multivariate densities can be challenging due to difficulties
associated with selection of basis functions and numerical issues. In this section, we propose
to use the method of penalized MLE to determine the degree of smoothing: instead of
painstakingly selecting a (small) set of basis functions to model a multivariate density, we
use a relatively large set of basis functions and shrink coefficients of the resulting models
toward zero to penalize its complexity.
3.1 The model
Good and Gaskins (1971) introduce the idea of roughness penalty density estimation. Their
idea is to use as an estimate that density which maximizes a penalized version of the likeli-
hood. The penalized likelihood is defined as
Q =1
n
n∑i=1
ln f(Xi)− λJ(f),
12
where J(f) is a roughness penalty for density f and λ is the smoothing/tuning parameter.
The log likelihood dictates the estimate to adapt to the data, the roughness counteracts
by demanding less variation, and the smoothing parameter controls the tradeoff between
the two conflicting goals. Various roughness penalties have been proposed in the literature.
For instance, Good and Gaskins (1971) use J(f) =∫
(f ′)2(x)/f(x)dx, Silverman (1982) sets
J(f) =∫
(d/dx)3 ln f(x)dx, and Gu and Qiu (1993) propose general quadratic roughness
penalties for the smoothing spline density estimation.
Without loss of generality and for simplicity, we consider a bivariate random variable
(x, y) with a strictly positive and bounded density, f0, defined on the unit square [0, 1]×[0, 1].
Let φk(x, y), 1 ≤ |k| ≤ K, be an orthonormal basis function with respect to the Lebesgue
measure on the unit square. To ease notation, we denote M = |MK |, where M is understood
to be a function of K, which in turn is a function of the sample size. We also change the
multi-index k to a single index k, k = 1, . . . ,M .
We consider approximating f0 by
f(x, y) =exp(
∑Mk=1 φk(x, y))∫
exp(∑M
k=1 ckφk(x, y))dxdy≡ exp(g(x, y))∫
exp(g(x, y))dxdy, (5)
where g(x, y) = c′φ(x, y) with c = (c1, . . . , cM)′ and φ(x, y) = (φ1(x, y), . . . , φM(x, y))′.
Throughout this section, integration is taken to be on the unit square. For the roughness
penalty, we adopt a quadratic penalty on the log density g. The penalized MLE objective
function is then given by
Q =1
n
n∑i=1
c′φ(Xi, Yi)− ln
∫exp(g(x, y))dxdy − λ
2c′Wc, (6)
where W is a positive definite weight matrix for the roughness penalty.
Given the smoothing parameter and the roughness penalty, one can use Newton’s method
13
to solve for c iteratively. The gradient and Hessian of (6) are respectively
S = φ− µg(φ)− λWc,
H = −{µg(φφ′)− µg(φ)µg(φ)′} − λW ,
where µg is given by (4), and g = exp {c′φ}. One can establish the existence and uniqueness
of the penalized MLE within the general exponential family under rather mild conditions
(see, e.g., Lemma 2.1 of Gu and Qiu (1993)).
To implement the penalized MLE, we must specify several factors: (i) the basis functions
φ; (ii) the weight function W ; and (iii) the smoothing parameter λ. The smoothing param-
eter plays a most crucial role and is to be discussed in length in the next subsection. As for
the choice of basis functions, commonly used orthogonal series include the Legendre series,
the trigonometric series and the splines. Although there exist subtle differences between
these series (e.g., in terms of their boundary biases), they lead to the same convergence rates
under suitable regularity conditions.
We also need determine the size/number of basis functions. We stress that for the penal-
ized MLE, the number of basis functions is generally not considered a smoothing parameter.
For instance, in the smoothing spline density estimations, the size of the basis functions can
be as large as the sample size. In practice, often a “sufficiently large” (but smaller than
sample size) basis suffices. The size of basis functions in the penalized likelihood estimation
is usually considerably larger than that would be selected according to some information
criteria, and thus calls for roughness penalty.6
Next we need choose the weight matrix W . We consider penalizing the roughness of log
density g(x, y) = c′φ(x, y), whose penalty is given by
J(g) =
∫ {g(m)(x, y)
}2dxdy =
∫ {c′φ(m)(x, y)
}2dxdy = c′
{∫φ(m)φ(m)′dxdy
}c,
6See Gu and Wang (2003) for an asymptotic analysis of the size of basis functions in the smoothing splineestimations.
14
where g(m)(x, y) =∑
m1+m2≤m∂g(x,y)
∂xm1∂ym2with m ≥ 0. Therefore, W is given by the middle
factor of the last equality. Using orthonormal series simplifies the construction of the weight
matrix since it leads to a diagonal weight matrix. When m = 0, W equals the identity
matrix and coefficients for all basis functions are penalized equally; when m ≥ 1, coefficients
for higher order/frequency moments are increasingly penalized, where the rate of increase
rises geometrically with m.7 Popular choices of m include m = 0 and m = 2, corresponding
to the natural splines and cubic splines respectively in the smoothing spline estimations.
Since there is a one-to-one correspondence between the characterizing moments and their
coefficients in the ESE density, the penalized MLE can be viewed as a shrinkage estimator
that shrinks the sample characterizing moments towards zero. In addition, this roughness
penalty defines a null space J⊥, on which J(g) = 0. This null space is usually finite-
dimensional to avoid interpolation of the data. When the smoothing parameter λ→∞, the
penalized MLE converges to the MLE on J⊥, which is the smoothest density induced by the
given roughness penalty. For instance, when m = 2, the smoothest function for g is linear in
x; when m = 3, the smoothest function is quadratic in x, leading to the normal distribution
(See Silverman (1982)).
3.2 Selection of Smoothing Parameter
One advantage of using the penalized MLE for model selection is that it avoids the difficult
subset selection or truncation process to determine the optimal degree of smoothing. Instead,
a relatively large number of basis functions is used and the degree of smoothing is determined
by a single continuous parameter λ. In practice, one has to choose λ. The method of cross
validation is commonly used for this purpose. This is a natural choice, because the penalized
MLE does not involve subset selection and therefore the AIC, BIC type criteria that penalize
the number of parameters cannot be easily applied.
The leave-one-out cross validation for linear regressions is quite straightforward. As a
matter of fact, for a sample with n observations, one usually need not estimate n regressions
7For instance, the penalty weight given to a univariate cosine basis φk(x) =√
2 cos(πkx), x ∈ [0, 1], is(πk)2m.
15
because there is an analytical formula to calculate the least squares cross validation result
from the regression on the full sample. This kind of analytical solution, however, generally
does not exist for nonlinear estimations. For the ESE estimation of multivariate densities,
this can pose a practical difficulty due to high computational cost. This is because that
the coefficients of the ESE are calculated iteratively through Newton’s updating. For ba-
sis functions of size M , the Hessian matrix has M(M + 1)/2 distinct elements to evaluate,
each requiring multidimensional integration by numerical methods. The computational cost
increases rapidly with the dimension because (i) the number of basis functions increases
with the dimension in nonparametric estimations, and (ii) the multidimensional integration
becomes increasingly expensive with the dimension as well. Thus it is rather expensive to
implement the leave-one-out cross validation for multivariate ESEs, especially for penalized
MLEs that use a large number of basis functions. Therefore we propose a first order approx-
imation to the cross validated log likelihood, which only requires one estimation of the ESE
based on the full sample.
Recall that (5) belongs to the general exponential family and the sample averages φ
are the sufficient statistics for the penalized MLE. Denote the sample averages calculated
leaving out the ith observation by φ−i = 1/(n − 1)∑n
j 6=iφ(Xj, Yj). It follows that φ−i are
the sufficient statistics for the penalized MLE calculated with the ith observation deleted.
For given basis functions and smoothing parameter, denote by f and f−i the penalized
MLE estimates associated with φ and φ−i respectively. Let c and H be the estimated
coefficients and Hessian matrix of f , and c−i and H−i be similarly defined. By Taylor’s
theorem, we have
c−i ≈ c− H−1(φ− φ−i).
The normalization factor can be approximated similarly. Define c0 = − ln∫
exp(g(x, y))dxdy.
Let c0 and c0,−i be the normalization factor of f and f−i respectively. It follows that
c0,−i ≈ c0 − µg(φ′)H−1(φ− φ−i).
16
Next let L−i be the log likelihood of the ith observation evaluated at f−i. The cross
validated log likelihood can then be approximated as follows:
L− =1
n
n∑i=1
L−i(Xi, Yi)
=1
n
n∑i=1
{c′−iφ(Xi, Yi)− c0,−i
}≈ 1
n
n∑i=1
{c− H−1(φ− φ−i)
}′φ(Xi, Yi)−
1
n
n∑i=1
{c0 − (µg(φ))′H−1(φ− φ−i)
}=
1
n
n∑i=1
{c′φ(Xi, Yi)− c0} −1
n
n∑i=1
φ′(Xi, Yi)H−1(φ− φ−i)
= L − 1
n
n∑i=1
φ′(Xi, Yi)H−1(φ− φ−i), (7)
where L is the maximum penalized log likelihood on the full sample, and the second to
last equality follows because∑n
i=1(φ − φ−i) = 0. Next let Φ be a n × M matrix with
the ith row being (φ1(Xi, Yi), . . . , φM(Xi, Yi))′. The cross validated log likelihood (7), after
straightforward but tedious algebra, can be written in the following matrix form
L− ≈ L−1
n− 1trace[ΦH−1Φ′] +
1
n(n− 1)(1′Φ) H−1 (Φ′1) , (8)
where 1 is a n× 1 vector of ones.
We then use (8) as the objective function of the penalized MLE. Since λ is a scalar, we
use a simple grid search to locate the solution. As is discussed above, multi-dimensional
numerical integrations are used repeatedly in our estimations. For the calculation of µgs, we
use Smolyak algorithm for cubatures following Gu and Wang (2003). Smolyak cubatures are
highly accurate for smooth functions. We note that the placement of nodes in Smolyak cu-
batures is dense near the boundaries. Therefore, they are particularly suitable for evaluating
the ESEs of copula densities since they often peak near the boundaries and corners.
17
4 Monte Carlo Simulations
To investigate the finite sample performance of the proposed estimators, we conduct a
series of Monte Carlo simulations. For the penalized MLE estimator, we penalize the
third order derivatives of the log densities. Denote a bivariate ESE density by f(x, y) =
exp (g(x, y)− c0), where g(x, y) =∑M
m=1 φm(x, y). The the penalty then takes the form
J(g) = c′
[∫ {(∂
∂x3+ 3
∂
∂x2y+ 3
∂
∂xy2+
∂
∂y3)g(x, y)
}2
dxdy
]c.
When the smoothing parameter goes to infinity, the penalized MLE converges to the following
smoothest distribution induced by the penalty given above:
f(x, y) = exp(c1x+ c2x
2 + c3y + c4y2 + c5xy − c0
), x, y ∈ [0, 1],
which is a truncated bivariate normal density defined on the unit square.8 Alternatively,
one can penalize lower or higher order derivatives of g. We choose the third order derivative
because under this penalty, the smoothest distribution is the simplest one that contains useful
information on the dependence between x and y, which is captured by the sample moment
1/n∑n
i=1XiYi. If a lower order derivative is used, the smoothest distribution contains only
moments on the margins and thus is not informative since for copula densities, all margins are
uniform. On the other hand, if higher order derivatives are used, the smoothest distributions
contain higher order information on the dependence between x and y, whose coefficients are
not penalized.9
We consider both the Legendre series and the cosine series, orthonormalized on the unit
8We note that this is different from the Gaussian copula, whose distribution function is given by
C(x, y; ρ) = Φρ(Φ−1(x),Φ−1(y)), x, y ∈ [0, 1],
where Φ is the standard normal distribution function, and Φρ is the standard bivariate normal distributionfunction with correlation coefficient ρ.
9In contrast to the large literature on the selection of smoothing parameters, theoretical guidance to thespecification of penalty forms is scanty. On the the hand, existing literature suggests that the estimationsare usually not sensitive to the form of penalty, which is consistent with our own numerical experiments.
18
square. The results from the two basis functions are rather similar, hence to save space below
we only report the results on the Legendre series. We consider three sample sizes: 50, 100
and 200. For all three sizes, we find that the Legendre basis functions with degree no larger
than 4 produce satisfactory results.10 The approximate cross validation method described
in the previous section is used to select the smoothing parameter. For comparison, we also
estimate the copula densities using the kernel density estimator. In particular, we use the
product Gaussian kernel and the bandwidth is selected by the method of likelihood cross
validation.
We consider four different copulas: the Gaussian, T , Frank, and Galambos; the first
two belong to the Elliptical class, and the next two to the Archimedean and the extreme
value classes respectively. For each copula, we look at three cases with low, medium and
high dependence respectively. The coefficients for the copulas are selected such that the
low, medium and high dependence cases correspond to a correlation of 0.2, 0.5 and 0.8
respectively. All experiments are repeated 500 times.
For each experiment, let (Xi, Yi), i = 1, . . . , n, be an iid sample generated from a given
copula. Define the pseudo-observations as
X∗i =1
n+ 1
n∑j=1
I(Xj ≤ Xi), Y∗i =
1
n+ 1
n∑j=1
I(Yj ≤ Yi),
where the denominator is set to n + 1 to avoid numerical difficulties. Jackel (2002) and
Charpentier et al. (2007) suggest that using the pseudo-observations instead of the true
observations reduces the variation. The intuition is that the above transformation effectively
changes both marginal series (after being sorted in the ascending order) to ( 1n+1
, . . . , nn+1
),
which is consistent with the fact that copula densities have uniform margins. We use the
pseudo-observations in all our estimations.
To gauge the performance of our estimates, we calculate the mean square errors (MSE)
and mean absolute deviations (MAD) between the estimated densities and the true copula
10Let φk the kth degree Legendre polynomial on [0, 1]. We include in our bivariate density estimationsbasis functions of the form φj(x)φk(y), j + k ≤ 4. The size of this basis is 14.
19
densities, evaluated on a 30 by 30 equally spaced grid on the unit square. Figure 1 reports
the estimated results measured by the MSE. The top, middle and bottom rows correspond
to the sample sizes 200, 100 and 50, and the left, middle and right columns correspond to the
low, medium and high correlation cases. In each plot, the MSEs for the ESEs are represented
by circles connected by solid lines, while those for the KDEs are represented by triangles
connected by dash lines. The Gaussian, T , Frank and Galambos copulas are labeled as 1,2,3
and 4 respectively in each plot. Note that the scales for the plots differ.
In all our experiments, the ESE outperforms the KDE, often considerably. The MSE
increases with the degree of dependence and decreases with the sample size. Averaging
across four copulas, the ratios of MSEs between the ESEs and KDEs are 0.25, 0.49 and 0.77
for the low, medium and high correlation cases respectively. The corresponding ratios for
sample sizes 50, 100 and 200 are respectively 0.62, 0.67 and 0.73.
Figure 2 reports the estimation results in MADs. The overall pictures are similar to
those of MSEs, but with a lager average performance gap. Averaging across four copulas,
the ratios of MADs between the ESEs and KDEs are 0.32, 0.42 and 0.61 for the low, medium
and high correlation cases respectively. The corresponding ratios for sample sizes 50, 100
and 200 are respectively 0.45, 0.45 and 0.48. Thus our numerical experiments support our
contentions in the previous sections that the ESE provides a useful nonparametric estimator
for the copula densities.
5 Concluding Remarks
We have proposed a penalized maximum likelihood estimator of the exponential series
method for the copula density estimation. The exponential series density estimator is strictly
positive and overcomes the boundary bias issue associated with the kernel density estimation.
However, the selection of basis functions for the ESEs is challenging and can cause severe
numerical difficulties, especially for multivariate densities. To avoid the issue of basis func-
tion selection, we adopt the strategy of regularization by employing a relatively large basis
and penalizing the roughness of the resulting model, which leads to a penalized maximum
20
copula
MS
E
0.1
0.2
0.3
1 2 3 4
●
●
●
●
0.250
0.0
0.2
0.4
0.6
1 2 3 4
●
●
●
●
0.550
01
23
1 2 3 4
●
●
●
●
0.850
0.0
0.1
0.2
1 2 3 4
●
●
●
●
0.2100
0.0
0.2
0.4
0.6
1 2 3 4
●
●
●
●
0.5100
01
23
1 2 3 4
●
●
●
●
0.8100
0.00
0.05
0.10
0.15
0.20
1 2 3 4
●
●
●
●
0.2200
0.0
0.2
0.4
1 2 3 4
●
●
●
●
0.5200
01
2
1 2 3 4
●
●
●
●
0.8200
Figure 1: Mean squared errors of estimated copulas. The ESE and the KDE results arerepresented by circles and triangles respectively. Rows 1-3 correspond to n = 200, 100 and50 respectively; columns 1-3 correspond to correlation equal to 0, 2, 0.5 and 0.8 respectively;in each plot, copulas 1-4 correspond to the Gaussian, T , Frank and Galambos copula. Notethat the scales of the plots differ.
21
copula
MA
D
0.10
0.20
0.30
1 2 3 4
●
●
●
●
0.250
0.10
0.20
0.30
0.40
1 2 3 4
●
●
●
●
0.550
0.2
0.3
0.4
0.5
1 2 3 4
●
●
●
●
0.850
0.10
0.20
0.30
1 2 3 4
●
●
●
●
0.2100
0.10
0.20
0.30
1 2 3 4
●
●
●
●
0.5100
0.15
0.25
0.35
1 2 3 4
●
●
●
●
0.8100
0.05
0.15
0.25
1 2 3 4
●
●
●
●
0.2200
0.05
0.15
0.25
1 2 3 4
●
●
●
●
0.5200
0.15
0.25
0.35
1 2 3 4
●
●
●
●
0.8200
Figure 2: Mean absolute deviation of estimated copulas. The ESE and the KDE results arerepresented by circles and triangles respectively. Rows 1-3 correspond to n = 200, 100 and50 respectively; columns 1-3 correspond to correlation equal to 0, 2, 0.5 and 0.8 respectively;in each plot, copulas 1-4 correspond to the Gaussian, T , Frank and Galambos copula. Notethat the scales of the plots differ.
22
likelihood estimator. To further reduce the computational cost, we propose an approximate
likelihood cross validation method for the selection of smoothing parameter. Our extensive
Monte Carlo simulations demonstrate the usefulness of the proposed estimator for copula
density estimations. Generalization of the said estimator to nonparametric multivariate re-
gressions and applications in high dimensional analysis, especially in financial econometrics,
may be of interest for future study.
23
References
Barron, A. and C. Sheu (1991). Approximation of density functions by sequences of expo-
nential families. Annals of Statistics 19, 1347–1369.
Charpentier, A., J. Fermanian, and O. Scaillet (2007). The estimation of copulas: Theory
and practice. In J. Rank (Ed.), Copulas: From theory to Application in Finance. Risk
Publications.
Chui, C. and X. Wu (2009). Exponential series estimation of empirical copulas with ap-
plication to financial returns. In Q. Li and J. Racine (Eds.), Advances in Econometrics,
Volume 25.
Crain, B. (1974). Estimation of distributions using orthogonal expansions. Annals of Statis-
tics 2, 454–463.
Good, I. (1963). Maximum entropy for hypothesis formulation, especially for multidimen-
sional contingency tables. Annals of Mathematical Statistics 34, 911–934.
Good, I. and R. Gaskins (1971). Nonparametric roughness penalities for probability densi.
Biometrika 58, 255–277.
Gu, C. and C. Qiu (1993). Smoothing spline density estimation: Theory. Annals of Statis-
tics 21, 217–234.
Gu, C. and J. Wang (2003). Penalized likelihood density estimation: Direct cross-validation
and scalable approximation. Statistica Sinica 13, 811–826.
Hall, P. (1987). On kullback-leibler loss and density estimation. Annals of Statistics 15,
1491–1519.
Jackel, P. (2002). Monte Carlo Methods in Finance. New York: John Wiley and Sons.
Jaynes, E. (1957). Information theory and statistical mechanics. Physics Review 106, 620–
630.
24
Nelsen, R. B. (2010). An Introduction to Copulas. Springer.
Neyman, J. (1937). Smooth test for goodness of fit. Scandinavian Aktuarial 20, 149–199.
Silverman, B. W. (1982). On the estimation of a probability density function by the maximum
penalized likelihood method. Annals of Statistics 10, 795–810.
Skalar, A. (1959). Functions de repartition a n dimensions et leurs merges. Publ. Inst. Statist.
Univ. Paris 8, 229–231.
Wu, X. (2003). Calculation of maximum entropy densities with application to income dis-
tribution. Journal of Econometrics 115, 347–354.
Wu, X. (2010). Exponential series estimator of multivariate densities. Journal of Economet-
rics 156, 354–366.
Zellner, A. and R. Highfield (1988). Calculation of maximum entropy distribution and
approximation of marginal posterior distributions. Journal of Econometrics 37, 195–209.
25