penalized exponential series estimation of copula densities

Penalized Exponential Series Estimation of Copula

Densities

Ximing Wu∗

Abstract

The exponential series density estimator is advantageous to the copula density es-

timation as it is strictly positive, explicitly defined on a bounded support, and largely

mitigates the boundary bias problem. However, the selection of basis functions is chal-

lenging and can cause numerical difficulties, especially for high dimensional density

estimations. To avoid the issues associated with basis function selection, we adopt

the strategy of regularization by employing a relatively large basis and penalizing the

roughness of the resulting model, which leads to a penalized maximum likelihood esti-

mator. To further reduce the computational cost, we propose an approximate likelihood

cross validation method for the selection of smoothing parameter. Our extensive Monte

Carlo simulations demonstrate the effectiveness of the proposed estimator for copula

density estimations.

∗Department of Agricultural Economics, Texas A&M University. Email: [email protected]. I gratefullyacknowledge the Supercomputing Facility of the Texas A&M University where all computations in the studywere performed.

1

1 Introduction

This paper proposes a penalized maximum likelihood estimation for copula densities via the

exponential series estimator for multivariate densities introduced in Wu (2010). Consider a

d-dimensional random variable x with joint distribution function F . In his seminal paper,

Skalar (1959) shows that, via a change of variable argument, the joint distribution can be

written as

F (x1, . . . , xd) = C (F1(x1), . . . , Fd(xd)) , (1)

where Fj(xj), j = 1, . . . , d, is the marginal distribution for the jth element of x. C(·), the

so called copula function, summarizes the dependence structure among x completely. When

the margins are continuous, the copula function is unique. Thus a multivariate distribution

can be completely described by its copula and its univariate marginal distributions. Suppose

F is differentiable with density function f . Taking derivatives of both sides of (1) yields

f(x1, · · · , xd) = f1(x1) · · · fd(xd)c(F1(x1), . . . , Fd(xd)), (2)

where fj(·) is the marginal density of xj, j = 1, . . . , d, and c(·) is the copula density function,

which itself is a density function defined on the unit cube [0, 1]d. For a detailed treatment

of the mathematical properties of copulas, see Nelsen (2010).

Copula is a useful device for two reasons. First, it provides a way of studying scale-free

dependence structure. By writing a joint density function as the product of marginal densities

and the copula density, one can separate the influence of marginal densities from that of the

dependence structure. The dependence structure captured by the copula is scale free and

invariant to monotone transformations. In fact, many well known measures of dependence,

including Kendall’s τ and Spearman’s ρ, can be calculated from the copula function alone.

Second, copula is a starting point for constructing families of multivariate distributions. It

allows us to divide multivariate density estimation into two parts: the univariate density

estimation of the margins and the copula estimation.

Like the usual density functions, the copula densities can be estimated by parametric

2

or non-parametric methods. The commonly used parametric copulas usually contain one

or two parameters and thus may not be adequate to describe complicated relations among

random variables. In addition simple copulas sometimes place restrictions on the dependence

structure among variables. For example, the popular Gaussian copula assumes zero tail

dependence among random variables and is therefore not suitable for the study of financial

assets that tend to move together under extreme market conditions. A second limitation

of the parametric approach is that many parametric copulas are only defined for bivariate

variables, and extensions to higher dimensional cases are not available.

Alternatively one can estimate copula densities using nonparametric methods. The kernel

density estimator (KDE) is a popular smoother for density estimation. It is known that the

KDE suffers from the boundary bias, which is particularly severe when the derivatives of a

density do not vanish at the boundaries. Unfortunately this poses a considerable difficulty for

copula density estimation, because copula density functions often have nonzero derivatives at

boundaries and corners. For example, the return distributions of US and UK stock markets

tend to move together, especially under extreme market conditions, resulting in spikes in

their copula density function at the two ends of the diagonal of the unit square.1

Like the kernel estimation, the series estimation is a commonly used nonparametric ap-

proach. For density estimations, orthogonal series estimation, or generalized Fourier estima-

tion, is often employed. The series estimator has the advantage of automatic adaptiveness

in the sense that the degree of the series, when selected in an optimal manner, can adapt

to the unknown degree of smoothness of the underlying distribution to obtain the optimal

convergence rate. In contrast for the kernel estimations, one may need to use a higher (than-

second) order kernel density estimation to obtain the optimal convergence rate.2 However,

the series density estimators share with higher order kernel estimators the same problem

that they may produce negative density estimates.

Wu (2010) proposes an exponential series estimator (ESE) for multivariate density esti-

1Charpentier et al. (2007) discuss several remedies to mitigate boundary bias of the KDE along the lineof boundary kernel estimators.

2For instance higher-order kernels are required to obtain faster-than n−2/5 convergence rate for univariatekernel density estimations.

3

mations. This method is particularly advantageous to the copula density estimation as it is

strictly positive, explicitly defined on a bounded support, and largely mitigates the boundary

bias problem. Numerical evidence in Wu (2010) and Chui and Wu (2009) demonstrates the

effectiveness of this method for copula densities. However, the selection of basis functions

for the multivariate ESEs is challenging and can cause severe numerical difficulties. In this

study, we adopt a regularization approach by employing a relatively large basis functions

and penalizing the roughness of the resulting model to balance between the goodness-of-fit

and the simplicity of the model. This approach leads to a penalized maximum likelihood

estimator for copula densities. To further reduce the computational cost, we suggest an ap-

proximate likelihood cross validation method for smoothing parameter selection. Our Monte

Carlo simulations show that the proposed estimator outperforms the conventional kernel

density estimator, sometimes by substantial margins.

The rest of the paper is organized as follows. Section 2 provides brief backgrounds on

the exponential series estimation, discussing its information theoretic origin, large sample

properties, extensions to multivariate variables, and smoothing parameter selection. Section

3 proposes the penalized exponential series estimator and presents an approximate likelihood

cross validation method for smoothing parameter selection. Section 4 reports our Monte

Carlo simulations. Some concluding remarks are offered in the last section.

2 Exponential Series Estimator of Copula Density Func-

tions

Wu (2010) proposes a multivariate exponential series estimator, and shows that it is particu-

larly useful for copula density estimations. In this section, we briefly discuss the exponential

series density estimator. We first introduce the idea of maximum entropy density, upon

which the exponential series estimator is based. We then present the exponential series

estimator and discuss its smoothing parameter selection and some practical difficulties in

multivariate cases.

4

2.1 Maximum Entropy Density

One strategy to obtain strictly positive density estimates using the series method is to

model the log density via a series estimator. This idea is not new; earlier studies on the

approximation of log- densities using polynomials include Neyman (1937) and Good (1963).

Transforming the polynomial estimate of log-density back to its original scale results in a

density estimator in the exponential family. Thus approximating log densities by the series

estimators amounts to estimating densities by sequences of canonical exponential families.

The maximum likelihood estimation (MLE) provides efficient estimates of these exponential

families. Crain (1974) establishes the existence and consistency of the MLE in this case.

This method of density estimation arises naturally according to the principle of maximum

entropy. The information entropy, the central concept of information theory, of a univariate

continuous random variable with density f is defined as

W (f) =

∫f(x) log f(x)dx.

Suppose regarding a random variable x with an unknown density function f0, one knows

only some of its moments. There may exist an infinite number of distributions satisfying

these given moment conditions. Jaynes (1957) proposes a method of constructing a unique

density estimate based on the moment conditions as the following:

maxf−W (f)

subject to the integration to unity and side moment conditions:

∫f(x)dx = 1,∫φk(x)f(x)dx = µk, k = 1, . . . , K,

where φks are real-valued, linearly independent functions defined on the support of x.

5

The solution, obtained by an application of calculus of variations, takes the form

f(x; c) = exp(K∑k=1

ckφk(x)− c0) = exp(c′φ(x)− c0), (3)

where φ = (φ1, . . . , φK)′ and c = (c1, . . . , cK)′ are the Lagrangian multipliers for the moment

conditions. The normalization factor c0 = log{∫

exp(c′φ(x))dx}< ∞ ensures the integra-

tion to unity condition. Among all distributions satisfying the given moment conditions, the

maximum entropy density is the closest to the uniform distribution defined on the support

of x. Many distributions can be characterized as maximum entropy densities. For example,

the normal distribution is obtained by setting φ1(x) = x and φ2(x) = x2 for x ∈ R, and the

Beta distribution by φ1(x) = ln(x) and φ2(x) = ln(1− x) for x ∈ (0, 1).

In practice, the population moments are often unknown and therefore replaced by their

sample counterparts. Given an iid random sample X1, . . . , Xn, the maximum entropy density

is then estimated by the MLE based on the sample moments φ = (φ1, . . . , φK)′, where

φk = 1/n∑n

i=1 φk(Xi), k = 1, . . . , K. The log-likelihood function is given by

L =1

n

n∑i=1

[c′φ(Xi)− log

{∫exp(c′φ(x))dx

}]= c′φ− log

{∫exp(c′φ(x))dx

}.

Denote the MLE solution by f(·; c). Thanks to the canonical exponential form of the max-

imum entropy density, φ are the sufficient statistics of f(·; c). Therefore, we call φ the

characterizing moments of the maximum entropy density.

The coefficients for the maximum entropy densities generally cannot be obtained analyt-

ically and thus are to be solved numerically. Zellner and Highfield (1988) and Wu (2003)

discuss the numerical calculations of the maximum entropy density. Define

g(x) = exp(c′φ(x)), µg(h) =

∫h(x) exp(g(x))dx/

∫exp(g(x))dx. (4)

6

The score function and the Hessian matrix of the MLE are then given by

S = φ− µg(φ),

H = −{µg(φφ′)− µg(φ)µg(φ

′)} ,

where g(x) = exp(c′φ(x)). One can then use the Newton’s method to solve for c iteratively.

The uniqueness of the solution is ensured by the positive-definiteness of the Hessian matrix.

Therefore for a maximum entropy density, there exists a unique correspondence between its

characterizing moments φ and its coefficients c.

2.2 Exponential Series Density Estimator

The maximum entropy density is a useful approach for constructing a density estimate given

a set of moment conditions, which enjoys an appealing information-theoretic interpretation.

On the other hand, like the usual parametric models, this density estimator generally is not

consistent unless the underlying distribution happens to belong to the canonical exponential

family with characterizing moments given by φ. To obtain consistent density estimates,

in principle one can let the number of characterizing moments increase with the sample

size at a proper rate, which effectively transforms the maximum entropy method into a

nonparametric estimator. To stress the nonparametric nature of the said estimator, we call

a maximum entropy density whose number of characterizing moments increases with sample

size an Exponential Series Estimator (ESE).

Moving into the realm of nonparametric estimations inevitably brings new problems.

The paramount issue is the determination of degree of smoothing, which will be discussed

in length below. Another issue that warrants caution is identification. To ensure a one-

to-one correspondence between f(x) and exp(g(x))/∫

exp(g(x))dx, we need impose certain

restrictions. Two commonly used identification conditions are g(x0) = 0 for certain x0 and∫g(x)dx = 0. When we use orthogonal series as the basis functions, the second condition is

satisfied automatically. Furthermore, since a constant in g(x) is not identified, it is excluded.

7

Thus throughout the text, we maintain that the usual zero order term φ0(x) = 1 is excluded

from the basis functions for g.

Let x be a random variable defined on [0, 1] with density f0 and f is an ESE approximation

to f0. Without loss of generality, let φ = (φ1, . . . , φK)′ be a series of orthonormal basis

functions with respect to the Lebesgue measure on [0, 1]. One can measure the discrepancy

between f0 and f by the Kullback-Leibler Information Criterion (KLIC, also known as the

relative entropy or cross entropy), which is defined as D(f0‖f) =∫f0(x) ln(f0(x)/f(x))dx.3

In an important development, Barron and Sheu (1991) establish that the sequence of fs

converge to f0 in terms of the KLIC. In particular, suppose∫{∂/∂xr (log f0(x))}2 dx <∞,

then D(f0‖f) = Op(1/K2r + K/n), with K → ∞ and K3/n → 0 for the power series and

K2/n→ 0 for the trigonometric series and the splines.

Wu (2010) extends the ESE to multivariate densities. He uses the tensor product of the

univariate orthogonal basis functions to construct multivariate orthogonal basis functions.

Let x be a d-dimensional random variable defined on [0, 1]d with density f0. A multivariate

ESE for f0 is then constructed as

f(x) =exp

(∑K1

k1=1 · · ·∑Kd

kd=1 φk1(x1) · · ·φkd(xd))

∫exp

(∑K1

k1=1 · · ·∑Kd

kd=1 φk1(x1) · · ·φkd(xd))dx1 · · · dxd

.

Under the assumption that∫ {

∂r

∂r1 ···∂rdln f0(x)

}2dx <∞, where r =

∑dj=1 rj, he shows that

the ESE estimates converge to f0 at rate Op(∏d

j=1K−2rjj + 1/n

∏dj=1Kj) in terms of the

KLIC. Convergence rates in other metrics are also established, and extensive Monte Carlo

simulations demonstrate the effectiveness of the ESE for multivariate density estimations.

Like the orthogonal series density estimator, the ESE enjoys the automatic adaptiveness

to the unknown smoothness of the underlying distribution. On the other hand, it is strictly

positive, and therefore avoids the negative density estimates that might occur with the

orthogonal series estimators and the higher order kernel estimators. In addition, Wu (2010)

3The KLIC is a pseudo-metric in the sense that D(f ||g) = 0 if and only if f = g almost everywhere,whereas it is asymmetric and does not satisfy the triangle inequality.

8

suggests that the ESE is an appealing estimator for the copula density because it is explicitly

defined on a bounded support and therefore less sensitive to the boundary bias problem. Chui

and Wu (2009) provide further Monte Carlo evidence on this.

2.3 Selection of basis functions

In this subsection we discuss the selection of the degree of basis functions for the ESE, with

a focus on the multivariate case. It is well known that the choice of smoothing parameter

is often the most crucial ingredient of nonparametric estimations. The kernel density esti-

mates can vary substantially with the bandwidth. Similarly, the numerical performance of

orthogonal series density estimations hinges on the degree of basis functions; for example,

a high order power series may oscillate wildly and produce negative density estimates. The

ESE, which can be viewed as a series estimator raised to the exponent, is no exception.

When a higher-than desirable number of characterizing moments is used in the estimation,

the density estimates may exhibit spurious bumps and spikes. In addition, a large number of

characterizing moments increases not only the computational cost, but also the probability

that the Hessian matrix used in Newton updating approaches (near) singularity. Therefore,

judicious choice of the degree of basis functions is called for.4

The natural connection between the maximum entropy density and the MLE facilitates

adopting some information criterion for model specification. The Akaike Information Crite-

rion (AIC) and Bayesian Information Criterion (BIC) are two commonly used information

criteria that strive for a balance between the goodness-of-fit and simplicity of statistical

modeling. In the paradigm of nonparametric estimation where an estimator approximates

the unknown underlying process, the AIC is considered optimal in the minimax sense.5 On

4The selection of basis functions is particularly important for the ESEs, compared with the generalizedFourier series density estimation. For the latter, the coefficient for each basis function φk is given by∫φk(x)f0(x)dx, which can be conveniently estimated by its sample counterpart. Therefore although the

selection of basis functions affects its performance, no numerical difficulties are involved for the generalizedFourier series estimations. In contrast, the coefficients for the ESEs are obtained through an inverse problemwhich involves all basis functions through Newton’s updating; an overlarge basis may render the Hessianmatrix near singular and consequently cause numerical difficulties.

5The BIC is consistent if the set of candidates contains the true model. However, in nonparametricestimations generally the true model is assumed unknown and the goal is to arrive at an increasingly better

9

the other hand, from a penalized MLE point of view, the difference between the AIC and

the BIC resides in their penalties for roughness or number of parameters. Let L be the log

likelihood and K the number of parameters in a model, which reflects the complexity of the

model and is to be penalized. Both criteria can be written in the form of L−λK, where the

second term is the roughness penalty and λ determines the strength of roughness penalty.

For the AIC and the BIC, λ takes the value of 1 and 1/2 lnK respectively.

The cross validation (CV) provides an alternative method to select smoothing parameters.

Let L−i denote the log likelihood for the ith observation evaluated at a model estimated with

the entire sample but the ith observation. The cross validated log likelihood is calculated

as L− ≡∑n

i=1 L−i. The likelihood cross validation method minimizes the Kullback-Leibler

loss and therefore is asymptotically equivalent to the AIC approach. (See Hall (1987) for an

in-depth analysis of the likelihood cross validation method.)

Using the information criteria or the CV method to select the smoothing parameter

for the univariate ESE is relatively straightforward. Recall that the selection of smooth

parameter is equivalent to the selection of basis functions or characterizing moments for the

ESE. Given a reasonably large candidate set of basis functions, one can evaluate all subsets of

the candidate set and select the optimal set of basis functions according to a given selection

criterion. However, the process can be time consuming: if the candidate set contains K basis

functions, then the number of subsets is 2K . This process can be greatly simplified when

the basis functions have certain natural ordering or hierarchical structure. For example, the

polynomial series and the trigonometric series have an intuitive frequentist interpretation

such that the low/high order basis functions capture low/high frequency features of the

underlying process. When this type of series are used, it is a common, and sometimes

preferred, practice to use a hierarchical selection approach in the sense that if the kth basis

function is selection, all lower order basis are included automatically. Clearly, the hierarchical

selection method is a truncation approach. The number of models that need to be estimated

is K, considerably smaller than 2K as is required in the complete subset selection.

In principle either the subset selection or truncation method can be used to select the

approximation to the underlying model rather than identifying the true model.

10

smoothing parameter for estimating multivariate densities using the ESE. However, the

practical difficulty is that the number of required evaluations increases exponentially with

the dimension of x. For density estimation of a d-dimensional x, we consider the tensor

products of univariate basis functions given by φk(x) =∏d

j=1 φkjj (xj), where the multi-index

k = (k1, . . . , kd). Denote the size of a multi-index by |k| =∑d

j=1 kj. Suppose the candidate

set MK consists of basis functions whose sizes are no great than K, i.e.,

MK = {φk : 1 ≤ |k| ≤ K} .

With a slight abuse of notation, let |MK | denote the number of elements in MK . One can

show that |MK | =(K+dd

)− 1. Therefore if the subset selection method is used, it would

require estimating 2|MK | ESE densities, which can be prohibitively expensive. For instance,

if d = 2 and K = 4, we need estimate 214 ESE densities; the number explodes to 234 if d = 3.

Thus, the subset selection approach is practically infeasible except for the simplest cases.

Now let us consider the truncation method, which is more economical in terms of basis

functions. It seems we can proceed as in the univariate case, estimating the ESE densities

with basis functions Mk, k = 1, . . . , K, and then selecting the optimal set according to

some criterion. There is, however, a key difference between the univariate case and the

multivariate case. For the former, as we raise k from 1 to K, each time we increase the

number of basis functions by one. On the other hand, for the general d-dimensional case,

the number of basis functions added to the candidate set increases with k. To be precise, let

mK = {φk : |k| = K}, thenMK = (m1, . . . ,mK). The number of additional basis functions

incorporated along the stepwise process at each stage is |mk| = |Mk| − |Mk−1| =(k+d−1d−1

)for d ≥ 2 and k ≥ 1. For instance, when d = 2, we have |mk| = k + 1; when d = 3,

the corresponding step sizes increase to 3, 6, 10, and 15 for k = 1, . . . , 4. Therefore in

the multivariate case, the number of basis functions added along the process of truncation

method rises rapidly with the dimension d, leading to increasingly abrupt expansions in

model complexity. Consequently, the suitability of the simple truncation method to high

dimensional case is questionable.

11

Lastly in addition to the practical difficulties associated with model specification in the

high dimension case, there exists another potential problem. Recall that the maximum en-

tropy density can produce spurious spikes when it is based on a large number of moments.

Not surprisingly, this problem can be aggravated when using the ESE to estimate multi-

variate densities, whose number of characterizing moments increase rapidly with dimension

and sample size. In addition, the larger is the number of basis functions, the higher is the

probability that the Hessian matrix used in the Newton updating becomes (near) singu-

lar, introducing further complications. To mitigate the aforementioned problems, below we

propose an alternative penalized MLE estimation approach for copula densities using the

ESE.

3 Penalized exponential series estimation

As is discussed above, the ESE for multivariate densities can be challenging due to difficulties

associated with selection of basis functions and numerical issues. In this section, we propose

to use the method of penalized MLE to determine the degree of smoothing: instead of

painstakingly selecting a (small) set of basis functions to model a multivariate density, we

use a relatively large set of basis functions and shrink coefficients of the resulting models

toward zero to penalize its complexity.

3.1 The model

Good and Gaskins (1971) introduce the idea of roughness penalty density estimation. Their

idea is to use as an estimate that density which maximizes a penalized version of the likeli-

hood. The penalized likelihood is defined as

Q =1

n

n∑i=1

ln f(Xi)− λJ(f),

12

where J(f) is a roughness penalty for density f and λ is the smoothing/tuning parameter.

The log likelihood dictates the estimate to adapt to the data, the roughness counteracts

by demanding less variation, and the smoothing parameter controls the tradeoff between

the two conflicting goals. Various roughness penalties have been proposed in the literature.

For instance, Good and Gaskins (1971) use J(f) =∫

(f ′)2(x)/f(x)dx, Silverman (1982) sets

J(f) =∫

(d/dx)3 ln f(x)dx, and Gu and Qiu (1993) propose general quadratic roughness

penalties for the smoothing spline density estimation.

Without loss of generality and for simplicity, we consider a bivariate random variable

(x, y) with a strictly positive and bounded density, f0, defined on the unit square [0, 1]×[0, 1].

Let φk(x, y), 1 ≤ |k| ≤ K, be an orthonormal basis function with respect to the Lebesgue

measure on the unit square. To ease notation, we denote M = |MK |, where M is understood

to be a function of K, which in turn is a function of the sample size. We also change the

multi-index k to a single index k, k = 1, . . . ,M .

We consider approximating f0 by

f(x, y) =exp(

∑Mk=1 φk(x, y))∫

exp(∑M

k=1 ckφk(x, y))dxdy≡ exp(g(x, y))∫

exp(g(x, y))dxdy, (5)

where g(x, y) = c′φ(x, y) with c = (c1, . . . , cM)′ and φ(x, y) = (φ1(x, y), . . . , φM(x, y))′.

Throughout this section, integration is taken to be on the unit square. For the roughness

penalty, we adopt a quadratic penalty on the log density g. The penalized MLE objective

function is then given by

Q =1

n

n∑i=1

c′φ(Xi, Yi)− ln

∫exp(g(x, y))dxdy − λ

2c′Wc, (6)

where W is a positive definite weight matrix for the roughness penalty.

Given the smoothing parameter and the roughness penalty, one can use Newton’s method

13

to solve for c iteratively. The gradient and Hessian of (6) are respectively

S = φ− µg(φ)− λWc,

H = −{µg(φφ′)− µg(φ)µg(φ)′} − λW ,

where µg is given by (4), and g = exp {c′φ}. One can establish the existence and uniqueness

of the penalized MLE within the general exponential family under rather mild conditions

(see, e.g., Lemma 2.1 of Gu and Qiu (1993)).

To implement the penalized MLE, we must specify several factors: (i) the basis functions

φ; (ii) the weight function W ; and (iii) the smoothing parameter λ. The smoothing param-

eter plays a most crucial role and is to be discussed in length in the next subsection. As for

the choice of basis functions, commonly used orthogonal series include the Legendre series,

the trigonometric series and the splines. Although there exist subtle differences between

these series (e.g., in terms of their boundary biases), they lead to the same convergence rates

under suitable regularity conditions.

We also need determine the size/number of basis functions. We stress that for the penal-

ized MLE, the number of basis functions is generally not considered a smoothing parameter.

For instance, in the smoothing spline density estimations, the size of the basis functions can

be as large as the sample size. In practice, often a “sufficiently large” (but smaller than

sample size) basis suffices. The size of basis functions in the penalized likelihood estimation

is usually considerably larger than that would be selected according to some information

criteria, and thus calls for roughness penalty.6

Next we need choose the weight matrix W . We consider penalizing the roughness of log

density g(x, y) = c′φ(x, y), whose penalty is given by

J(g) =

∫ {g(m)(x, y)

}2dxdy =

∫ {c′φ(m)(x, y)

}2dxdy = c′

{∫φ(m)φ(m)′dxdy

}c,

6See Gu and Wang (2003) for an asymptotic analysis of the size of basis functions in the smoothing splineestimations.

14

where g(m)(x, y) =∑

m1+m2≤m∂g(x,y)

∂xm1∂ym2with m ≥ 0. Therefore, W is given by the middle

factor of the last equality. Using orthonormal series simplifies the construction of the weight

matrix since it leads to a diagonal weight matrix. When m = 0, W equals the identity

matrix and coefficients for all basis functions are penalized equally; when m ≥ 1, coefficients

for higher order/frequency moments are increasingly penalized, where the rate of increase

rises geometrically with m.7 Popular choices of m include m = 0 and m = 2, corresponding

to the natural splines and cubic splines respectively in the smoothing spline estimations.

Since there is a one-to-one correspondence between the characterizing moments and their

coefficients in the ESE density, the penalized MLE can be viewed as a shrinkage estimator

that shrinks the sample characterizing moments towards zero. In addition, this roughness

penalty defines a null space J⊥, on which J(g) = 0. This null space is usually finite-

dimensional to avoid interpolation of the data. When the smoothing parameter λ→∞, the

penalized MLE converges to the MLE on J⊥, which is the smoothest density induced by the

given roughness penalty. For instance, when m = 2, the smoothest function for g is linear in

x; when m = 3, the smoothest function is quadratic in x, leading to the normal distribution

(See Silverman (1982)).

3.2 Selection of Smoothing Parameter

One advantage of using the penalized MLE for model selection is that it avoids the difficult

subset selection or truncation process to determine the optimal degree of smoothing. Instead,

a relatively large number of basis functions is used and the degree of smoothing is determined

by a single continuous parameter λ. In practice, one has to choose λ. The method of cross

validation is commonly used for this purpose. This is a natural choice, because the penalized

MLE does not involve subset selection and therefore the AIC, BIC type criteria that penalize

the number of parameters cannot be easily applied.

The leave-one-out cross validation for linear regressions is quite straightforward. As a

matter of fact, for a sample with n observations, one usually need not estimate n regressions

7For instance, the penalty weight given to a univariate cosine basis φk(x) =√

2 cos(πkx), x ∈ [0, 1], is(πk)2m.

15

because there is an analytical formula to calculate the least squares cross validation result

from the regression on the full sample. This kind of analytical solution, however, generally

does not exist for nonlinear estimations. For the ESE estimation of multivariate densities,

this can pose a practical difficulty due to high computational cost. This is because that

the coefficients of the ESE are calculated iteratively through Newton’s updating. For ba-

sis functions of size M , the Hessian matrix has M(M + 1)/2 distinct elements to evaluate,

each requiring multidimensional integration by numerical methods. The computational cost

increases rapidly with the dimension because (i) the number of basis functions increases

with the dimension in nonparametric estimations, and (ii) the multidimensional integration

becomes increasingly expensive with the dimension as well. Thus it is rather expensive to

implement the leave-one-out cross validation for multivariate ESEs, especially for penalized

MLEs that use a large number of basis functions. Therefore we propose a first order approx-

imation to the cross validated log likelihood, which only requires one estimation of the ESE

based on the full sample.

Recall that (5) belongs to the general exponential family and the sample averages φ

are the sufficient statistics for the penalized MLE. Denote the sample averages calculated

leaving out the ith observation by φ−i = 1/(n − 1)∑n

j 6=iφ(Xj, Yj). It follows that φ−i are

the sufficient statistics for the penalized MLE calculated with the ith observation deleted.

For given basis functions and smoothing parameter, denote by f and f−i the penalized

MLE estimates associated with φ and φ−i respectively. Let c and H be the estimated

coefficients and Hessian matrix of f , and c−i and H−i be similarly defined. By Taylor’s

theorem, we have

c−i ≈ c− H−1(φ− φ−i).

The normalization factor can be approximated similarly. Define c0 = − ln∫

exp(g(x, y))dxdy.

Let c0 and c0,−i be the normalization factor of f and f−i respectively. It follows that

c0,−i ≈ c0 − µg(φ′)H−1(φ− φ−i).

16

Next let L−i be the log likelihood of the ith observation evaluated at f−i. The cross

validated log likelihood can then be approximated as follows:

L− =1

n

n∑i=1

L−i(Xi, Yi)

=1

n

n∑i=1

{c′−iφ(Xi, Yi)− c0,−i

}≈ 1

n

n∑i=1

{c− H−1(φ− φ−i)

}′φ(Xi, Yi)−

1

n

n∑i=1

{c0 − (µg(φ))′H−1(φ− φ−i)

}=

1

n

n∑i=1

{c′φ(Xi, Yi)− c0} −1

n

n∑i=1

φ′(Xi, Yi)H−1(φ− φ−i)

= L − 1

n

n∑i=1

φ′(Xi, Yi)H−1(φ− φ−i), (7)

where L is the maximum penalized log likelihood on the full sample, and the second to

last equality follows because∑n

i=1(φ − φ−i) = 0. Next let Φ be a n × M matrix with

the ith row being (φ1(Xi, Yi), . . . , φM(Xi, Yi))′. The cross validated log likelihood (7), after

straightforward but tedious algebra, can be written in the following matrix form

L− ≈ L−1

n− 1trace[ΦH−1Φ′] +

1

n(n− 1)(1′Φ) H−1 (Φ′1) , (8)

where 1 is a n× 1 vector of ones.

We then use (8) as the objective function of the penalized MLE. Since λ is a scalar, we

use a simple grid search to locate the solution. As is discussed above, multi-dimensional

numerical integrations are used repeatedly in our estimations. For the calculation of µgs, we

use Smolyak algorithm for cubatures following Gu and Wang (2003). Smolyak cubatures are

highly accurate for smooth functions. We note that the placement of nodes in Smolyak cu-

batures is dense near the boundaries. Therefore, they are particularly suitable for evaluating

the ESEs of copula densities since they often peak near the boundaries and corners.

17

4 Monte Carlo Simulations

To investigate the finite sample performance of the proposed estimators, we conduct a

series of Monte Carlo simulations. For the penalized MLE estimator, we penalize the

third order derivatives of the log densities. Denote a bivariate ESE density by f(x, y) =

exp (g(x, y)− c0), where g(x, y) =∑M

m=1 φm(x, y). The the penalty then takes the form

J(g) = c′

[∫ {(∂

∂x3+ 3

∂

∂x2y+ 3

∂

∂xy2+

∂

∂y3)g(x, y)

}2

dxdy

]c.

When the smoothing parameter goes to infinity, the penalized MLE converges to the following

smoothest distribution induced by the penalty given above:

f(x, y) = exp(c1x+ c2x

2 + c3y + c4y2 + c5xy − c0

), x, y ∈ [0, 1],

which is a truncated bivariate normal density defined on the unit square.8 Alternatively,

one can penalize lower or higher order derivatives of g. We choose the third order derivative

because under this penalty, the smoothest distribution is the simplest one that contains useful

information on the dependence between x and y, which is captured by the sample moment

1/n∑n

i=1XiYi. If a lower order derivative is used, the smoothest distribution contains only

moments on the margins and thus is not informative since for copula densities, all margins are

uniform. On the other hand, if higher order derivatives are used, the smoothest distributions

contain higher order information on the dependence between x and y, whose coefficients are

not penalized.9

We consider both the Legendre series and the cosine series, orthonormalized on the unit

8We note that this is different from the Gaussian copula, whose distribution function is given by

C(x, y; ρ) = Φρ(Φ−1(x),Φ−1(y)), x, y ∈ [0, 1],

where Φ is the standard normal distribution function, and Φρ is the standard bivariate normal distributionfunction with correlation coefficient ρ.

9In contrast to the large literature on the selection of smoothing parameters, theoretical guidance to thespecification of penalty forms is scanty. On the the hand, existing literature suggests that the estimationsare usually not sensitive to the form of penalty, which is consistent with our own numerical experiments.

18

square. The results from the two basis functions are rather similar, hence to save space below

we only report the results on the Legendre series. We consider three sample sizes: 50, 100

and 200. For all three sizes, we find that the Legendre basis functions with degree no larger

than 4 produce satisfactory results.10 The approximate cross validation method described

in the previous section is used to select the smoothing parameter. For comparison, we also

estimate the copula densities using the kernel density estimator. In particular, we use the

product Gaussian kernel and the bandwidth is selected by the method of likelihood cross

validation.

We consider four different copulas: the Gaussian, T , Frank, and Galambos; the first

two belong to the Elliptical class, and the next two to the Archimedean and the extreme

value classes respectively. For each copula, we look at three cases with low, medium and

high dependence respectively. The coefficients for the copulas are selected such that the

low, medium and high dependence cases correspond to a correlation of 0.2, 0.5 and 0.8

respectively. All experiments are repeated 500 times.

For each experiment, let (Xi, Yi), i = 1, . . . , n, be an iid sample generated from a given

copula. Define the pseudo-observations as

X∗i =1

n+ 1

n∑j=1

I(Xj ≤ Xi), Y∗i =

1

n+ 1

n∑j=1

I(Yj ≤ Yi),

where the denominator is set to n + 1 to avoid numerical difficulties. Jackel (2002) and

Charpentier et al. (2007) suggest that using the pseudo-observations instead of the true

observations reduces the variation. The intuition is that the above transformation effectively

changes both marginal series (after being sorted in the ascending order) to ( 1n+1

, . . . , nn+1

),

which is consistent with the fact that copula densities have uniform margins. We use the

pseudo-observations in all our estimations.

To gauge the performance of our estimates, we calculate the mean square errors (MSE)

and mean absolute deviations (MAD) between the estimated densities and the true copula

10Let φk the kth degree Legendre polynomial on [0, 1]. We include in our bivariate density estimationsbasis functions of the form φj(x)φk(y), j + k ≤ 4. The size of this basis is 14.

19

densities, evaluated on a 30 by 30 equally spaced grid on the unit square. Figure 1 reports

the estimated results measured by the MSE. The top, middle and bottom rows correspond

to the sample sizes 200, 100 and 50, and the left, middle and right columns correspond to the

low, medium and high correlation cases. In each plot, the MSEs for the ESEs are represented

by circles connected by solid lines, while those for the KDEs are represented by triangles

connected by dash lines. The Gaussian, T , Frank and Galambos copulas are labeled as 1,2,3

and 4 respectively in each plot. Note that the scales for the plots differ.

In all our experiments, the ESE outperforms the KDE, often considerably. The MSE

increases with the degree of dependence and decreases with the sample size. Averaging

across four copulas, the ratios of MSEs between the ESEs and KDEs are 0.25, 0.49 and 0.77

for the low, medium and high correlation cases respectively. The corresponding ratios for

sample sizes 50, 100 and 200 are respectively 0.62, 0.67 and 0.73.

Figure 2 reports the estimation results in MADs. The overall pictures are similar to

those of MSEs, but with a lager average performance gap. Averaging across four copulas,

the ratios of MADs between the ESEs and KDEs are 0.32, 0.42 and 0.61 for the low, medium

and high correlation cases respectively. The corresponding ratios for sample sizes 50, 100

and 200 are respectively 0.45, 0.45 and 0.48. Thus our numerical experiments support our

contentions in the previous sections that the ESE provides a useful nonparametric estimator

for the copula densities.

5 Concluding Remarks

We have proposed a penalized maximum likelihood estimator of the exponential series

method for the copula density estimation. The exponential series density estimator is strictly

positive and overcomes the boundary bias issue associated with the kernel density estimation.

However, the selection of basis functions for the ESEs is challenging and can cause severe

numerical difficulties, especially for multivariate densities. To avoid the issue of basis func-

tion selection, we adopt the strategy of regularization by employing a relatively large basis

and penalizing the roughness of the resulting model, which leads to a penalized maximum

20

copula

MS

E

0.1

0.2

0.3

1 2 3 4

●

●

●

●

0.250

0.0

0.2

0.4

0.6

1 2 3 4

●

●

●

●

0.550

01

23

1 2 3 4

●

●

●

●

0.850

0.0

0.1

0.2

1 2 3 4

●

●

●

●

0.2100

0.0

0.2

0.4

0.6

1 2 3 4

●

●

●

●

0.5100

01

23

1 2 3 4

●

●

●

●

0.8100

0.00

0.05

0.10

0.15

0.20

1 2 3 4

●

●

●

●

0.2200

0.0

0.2

0.4

1 2 3 4

●

●

●

●

0.5200

01

2

1 2 3 4

●

●

●

●

0.8200

Figure 1: Mean squared errors of estimated copulas. The ESE and the KDE results arerepresented by circles and triangles respectively. Rows 1-3 correspond to n = 200, 100 and50 respectively; columns 1-3 correspond to correlation equal to 0, 2, 0.5 and 0.8 respectively;in each plot, copulas 1-4 correspond to the Gaussian, T , Frank and Galambos copula. Notethat the scales of the plots differ.

21

copula

MA

D

0.10

0.20

0.30

1 2 3 4

●

●

●

●

0.250

0.10

0.20

0.30

0.40

1 2 3 4

●

●

●

●

0.550

0.2

0.3

0.4

0.5

1 2 3 4

●

●

●

●

0.850

0.10

0.20

0.30

1 2 3 4

●

●

●

●

0.2100

0.10

0.20

0.30

1 2 3 4

●

●

●

●

0.5100

0.15

0.25

0.35

1 2 3 4

●

●

●

●

0.8100

0.05

0.15

0.25

1 2 3 4

●

●

●

●

0.2200

0.05

0.15

0.25

1 2 3 4

●

●

●

●

0.5200

0.15

0.25

0.35

1 2 3 4

●

●

●

●

0.8200

Figure 2: Mean absolute deviation of estimated copulas. The ESE and the KDE results arerepresented by circles and triangles respectively. Rows 1-3 correspond to n = 200, 100 and50 respectively; columns 1-3 correspond to correlation equal to 0, 2, 0.5 and 0.8 respectively;in each plot, copulas 1-4 correspond to the Gaussian, T , Frank and Galambos copula. Notethat the scales of the plots differ.

22

likelihood estimator. To further reduce the computational cost, we propose an approximate

likelihood cross validation method for the selection of smoothing parameter. Our extensive

Monte Carlo simulations demonstrate the usefulness of the proposed estimator for copula

density estimations. Generalization of the said estimator to nonparametric multivariate re-

gressions and applications in high dimensional analysis, especially in financial econometrics,

may be of interest for future study.

23

References

Barron, A. and C. Sheu (1991). Approximation of density functions by sequences of expo-

nential families. Annals of Statistics 19, 1347–1369.

Charpentier, A., J. Fermanian, and O. Scaillet (2007). The estimation of copulas: Theory

and practice. In J. Rank (Ed.), Copulas: From theory to Application in Finance. Risk

Publications.

Chui, C. and X. Wu (2009). Exponential series estimation of empirical copulas with ap-

plication to financial returns. In Q. Li and J. Racine (Eds.), Advances in Econometrics,

Volume 25.

Crain, B. (1974). Estimation of distributions using orthogonal expansions. Annals of Statis-

tics 2, 454–463.

Good, I. (1963). Maximum entropy for hypothesis formulation, especially for multidimen-

sional contingency tables. Annals of Mathematical Statistics 34, 911–934.

Good, I. and R. Gaskins (1971). Nonparametric roughness penalities for probability densi.

Biometrika 58, 255–277.

Gu, C. and C. Qiu (1993). Smoothing spline density estimation: Theory. Annals of Statis-

tics 21, 217–234.

Gu, C. and J. Wang (2003). Penalized likelihood density estimation: Direct cross-validation

and scalable approximation. Statistica Sinica 13, 811–826.

Hall, P. (1987). On kullback-leibler loss and density estimation. Annals of Statistics 15,

1491–1519.

Jackel, P. (2002). Monte Carlo Methods in Finance. New York: John Wiley and Sons.

Jaynes, E. (1957). Information theory and statistical mechanics. Physics Review 106, 620–

630.

24

Nelsen, R. B. (2010). An Introduction to Copulas. Springer.

Neyman, J. (1937). Smooth test for goodness of fit. Scandinavian Aktuarial 20, 149–199.

Silverman, B. W. (1982). On the estimation of a probability density function by the maximum

penalized likelihood method. Annals of Statistics 10, 795–810.

Skalar, A. (1959). Functions de repartition a n dimensions et leurs merges. Publ. Inst. Statist.

Univ. Paris 8, 229–231.

Wu, X. (2003). Calculation of maximum entropy densities with application to income dis-

tribution. Journal of Econometrics 115, 347–354.

Wu, X. (2010). Exponential series estimator of multivariate densities. Journal of Economet-

rics 156, 354–366.

Zellner, A. and R. Highfield (1988). Calculation of maximum entropy distribution and

approximation of marginal posterior distributions. Journal of Econometrics 37, 195–209.

25

penalized exponential series estimation of copula densities

Documents