estimation of non-gaussian a ne term structure...

Estimation of non-Gaussian AffineTerm Structure Models ∗

Drew D. Creal†

Chicago BoothJing Cynthia Wu‡

Chicago Booth

First draft: September 15, 2012This draft: September 24, 2013

Abstract

We develop a new estimation procedure for non-Gaussian affine term structure modelsthat uses linear regression to construct a concentrated likelihood function. Many pa-rameters are eliminated from the likelihood that often cause problems for conventionalestimation methods. Our approach consistently finds the maxima, whereas conven-tional approaches do not. It is more than 60 times faster, dropping estimation timefrom several hours to a couple minutes. Our method also works for models with ob-servable macroeconomic variables, hidden factors, and can allow for restrictions on thekey parameters of interest.

Keywords: affine term structure models; multivariate non-central gamma distribu-tion; concentrated log-likelihood; estimation; local maxima.

∗We thank Michael Bauer, Alan Bester, John Cochrane, Rob Engle, Jim Hamilton, Chris Hansen, KenSingleton and seminar and conference participants at Chicago Booth, NYU Stern, NBER Summer Institute,Chicago Booth Junior Finance Symposium, Bank of Canada, Kansas, and UMass for helpful comments. Bothauthors gratefully acknowledge financial support from the University of Chicago Booth School of Business.Cynthia Wu also gratefully acknowledges financial support from the IBM Faculty Research Fund at theUniversity of Chicago Booth School of Business.†The University of Chicago Booth School of Business, 5807 South Woodlawn Avenue, Chicago, IL 60637,

USA, [email protected]‡The University of Chicago Booth School of Business, 5807 South Woodlawn Avenue, Chicago, IL 60637,

USA, [email protected]

1

1 Introduction

We develop a family of discrete-time, non-Gaussian affine term structure models (ATSMs)

whose state vector is closed under admissible affine transformations. The market prices of

risk have an intuitive form analogous to popular Gaussian ATSMs. Our main contribution

is a new estimation procedure that uses linear regression to concentrate many parameters

out of the likelihood function. This dramatically reduces the dimension of the parameter

space. Our approach consistently finds the maxima, whereas conventional approaches do not.

Estimation time also drops from hours to minutes. We can include observable macroeconomic

variables or hidden factors and allow for restrictions on the key parameters of interest.

Our method makes it possible for us to study local maxima, explain why they exist, and

their economic implications. Estimating several popular models, we find that Gaussian and

non-Gaussian models fit the cross-section of yields equally well. Differences in the economic

implications between the models comes primarily from their time series dynamics. Finally,

we explain where the superior cross sectional information comes from, and demonstrate that

a small number of yields capture a large quantity of information.

ATSMs are popular among policy makers, practitioners, and academic researchers. They

are the workhorse models for pricing bonds, understanding the role of monetary policy,

and determining how macroeconomic shocks impact discount rates; for overviews, see Pi-

azzesi(2010), Duffee(2012), Gurkaynak and Wright(2012), and Diebold and Rudebusch(2013).

As the literature on ATSMs has developed over the last decade, there is a consensus that

estimation can be challenging. This point was emphasized by, among others, Duffee(2002),

Ang and Piazzesi(2003), Kim and Orphanides(2005), and Hamilton and Wu(2012b). Direct

maximization of the log-likelihood function can be problematic as yields are close to non-

stationary and the no-arbitrage restrictions place complicated non-linear constraints on the

parameters of the model. ATSMs also have many local maxima that carry different eco-

nomic implications. Only recently have reliable and transparent estimation methods been

developed for Gaussian ATSMs; see, Joslin, Singleton, and Zhu(2011), Christensen, Diebold,

2

and Rudebusch(2011), and Hamilton and Wu(2012b).

Although Gaussian ATSMs have provided important insights, they cannot capture con-

ditional heteroskedasticity.1 Introducing positive non-Gaussian state variables serving as

stochastic volatility factors unfortunately further complicates estimation. First, the stochas-

tic volatility factors have more complicated dynamics and are bounded from below by zero.

Second, in models with both factors, Gaussian and non-Gaussian factors are asymmetric

because the non-Gaussian state variables enter both the conditional mean and variance of

the Gaussian state variables. The interaction between these factors creates a plethora of

local maxima that are not present in models with only Gaussian factors.

Much of the literature estimating non-Gaussian ATSMs has been conducted in continuous-

time; see among others Duffie and Kan(1996), Dai and Singleton(2000), Duffee(2002), Cherid-

ito, Filipovic, and Kimmel(2007), Collin-Dufresne, Goldstein, and Jones(2008), and Aıt-

Sahalia and Kimmel(2010). With the exception of a few special sub-classes of models, the

transition densities of the state variables within a continuous-time model are not known

making it necessary to approximate the likelihood function. Le, Singleton, and Dai(2010)

extended the univariate discrete-time, non-Gaussian model of Gourieroux and Jasiak(2006)

to multiple factors, and provided a class of discrete-time models.

One contribution of this paper is to develop a family of discrete-time non-Gaussian mod-

els that encompass any admissible rotation of a multivariate discrete-time Cox, Ingersoll,

and Ross(1985) process, nesting other discrete-time models as one of the rotations.2 This

generalization helps to understand identification of the model. We derive the properties of

this class of models and show that all members within this family have a closed-form tran-

sition density. As an immediate result, the likelihood function is exact as well. The market

prices of risk are of the extended affine form of Cheridito, Filipovic, and Kimmel(2007). We

1 For example, people use Gaussian models to study how macroeconomic fundamentals impact interestrates (Ang and Piazzesi(2003)), the dynamics of term premia across different countries (Wright(2011)), andthe impact of quantitative easing by the Federal Reserve when the Fed funds rate is at its zero lower bound(Hamilton and Wu(2012a))

2In this paper, we do not consider the class of non-Gaussian ATSMs built from the non-central Wishartprocess of Gourieroux, Jasiak, and Sufana(2009).

3

demonstrate that they have an intuitive form, illustrating how an agent gets compensated

for facing both Gaussian and stochastic volatility shocks.

Our main contribution is a new estimation procedure based on a concentrated log-

likelihood that dramatically improves estimation. Most of the parameters governing the

conditional mean of the state variables can be concentrated out of the model analytically

using linear regression. This reduces the dimensionality of the optimization problem. More

importantly, these parameters can cause numerical problems as well as extremely slow con-

vergence to the optimum due to the near non-stationarity of interest rates. We also provide

the analytical gradient of the log-likelihood.

Our method outperforms conventional approaches both in terms of stability of conver-

gence and speed. A Monte Carlo study shows that our method guarantees convergence as

long as it is locally identified, and it converges to a number of local maxima repeatedly. Aside

from being able to find the global maximum, our method helps us to locate and understand

the economic implications of different local maxima. Conversely, the conventional method of

directly maximizing the original likelihood never converges fully to any of the local maxima,

nor does it converge to the same point twice in repeated trials even when it is initialized

under the same local mode. This makes it difficult for researchers to differentiate between

points near a well-behaved local maximum having the same economic meaning and locations

corresponding to local maxima that are economically different. The median time it takes for

our new procedure is less than 2 minutes for a three factor model with one non-Gaussian

factor, whereas the conventional approach takes over 2 hours.

Using our method, we shed light on how local maxima with different economic impli-

cations are created in non-Gaussian models. In Gaussian models, different rotations of the

factors (such as re-ordering of the factors) result in equivalent global maxima, with identical

economic implications. Researchers impose identifying restrictions to isolate one of these

global maxima. In non-Gaussian models, rotations such as re-ordering factors can have sub-

stantial economic impacts. The non-Gaussian state variables must be positive and enter the

4

conditional variance. This creates an asymmetry between the Gaussian and non-Gaussian

factors resulting in many local maxima that are not equivalent. Any economic conclusions

drawn from the models must be made with care as they can vary widely for different local

maxima.

We apply our estimation method to several popular models with three and four factors.

We find that Gaussian and non-Gaussian models fit the cross-section of yields almost iden-

tically because (i) the bond loading recursions are identical up to Jensen’s inequality; and

(ii) as the measurement errors in the cross-section of yields are small, an efficient estimator

(like maximum likelihood) will use the inverse of the variance as the weighting matrix to em-

phasize the fit of the cross section. The fact that the cross-sectional fit is the same suggests

that any differences in economic conclusions between Gaussian and non-Gaussian models

must be driven by the differences in their time series properties. Gaussian models are more

flexible in their conditional mean. Consequently, term premia are more flexible in Gaussian

models, especially over long horizons. Non-Gaussian models can track the broad trend in

volatility of yields but they do not capture the detailed variation that reduced-form models

can capture. In our data set, the volatility factor is the level factor, because in the 1970’s,

interest rates were high when volatility was high. Conventional wisdom in finance suggests

that the short rate serves as the volatility factor but we find that estimates of conditional

volatility are more closely related to the long rate.

We also illustrate where the cross section information is coming from and show that a

small number of yields in the cross section contain a large amount of information. This is

because the bond loadings are a high powered polynomial function of the risk-neutral autore-

gressive coefficients. The bond loadings are sensitive to small changes in these parameters,

which causes the parameters to be estimated more precisely. This contrasts with the popular

view that the superior information comes from the large number of cross-sectional yields at

any time period.

This paper continues as follows. In Section 2, we specify a general class of discrete-time,

5

non-Gaussian affine term structure models. In Section 3, we describe our new approach to

estimation showing how to construct the concentrated likelihood. Section 4 describes the

data and identification of the models. Section 5 studies a three factor non-Gaussian model

in depth. In Section 6, we study several three and four factor Gaussian and non-Gaussian

models. In Section 7, we discuss directions for future research and conclude.

2 Model

In this section, we describe a class of discrete-time, non-Gaussian ATSMs and show that the

state vector is closed under affine transformations. In addition, all models within this family

have closed-form transition densities.

2.1 Factor dynamics

The H×1 vector of non-Gaussian state variables ht+1 captures the volatility. Their stochastic

process is obtained by taking an affine transformation to the exact discrete-time equivalent

of a multivariate Cox, Ingersoll, and Ross(1985) process. The model for ht+1 under the

physical measure P is

ht+1 = µh + Σhwt+1 (1)

wi,t+1 ∼ Gamma (νh,i + zi,t+1, 1) , i = 1, . . . , H (2)

zi,t+1 ∼ Poisson(e′iΣ

−1h ΦhΣhwt

), i = 1, . . . , H (3)

where νh = (νh,1, . . . , νh,H) are shape parameters, Φh is a matrix controlling the autocorrela-

tion of ht+1, Σh is a scale matrix, and µh is a vector determining the lower bound of ht+1. We

use ei to denote the i-th column of IH . To guarantee that ht+1 remains positive, all elements

of µh and Σh must be non-negative. To ensure that the mean of the Poisson distribution is

non-negative, Σ−1h ΦhΣh must be non-negative.

6

The conditional mean of ht+1 can be written in matrix form as

E (ht+1|It) = (IH − Φh)µh + Σhνh + Φhht

where It stands for the agent’s information set at time t. It is a linear function of its own lag

ht, similar to a vector autoregression. The conditional variance V (ht+1|It) is also an affine

function of ht

Σh,tΣ′h,t = Σhdiag(νh − 2Σ−1

h Φhµh)Σ′h + Σhdiag

(2Σ−1

h Φhht)

Σ′h

Gourieroux and Jasiak(2006) built the univariate version of this model and Le, Singleton,

and Dai(2010) extended it to (1)-(3) with µh = 0 and Σh diagonal. In this model for ht+1,

shocks are allowed to be correlated through the off-diagonal elements of Σh and the process

may have a negatively correlated drift through the off-diagonal elements of Φh. In Appendix

A.2, we provide the transition density of ht+1 for any admissible rotation.

The G × 1 vector of conditionally Gaussian state variables gt+1 follows a vector autore-

gression with conditional heteroskedasticity

gt+1 = µg + Φggt + Φghht + Σghεh,t+1 + εg,t+1, εg,t+1 ∼ N(0,Σg,tΣ

′g,t

), (4)

Σg,tΣ′g,t = Σ0,gΣ

′0,g +

H∑i=1

Σi,gΣ′i,ghit,

εh,t+1 = ht+1 − E (ht+1|It)

where Σi,g are lower triangular for i = 0, . . . , H. The Gaussian factors are functions of the

non-Gaussian state variables through both the autoregressive term Φghht and the covariance

term Σghεh,t+1. The conditional variance of gt+1 is also a function of the non-Gaussian

factors, which introduces conditional heteroskedasticity into bond prices.

A nice property of the model (1)-(4) for the vector xt+1 =(h′t+1, g

′t+1

)′pricing bonds is

that any admissible affine transformation remains within the same family of distributions.

7

Proposition 1 Let xt = (h′t, g′t)′ follow the process of (1)-(4) with parameters θ. Consider

an admissible affine transformation of the form

ht

gt

=

ch

cg

+

Chh Chg

Cgh Cgg

ht

gt

.

The new process xt =(h′t, g

′t

)′remains in the same family of distributions under updated

parameters θ. The parameters νh and Σ−1h ΦhΣh are invariant to rotation. The admissibility

restrictions and the relationship between the new and old parameterizations can be found in

Appendix E.1.

Proof: See Appendix E.1.

This proposition helps to understand identification in Section 4.2. The admissibility con-

straints ensure that the non-Gaussian state variables always remain positive after applying a

transformation from xt to xt and that there exists another admissible rotation from xt back

to xt.

Analogous to the popular class of Gaussian ATSMs, we specify xt = (g′t, h′t)′ to have the

same dynamics under both P and Q. We allow the parameters controlling the conditional

mean to be different under the two probability measures and set the scale parameters Σgh,Σh

and Σi,g for i = 0, . . . , H to be the same. The location parameter µh must also be the same

under both measures to ensure no-arbitrage.

2.2 Stochastic discount factor

In this section, we demonstrate how an agent gets compensated for risk exposure when

holding a zero-coupon bond under stochastic volatility. The detailed derivation of the log

of the stochastic discount factor (SDF) can be found in Appendix B. We decompose the

8

log-SDF into the risk free rate plus three components describing the risk compensation

mt+1 = −rt −1

2λ′gtλgt − λ′gtεg,t+1 − λ′wtεw,t+1 − λ′ztεz,t+1

where εi,t+1 are standardized shocks with mean zero and identity covariance matrix, and

λit is the price of risk i for each of the three types of shocks in the model. In addition to

the risk-free rt, the agent gets compensated for being exposed to the Gaussian shock εg,t+1

in equation (4), the gamma shock εw,t+1 in equation (2), and the Poisson shock εz,t+1 in

equation (3). The prices of these risks are defined as

λgt = V(gt+1|It, ht+1, zt+1)−1/2[E (gt+1|It, ht+1, zt+1)− EQ (gt+1|It, ht+1, zt+1)

]λwt = V(wt+1|It, zt+1)−1/2

[E(wt+1|It, zt+1)− EQ(wt+1|It, zt+1)

],

λzt = V(zt+1|It)−1/2[E(zt+1|It)− EQ(zt|It)

].

The market prices of risk have an intuitive form as the Sharpe ratio measuring per unit

risk compensation. Specifically, they are the difference in the conditional means of each

shock under P and Q standardized by a conditional standard deviation. The time-varying

quantities of risk are a feature of non-Gaussian models that are not available in Gaussian

models.

2.3 Bond prices

The price of a zero-coupon bond with maturity n at time t is the expected price of the same

asset at time t+ 1 discounted by the short rate rt under the risk neutral measure

P nt = E

Qt

[exp (−rt)P n−1

t+1

].

9

The short rate is a linear function of the state vector

rt = δ0 + δ′1,hht + δ′1,ggt.

This leads to bond prices that are an exponentially affine function of the state variables

P nt = exp

(an + b′n,hht + b′n,ggt

).

The bond loadings an, bn,h and bn,g can be expressed recursively in matrix notation as

an = −δ0 + an−1 + µQ′g bn−1,g +[µh − ΦQh µh + Σhν

Qh

]′bn−1,h +

1

2b′n−1,gΣ0,gΣ

′0,g bn−1,g

+µ′hΦQ′h Σ−1′

h

(IH −

[diag

(ιH − Σ′hbn−1,gh

)]−1)

Σ′hbn−1,gh

−νQ′h[log(ιH − Σ′hbn−1,gh

)+ Σ′hbn−1,gh

](5)

bn,h = −δ1,h + ΦQ′ghbn−1,g + ΦQ′h bn−1,h +1

2

(IH ⊗ b′n−1,g

)ΣgΣ

′g

(ιH ⊗ bn−1,g

)−ΦQ′h Σ−1′

h

(IH −

[diag

(ιH − Σ′hbn−1,gh

)]−1)

Σ′hbn−1,gh (6)

bn,g = −δ1,g + ΦQ′g bn−1,g (7)

where ΣgΣ′g is a GH × GH block diagonal matrix with diagonal elements Σi,gΣ

′i,g for i =

1, . . . , H and bn−1,gh = Σ′ghbn−1,g + bn−1,h. The loadings must satisfy the restriction that the

i-th component of Σ′hbn−1,gh < 1 for i = 1, . . . , H. The initial loadings at maturity one are

a1 = −δ0, b1,g = −δ1,g and b1,h = −δ1,h. The derivation of these expressions is available in

Appendix C.

Bond yields ynt ≡ − 1n

log (P nt ) are linear in the factors

ynt = an + b′n,hht + b′n,ggt

with an = − 1nan, bn,h = − 1

nbn,h and bn,g = − 1

nbn,g. Stacking ynt in order for N different

maturities n1, n2, ..., nN gives Yt = A + Bxt where A = (an1 , . . . , anN)′, B = (b′n1

, ..., b′nN)′.

10

If more yields are observed than the number of factors (N > G + H), not all yields can be

priced exactly. We make the standard assumption in the ATSM literature that N1 = G+H

linear combinations of the yields Y(1)t = SY1Yt are priced without error and the remaining

N2 = N − N1 linear combinations Y(2)t = SY2Yt are observed with Gaussian measurement

errors. Given this assumption, the observation equations are

Y(1)t = A1 +B1xt Y

(2)t = A2 +B2xt + ηt ηt ∼ N (0,Ω) (8)

where A1 ≡ SY1A,A2 ≡ SY2A,B1 ≡ SY1B, and B2 ≡ SY2B.

3 Estimation methodology

In this section, we introduce a new estimation method for any identified model. In Section

3.2, we illustrate how the basic idea can be applied to a wide range of ATSM’s including

models with observable macroeconomic variables and hidden factors.

3.1 Concentrated likelihood estimation

Given the parameters of the model θ, the likelihood function is

p(Y

(1)1:T , Y

(2)1:T ; θ

)= |det (J)|−(T−1)

T∏t=1

p(Y

(2)t |It; θ

) T∏t=1

p (gt|ht, It−1; θ) p (g0|h0; θ)

T∏t=1

H∏i=1

p (hit|It−1; θ) p (h0; θ) (9)

where J is the Jacobian of the transformation from xt = (g′t, h′t)′ to Y

(1)t . We have used the

fact that the pricing equation (8) can be inverted

xt = B−11

(Y

(1)t − A1

)gt = Sgxt ht = Shxt (10)

11

and the Gaussian and non-Gaussian factors can be selected out by the matrices Sg and Sh.

Expressions for the log-likelihood ` (θ) are available in Appendix D.3 Direct maximization

of the log-likelihood is however extremely challenging as interest rates are close to non-

stationary, the bond loadings are non-linear functions of the models’ parameters, and the

maximization must impose the condition that ht > 0.

The key insight to improving estimation methods for this class of models is to recognize

that it is possible to concentrate out a large subset of the parameters from the log-likelihood

by linear regression. Specifically, it is possible to concentrate out the parameters entering

the conditional mean of the Gaussian state variables (µg,Φg,Φgh) as well as the covariance

matrix Ω in (8). The parameter vector can be split into two sub-vectors θ = (θc, θm) which

are the parameters that can be concentrated out θc = (µg,Φg,Φgh,Ω) and the remaining

parameters θm that will be maximized numerically. The method we propose is a result of

the following proposition.

Proposition 2 Let θc = (µg,Φg,Φgh,Ω) and θm contain the remaining parameters of the

model. For the general affine model, the maximum likelihood estimator θ =(θc, θm

)can be

obtained by the following procedure.

(1.) Given θm, maximize the conditional log-likelihood to obtain θc (θm) = argmaxθc

` (θc, θm).

The first-order conditions for this problem can be solved analytically as follows.

(a.) Given θm, calculate the bond loadings A and B and the state variables gt and ht

from xt = B−11

(Y

(1)t − A1

).4

3The stationary distribution p (g0, h0; θ) = p (g0|h0; θ) p (h0; θ) is only known for special sub-classes ofthe affine family of models. In this paper, we will assume a diffuse initial condition and start from t = 2.When the stationary distribution of the state vector is known, including the initial condition is easy to doafter initial estimation of the model. While including the initial conditions enforces stationarity, it alsohas a potential negative impact on the estimates of the autoregressive parameters as it can increase theirdownward bias; see, e.g. Bauer, Rudebusch, and Wu(2012).

4During estimation, we impose B1 to be invertible and ht to be positive.

12

(b.) Given gt and ht, calculate εh,t+1 and Σg,t. Run a GLS regression

gt+1 − Σghεh,t+1 = µg + Φggt + Φghht + Σg,tεg,t+1

to calculate µg (θm) , Φg (θm) , Φgh (θm).

(c.) Calculate the covariance matrix

Ω (θm) =1

T − 1

T∑t=2

(Y

(2)t − A2 −B2xt

)(Y

(2)t − A2 −B2xt

)′

(2.) Substitute θc (θm) = (µg (θm) , Φg (θm) , Φgh (θm) , Ω (θm)) into the original log-likelihood.

Maximize the concentrated log-likelihood function θm = argmaxθm

`(θc (θm) , θm

).

Proof: See Appendix E.2.

The dimension of the optimization problem decreases dramatically as the concentrated log-

likelihood function `(θc (θm) , θm

)is only a function of θm. Restrictions on Ω are also

possible.

The intuition behind this result is that given θm the bond loadings can be calculated and

the factors gt and ht are conditionally observable from (10). Once the factors are observed,

the first-order conditions for the parameters (µg,Φg,Φgh) can be solved analytically in part

(1.b.) because they enter the log-likelihood only in the P dynamics as quadratic functions of

the state variables. Solving the subset of first order conditions for these parameters in terms

of gt and ht is equivalent to running the generalized least squares (GLS) regression defined

by the conditionally Gaussian factor dynamics.

Of critical importance is the fact that the parameters being concentrated out µg,Φg,Φgh

have the potential to cause problems during estimation. These P parameters govern the time

series dynamics of the state variables. As yields are close to non-stationary, some factors are

also close to non-stationary.

The dimension of θc depends on the number of factors G and H as well as the rotation

13

of the state vector xt chosen by the researcher. This is because the number of estimable

parameters entering the matrices (µg,Φg,Φgh) depends on the rotation. Making these full

matrices maximizes the number of parameters that can be concentrated out. There are

multiple rotations of xt that accommodate this. If one considers a rotation of the state

vector such that these are not full matrices, this sub-set of the parameters can still be

concentrated out.

In Appendix F, we also derive the analytical gradients of the concentrated log-likelihood.

Our derivation shows how the gradients for affine models can be decomposed into pieces

according to whether a parameter enters the bond loadings, the P dynamics, or both.

Proposition 3 The gradient of the concentrated log-likelihood `(θc (θm) , θm

)can be decom-

posed into three terms:

d`(θc (θm) , θm, A (θm) , B (θm)

)dθ′m

=∂`(θc, θm, A,B

)∂θ′m

+∂`(θc, θm, A,B

)∂A′

∂A (θm)

∂θ′m

+∂`(θc, θm, A,B

)∂vec (B′)′

∂vec(B (θm)′

)∂θ′m

.

The first term is the partial derivative of the P dynamics and Jacobian with respect to θm.

This measures the direct effect parameters have on the log-likelihood through the time series

of the factors. The second and third terms measure the indirect effect parameters have on

the log-likelihood through the bond loadings A and B.

Proof: See Appendix F.2

The expressions for the gradient can be used for other affine models such as models for

defaultable bonds and credit default swaps. Standard errors and other model diagnostics

also benefit from the analytical gradient.

14

3.1.1 Approximately concentrating out other parameters

It is possible to “approximately” concentrate out other parameters that either govern the P

dynamics of ht such as Φh or the scale parameters Σgh. All our final results are based on the

exact concentrated likelihood. But, for complicated models, this reduces the dimensionality

of the maximization problem and it provides excellent starting values for the full estimation.

The matrix Φh only enters the likelihood through the P dynamics. Using the results in

Appendix H, the non-Gaussian state variables can be written as a VAR(1) with conditionally

heteroskedastic, non-Gaussian shocks. This only requires adding step (d.) to Proposition 2.

(d.) Given ht, calculate Φh from the GLS regression

ht+1 − h = Φhht +

√diag

(Σh,ii

2hii,t

)εh,t+1 E [εh,t+1] = 0 V [εh,t+1] = IH

where h is an intercept. This step effectively utilizes a QML-type approximation to the

dynamics of ht+1|ht by assuming that the non-Gaussian errors are Gaussian.

The parameter Σgh enters the log-likelihood through both the bond loadings A and B

and the dynamics of gt+1. However, it enters the bond loadings only through the Jensen’s

inequality terms whose impact on the bond loadings is small unless very long maturities

are added. Information on the parameter Σgh accumulates primarily from the dynamics of

gt+1. The approximate concentrated log-likelihood can be calculated by evaluating the bond

loadings in part (1.a.) with Σgh = 0 and replacing the regression in part (1.b.) with

(b.’) Given the history of gt and ht, calculate εh,t+1 and Σg,t. Run the GLS regression

gt+1 = µg + Φggt + Φghht + Σghεh,t+1 + Σg,tεg,t+1

to calculate µg (θm) , Φg (θm) , Φgh (θm) , Σgh (θm).

This step is useful models with multiple non-Gaussian factors.

15

3.2 Examples

In this section, we discuss how our approach can be applied to several prominent ATSMs.

Example #1: observable macroeconomic variables

Our approach can be used in models with both yield factors and observable macroeco-

nomic variables. Our procedure works the same as before except the state vector xt now

contains the yield factors as well as the observed macroeconomic factors. For step (1.a.) of

Proposition 2, we use Y(1)t to back out the latent component of xt. Shocks to macroeconomic

variables may depend on the latent non-Gaussian factors and be heteroskedastic. A descrip-

tion of the model with macroeconomic variables is provided in Appendix G.2. A special case

of this model with no non-Gaussian factors and no feedback from the latent factors to the

macroeconomic variables is Ang and Piazzesi(2003).

More parameters are identified in these models as long as sufficiently many additional

yields in Y(2)t are available; see, e.g. Hamilton and Wu(2012b) for a discussion of this for

Gaussian ATSMs. Fortunately, many of the parameters that have been added to the model

can be concentrated out of the likelihood function. The parameters concentrated out govern

the time series dynamics of the observed macroeconomic variables, which are often highly

persistent and cause problems during estimation.

Example #2: Gaussian models

Multi-factor Gaussian models are one of the most widely applied tools for conducting

monetary policy and are implemented at central banks around the world. Our approach for

Gaussian models is particularly simple and in our experience these models can be reliably

estimated in a few seconds for a range of rotations with and without restrictions.

A full description of the Gaussian model is provided in Appendix G.3. The parameters

in each of the sub-vectors are θc = (µg,Φg,Ω) and θm =(δ0, δ1g, µ

Qg ,Φ

Qg ,Σ0,g

). In models

with only Gaussian factors, the regression in part (1 b.) in Proposition 2 simplifies to OLS

16

(1 b.) Given gt, calculate µg (θm) , Φg (θm) by running the OLS regression

gt+1 = µg + Φggt + Σ0,gεg,t+1

All other steps of the estimation procedure are the same as before.

Example #3: parameter constraints

Researchers often impose restrictions on parameters of an ATSM. Constraints of economic

interest typically center on the relationship between the conditional means of the state vector

xt across the P and Q measures, see Cochrane and Piazzesi(2008) and Bauer(2011). This

allows information from the cross-section of yields to be exploited to estimate the time series

parameters. Constraints can also eliminate parameters that are statistically insignificant;

see, e.g. Ang and Piazzesi(2003) and Kim and Wright(2005). In our approach, a researcher

can impose constraints directly on the P and Q parameters within a non-Gaussian model

and still concentrate out parameters by linear regression.

We denote the penalized or constrained log-likelihood function `p (θ) as

`p(θ) = log p(Y

(1)1:T , Y

(2)1:T ; θ

)+ p(θ),

where p(θ) is the penalty term. For example, when the constraints can be written as linear

functions of the ATSM’s parameters such as λi = µg,i − µQg,i = 0 and Λij = Φg,ij − ΦQg,ij = 0,

the penalty term is just a vector of Lagrange multipliers. Concentrating parameters out of

the log-likelihood is equivalent to running a constrained GLS regression. Another attractive

approach for incorporating prior information about the dynamics of the factors is to apply

a shrinkage estimator such as ridge regression to µg,Φg,Φgh, in which case the penalty

term p(θ) is a quadratic function of µg,Φg,Φgh. For example, a researcher may want to

shrink the parameters governing the dynamics of the factors under P toward a unit root in

order to counteract the small-sample downward bias of the autoregressive coefficients, Bauer,

17

Rudebusch, and Wu(2012). Alternatively, a researcher can shrink the P parameters toward

their counterparts under the Q measure µQg ,ΦQg ,Φ

Qgh.

Example #4: “hidden” factors

It is well-known since Litterman and Scheinkman(1991) that three factors explain the

majority of variation in the cross-section of yields. Recently, Duffee(2011) argued that more

than three factors are needed to explain the time-series dynamics of yields and risk premia.

These additional factors are “hidden” from the cross-section of yields because the factors

are not priced. The hidden factors are nevertheless part of the P dynamics. For simplicity,

we illustrate the basic ideas here for Gaussian models as in Duffee(2011) and leave details

of the general non-Gaussian model in Appendix G.4.

The Gaussian state vector can be separated into sub-vectors xt+1 =(g′1,t+1, g

′2,t+1

)′whose

dimensions are G1 × 1 and G2 × 1, respectively. The dynamics under the P measure are

g1,t+1 = µg,1 + Φg,11g1t + Φg,12g2t + ε1,t+1 ε1,t+1 ∼ N(0,Σ0,gΣ

′0,g

)(11)

g2,t+1 = µg,2 + Φg,21g1t + Φg,22g2t + ε2,t+1 ε2,t+1 ∼ N (0, IG2) (12)

The dynamics of gt+1 are the same under the Q measure but with the restrictions that

ΦQg,12 = 0 and the last G2 entries of δ1g are zero. These restrictions imply that only g1,t

directly impacts yields as the bond loadings on g2,t are zero by construction.

Given the subset of parameters that enter the bond loadings, the factors that price bonds

are conditionally observable through the transformation g1,t = B−11

(Y

(1)t − A1

)just as in

step (a.) of Proposition 2. We can now treat g1,t+1 as the observed data and (11) is the new

observation equation for a linear, Gaussian state space model. The remaining state variables

g2t have transition equation (12) and are just serially correlated shocks to the factors g1t

that price bonds. In our procedure, step (1.b.) of Proposition 2 is replaced by the Kalman

filter, which is equivalent to a GLS regression where the errors are serially correlated. To

concentrate the parameters (µg,1, µg,2,Φg,11,Φg,21) out of the likelihood, we can either place

18

them in the state vector or use the augmented Kalman filter of de Jong(1991), see also

Chapter 5 of Durbin and Koopman(2012). The Kalman filter delivers the concentrated

log-likelihood∑T

t=2 log p (g1t|g1t−1; θm), associated with the P dynamics of the model.

Example #5: observable yield factors

Another special case are models in which the state variables xt are chosen a priori by

the researcher and are therefore observable. In most applications, the factors are the linear

combination of yields priced without error xt = Y(1)t . This creates restrictions on the Q

parameters within the bond loadings because by construction A1 = SY1A = 0 and B1 =

SY1B = I; see, e.g. Joslin, Singleton, and Zhu(2011) and Hamilton and Wu(2012c). For

Gaussian models with observable factors, our procedure will coincide with Joslin, Singleton,

and Zhu(2011).

Working with observable state variables in non-Gaussian models has some practical dif-

ficulties that are not present for Gaussian models. A researcher must know a priori exactly

which linear combination of yields are Gaussian and which are non-Gaussian. This is diffi-

cult to define in practice because it is not a priori clear which factor(s) (e.g. level, slope, or

curvature) are the stochastic volatility factors.5

4 Data and identification

4.1 Data

We use the Fama and Bliss(1987) zero coupon bond data available from the Center for

Research in Securities Prices (CRSP). The data is monthly and spans from June 1952 through

June 2012 for a total of T = 721 observations with maturities of (1, 3, 12, 24, 36, 48, 60)

5For non-Gaussian ATSMs, Collin-Dufresne, Goldstein, and Jones(2008) proposed a novel approach toestimate non-Gaussian ATSMs based on observable factors that are implied by the theoretical propertiesof continuous-time models. They define the state variables in terms of the slope, curvature, and integratedcovariance of the instantaneous short rate. These quantities would be observable if a continuous recordof yield data were available. Finding empirical counterparts for these theoretical quantities is non-trivial,especially over long periods of time.

19

months. For three factor models, the yields measured without error Y(1)t include the (1,

12, 60) month maturities. In models with four factors, Y(1)t are the (1, 12, 24, 60) month

maturities.

4.2 Identification

Proposition 1 gives guidelines for identification of the model. For the pure Gaussian part,

a number of parameters enter the log-likelihood in the same way. This requires: (i) G

restrictions on µg and µQg to prevent shift; (ii) (H + 1)G(G − 1)/2 restrictions to identify

Σi,g from Σi,gΣ′i,g; (iii) G restrictions between Σi,g and δ1g to prevent scaling; (iv) G(G− 1)

restrictions between Φg,ΦQg and Σi,g to prevent rotation.6 For the pure non-Gaussian part,

this requires: (i) H restrictions imposed on µh to prevent shift; (ii) H restrictions on Σh and

δ1h to prevent scaling; (iii) H(H − 1) restrictions on Σh, Φh, and ΦQh to prevent rotation.

An additional GH restrictions are required on the matrices ΦQgh,Φgh, and Σgh to prevent

rotation between the factors.

An identification exercise similar to Hamilton and Wu(2012b) indicates that only δ0 and

the three eigenvalues in ΦQ are identified from the cross-section. This is because in a just-

identified model the univariate cross-sectional regression of Y(2)t on xt can only identify four

parameters that enter the bond loadings. In both Gaussian and non-Gaussian models, these

parameters are enough to determine the bond loadings B for specific rotations of the state

vector and when Jensen’s inequality terms are small. For Gaussian models, these parameters

also determine A and hence the bond pricing function. For non-Gaussian models, there

are more parameters in Q that determine A and these are identified from the time-series

component of the likelihood.

In our empirical work, we impose the following identifying restrictions. For the Gaussian

part, these are: (i) µQg = 0; (ii) ΦQg in ordered Jordan form;7 (iii) δ1g = ι is a column vector

6For special cases such as repeated eigenvalues in Jordan form, there are additional restrictions, whichwe discuss in Section 6.3.

7For the case where ΦQg has real distinct eigenvalues, it is a diagonal matrix with diagonal elements in

20

of ones; and (iv) Σi,g is lower triangular. For the non-Gaussian part with H = 1, (i) µh = 0;

(ii) δ1h = ±1. For the cross terms, ΦQgh = 0. Elements of the vector δ1h can take either sign,

which unlike Gaussian-only models will lead to inequivalent maxima as we explain in Section

5.2. It is also possible to estimate the parameters νQh , νh as they are identified regardless of

the rotation of the factors, based on Proposition 1.

5 A three factor model

In this section, we use our new method to estimate a three factor model with one volatility

factor, which has been the preferred model by many researchers.8 This is the A1(3) model

in the Dai and Singleton(2000) notation. For an A1(3) model, the concentrated likelihood

drops the number of parameters by one-third from 24 parameters to 16 parameters.

We focus on two aspects of estimation: (1) we compare the performance of our estimation

method with the conventional approach of directly maximizing the log-likelihood; (2) we

discuss why local maxima exist in models with both Gaussian and non-Gaussian factors.

5.1 Performance comparison

To illustrate the mileage we gain from using our method, we compare our approach to the

conventional method that does not concentrate out (µg,Φg,Φgh) or use analytical gradients.

We perform a Monte Carlo experiment where we estimate the A1(3) model on the CRSP

dataset 100 hundred times from 100 different starting values using both methods.9 We com-

pare our method and the direct approach along two dimensions: convergence and speed. To

measure the former, we use the likelihood ratio (two times the difference in log-likelihoods).

descending order.8This model has been widely considered as the benchmark non-Gaussian ATSM, see Dai and Single-

ton(2000), Cheridito, Filipovic, and Kimmel(2007), Collin-Dufresne, Goldstein, and Jones(2008), and Aıt-Sahalia and Kimmel(2010) for examples.

9To make the comparison as parallel as we can, we write the likelihood function the same way, imposethe same identifying restrictions, and use the same scaling and initial values for the parameters except thatthe conventional method has additional parameters entering the numerical optimizer.

21

The global solution found by our method has a log-likelihood of 36647.69 (estimates and

asymptotic robust standard errors can be found on the right hand side of Table 2). We

achieve an identical value for all 17 random starting values whenever the parameters were

initialized in this region or mode of the parameter space.10 Seventeen equals the number of

times (one-sixth) that it started in this region. Conversely, the conventional method does not

find this log-likelihood once nor does the method reproduce the same (incorrect) estimates

for each of these 17 starting values. The highest log-likelihood value found by the standard

approach is 36645.29, and it is only achieved for one starting value. The difference between

the two methods corresponds to a likelihood ratio of 4.8. The null hypothesis that the two

likelihood values are statistically the same will be rejected by a χ2 test, even if our method

has 1 more degree of freedom than the conventional method. In short, the conventional

method does not achieve the global solution. Second, across these 17 starting values, the

conventional method yields log-likelihood values ranging between 36645.29 to 36636.82, the

difference between these two numbers again are statistically significant. With our method

producing the same number repeatedly, we can conclude that it is a maximum. The fact

that a conventional approach does not repeatedly find the same value even when they are

initialized in the same region makes it extremely difficult to understand the behavior of the

log-likelihood surface and consequently the economic implications of the model.

An immediate benefit of the stable behavior of our method is that we are able to find

that the A1(3) model has 6 local modes with three well-behaved local maxima and three

regions of the log-likelihood that appear to be locally unidentified. The three well-behaved

local maxima are listed in Table 1 and we will discuss the properties of the model that create

these local modes in Section 5.2. Our method converges 17/100 times to Local 1, 14/100

times to Local 2, and 17/100 times to Local 3. Inspection of the starting values indicates that

if our procedure is started under the corresponding well-behaved local maxima, it converges

to the correct location. This is not true when the log-likelihood is maximized directly using

10We consider two log-likelihoods to be numerically identical if they agree up to 2 decimal points. Inpractice, the log likelihood values are identical up to 8 decimal points.

22

the un-concentrated log-likelihood and no analytical gradients. The median likelihood ratio

between our procedure and the un-concentrated log-likelihood with no analytical gradient

is 29.5 indicating a substantial difference between the two procedures. The conventional

method, even if it gets close to a local maximum, always stops before it fully converges. This

makes it difficult for researchers to differentiate between points that are near a well-behaved

local maximum that have the same economic meaning and locations corresponding to local

maxima that are economically different. The fact that our method always finds the local

maximum within the region helps us to uncover the different local maxima, and allows us to

study the economic implications of them.

Estimation time is another important dimension along which we compare our approach

to the conventional method. The median estimation time for our procedure to estimate

from a random starting value is less than 2 minutes, whereas the conventional approach of

directly maximizing the log-likelihood function takes more than 2 hours. To perform our

Monte Carlo study with 100 starting values, it takes our method about 4 hours, whereas it

takes roughly 9 days to complete the same exercise with the conventional method.

In summary, our method addresses all of the following problems with the conventional

method. The conventional method is painfully slow. It does not achieve the global maximum.

And, it is extremely hard to assess convergence behavior and the number of local maxima

because conventional approaches do not repeatedly find the same local maximum even when

started in that region of the parameter space.

5.2 Local maxima

Using our approach simplifies estimation and helps uncover some features of the log-likelihood

surface that may be obscured by directly maximizing the log-likelihood. In this section,

we discuss the characteristics of the model that create local maxima and their economic

consequences.

In Gaussian ATSMs, a change in the sign of δ1g rotates the factors from gt to −gt. The

23

Table 1: Local maxima in the A1(3) modelLocal 1 Local 2 Local 3

ht level slope curvature

ΦQh 0.9961 0.9512 0.5412ΦQg 0.9514 0.9992 0.9965

0.5358 0.5672 0.9507δ1,h 1 1 -1LLF 36647.69 36442.15 36477.72

Maximum likelihood estimates of ΦQh and ΦQg with corresponding log-likelihood.

rotation of δ1g is economically irrelevant because the estimated model switches between two

global maximums. As a result, researchers need to fix the sign of δ1g to achieve identification.

Unlike Gaussian models, fixing the sign of δ1h is not inconsequential. The state variable ht is

positive by definition. Changing the sign of δ1h does not rotate ht to −ht. Therefore, there

can exist inequivalent local maxima for each combination of different signs of δ1h. For each

of the local maxima, the estimated latent factor ht is different, which changes the conditional

variance of gt and consequently the log-likelihood.

Reordering the eigenvalues in ΦQ has completely different implications for non-Gaussian

models than for Gaussian models.11 If the eigenvalues are reordered in a multi-factor Gaus-

sian model, it implies equivalent global maxima with the same economic implication. How-

ever, with non-Gaussian factors, they can yield inequivalent local maxima. Here, we demon-

strate the intuition using the A1(3) model, although the basic idea holds for all non-Gaussian

models. The factors are labeled as level, slope and curvature, from most persistent to least

persistent. Reordering the eigenvalues across ΦQg and ΦQh does not generally change the

shape of the factors but it does change whether ht is the level, slope, or curvature. Any

change in ht from one type of factor (level) to another (slope/curvature) implies a different

conditional variance for gt making the likelihood no longer equivalent. More importantly,

11We collect the autoregressive parameters together in matrices as

Φ =

(Φh 0Φgh Φg

)ΦQ =

(ΦQh 0

ΦQgh ΦQg

)

24

the economic implications that can be drawn from the model such as evidence about the

expectations hypothesis, term premia, estimates of conditional volatilities, and forecasts will

change. Changing the order of the eigenvalues within ΦQg and/or ΦQh results in the factors

being reordered within each respective state vector. This results in an equivalent global

maximum. The intuition is the same as re-ordering of the factors gt within a Gaussian

ATSM.

In a non-Gaussian ATSM, it is not clear a priori which local maximum created by these

characteristics of the model will be the global maximum. To estimate a non-Gaussian model,

one must intentionally search each region that potentially has a local maximum and compare

their likelihood values. To illustrate this idea, we present different local maxima for theA1(3)

model corresponding to different signs of δ1,h and different orderings of the eigenvalues. We

report ΦQ, δ1,h, and log-likelihood values in Table 1. In the first column, ht is the level factor

and δ1,h is positive. This is the global maximum in this case. In our sample, volatility is high

during episodes where interest rates are high, so the level factor tends to explain the volatility

best and δ1,h is positive. The next two columns present what happens when ht is the slope or

curvature factor. Due to the nature of the data we are using, the likelihood function drops

significantly from the global maximum to these alternative local maxima. In theory, there

are six potentially different local maxima for each combination of eigenvalues and sign of

δ1,h but in practice there are only three well-behaved local maxima with the remaining local

maxima being locally unidentified. In summary, when estimating non-Gaussian models with

both Gaussian and non-Gaussian factors, we recommend trying to intentionally find each of

the local maxima and compare their log-likelihood values.

6 Model comparison

In this section, we use our methodology to estimate several more ATSMs. We restrict

attention to models with at least three factors. We impose the Feller condition νh,i > 1 and

25

Table 2: Maximum likelihood estimates for the A0(3) and A1(3) models.

G = 3, H = 0 LLF = 37080.94 G = 2, H = 1 LLF = 36647.69µg Σhνh µg νh

6.97e-05 -4.85e-05 -3.37e-04 2.97e-05 -1.32e-05 3.32e-05 1.934(6.43e-05) (4.65e-05) (6.48e-05) (1.10e-04) (1.53e-05) (0.124)

Φg Φh

1.007 0.048 0.067 0.994(0.011) (0.016) (0.040) (0.004)

Φgh Φg

-0.012 0.938 0.019 0.008 0.985 0.066(0.008) (0.016) (0.047) (0.039) (0.055) (0.143)-0.037 -0.059 0.631 -0.041 -0.073 0.643(0.010) (0.018) (0.051) (0.036) (0.036) (0.087)

µQg δ0 ΣhνQh µQg δ0 νQh

0 0 0 0.0083 4.09e-05 0 0 -0.0011 2.637(0.0005) (0.0004) (0.417)

ΦQg ΦQh ΦQg0.995 0.954 0.530 0.996 0.951 0.536

(0.0007) (0.003) (0.029) (0.0009) (0.003) (0.033)Σ0,g Σh

3.99e-04 0 0 1.55e-05(2.52e-05) (1.60e-06)

Σgh Σ0,g Σ1,g

-3.09e-04 5.09e-04 0 -0.893 8.20e-09 0 0.0063 0(3.83e-05) (3.71e-05) (0.104) (5.58e-08) (0.0005)-4.50e-06 -2.52e-04 3.78e-04 0.054 -1.90e-09 1.08e-08 -0.0035 0.0046(9.60e-06) (2.69e-05) (2.38e-05) (0.100) (1.23e-07) (7.33e-08) (0.0003) (0.0003)

δ1,g δ1,h δ1,g1 1 1 1 1 1

Maximum likelihood estimates with asymptotic robust standard errors. Left: Gaussian A0(3) model. Right:non-Gaussian A1(3) model. The restrictions µQg = 0, δ1,g = ι, and δ1,h = 1 are imposed during estimation.

νQh,i > 1 for i = 1, . . . , H and do not impose restrictions on the covariance matrix Ω.

6.1 Cross section

6.1.1 Three factor models

Besides the A1(3) model demonstrated in Section 5, we estimate the three factor Gaussian

model that is popular in the macro-finance literature. As in A1(3) models, the number

of parameters needed to be maximized numerically in the A0(3) model drops dramatically

when using the concentrated likelihood instead of the original likelihood. It drops by more

than half from 22 to 10.12

We report parameter estimates and asymptotic robust standard errors (see Hamilton(1994)

12The same improvement is also achieved in Joslin, Singleton, and Zhu(2011).

26

equation 5.8.7) for both models in Table 2. An interesting feature of the results is how the

estimated values of ΦQ are practically identical across both models and consequently so are

the bond loadings. This implies that the estimated latent factors (level, slope and curvature)

are also identical, see Figure 2 for the A0(3) and A1(3) models. The correlation between

each of the respective factors is 0.999. The only noticeable difference is the level of the level

factor, which is forced to be positive in the A1(3) model because it is the non-Gaussian state

variable.

As both the factors and bond loadings are identical, the cross-sectional component

p(Y

(2)t |xt; θ

)of the likelihood (9) is the same for the A0(3) and A1(3) models even with

Jensen’s inequality taken into account. This implies that when economic conclusions (e.g.

market prices of risk and term premia) differ across Gaussian and non-Gaussian three factor

models these differences are driven primarily by each model’s respective time series proper-

ties. In Section 6.2, we discuss the differences in term premia and conditional volatility in

more detail.

The fact that theA0(3) andA1(3) models fit the cross-section of yields equally well can be

explained by two things. First, the bond loading recursions for Gaussian and non-Gaussian

models in (6) and (7) are the same up to Jensen’s inequality. The Jensen’s inequality terms

are small empirically. Secondly, the magnitude of the cross-sectional measurement errors Ω is

much smaller than those of the dynamics Σi,gΣ′i,g. These matrices are key components of the

information matrix, which determines how much emphasis MLE gives to each component.

An efficient estimator, such as maximum likelihood, prioritizes the greater information (large

Ω−1) in the cross-section and chooses the parameters that enter the bond loadings B, i.e.

ΦQ, to match that feature of the data. As a result, the estimated values of ΦQ in both

models are practically identical, see Table 2. The estimates of ΦQ from Gaussian models

therefore provide excellent starting values when estimating non-Gaussian models. However,

as discussed in Section 5.2, the likelihood still has multiple modes depending on which factors

(level, slope, or curvature) are Gaussian and which factors are non-Gaussian. As the cross-

27

Figure 1: Estimated latent factors for the A0(3) and A1(3) models.

1960 1970 1980 1990 2000 2010

−10

−5

0

5

x 10−3 G = 3, H = 0: g

1t

1960 1970 1980 1990 2000 2010

−4

−2

0

2

4

6x 10

−3 G = 3, H = 0: g2t

1960 1970 1980 1990 2000 2010

−5

−4

−3

−2

−1

0

1

2

x 10−3 G = 3, H = 0: g

3t

1960 1970 1980 1990 2000 2010

0

5

10

15x 10

−3 G = 2, H = 1: ht

1960 1970 1980 1990 2000 2010

−4

−2

0

2

4

6

x 10−3 G = 2, H = 1: g

1t

1960 1970 1980 1990 2000 2010

−5

−4

−3

−2

−1

0

1

2

x 10−3 G = 2, H = 1: g

2t

Figure 2: Estimated latent factors for the A0(3) and A1(3) models. Top row: Gaussian A0(3) model with

from left to right the first g1t, second g2t and third g3t factors. Bottom row: non-Gaussian A1(3) model with

from left to right the first ht, second g1t, and third g2t factors.

sectional component p(Y

(2)t |xt; θ

)of the likelihood (9) is the same (this can be easily seen

from Table 1 across different local maxima), which factors act as volatility are pinned down

by the time series portion of the likelihood.

6.1.2 Four factor models

Next, we consider two four factor models: the Gaussian A0(4) model and the non-Gaussian

A1(4) model. There are a total of 35 parameters in the A0(4) model and only 20 of these

parameters enter the numerical optimizer, while there are 39 parameters in the A1(4) model

and 15 of these can be concentrated out. When estimating the A1(4) model, we found that

ΦQg had a pair of repeated eigenvalues, requiring the use of the Jordan decomposition for

28

Table 3: Maximum likelihood estimates for the A0(4) and A1(4) models.

G = 4, H = 0 LLF = 37195.84 G = 3, H = 1 LLF = 36729.22µg Σhνh µg νh

3.89e-04 -9.73e-04 1.28e-03 -8.19e-04 5.64e-05 3.07e-04 -3.76e-05 -3.10e-04 2.6778(1.68e-04) (2.20e-04) (2.66e-04) (3.64e-04) (3.13e-04) (3.39e-05) (3.15e-04) (2.2549)

Φg Φh

1.034 0.087 -0.014 0.091 0.990(0.038) (0.058) (0.016) (0.034) (0.008)

Φgh Φg

-0.078 0.834 0.204 -0.079 0.004 0.867 1.170 0.091(0.085) (0.111) (0.120) (0.083) (0.006) (0.041) (1.210) (0.224)0.0890 0.154 0.694 0.1926 0.002 0.015 0.816 0.0028(0.112) (0.177) (0.174) (0.148) (0.003) (0.003) (0.106) (0.017)-0.085 -0.141 -0.040 0.546 -0.035 -0.009 -0.124 0.651(0.073) (0.118) (0.093) (0.124) (0.011) (0.039) (1.050) (0.199)

µQg ΣhνQh µQg νQh

0 0 0 2.71e-05 0 0 0 1.284(0.338)

ΦQg ΦQh ΦQg0.992 0.960 0.876 0.696 0.995 0.912 – 0.702

(0.003) (0.013) (0.033) (0.043) (0.002) (0.015) – (0.054)Σ0,g Σh

6.94e-04 0 0 0 2.11e-05(2.29e-04) (6.28e-06)

Σgh Σ0,g

-1.47e-03 9.77e-04 0 0 0.925 8.37e-04 0 0(3.24e-04) (4.71e-04) (0.826) (5.71e-04)1.66e-03 -1.39e-03 8.74e-04 0 -0.217 -9.03e-05 6.85e-13 0

(2.24e-04) (4.43e-04) (3.23e-04) (0.057) (5.25e-05) (2.24e-05)-8.06e-04 5.65e-04 -7.18e-04 4.04e-04 -1.563 -7.96e-04 9.43e-11 4.04e-10(3.47e-04) (1.90e-04) (3.53e-04) (2.91e-05) (0.723) (5.51e-04) (9.31e-05) (3.77e-05)

Σ1,g

1.04e-02 0 0(3.42e-03)-6.75e-04 8.15e-04 0(2.97e-04) (1.01e-04)-9.01e-03 1.12e-03 4.68e-03(3.57e-03) (7.67e-04) (3.59e-04)

δ0 δ03.90e-03 -4.43e-04

(7.07e-04) (3.98e-04)δ1,g δ1,h δ1,g

1 1 1 1 1 1 1 1

Maximum likelihood estimates with asymptotic standard errors. Left: Gaussian A0(4) model. Right:non-Gaussian A1(4) model. The restrictions µQg = 0, δ1,g = ι, and δ1,h = 1 are imposed during estimation.

ΦQg .13 We will explain this in more detail in Section 6.3 below.

13With two repeated eigenvalues in the Gaussian factors, the Jordan decomposition of ΦQg in the A1(4)model becomes

ΦQg =

λ1 1 00 λ1 00 0 λ2

where λ1 and λ2 are the unique eigenvalues.

29

Parameter estimates and asymptotic robust standard errors for both models are in Table

3. The A1(4) model has three fewer parameters in the conditional mean and six more

parameters in the conditional variance. Output from these two models is comparable just as

it was for the two three factor models. The largest and smallest values in ΦQ are close across

both models. The two middle elements in the matrix ΦQ of the A1(4) model have been

imposed to be the same (repeated real eigenvalues). The estimated value of this parameter

is roughly the average of the two middle estimates in the matrix ΦQ of the A0(4) model.

The overall magnitude of the cross-sectional likelihood p(Y

(2)t |xt; θ

)is still basically the

same indicating that the primary differences across the models are found in the times series

component of the likelihood. When repeated real eigenvalues are imposed on the A0(4)

model, the estimated log-likelihood is 37194.35. The Gaussian and non-Gaussian models

then have equivalent factors.

6.1.3 Information in the cross section

The previous sections point to the fact that MLE prioritizes the cross section, and the

elements of ΦQ for both models are estimated with high precision. The primary source of

this precision are the high powered polynomial functions of ΦQ in the bond loadings B in

(6) and (7). The polynomial functions are sensitive to small changes in these parameters,

especially as ΦQ gets closer to one. This is why ΦQ is estimated precisely, especially for the

level factor. Given this argument, a natural question is whether it is necessary to have the

whole yield curve or just a handful of yields in order to get a precise description of the cross

section.

From the discussion on identification for Gaussian ATSMs in Hamilton and Wu(2012b),

we know that only four yields (one in the cross section Y(2)t ) are required to estimate Gaussian

models with three factors. A similar analysis of identification for AH(3) models indicates

that the same is true for any number of volatility factors. Any number of yields available

from the cross-section greater than one do not identify more Q parameters. Additional yields

30

only provide over-identifying information. Do they increase the precision of the parameter

estimates? If so, by how much?

We run an experiment where we estimate the model using all possible different combina-

tions of subsets of yields in Y(2)t . Our goal is to study the incremental information in these

yields and we use the size of the standard errors as our proxy. We use the A0(3) for demon-

stration. With only one yield included in Y(2)t , the standard errors for the largest eigenvalue

in ΦQg range from 0.00078 to around 0.0028, as opposed to 0.00071 when the model is esti-

mated with all the yields included. Even the largest standard error is still small compared

to the point estimates of 0.995 and it is an order of magnitude smaller than the uncertainty

of the P counterpart. The smallest standard error is obtained when Y(2)t includes only the

3 year yield and the largest is when it includes the 3-month yield. To pin down the cross

section with greater precision, it helps to spread out the yields between Y(1)t and Y

(2)t . Once

we include two yields in Y(2)t , the standard errors are smaller and exhibit less variability. The

smallest standard error for the largest eigenvalue is 0.00071 if we include both the 3 month

and 3 year yields, which is the same as in Table 2. The overall message is that a handful of

yields in the cross section contain a large amount of information because of the high power

polynomial. This contrasts with the popular view that the superior information comes from

the large number of cross-sectional yields at any time period.

6.2 Time series

The class of ATSMs define a set of non-nested models making direct comparison of the

models based on likelihood ratio statistics infeasible. However, by means of any information

criteria, Gaussian models are preferred. This is because Gaussian models such as the A0(3)

and A0(4) have higher log-likelihood values and fewer parameters than their respective A1(3)

and A1(4) counterparts.

In the remainder of this section, we focus on the two dimensions in which Gaussian and

non-Gaussian models differ in the time series dimension, namely their conditional means

31

and variances under P. Non-Gaussian ATSMs add parameters to the model to make the

conditional variance time-varying but impose restrictions on the conditional mean. The

restriction that non-Gaussian models impose comes from the fact that the conditional mean

of ht+1|It does not depend on gt by construction, i.e., the matrices Φ and ΦQ both contain

blocks of zeros. The economic implication of this restriction (using the terminology of three

factor models) is the level factor does not depend on the past values of slope and curvature

factors. This is apparently counterintuitive. If the slope is high in the last period, i.e., the

long rate is much higher than the short rate (more than explained by compensating for risk),

then it means the market expects the short rate will increase in the future. On average, the

next periods’ short rate or level will increase. A similar point has been made previously by

Duffee(2002).

Next, we compare the impact of the restrictions on the conditional mean using the term

premium, which is a popular and important measure for monetary policy. The term premium

measures the additional compensation a risk averse agent needs to hold the risky asset relative

to a risk-neutral agent. The term premium is defined as the difference between the model

implied yield ynt and the average of expected future short rates over the same period

rpnt ≡ ynt − ynt , ynt =1

nEt (rt + rt+1 + . . .+ rt+n−1)

see, e.g. Cochrane and Piazzesi(2008).14 Defining risk compensation in terms of the term

premium has the nice feature that it is invariant to the rotation of the state vector unlike the

market prices of risk. The restrictions on the conditional mean have a large impact on the

flexibility of the term premia for non-Gaussian models. Figure 3 plots the one year and five

year term premia for all four models. At short horizons, the difference in the term premia

between theA0(·) andA1(·) models is small but it grows larger as the horizon increases. Over

14The solution to this expectation is

ynt =1

n

(nδ0 + δ′1

[(n− 1)I + (n− 2)Φ + . . .+ Φ(n−2)

]µ+ δ′1

[I + Φ + Φ2 + . . .+ Φ(n−1)

]xt

).

32

Figure 3: One year and five year term premia

1960 1970 1980 1990 2000 20100

0.5

1

1.5

2

2.5

3

3.5

4

Term premia 1 yr

A

0(3)

A1(3)

1960 1970 1980 1990 2000 20100

0.5

1

1.5

2

2.5

3

3.5

4

Term premia 5 yr

A

0(3)

A1(3)

1960 1970 1980 1990 2000 20100

0.5

1

1.5

2

2.5

3

3.5

4

Term premia 1 yr

A

0(4)

A1(4)

1960 1970 1980 1990 2000 20100

0.5

1

1.5

2

2.5

3

3.5

4

Term premia 5 yr

A

0(4)

A1(4)

Estimated term premia at the one and five year horizons. Top left: 1 year term premia from the A0(3) and

A1(3) models. Top right: 5 year term premia from the A0(3) and A1(3) models. Bottom left: 1 year term

premia from the A0(4) and A1(4) models. Bottom right: 5 year term premia from the A0(4) and A1(4)

models.

longer horizons, the fact that the most persistent of the factors in the A0(·) model reverts to

its unconditional mean faster implies that there is more variation in term premium from one

period to the next. In particular, the A0(3) and A1(3) models have substantially different

implications for the 5 year term premium.

The estimated conditional volatilities of yields from the A0(3) and A1(3) models are in

Figure 4, while the estimated conditional volatilities of yields from the four factor models

have the same qualitative features and are not shown. To provide a point of comparison,

we also plot in these graphs estimates of the conditional volatilities from the multivari-

33

Figure 4: Estimated conditional volatility of yields in the A1(3) model.

1960 1970 1980 1990 2000 2010

0.5

1

1.5

2

x 10−3 Volatilities: 1 mth

A

0(3)

A1(3)

GAS model

1960 1970 1980 1990 2000 2010

2

4

6

8

10

12

14

16

x 10−4 Volatilities: 1 yr

A

0(3)

A1(3)

GAS model

1960 1970 1980 1990 2000 20100

0.005

0.01

0.015

Variance ht compared to the 1 mth and 5 yr yield

h

t

1 mth yield5 yr yield

1960 1970 1980 1990 2000 20101

2

3

4

5

6

7

8

9

x 10−4 Volatilities: 5 yr

A

0(3)

A1(3)

GAS model

Estimated conditional volatility of the yields measured without error Y(1)t for the A0(3) and A1(3) models

compared with the multivariate generalized autoregressive score model. Top left: volatility of the one month

yield. Top right: volatility of the one year yield. Bottom left: variance ht compared to the 1 month and 5

year yields. Bottom right: volatility of the five year yield.

ate generalized autoregressive score model of Creal, Koopman, and Lucas(2011) and Creal,

Koopman, and Lucas(2013).15 We find that A1(·) models are able to capture the broad trend

of volatilities, whereas A0(·) models have constant volatility by construction. However, the

estimated volatilities from A1(·) models are much less volatile than GAS volatility. A sim-

ilar observation for the A1(3) and A1(4) model was made by Collin-Dufresne, Goldstein,

and Jones(2009) using univariate GARCH models. The intuition is that in A1(·) models

15The generalized autoregressive score model with time-varying covariance matrix is similar to a multi-variate GARCH model. To make the volatilities of yields comparable across models, we use a VAR(1) for

the conditional mean of yields Y(1)t and allow the errors to have time-varying volatilities and correlations.

34

Table 4: Repeated EigenvaluesLocal 1 Local 2 Local 3 Repeated

ΦQh 0.9952 0.9952 0.9952 0.9952ΦQg 0.9130 0.9164 0.9126 0.9121

0.9112 0.9075 0.9115 –0.7021 0.7025 0.7021 0.7021

LLF 36729.22 36729.20 36729.22 36729.22

the non-Gaussian state variable ht serves a dual role: it is the level factor as well as the

volatility factor. The maximum likelihood estimator chooses the parameter vector θ to fit

the conditional mean first before fitting the conditional variance.

In the bottom left panel of Figure 4, we also compare the variance factor ht together

with the short rate (1 month yield) and the long rate (5 year). Contrary to the conventional

wisdom that the volatility of interest rates is driven by the short rate, we find that the

variance factor mimics the movement of the long term interest rate more closely with a

correlation between ht and the 5 year yield of 97.6%.

6.3 Repeated eigenvalues

For identification of the parameters of the model, a necessary condition is that B1 is invert-

ible. When ΦQ has repeated eigenvalues, the matrix B1 will be singular and one element

in ΦQ is unidentified. For a matrix with repeated eigenvalues, the Jordan decomposition

imposes the restrictions necessary to obtain identification.

An example of repeated real eigenvalues occurs when we estimate the A1(4) model on

the Fama-Bliss data set. When the A1(4) model is estimated without imposing repeated

eigenvalues, the model produces identical likelihoods for different sets of parameter vectors

θ. We report some of these optima in Table 4. Across the four local maxima, the values

of ΦQh , the last eigenvalue of ΦQg , and the log-likelihood function are almost identical. But

the first two eigenvalues of ΦQg vary across different optima. This empirical finding indicates

the existence of repeated eigenvalues, and not all the parameters are identified using the

35

diagonal form of ΦQ. The last column of Table 4 shows the results when we impose repeated

eigenvalues. The estimates of ΦQh , the last eigenvalue of ΦQg and the likelihood function

have the same values as before. However, the first two eigenvalues of ΦQg are identical by

definition and are equal to the average of the first two eigenvalues in those local maxima.

The log-likelihood value also does not change.

7 Conclusion

We generalize the class of discrete-time non-Gaussian ATSMs by allowing for any admissible

rotation of the non-Gaussian state variables and we provide a new approach to estimate

them. The new estimation approach leverages the fact that many of the parameters (i.e. P

parameters) can be concentrated out of the likelihood function. Our method improves the

estimation dramatically by reducing the number of parameters that need to be maximized

numerically. At the same time, the parameters that are concentrated out can cause numerical

problems due to the near unit-root nature of interest rates. We illustrate that our method

speeds up estimation by more than 60 times, and finds maxima consistently. We also explain

why there exist non-equivalent local maxima and their different economic implications. Using

our new method, we find that Gaussian and non-Gaussian models fit the cross-section of

yields equally well. Differences in the economic implications from these models comes from

their relative fit of the time series. Finally, we explain where the superior cross sectional

information comes from, and demonstrate that it is not necessarily due to observing a large

number of cross sectional yields.

Our method can be used for any rotation of the state variables implied by the identifying

restrictions. The methods can be applied to ATSMs that include observable macroeconomic

variables or hidden factors, and allows for restrictions on the key parameters of interest.

The fact that our method can be implemented successfully without relying on any specific

rotation is critical for non-Gaussian models. Unlike Gaussian models where all the factors

36

are symmetric, rotations of the factors in non-Gaussian models do not result in equivalent

local maxima.

37

References

Abramowitz, Milton, and Irene A. Stegun (1964) Handbook of Mathematical Functions

Dover Publications Inc, New York, NY.

Aıt-Sahalia, Yacine, and Robert L. Kimmel (2010) “Estimating affine multifactor term

structure models using closed-form likelihood expansions” Journal of Financial Eco-

nomics 98, 113–144.

Ang, Andrew, and Monika Piazzesi (2003) “A no-arbitrage vector autoregression of term

structure dynamics with macroeconomic and latent variables” Journal of Monetary

Economics 50, 745–787.

Bauer, Michael D. (2011) “Bayesian Estimation of Dynamic Term Structure Models

under Restrictions on Risk Pricing” Federal Reserve Bank of San Francisco Working

Paper 2011-03.

Bauer, Michael D., Glenn D. Rudebusch, and Jing Cynthia Wu (2012) “Correcting esti-

mation bias in dynamic term structure models.” Journal of Business and Economic

Statistics 30, 454–467.

Cheridito, Patrick, Damir Filipovic, and Robert L. Kimmel (2007) “Market price of

risk specifications for affine models: theory and evidence” Journal of Financial Eco-

nomics 84, 123–170.

Christensen, Jens H.E., Francis X. Diebold, and Glenn D. Rudebusch (2011) “The affine

arbitrage-free class of Nelson-Siegel term structure models.” Journal of Econometrics

164, 4–20.

Cochrane, John H., and Monika Piazzesi (2008) “Decomposing the yield curve” Unpub-

lished manuscript, Booth School of Business, University of Chicago.

Collin-Dufresne, Pierre, Robert S. Goldstein, and Charles Jones (2008) “Identification

of maximal affine term structure models.” The Journal of Finance 63, 743–795.

38

Collin-Dufresne, Pierre, Robert S. Goldstein, and Charles Jones (2009) “Can the volatil-

ity of interest rates be extracted from the cross section of bond yields? An investiga-

tion of unspanned stochastic volatility.” Journal of Financial Economics 94, 47–66.

Cox, John C., Jonathan E. Ingersoll, and Stephen A. Ross (1985) “A theory of the term

structure of interest rates” Econometrica 53, 385–407.

Creal, Drew D., Siem Jan Koopman, and Andre Lucas (2011) “A dynamic multivariate

heavy-tailed model for time-varying volatilities and correlations.” Journal of Busi-

ness and Economic Statistics 29, 552–563.

Creal, Drew D., Siem Jan Koopman, and Andre Lucas (2013) “Generalized autoregres-

sive score models with applications” Journal of Applied Econometrics 28, 777–795.

Dai, Qiang, and Kenneth J. Singleton (2000) “Specification analysis of affine term struc-

ture models.” The Journal of Finance 55, 1943–1978.

de Jong, Piet (1991) “The diffuse Kalman filter” The Annals of Statistics 19, 1073–83.

Diebold, Francis X., and Glenn D. Rudebusch (2013) Yield Curve Modeling and Fore-

casting. Princeton University Press, Princeton, NJ.

Duffee, Gregory R. (2002) “Term premia and interest rate forecasts in affine models”

The Journal of Finance 57, 405–443.

Duffee, Gregory R. (2011) “Information in (and not in) the term structure” The Review

of Financial Studies 24, 2895–2934.

Duffee, Gregory R. (2012) “Bond pricing and the macroeconomy” Working paper, Johns

Hopkins University.

Duffie, Darrell, and Rui Kan (1996) “A yield factor model of interest rates” Mathemat-

ical Finance 6, 379–406.

Durbin, James, and Siem Jan Koopman (2012) Time Series Analysis by State Space

Methods Oxford University Press, Oxford, UK 2 edition.

39

Fama, Eugene F., and Robert R. Bliss (1987) “The information in long maturity forward

rates” American Economic Review 77, 680–692.

Gourieroux, Christian, and Joann Jasiak (2006) “Autoregressive gamma processes.”

Journal of Forecasting 25, 129–152.

Gourieroux, Christian, Joann Jasiak, and Razvan Sufana (2009) “The Wishart autore-

gressive process of multivariate stochastic volatility.” Journal of Econometrics 150,

167–181.

Gourieroux, Christian, and Alain Monfort (1995) Statistics and Econometric Models

volume 1 Cambrige University Press, Cambridge, UK.

Gurkaynak, Refet S., and Jonathan H. Wright (2012) “Macroeconomics and the term

structure” Journal of Economic Literature 50, 331–367.

Hamilton, James D (1994) Time Series Analysis Princeton University Press, Princeton,

NJ.

Hamilton, James D., and Jing Cynthia Wu (2012a) “The effectiveness of alternative

monetary policy tools in a zero lower bound environment” Journal of Money, Credit,

and Banking 44 (s1), 3–46.

Hamilton, James D., and Jing Cynthia Wu (2012b) “Identification and estimation of

Gaussian affine term structure models.” Journal of Econometrics 168, 315–331.

Hamilton, James D., and Jing Cynthia Wu (2012c) “Testable implications of affine term

structure models.” Journal of Econometrics forthcoming.

Joslin, Scott, Kenneth J. Singleton, and Haoxiang Zhu (2011) “A new perspective on

Gaussian affine term structure models” The Review of Financial Studies 27, 926–

970.

Kim, Don H., and Athanasios Orphanides (2005) “Term structure estimation with sur-

vey data on interest rate forecasts.” Federal Reserve Board, Finance and Economics

40

Discussion Series 2005-48.

Kim, Don H., and Jonathan H. Wright (2005) “An arbitrage-free three-factor term

structure model and the recent behavior of long-term yields and distant-horizon

forward rates.” Federal Reserve Board, Finance and Economics Discussion Series

2005-33.

Le, Anh, Kenneth J. Singleton, and Qiang Dai (2010) “Discrete-time affine term struc-

ture models with generalized market prices of risk.” The Review of Financial Studies

23, 2184–2227.

Litterman, Robert, and Jose Scheinkman (1991) “Common factors affecting bond re-

turns” The Journal of Fixed Income 1, 54–61.

Piazzesi, Monika (2010) “Affine term structure models” in Handbook of Financial Econo-

metrics, edited by Y. Ait-Sahalia and L. P. Hansen Elsevier, New York pages 691–

766.

Wright, Jonathan H. (2011) “Term premia and inflation uncertainty: empirical evidence

from an international panel dataset” American Economic Review 101(4), 1514–1534.

41

Appendix A Distributions

We start by defining several of the distributions found in the paper, which are useful for implementing the

procedures in practice. The notation for these distributions is local to the appendix.

Appendix A.1 Gamma and multivariate gamma distributions

A univariate gamma r.v. wt+1 ∼ Gamma (νh, κ) has p.d.f p (wt+1|νh, κ) = 1Γ(νh)w

νh−1t+1 κ−νh exp

(−wt+1

κ

)and

Laplace transform E [exp (uwt+1)] =(

11−κu

)νh, which exists only if κu < 1. The mean and variance are

E (wt+1) = νhκ and V (wt+1) = νhκ2.

A multivariate gamma random vector ht+1 ∼ Mult. Gamma (νh,Σh, µh) can be obtained by shifting and

rotating a vector of uncorrelated gamma r.v.’s. It can be written as ht+1 = µh + Σhwt+1 where wt+1 is an

H × 1 vector with elements wi,t+1 ∼ Gamma (νh,i, 1) for i = 1, . . . ,H. The H × 1 vector of (non-negative)

location parameters is µh, Σh is a full rank H ×H matrix of (non-negative) scale parameters, and νh > 0 is

a H × 1 vector of shape parameters. The p.d.f of ht+1 can be determined by a standard change-of-variables

p (ht+1|νh,Σh, µh) = |Σ−1h |

H∏i=1

1

Γ (νh,i)

(e′iΣ−1h [ht+1 − µh]

)νh,i−1exp

(−e′iΣ

−1h [ht+1 − µh]

)where ei is an H × 1 unit vector that selects out the i-th element of a vector. The mean and variance are

E [ht+1] = µh + Σhνh and V [ht+1] = Σhdiag (νh) Σ′h. The Laplace transform is

E [exp (u′ht+1)] =

∫ ∞0

exp (u′ht+1) p (ht+1|νh,Σh, µh) dht+1

= exp (u′µh)

∫ ∞0

exp (u′Σhwt+1)

H∏i=1

1

Γ (νh,i)wνh,i−1i,t+1 exp (−wt+1) dwt+1

= exp (u′µh)

H∏i=1

(1

1− e′iΣ′hu

)νh,i= exp

(u′µh −

H∑i=1

νh,i log [1− e′iΣ′hu]

)

The Laplace transform exists only if e′iΣ′hu < 1 for i = 1, . . . ,H.

42

Appendix A.2 Multivariate non-central gamma distributions

A H × 1 non-central gamma (NCG) random vector ht+1 ∼ Mult.-N.C.G. (νh,Φhht,Σh, µh) is a Poisson

mixture of multivariate gamma r.v.’s

ht+1 = µh + Σhwt+1

wi,t+1 ∼ Gamma (νh,i + zi,t+1, 1) i = 1, . . . ,H

zi,t+1 ∼ Poisson(e′iΣ−1h ΦhΣhwt

)i = 1, . . . ,H.

The process ht remains positive and well-defined as long as µh ≥ 0, Σ−1h ΦhΣh ≥ 0, and elements of Σh

cannot be negative. The conditional mean and variance are

E (ht+1|ht) = (IH − Φh)µh + Σhνh + Φhht

V (ht+1|ht) = Σhdiag(νh − 2Σ−1h Φhµh)Σ′h + Σhdiag(2Σ−1

h Φhht)Σ′h

A standard multivariate NCG random variable (i.e. the discrete-time CIR process) is obtained by setting

µh = 0 and letting Σh be a diagonal matrix. Further properties of the univariate NCG process are described

in Gourieroux and Jasiak(2006).

As long as Σh has full rank, the p.d.f. can be found by integrating out the Poisson r.v.’s

p (ht+1|νh,Φhht,Σh, µh) = |Σ−1h | exp

(−

H∑i=1

e′iΣ−1h [ht+1 − µh] + e′iΣ

−1h Φh [ht − µh]

)H∏i=1


)νh,i−1

∞∑zi,t=0

1

Γ (νh,i + zi,t)

1

zi,t!

[(e′iΣ−1h [ht+1 − µh]

) (e′iΣ−1h Φh [ht − µh]

)]zi,tUsing the definition of the modified Bessel function of the first kind16, the p.d.f. can be expressed as

p (ht+1|νh,Φhht,Σh, µh) = |Σ−1h | exp

(−

H∑i=1

e′iΣ−1h [ht+1 − µh] + e′iΣ

−1h Φh [ht − µh]

)H∏i=1


) νh,i−1

2(e′iΣ−1h Φh [ht − µh]

)− νh,i−1

2

Iνh,i−1

(2√(

e′iΣ−1h [ht+1 − µh]

) (e′iΣ−1h Φh [ht − µh]

)).

16This is defined as Iλ(x) =(x2

)λ∑∞z=0

1Γ(λ+z+1)z!

(x2

4

)z, see Abramowitz and Stegun(1964).

43

The Laplace transform can be derived from the law of iterated expectations

E [exp (u′ht+1)] = Ez(Eh|z [exp (u′ht+1)]

)= Ez

(exp (u′µh)

H∏i=1

(1

1− e′iΣ′hu

)νh,i+zi)

= exp (u′µh)

H∏i=1

(1

1− e′iΣ′hu

)νh,iEz

(H∏i=1

(1

1− e′iΣ′hu

)zi)

= exp (u′µh)

H∏i=1

(1

1− e′iΣ′hu

)νh,i H∏i=1

exp

((e′iΣ−1h Φh [ht − µh]

)e′iΣ′hu

1− e′iΣ′hu

)

= exp

(u′µh +

H∑i=1

e′iΣ′hu

1− e′iΣ′hu

e′iΣ−1h Φh (ht − µh)−

H∑i=1

νh,i log (1− e′iΣ′hu)

)

where e′iΣ′hu denotes the i-th element of the H×1 vector Σ′hu. The Laplace transform exists only if e′iΣ

′hu < 1

for i = 1, . . . ,H.

Appendix A.3 Mixture of Gaussian and mult. NCG distributions

From standard results in statistics, the multivariate (G × 1) Gaussian r.v. gt+1 ∼ N(gt+1|µg,ΣgΣ′g

)has

Laplace transform E [exp (u′gt+1)] = exp(µ′gu+ 1

2u′ΣgΣ

′gu)

for any real (G × 1) vector u. Consider

a (G + H) × 1 vector xt+1 = (h′t+1, g′t+1)′ where ht+1 is an H × 1 vector having a multivariate NCG

distribution p (ht+1|νh,Φhht,Σh, µh) and gt+1 is a G × 1 vector of conditionally Gaussian r.v. gt+1 ∼

N(µg + Σghht+1,ΣgΣ

′g

). Let u = (u′h, u

′g)′ where uh and ug are H×1 and G×1 vectors, respectively. Using

the law of iterated expectations, the Laplace transform is

E [exp (u′xt+1)] = E[exp

(u′ggt+1

)exp (u′hht+1)

]= Eh

[Eg|h

[exp

(u′ggt+1

)]exp (u′hht+1)

]= Eh

[exp

((µg + Σghht+1)′ug +

1

2u′gΣgΣ

′gug

)exp (u′hht+1)

]= exp

(u′gµg +

1

2u′gΣgΣ

′gug

)Eh[exp

([u′gΣgh + u′h

]ht+1

)]= exp

(u′gµg +

1

2u′gΣgΣ

′gug + u′ghµh −

H∑i=1

νh,i log (1− e′iΣ′hugh)

+

H∑i=1

e′iΣ′hugh

1− e′iΣ′hugh

e′iΣ−1h Φh (ht − µh)

)

where ugh = Σ′ghug+uh is an H×1 vector. The Laplace transform exists only if e′iΣ′hugh < 1 for i = 1, . . . ,H.

This is the key expression for solving for closed-form zero-coupon bond prices.

44

Appendix B Stochastic discount factor

We define the stochastic discount factor as

Mt+1 =exp (−rt) p (gt+1|It, ht+1, zt+1; θ,Q) p (ht+1|It, zt+1; θ,Q) p (zt+1|It; θ,Q)

p (gt+1|It, ht+1, zt+1; θ,P) p (ht+1|It, zt+1; θ,P) p (zt+1|It; θ,P)

where the distributions are conditionally Gaussian, conditionally gamma, and Poisson. This is the exact

(non-linear) SDF with no approximations, which we use during estimation. For intuition, consider breaking

the log-stochastic discount factor mt+1 into three terms; one for each of the shocks that the economic agent

faces

mt+1 = −rt +mg,t+1 +mh,t+1 +mz,t+1

where mi,t+1 is the compensation for risk i. Let λg = µg −µQg , λh = νh− νQh , Λg = Φg −ΦQg , Λh = Φh−ΦQh ,

and Λgh = Φgh − ΦQgh.

Appendix B.1 Gaussian risks

Starting with the Gaussian portion, we find

mg,t+1 = −1

2λ′gtλgt − λ′gtεg,t+1

where εg,t+1 = Σ−1g,tεg,t+1 is a standard, zero mean Gaussian shock. The price of Gaussian risk is

λgt = Σ−1g,t (λg + Λggt + Λghht)− Σgh [Σhλh + Λh (ht − µh)]

This is a clear generalization of the expression for Gaussian ATSMs. The key differences is a time-varying

quantity of risk Σg,t.

45

Appendix B.2 Gamma risks

Recall from the definition of the non-Gaussian process that wt+1 = Σ−1h (ht+1 − µh). We will write risk

compensation in terms of wt+1.

mh,t+1 =

H∑i=1

− log Γ(νQh,i + zi,t+1

)+(νQh,i + zi,t+1 − 1

)log(e′iΣ−1h (ht+1 − µh)

)− e′iΣ

−1h (ht+1 − µh)

H∑i=1

log Γ (νh,i + zi,t+1)− (νh,i + zi,t+1 − 1) log(e′iΣ−1h (ht+1 − µh)

)+ e′iΣ

−1h (ht+1 − µh)

=

H∑i=1

log

Γ (νh,i + zi,t+1)

Γ(νQh,i + zi,t+1

)− (νh,i − νQh,i) log

(e′iΣ−1h (ht+1 − µh)

)

≈H∑i=1

log(

[νh,i + zi,t+1]νh,i−νQh,i

)− λh,i log (wi,t+1)

=

H∑i=1

−λh,i [log (wi,t+1)− log (νh,i + zi,t+1)]

=

H∑i=1

−λh,i[log

(1 +

wi,t+1 − νh,i − zi,t+1

νh,i + zi,t+1

)]

≈H∑i=1

− λh,i√νh,i + zi,t+1

wi,t+1 − νh,i − zi,t+1√νh,i + zi,t+1

This implies that the compensation for gamma risks is approximately17

mh,t+1 ≈ −λ′wtεw,t+1

where εw,t+1,i =wi,t+1−νh,i−zi,t+1√

νi+zi,t+1is a gamma r.v. standardized to have mean zero and variance one. The

market price of risk is λwt,i =λh,i√

νh,i+zi,t+1. We note that V (wit|zt) = νh,i + zi,t.

17Our derivation uses the approximation that Γ(a+x)Γ(b+x) ∝ x

a−b (1 +O(

1x

))for large x. We also use the fact

that log(1 + x) = x for small x.

46

Appendix B.3 Poisson risks

Consider the non-Gaussian part due to the Poisson distribution

mz,t+1 =

H∑i=1

zi,t+1 log(

e′iΣ−1h ΦQh (ht − µh)

)− log (zi,t+1!)− e′iΣ

−1h ΦQh (ht − µh)

−zi,t+1 log(e′iΣ−1h Φh (ht − µh)

)+ log (zi,t+1!) + e′iΣ

−1h Φh (ht − µh)

=

H∑i=1

zi,t+1 log(

e′iΣ−1h ΦQh (ht − µh)

)− zi,t+1 log

(e′iΣ−1h Φh (ht − µh)

)+ e′iΣ

−1h

(Φh − ΦQh

)(ht − µh)

=

H∑i=1

zi,t+1 log

(1−

e′iΣ−1h ΛhΣhwt

e′iΣ−1h ΦhΣhwt

)+ e′iΣ

−1h ΛhΣhwt

≈H∑i=1

−zi,t+1e′iΣ−1h ΛhΣhwt


+ e′iΣ−1h ΛhΣhwt

=

H∑i=1

−e′iΣ−1h ΛhΣhwt

(zi,t+1 − e′iΣ

−1h ΦhΣhwt


)

=

H∑i=1

−e′iΣ−1h ΛhΣhwt√


εz,t+1,i

The log stochastic discount factor is

mz,t+1 ≈ −λ′ztεz,t+1

where εz,t+1 =zi,t+1−e′iΣ

−1h ΦhΣhwt√


is Poisson r.v. standardized to have mean 0 and variance 1 and λzt,i =

e′iΣ−1h ΛhΣhwt√


.

Appendix C Bond pricing

Appendix C.1 Bond pricing recursions

Bond prices can be solved by induction. Guess that bond prices are Pnt = exp(an + b′n,hht + b′n,ggt

)for

some coefficients an, bn,h, and bn,g. At maturity n = 1 when the payoff is P 0t+1 = 1, we find

P 1t = EQt

[exp (−rt)P 0

t+1

]= exp

(−δ0 − δ′1,hht − δ′1,ggt

)

47

such that a1 = −δ0, b1,g = −δ1,g and b1,h = −δ1,h. Next, consider an n-period bond whose price in the next

period is Pn−1t+1 . We find

Pnt = EQt[exp (−rt)Pn−1

t+1

]= EQt

[exp

(−δ0 − δ′1,hht − δ′1,ggt

)exp

(an−1 + b′n−1,hht+1 + b′n−1,ggt+1

)]= exp

(−δ0 − δ′1,hht − δ′1,ggt + an−1

)EQt[exp

(b′n−1,hht+1 + b′n−1,ggt+1

)]where the expectation is taken with respect to the distribution of the random vector xt+1 =

(h′t+1, g

′t+1

)′under Q such that

ht+1Q∼ Mult-NCG

(νQh,i,Φ

Qhht,Σh, µh

)gt+1

Q∼ N(µQg + ΦQg gt + ΦQghht + Σgh

[ht+1 −

((IH − ΦQh )µh + Σhν

Qh + ΦQhht

)],Σg,tΣ

′g,t

)

This expectation has the same form as the Laplace transform provided in Appendix A. Using ei to denote

a H × 1 unit vector, we find

Pnt = exp

(−δ0 − δ′1,ggt − δ′1,hht + an−1 +

1

2b′n−1,gΣg,tΣ

′g,tbn−1,g

+[µQg + ΦQg gt + ΦQghht − Σgh


Qh + ΦQhht

)]′bn−1,g

+[Σ′ghbn−1,g + bn−1,h

]′µh +

H∑i=1

e′iΣ′hbn−1,gh

1− e′iΣ′hbn−1,gh

e′iΣ−1h Φh [ht − µh]−

H∑i=1

νQh,i log(1− e′iΣ

′hbn−1,gh

))

= exp

(−δ0 + an−1 + µQ′g bn−1,g +

[Σ′ghbn−1,g + bn−1,h

]′µh −


Qh

)′Σ′ghbn−1,g

+1

2b′n−1,gΣg,tΣ

′g,tbn−1,g −

H∑i=1


′hbn−1,gh

)−

H∑i=1



e′iΣ−1h Φhµh

+[b′n−1,gΦ

Qg − δ′1,g

]gt

+

H∑i=1



e′iΣ−1h Φhht + b′n−1,g

(ΦQgh − ΣghΦQh

)ht − δ′1,hht

)= exp

(−δ0 + an−1 + µQ′g bn−1,g + µ′h

[bn−1,h + ΦQ′h Σ′ghbn−1,g

]− νQ′h Σ′hΣ′ghbn−1,g

+1

2b′n−1,gΣ0,gΣ

′0,g bn−1,g −

H∑i=1


′hbn−1,gh

)−

H∑i=1



e′iΣ−1h Φhµh

+[b′n−1,gΦ

Qg − δ′1,g

]gt

+

[H∑i=1



e′iΣ−1h Φh + b′n−1,g

(ΦQgh − ΣghΦQh

)− δ′1,h

+1

2

(IH ⊗ bn−1,g

)′ΣgΣ

′g

(ιH ⊗ bn−1,g

)]ht

)

48

where ΣgΣ′g is a GH × GH matrix with diagonal elements Σi,gΣ

′i,g for i = 1, . . . ,H. The expression

bn−1,gh = Σ′ghbn−1,g + bn−1,h is an H × 1 vector. The Laplace transform exists only if e′iΣ′hbn−1,gh < 1 for

i = 1, . . . ,H.

Appendix D Log-likelihood function

The log-likelihood for the general affine model is given by

`(θ) = CONST− (T − 1) log |det (B1)| − T − 1

2log |Ω| − 1

2

T∑t=2

tr(Ω−1ηtη

′t

)− 1

2

T∑t=2

log∣∣Σg,t−1Σ′g,t−1

∣∣−1

2

T∑t=2

tr((

Σg,t−1Σ′g,t−1

)−1εgtε

′gt

)−(T − 1) log|Σh| −

T∑t=2

H∑i=1

e′iΣ−1h (ht − µh)−

T∑t=2

H∑i=1

e′iΣ−1h Φh (ht−1 − µh)

+

T∑t=2

H∑i=1

(νh,i − 1)

2log(e′iΣ−1h [ht − µh]

)−

T∑t=2

H∑i=1

(νh,i − 1)

2log(e′iΣ−1h Φh [ht−1 − µh]

)+

T∑t=2

H∑i=1

log Iνh,i−1

(2√(

e′iΣ−1h [ht − µh]

) (e′iΣ−1h Φh [ht−1 − µh]

))

where Iλ (x) is the modified Bessel function of the first kind, see Abramowitz and Stegun(1964). We use ei

to denote the H × 1 unit vector.

Appendix E Proof of Propositions

Appendix E.1 Proof of Proposition 1

The necessary admissibility restrictions to keep the non-Gaussian factors positive are

1. Chg = 0;

2. Chh is restricted such that all elements ChhΣh are non-negative;

3. ch is restricted such that all elements in ch + Chhµh are non-negative;

4. Chh and Cgg are full rank.

For some values of θ, these restrictions may allow ch and Chh to be negative.

49

Under these restrictions, the process xt+1 =(h′t+1, g

′t+1

)′is a member of the same family of distributions

as xt+1 =(h′t+1, g

′t+1

)′only under a new parameters θ. The proof of this proposition is immediate by

comparing the Laplace transform of xt+1 to the Laplace transform of xt+1. The mapping between the new

parameters θ and the original parameters θ is given by

µh = ch + Chhµh

Φh = ChhΦhC−1hh

Σh = ChhΣh

µg = cg + Cggµg − CggΦgC−1gg cg + Cgh ([IH − Φh]µh + Σhνh)

−(CghΦh − CggΦgC−1

gg Cgh + CggΦgh)C−1hh ch

Φg = CggΦgC−1gg

Φgh =(CghΦh − CggΦgC−1

gg Cgh + CggΦgh)C−1hh

Σgh = (Cgh + CggΣgh)C−1hh

Σ0,gΣ′0,g = CggΣ0,gΣ

′0,gC

′gg −

H∑i=1

CggΣi,gΣ′i,gC

′gge′iC−1hh ch

Σi,gΣ′i,g =

H∑j=1

CggΣj,gΣ′j,gC

′gge′jC−1hh ei

Appendix E.2 Proof of Proposition 2

The proof that maximizing a concentrated log-likelihood will result in maximization of the original log-

likelihood can be found in Property 7.4 of Gourieroux and Monfort(1995). To prove Proposition 2, we

only need to show that the first-order conditions for θc = (µg,Φg,Φgh,Ω) can be solved analytically as a

function of θm using GLS. Let β′ = (µg,Φg,Φgh) and gt = gt − Σgh [ht − ((I − Φh)µh + Σhνh + Φhht−1)],

and Gt−1 = (ι, gt−1, ht−1). We use Dk to denote the duplication matrix, and DLk to denote the duplication

matrix for lower triangular (i.e. not necessarily symmetric) matrices.

The derivatives for the parameters in θc are

∂`

∂vec (β)′ = vec

(T∑t=2

Gt−1g′t

(Σg,t−1Σ′g,t−1

)−1

)− vec

(T∑t=2

Gt−1G′t−1β


)−1

)∂`

∂vech (Ω)′ = −T − 1

2vec(Ω−1

)′DG +1

2

T∑t=2

vec(Ω−1ηtη

′tΩ−1)′DG

These two first order conditions are the same as generalized least squares and can be solved to obtain

50

θc(θm) =(µg(θm), Φg(θm), Φgh(θm), Ω(θm)

), which concludes the proof of Proposition 2.

Appendix F Analytical Derivatives

In this appendix, we provide the analytical derivatives.

Appendix F.1 Preliminary lemma

The following lemma shows that the gradients of the log-likelihood and concentrated log-likelihood are

related. This enables us to derive the gradients for the concentrated log likelihood based on the original

log-likelihood, which makes the derivation easier.

Lemma 1 The derivative of the concentrated log-likelihood function `c (θm) ≡ `(θc (θm) , θm

)with respect

to θm can be computed as the partial derivative of the log-likelihood function `(θc, θm

)with respect to θm,

where θc (θm) = arg maxθc

` (θc, θm)

d`c (θm)

dθ′m=∂`(θc, θm

)∂θ′m

Proof:

d`c (θm)

dθ′m≡

d`(θc (θm) , θm

)dθ′m

=∂`(θc, θm

)∂θ′m

+∂`(θc, θm

)∂θ′c

dθc (θm)

dθ′m=

∂`(θc, θm

)∂θ′m

where∂`(θc,θm)

∂θ′c= 0 by the definition of θc.

Appendix F.2 Proof of Proposition 3

Using Lemma 1, we find

d`(θc (θm) , θm, A (θm) , B (θm)

)dθ′m

=d`(θc, θm, A (θm) , B (θm)

)dθ′m

.

51

Applying the chain rule,

d`(θc, θm, A (θm) , B (θm)

)dθ′m

=∂`(θc, θm, A,B

)∂θ′m

+∂`(θc, θm, A,B

)∂A′

∂A (θm)

∂θ′m

+∂`(θc, θm, A,B

)∂vec (B′)

′∂vec

(B (θm)

′)∂θ′m

.

which concludes the proof.

Appendix F.3 Gradients

Given Proposition 3, we can now provide the analytical gradient. Note that θc = (µg, Φg, Φgh, Ω) are

optimized by θc = argmaxθc

` (θc, θm) detailed in Proposition 2. For convenience, let ht = Σ−1h (ht − µh) and

ˆht−1 = Σ−1

h Φh (ht−1 − µh), and hit = 2√(

e′iΣ−1h [ht − µh]

) (e′iΣ−1h Φh [ht−1 − µh]

).

∂`

∂νh,i= −

T∑t=2

ε′gt(Σg,t−1Σ′g,t−1

)−1ΣghΣhei +

1

2

T∑t=2

log

(e′iht

e′iˆht−1

)

+

T∑t=2

1

Iνh,i−1

(hit

) ∂Iνh,i−1

(hit

)∂νh,i

∂`

∂vec (Φh)′ = −

T∑t=2

vec(

Σ′gh(Σg,t−1Σ′g,t−1

)−1εgth

′t−1

)′−

T∑t=2

H∑i=1

vec((ht−1 − µh) e′iΣ

−1h

)′ − T∑t=2

H∑i=1

(νh,i − 1)

2e′iˆht−1

vec((ht−1 − µh) e′iΣ

−1h

)′+

T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′iht

hitvec((ht−1 − µh) e′iΣ

−1h

)′

where we have used the fact that ∂Iλ(x)∂x = λ

xIλ(x) + Iλ+1(x), see, Abramowitz and Stegun(1964). The

derivative ∂Iλ(x)∂λ is a complicated expression that is easier to compute numerically. The derivatives for the

parameters that only enter the bond loadings are calculated in two steps via the chain rule. First, we take

derivatives of ` w.r.t. the loadings A1, A2, B1 and B2. Then, we take derivatives of the bond loadings with

52

respect to the model’s parameters inside the bond loadings.

∂`

∂δ0=

∂`

∂A′∂A

∂δ0∂`

∂δ′1,g=

∂`

∂A′∂A

∂δ′1,g+

∂`

∂vec (B′)′∂vec (B′)

∂δ′1,g

∂`

∂δ′1,h=

∂`

∂A′∂A

∂δ′1,h+

∂`


∂δ′1,h∂`

∂µQ′g=

∂`

∂A′∂A

∂µQ′g

∂`

∂vec(

ΦQg)′ =

∂`

∂A′∂A

∂vec(

ΦQg)′ +

∂`


∂vec(

ΦQg)′

∂`

∂vec(

ΦQgh

)′ =∂`

∂A′∂A

∂vec(

ΦQgh

)′ +∂`


∂vec(

ΦQgh

)′∂`

∂νQ′h=

∂`

∂A′∂A

∂νQ′h∂`

∂vec(

ΦQh

)′ =∂`

∂A′∂A

∂vec(

ΦQh

)′ +∂`


∂vec(

ΦQh

)′The derivatives of the remaining parameters that enter both the loadings and the P dynamics are

d`

dvec (Σh)′ =

∂`

∂vec (Σh)′ +

∂`

∂A′∂A

∂vec (Σh)′ +

∂`


∂vec (Σh)′

d`

dvec (Σgh)′ =

∂`

∂vec (Σgh)′ +

∂`

∂A′∂A

∂vec (Σgh)′ +

∂`


∂vec (Σgh)′

d`

dvech (Σ0,g)′ =

∂`

∂vech (Σ0,g)′ +

∂`

∂A′∂A

∂vech (Σ0,g)′

d`

dvech (Σi,g)′ =

∂`

∂vech (Σi,g)′ +

∂`

∂A′∂A

∂vech (Σi,g)′ +

d`


∂vech (Σi,g)′

d`

dµ′h=

∂`

∂µ′h+

∂`

∂A′∂A

∂µ′h

53

We need the following derivatives

∂`

∂vec (Σh)′ = −

T∑t=2

vec(

Σ′gh(Σg,t−1Σ′g,t−1

)−1εgtν

′h

)′− (T − 1)vec

(Σ−1h

)′+

T∑t=2

H∑i=1

vec(hte′iΣ−1h

)′+

T∑t=2

H∑i=1

vec(

ˆht−1e′iΣ

−1h

)′+

T∑t=2

H∑i=1

(νh,i − 1)

2e′ihtvec(hte′iΣ−1h

)′−

T∑t=2

H∑i=1

(νh,i − 1)

2e′iˆht−1

vec(

ˆht−1e′iΣ

−1h

)′

−T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′iht

hitvec(

ˆht−1e′iΣ

−1h

)′

−T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′i

ˆht−1

hitvec(hte′iΣ−1h

)′∂`

∂vec (Σgh)′ =

T∑t=2

vec(εhtε

′gt


)−1)′

∂`

∂vech (Σ0,g)′ =

T∑t=2

vec([(


)−1εgtε

′gt − IG

] (Σg,t−1Σ′g,t−1

)−1Σ0,g

)′DLG

∂`

∂vech (Σi,g)′ =

T∑t=2

vec([(


)−1εgtε

′gt − IG

] (Σg,t−1Σ′g,t−1

)−1Σi,ghi,t−1

)′DLG

∂`

∂µ′h= −

T∑t=2

ε′gt(Σg,t−1Σ′g,t−1

)−1Σgh (IH − Φh)

+

T∑t=2

H∑i=1

e′iΣ−1h +

T∑t=2

H∑i=1

e′iΣ−1h Φh

−T∑t=2

H∑i=1

(νh,i − 1)

2e′ihte′iΣ−1h +

T∑t=2

H∑i=1

(νh,i − 1)

2e′iˆht−1

e′iΣ−1h Φh

−T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2

hit

[e′ihte

′iΣ−1h Φh + e′i

ˆht−1e′iΣ

−1h

]

54

∂`

∂A′1=

T∑t=2

[−η′tΩ−1B2B

−11 + ε′gt


)−1[(IG − Φg)Sg − (Φgh + Σgh − ΣghΦh)Sh]B−1

1

+1

2vec((


)−1[IG − εgtε′gt


)−1])′ H∑

i=1

vec(Σi,gΣ

′i,g

)ShiB

−11

+

T∑t=2

H∑i=1

e′iΣ−1h ShB

−11 −

T∑t=2

H∑i=1

(νh,i − 1)

2e′ihte′iΣ−1h ShB

−11

+

T∑t=2

H∑i=1

e′iΣ−1h ΦhShB

−11 +

T∑t=2

H∑i=1

(νh,i − 1)

2e′iˆht−1

e′iΣ−1h ΦhShB

−11

−T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′i

ˆht−1

hite′iΣ−1h ShB

−11

−T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′iht

hite′iΣ−1h ΦhShB

−11

∂`

∂A′2=

T∑t=2

η′tΩ−1

∂`

∂vec (B′1)′ = −(T − 1)vec

(B−1

1

)′+

T∑t=2

[−vec

(xtη′tΩ−1B2B

−11

)′+vec

(xtε′gt


)−1(Sg − ΣghSh)B−1

1

)′−vec

(xt−1ε

′gt


)−1(ΦgSg + [Φgh − ΣghΦh]Sh)B−1

1

)′+

1

2vec((


)−1[IG − εgtε′gt


)−1])′ H∑

i=1

vec(Σi,gΣ

′i,g

) [ShiB

−11 ⊗ x′t−1

]+

T∑t=2

H∑i=1

vec(xte′iΣ−1h ShB

−11

)′ − T∑t=2

H∑i=1

(νh,i − 1)

2e′ihtvec(xte′iΣ−1h ShB

−11

)′+

T∑t=2

H∑i=1

vec(xt−1e′iΣ

−1h ΦhShB

−11

)′+

T∑t=2

H∑i=1

(νh,i − 1)

2e′iˆht−1

vec(xt−1e′iΣ

−1h ΦhShB

−11

)′−

T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′i

ˆht−1

hitvec(xte′iΣ−1h ShB

−11

)′

−T∑t=2

H∑i=1

(νh,i − 1)

hit+

Iνh,i

(hit

)Iνh,i−1

(hit

) 2e′iht

hitvec(xt−1e′iΣ

−1h ΦhShB

−11

)′∂`

∂vec (B′2)′ =

T∑t=2

vec(xtη′tΩ−1)′

55

The derivatives of the bond loadings A and B with respect to each of the parameters can be computed

recursively as a function of maturity along with the loadings an, bn,g and bn,h. The derivatives of the Gaussian

loadings Bg and the non-Gaussian loadings Bh will have separate recursions. We use bn,g,ψ to denote the

derivatives of the Gaussian loadings bn,g at maturity n with respect to a parameter ψ. All recursions for the

derivatives are written assuming that the ψ is a full vector/matrix of parameters with no restrictions. In

practice, if the matrix has fewer parameters than entries, then the user will have to multiply the respective

recursion by a selection matrix. Let dn−1 = diag(ιH − Σ′hbn−1,gh

)be a diagonal H × H matrix. Let

c′n−1 =(νQ′d−1

n−1 − µQ′h ΦQ′h Σ−1′

h d−2n−1

)Σ′h be an 1 ×H vector. The recursions for the derivatives of A as a

function of maturity are

a′n,µQg1×G

= a′n−1,µQg

+ b′n−1,g

a′n,µh1×H

= a′n−1,µh+ b′n−1,h + b′n−1,gΣghΦQh − b

′n−1,ghΣhd

−1n−1Σ−1

h ΦQh

a′n,νQh

1×H

= a′n−1,νQh

− log(ι′H − b′n−1,ghΣh

)− b′n−1,gΣghΣh

a′n,Σ0,g

1×G(G+1)/2

= a′n−1,Σ0,g+ vec

(bn−1,g b

′n−1,gΣ0,g

)′DLGa′n,δ1h1×H

= a′n−1,δ1h+ µ′hbn−1,h,δ1h + c′n−1bn−1,h,δ1h

a′n,ΦQgh

1×GH

= a′n−1,ΦQgh

+ µ′hbn−1,h,ΦQgh+ c′n−1bn−1,h,ΦQgh

a′n,Σi,g1×G(G+1)/2

= a′n−1,Σi,g + µ′hbn−1,h,Σi,g + c′n−1bn−1,h,Σi,g i = 1, . . . ,H

a′n,ΦQh

1×H2

= a′n−1,ΦQh

+ µ′hbn−1,h,ΦQh+ c′n−1bn−1,h,ΦQh

+vec((

Σ′ghbn−1,g − Σ−1′h d−1

n−1Σ′hbn−1,gh

)µ′h)′

a′n,Σgh1×GH

= a′n−1,Σgh+ µ′hbn−1,h,Σgh + c′n−1bn−1,h,Σgh + vec

[bn−1,g

(µ′hΦQ′h − ν

Q′h Σ′h + c′n−1

)]′a′n,δ1g1×G

= a′n−1,δ1g + µ′hbn−1,h,δ1g + c′n−1bn−1,gh,δ1g

+(µQ′g + µ′hΦQ′h Σ′gh − ν

Q′h Σ′hΣ′gh

)bn−1,g,δ1g + b′n−1,gΣ0,gΣ

′0,g bn−1,g,δ1g

a′n,ΦQg

1×G2

= a′n−1,ΦQg

+ µ′hbn−1,h,ΦQg+ c′n−1dbn−1,gh,ΦQg

+(µQ′g + µ′hΦQ′h Σ′gh − ν

Q′h Σ′hΣ′gh

)bn−1,g,ΦQg

+ b′n−1,gΣ0,gΣ′0,g bn−1,g,ΦQg

a′n,Σh1×H(H+1)/2

= a′n−1,Σh+ µ′hbn−1,h,Σh + c′n−1bn−1,h,Σh + vec

(Σ−1′h d−1

n−1Σ′hbn−1,ghµ′hΦQ′h Σ−1′

h

)′−vec

(Σ′ghbn−1,gν

′h

)′+ vec

(bn−1,ghc

′n−1Σ−1′

h

)′The derivative of A with respect to δ0 is ιN and the initial conditions are b1,g,δ1g = −IG, b1,h,δ1h = −IH with

56

all other initial conditions starting at zero.

bn,g,δ1gG×G

= ΦQ′g bn−1,g,δ1g − IG

bn,g,ΦQgG×G2

= ΦQ′g bn−1,g,ΦQg+(IG ⊗ b′n−1,g

)bn,h,δ1hH×H

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,h,δ1h − IH

bn,h,ΦQghH×GH

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,h,ΦQgh+ IH ⊗ b′n−1,g

bn,h,Σi,gH×G(G+1)/2

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,h,Σgi + eivec(bn−1,g b

′n−1,gΣi,g

)′DLG i = 1, . . . ,H

bn,h,ΦQhH×H2

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,h,ΦQh− IH ⊗ b′n−1,gΣgh + IH ⊗ b′n−1,ghΣhd

−1n−1Σ−1

h

bn,h,ΣghH×GH

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,h,Σgh − ΦQ′h Σ−1′h

(IH − d−2

n−1

)Σ′h ⊗ b′n−1,g

bn,h,δ1gH×G

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,gh,δ1g +(

ΦQgh − ΣghΦQh

)′bn−1,g,δ1g +

(IH ⊗ b′n−1,g

)ΣgΣ

′g

(ιH ⊗ bn−1,g,δ1g

)bn,h,ΦQgH×G2

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,gh,ΦQg+(

ΦQgh − ΣghΦQh

)′bn−1,g,ΦQg

+(IH ⊗ b′n−1,g

)ΣgΣ

′g

(ιH ⊗ bn−1,g,ΦQg

)bn,h,Σh

H×H(H+1)/2

= ΦQ′h Σ−1′h d−2

n−1Σ′hbn−1,h,Σh −(

ΦQ′h Σ−1′h ⊗ b′n−1,ghΣhd

−1n−1Σ−1

h

)+(

ΦQ′h Σ−1′h d−2

n−1 ⊗ b′n−1,gh

)

where we also need to account for the derivatives of bn,gh = Σ′ghbn,g + bn,h as

bn−1,gh,δ1gH×G

= Σ′ghbn−1,g,δ1g + bn−1,h,δ1g

bn−1,gh,ΦQgH×G2

= Σ′ghbn−1,g,ΦQg+ bn−1,h,ΦQg

Notice that many of the derivatives of the loadings are zero for all maturities. These include bn,g,µhG×H

, bn,h,µQgH×G

, bn,g,δ1hG×H

,

bn,g,ΦQhG×H2

, bn,g,ΣhG×H2

, bn,g,Σ0,g

G×G(G+1)/2

, bn,h,Σ0,g

H×G(G+1)/2

, bn,g,ΦQghG×GH

, bn,g,Σi,gH×G(G+1)/2

, bn,g,ΣghG×GH

, bn,g,νQhG×H

, bn,h,νQhH×H

.

57

Appendix G Model appendix

Appendix G.1 General affine model

The general affine model can be summarized as

ht+1|xt ∼ Mult. N.C.-Gamma (νh,Φhht,Σh, µh) ,


′g,t

),

ht+1|xt ∼ Mult. N.C.-Gamma(νQh ,Φ

Qhht,Σh, µh

),

gt+1 = µQg + ΦQg gt + ΦQghht + ΣghεQh,t+1 + εQg,t+1, εQg,t+1 ∼ N

(0,Σg,tΣ

′g,t

),

rt = δ0 + δ′1hht + δ′1ggt,

where


′0,g +

H∑i=1

Σi,gΣ′i,ghit,

εh,t+1 = ht+1 − E (ht+1|xt) ,

εQh,t+1 = ht+1 − EQ (ht+1|xt) .

Appendix G.2 Model with observable macroeconomic factors

Let Y(m)t denote the M × 1 vector of observable macroeconomic variables. For simplicity, we will assume

that the macroeconomic variables are Gaussian. The state vector gt =(g

(y)t , g

(m)t

)includes both latent yield

factors g(y)t and macroeconomic factors g

(m)t . The model can be summarized as



′g,t

),

ht+1|xt ∼ Mult. N.C.-Gamma(νQh ,Φ

Qhht,Σh, µh

),

gt+1 = µQg + ΦQg gt + ΦQghht + ΣghεQh,t+1 + εQg,t+1, εQg,t+1 ∼ N

(0,Σg,tΣ

′g,t

),

rt = δ0 + δ′1hht + δ′1ggt

58

where


′0,g +

H∑i=1

Σi,gΣ′i,ghit

εh,t+1 = ht+1 − E (ht+1|xt) ,

εQh,t+1 = ht+1 − EQ (ht+1|xt) .

The measurement equation relating the factors to the observed yields and macroeconomic variables is

Y(1)t

Y(m)t

=

A1

0

+

B1h B1g(y) B1g(m)

0M×H 0M×G IM

ht

g(y)t

g(m)t

The bond loadings for the macroeconomic factors B1g(m) can be calculated jointly with the bond loadings

for the latent Gaussian factors.

Appendix G.3 Gaussian model

The Gaussian model can be summarized as

gt+1 = µg + Φggt + εg,t+1, εg,t+1 ∼ N(0,Σ0,gΣ

′0,g

),

gt+1 = µQg + ΦQg gt + εQg,t+1, εQg,t+1 ∼ N(0,Σ0,gΣ

′0,g

),

rt = δ0 + δ′1ggt.

where the bond loadings recursions (5) and (7) still apply but with all non-Gaussian parameters set to zero.

Appendix G.4 “Hidden” factors

The general affine model with hidden Gaussian factors has three sets of state variables x = (h′t, g′1t, g

′2t)′,

where the dimensions of the Gaussian state variables are G1 × 1 and G2 × 1, respectively. The model is

parameterized intentionally such that only the state variables h′t and g′1t are priced by the cross-section. We

also define the model such that the non-Gaussian factors only impact the variances of the shock to g1t. The

59

dynamics under the P measure are


g1,t+1

g2,t+1

=

µg,1

µg,2

+

Φgh,1 Φg,11 Φg,12

Φgh,2 Φg,21 Φg,22

ht

g1t

g2t

+

Σgh,1

Σgh,2

εh,t+1 +

εg,1,t+1

εg,2,t+1

,

εg,1,t+1 ∼ N(0,Σg,tΣ

′g,t

),

εg,2,t+1 ∼ N(

0,Σεg2 Σ′εg2

),

εh,t+1 = ht+1 − E (ht+1|xt) ,


′0,g +

H∑i=1

Σi,gΣ′i,ghit,

and

rt = δ0 + δ′1hht + δ′1g1g1t + δ′1g2g2t,

The dynamics under the Q measure are the same with the following restrictions: δ1g2 = 0 and ΦQg,12 =

0. Under these restrictions, the bond loadings on the Gaussian state vector g2t will always be zero by

construction for all maturities. The bond loadings recursions (5), (6), and (7) remain the same as before.

Appendix H Relationship to continuous-time

Appendix H.1 Mapping between NCG and CIR

To see the intuition behind why the standard NCG process (µh = 0,diagonal Σh) converges to the Cox,

Ingersoll, and Ross(1985) process in the continuous-time limit, the autoregressive gamma process can be

written as a non-Gaussian autoregression with conditional heteroskedasticity

wi,t+1 = νh,iΣh,ii + Φ′h,iwt +√νh,iΣ2

h,ii + 2Σh,iiΦ′h,iwtεh,i,t+1

where the shock εh,i,t+1 = (wi,t+1 − Et[wi,t+1])/√Vt[wi,t+1] is a non-central Gamma random variable that

has been standardized to have mean zero and variance one. The shock εh,i,t+1 is a Poisson sum of gamma

random variables. Due to the infinite divisibility of the gamma distribution and the central limit theorem,

the standardized variable εh,i,t+1 will tend to a Gaussian random variable as the time between observations

60

τ → 0.

The continuous time representation is

dwit = (αi − κ′iwt)dt+ σi√witdWit, (H.1)

with the following mapping:

Φh = IH − κτ νh,i =2αiσ2i

Σh,ii =σ2i τ

2

The continuous-time process implies first and second conditional moments

E [dwit|wt] = (αi − κ′iwt)dt =⇒ E [wi,t+τ |wt] = αiτ + (ι′H − κ′iτ)wt

V [dwit|wt] = σ2iwitdt =⇒ V [wt+τ |wt] = σ2

iwitτ

while the discrete-time process implies

E [wi,t+τ |wt] = νh,iΣh,ii + Φ′h,iwt = αiτ + (ι′H − κ′iτ)wt

V [wi,t+τ |wt] = νh,iΣ2h,ii + 2Σh,iΦ

′h,iwt =

αiσ2i τ

2

2+ σ2

i τ(ι′H − κiτ)wt

When τ → 0, the discrete-time conditional moments approach their continuous-time counterparts.

Appendix H.2 Bond recursions in continuous time

Assume the factor dynamics under the Q measure are

dxt = (α− κxt)dt+ ΣtdWt,

where the covariance matrix is time varying, and depends only on ht:

ΣtΣ′t = Σ0Σ′0 +

H∑i=1

ΣiΣ′ihit

61

We impose admissibility restrictions to guarantee positivity of the non-Gaussian factor: α ≥ 0, κ1,hg = 0.

The continuous time version of bond recursions in equations (5) - (7) are as follows

˙a = α′b+1

2b′Σ0Σ′0b− δ0 (H.2)

˙bg = −κ′gg bg − δ1,g (H.3)

˙bh = −κ′hhbh − κ′ghbg +1

2

(IH ⊗ b′

)ΣΣ′

(ιH ⊗ b

)− δ1,h (H.4)

where ΣΣ′ is a block diagonal matrix with diagonal elements ΣiΣ′i for i = 1, . . . ,H.

62

estimation of non-gaussian a ne term structure...

Documents