financial mathematics team challenge brazil · each team of students wrote a report on their...

FINANCIAL MATHEMATICSTEAM CHALLENGE

BRAZIL

A collection of the four reports from the2018 Financial Mathematics Team Challenge Brazil

Preamble

One of the key aims of the FMTC-BR is for Brazilian postgraduate students in Fi-nancial and Insurance Mathematics to have the opportunity to focus on a topical,industry-relevant research project, while simultaneously developing links with in-ternational students and academics in the field. An allied purpose is to bring a vari-ety of international researchers to Brazil to give them a glimpse of the dynamic en-vironment at the School of Applied Mathematics (EMAp) of the Fundacao GetulioVargas (FGV). The primary goal, however, is for students to learn to work in di-verse teams and to be exposed to a healthy dose of fair competition.

Inspired by the success of the FMTC pioneered at the African Institute of FinancialMarkets and Risk Management (AIFMRM), University of Cape Town, in collabo-ration with University College London, EMAp/FGV was pleased to hold the firstFMTC Brazil in Rio de Janeiro from 8th to 18th August 2018.

The Challenge brought together four teams of Masters and PhD students fromAustralia, Brazil, Canada and the USA to pursue intensive research in FinancialMathematics. Each team worked on a distinct research project for seven days andthen presented their findings on the final two days in extended seminar talks. Theteams were mentored by expert academics from Brazil, Canada, South Africa andthe USA. Each research problem was proposed by the mentors and selected in topi-cal research areas. In order to prepare each team for the Challenge, initial guidanceand preliminary reading was given at the beginning of July. The team recognisedfor the highest-quality solution was awarded a floating trophy.

The research pursued by the four teams included projects on (a) Solving Challeng-ing PDEs in Finance and Economics using Deep Learning, (b) (Machine) Learningthe Greeks, (c) The LIBOR Forward Market Model and Emerging Market SwaptionImplied Volatility Term Structures, and on (d) Machine Learning and StochasticControl in Algorithmic Trading.

Each team of students wrote a report on their findings. The four reports constitutethis volume, which will be available to future FMTC participants. It may also be of

use and inspiration to Masters and PhD students in Financial and Insurance Math-ematics.

The teams enjoyed a welcoming, well-equipped and motivating work environ-ment. Our gaze is set on organising the next FMTC-BR!

Andrea Macrina, University College LondonRodrigo Targino, Fundacao Getulio Vargas (EMAp)

Contents

1. A. Al-Aradi, A. Correia, D. Naiff, G. Jardim 1

Solving Nonlinear and High-Dimensional Partial Differential Equations viaDeep Learning

2. M. Andrade, P. Casgrain, L. Alvarenga, A. SchmidtMachine Learning the Greeks

3. F. Nascimento, L. Farias, H. Yan, R. RivaMachine Learning and Stochastic Control in Algorithmic Trading

4. A. Angiuli, C. Antunes, C. Paolucci, A. SombraA Calibration of the Lognormal Forward LIBOR Model

1Winning team of the first Financial Mathematics Team Challenge Brazil

Solving Nonlinear andHigh-Dimensional PartialDifferential Equations viaDeep Learning

TEAM One

ALI AL-ARADI, University of TorontoADOLFO CORREIA, Instituto de Matematica Pura e AplicadaDANILO NAIFF, Universidade Federal do Rio de JaneiroGABRIEL JARDIM, Fundacao Getulio Vargas

Supervisor:YURI SAPORITO, Fundacao Getulio Vargas

EMAp, Fundacao Getulio Vargas, Rio de Janeiro, Brazil

Contents

1 Introduction 4

2 An Introduction to Partial Differential Equations 62.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 The Black-Scholes Partial Differential Equation . . . . . . . . . 82.3 The Fokker-Planck Equation . . . . . . . . . . . . . . . . . . . 102.4 Stochastic Optimal Control and Optimal Stopping . . . . . . . 112.5 Mean Field Games . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Numerical Methods for PDEs 213.1 Finite Difference Method . . . . . . . . . . . . . . . . . . . . . 213.2 Galerkin methods . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Finite Element Methods . . . . . . . . . . . . . . . . . . . . . . 263.4 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . 27

4 An Introduction to Deep Learning 294.1 Neural Networks and Deep Learning . . . . . . . . . . . . . . 304.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . 344.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 The Universal Approximation Theorem . . . . . . . . . . . . . 374.6 Other Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 The Deep Galerkin Method 415.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . 425.3 A Neural Network Approximation Theorem . . . . . . . . . . 445.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 44

6 Implementation of the Deep Galerkin Method 476.1 How this chapter is organized . . . . . . . . . . . . . . . . . . 486.2 European Call Options . . . . . . . . . . . . . . . . . . . . . . . 496.3 American Put Options . . . . . . . . . . . . . . . . . . . . . . . 51

2

6.4 Fokker-Planck Equations . . . . . . . . . . . . . . . . . . . . . 546.5 Stochastic Optimal Control Problems . . . . . . . . . . . . . . 576.6 Systemic Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.7 Mean Field Games . . . . . . . . . . . . . . . . . . . . . . . . . 676.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . 71

3

Chapter 1

Introduction

In this work we present a methodology for numerically solving a wide class of par-tial differential equations (PDEs) and PDE systems using deep neural networks.The PDEs we consider are related to various applications in quantitative financeincluding option pricing, optimal investment and the study of mean field gamesand systemic risk. The numerical method is based on the Deep Galerkin Method(DGM) described in Sirignano and Spiliopoulos (2018) with modifications madedepending on the application of interest.

The main idea behind DGM is to represent the unknown function of interest us-ing a deep neural network. Noting that the function must satisfy a known PDE,the network is trained by minimizing losses related to the differential operator, theinitial/terminal conditions and the boundary conditions given in the initial valueand/or boundary problem. The training data for the neural network consists ofdifferent possible inputs to the function and is obtained by sampling randomlyfrom the region on which the PDE is defined. One of the key features of this ap-proach is the fact that, unlike other commonly used numerical approaches such asfinite difference methods, it is mesh-free. As such, it does not suffer (as much asother numerical methods) from the curse of dimensionality associated with high-dimensional PDEs and PDE systems.

The main goals of this paper are to:

1. Present a brief overview of PDEs and how they arise in quantitative financealong with numerical methods for solving them.

2. Present a brief overview of deep learning; in particular, the notion of neuralnetworks, along with an exposition of how they are trained and used.

3. Discuss the theoretical foundations of DGM, with a focus on the justificationof why this method is expected to perform well.

4

4. Elucidate the features, capabilities and limitations of DGM by analyzing as-pects of its implementation for a number of different PDEs and PDE systems.

x

t

(ti, xj)

initialcondition

boun

dary

cond

itio

n mesh grid points

Figure 1.1: Grid-based finite differences method (left) vs. Deep Galerkin Method (right)

We present the results in a manner that highlights our own learning process, wherewe show our failures and the steps we took to remedy any issues we faced. Themain messages can be distilled into three main points:

1. Sampling method matters: DGM is based on random sampling; where andhow the sampled random points used for training are chosen are the singlemost important factor in determining the accuracy of the method.

2. Prior knowledge matters: similar to other numerical methods, having infor-mation about the solution that can guide the implementation dramaticallyimproves the results.

3. Training time matters: neural networks sometimes need more time than weafford them and better results can be obtained simply by letting the algorithmrun longer.

5

Chapter 2

An Introduction to PartialDifferential Equations

2.1 OverviewPartial differential equations (PDE) are ubiquitous in many areas of science, engi-neering, economics and finance. They are often used to describe natural phenom-ena and model multidimensional dynamical systems. In the context of finance,finding solutions to PDEs is crucial for problems of derivative pricing, optimal in-vestment, optimal execution, mean field games and many more. In this section,we discuss some introductory aspects of partial differential equations and motivatetheir importance in quantitative finance with a number of examples.

In short, PDEs describe a relation between a multivariable function and its partialderivatives. There is a great deal of variety in the types of PDEs that one can en-counter both in terms of form and complexity. They can vary in order; they may belinear or nonlinear; they can involve various types of initial/terminal conditionsand boundary conditions. In some cases, we can encounter systems of coupledPDEs where multiple functions are connected to one another through their partialderivatives. In other cases, we find free boundary problems or variational in-equalities where both the function and its domain are unknown and both must besolved for simultaneously.

To express some of the ideas in the last paragraph mathematically, let us providesome definitions. A k-th order partial differential equation is an expression of theform:

F(Dku(x), Dk−1u(x), ..., Du(x), u(x), x

)= 0 x ∈ Ω ⊂ Rn

where Dk is the collection of all partial derivatives of order k and u : Ω → R is theunknown function we wish to solve for.

6

PDEs can take one of the following forms:

1. Linear PDE: derivative coefficients and source term do not depend on anyderivatives: ∑

|α|≤k

aα(x) ·Dαu︸︷︷︸linear in

derivatives

= f(x)︸︷︷︸sourceterm

2. Semi-linear PDE: coefficients of highest order derivatives do not depend onlower order derivatives:∑

|α|=k

aα(x) ·Dαu︸︷︷︸linear in

highest orderderivatives

+ a0

(Dk−1u, ...,Du, u, x

)︸︷︷︸

sourceterm

= 0

3. Quasi-linear PDE: linear in highest order derivative with coefficients thatdepend on lower order derivatives:∑

|α|=k

aα

(Dk−1u, ...,Du, u, x

)︸︷︷︸

coefficient term ofhighest order derivative

·Dαu+ a0

(Dk−1u, ...,Du, u, x

)︸︷︷︸

source term does not dependon highest order derivative

= 0

4. Fully nonlinear PDE: depends nonlinearly on the highest order derivatives.

A system of partial differential equations is a collection of several PDEs involvingmultiple unknown functions:

F(Dku(x), Dk−1u(x), ..., Du(x),u(x), x

)= 0 x ∈ Ω ⊂ Rn

where u : Ω→ Rm.

Generally speaking, the PDE forms above are listed in order of increasing difficulty.Furthermore:

• Higher-order PDEs are more difficult to solve than lower-order PDEs;

• Systems of PDEs are more difficult to solve than single PDEs;

• PDEs increase in difficulty with more state variables.

7

In certain cases, we require the unknown function u to be equal to some knownfunction on the boundary of its domain ∂Ω. Such a condition is known as a bound-ary condition (or an initial/terminal condition when dealing with a time dimen-sion). This will be true of the form of the PDEs that we will investigate in Chapter 5.

Next, we present a number of examples to demonstrate the prevalence of PDEsin financial applications. Further discussion of the basics of PDEs (and more ad-vanced topics) such as well-posedness, existence and uniqueness of solutions, clas-sical and weak solutions and regularity can be found in Evans (2010).

2.2 The Black-Scholes Partial Differential EquationOne of the most well-known results in quantitative finance is the Black-ScholesEquation and the associated Black-Scholes PDE discussed in the seminal work ofBlack and Scholes (1973). Though they are used to solve for the price of variousfinancial derivatives, for illustrative purposes we begin with a simple variant ofthis equation relevant for pricing a European-style contingent claim.

2.2.1 European-Style Derivatives

European-style contingent claims are financial instruments written on a source ofuncertainty with a payoff that depends on the level of the underlying at a prede-termined maturity date. We assume a simple market model known as the Black-Scholes model wherein a risky asset follows a geometric Brownian motion (GBM)with constant drift and volatility parameters and where the short rate of interest isconstant. That is, the dynamics of the price processes for a risky asset X = (Xt)t≥0

and a riskless bank account B = (Bt)t≥0 under the “real-world” probability mea-sure P are given by:

dXt

Xt= µ dt+ σ dWt

dBtBt

= r dt

where W = (Wt)t≥0 is a P-Brownian motion.

We are interested in pricing a claim written on the asset X with payoff functionG(x) and with an expiration date T . Then, assuming that the claim’s price functiong(t, x) - which determines the value of the claim at time t when the underlyingasset is at the level Xt = x - is sufficiently smooth, it can be shown by dynamichedging and no-arbitrage arguments that g must satisfy the Black-Scholes PDE:

8

∂tg(t, x) + rx · ∂xg(t, x) + 1

2σ2x2 · ∂xxg(t, x) = r · g(t, x)

g(T, x) = G(x)

This simple model and the corresponding PDE can extend in several ways, e.g.

• incorporating additional sources of uncertainty;

• including non-traded processes as underlying sources of uncertainty;

• allowing for richer asset price dynamics, e.g. jumps, stochastic volatility;

• pricing more complex payoffs functions, e.g. path-dependent payoffs.

2.2.2 American-Style Derivatives

In contrast to European-style contingent claims, American-style derivatives allowthe option holder to exercise the derivative prior to the maturity date and receivethe payoff immediately based on the prevailing value of the underlying. This canbe described as an optimal stopping problem (more on this topic in Section 2.4).

To describe the problem of pricing an American option, let T [t, T ] be the set ofadmissible stopping times in [t, T ] at which the option holder can exercise, andlet Q be the risk-neutral measure. Then the price of an American-style contingentclaim is given by:

g(t, x) = supτ∈T [t,T ]

EQ[e−r(τ−t)G(Xτ )

∣∣∣ Xt = x]

Using dynamic programming arguments it can be shown that optimal stoppingproblems admit a dynamic programming equation. In this case, the solution ofthis equation yields the price of the American option. Assuming the same mar-ket model as the previous section, it can be shown that the price function forthe American-style option g(t, x) with payoff function G(x) - assuming sufficientsmoothness - satisfies the variational inequality:

max (∂t + L − r) g, G− g = 0, for (t, x) ∈ [0, T ]× R

where L = rx · ∂x + 12σ

2x2 · ∂xx is a differential operator.

The last equation has a simple interpretation. Of the two terms in the curly brack-ets, one will be equal to zero while the other will be negative. The first term is equalto zero when g(t, x) > G(x), i.e. when the option value is greater than the intrinsic

9

(early exercise) value, the option is not exercised early and the price function satis-fies the usual Black-Scholes PDE. When the second term is equal to zero we havethat g(t, x) = G(x), in other words the option value is equal to the exercise value(i.e. the option is exercised). As such, the region where g(t, x) > G(x) is referred toas the continuation region and the curve where g(t, x) = G(x) is called the exer-cise boundary. Notice that it is not possible to have g(t, x) < G(x) since both termsare bounded above by 0.

It is also worth noting that this variational inequality can be written as follows:∂tg + rx · ∂xg + 1

2σ2x2 · ∂xxg − r · g = 0 (t, x) : g(t, x) > G(x)

g(t, x) ≥ G(x) (t, x) ∈ [0, T ]× Rg(T, x) = G(x) x ∈ R

where we drop the explicit dependence on (t, x) for brevity. The free boundary setin this problem is F = (t, x) : g(t, x) = G(x) which must be determined along-side the unknown price function g. The set F is referred to as the exercise bound-ary; once the price of the underlying asset hits the boundary, the investor’s optimalaction is to exercise the option immediately.

2.3 The Fokker-Planck EquationWe now turn out attention to another application of PDEs in the context of stochas-tic processes. Suppose we have an Ito process on Rd with time-independent driftand diffusion coefficients:

dXt = µ(Xt)dt+ σ(Xt)dWt

and assume that the initial point is a random vector X0 with distribution givenby a probability density function f(x). A natural question to ask is: “what is theprobability that the process is in a given region A ⊂ Rd at time t?” This quantity can becomputed as an integral of the probability density function of the random vectorXt, denoted by p(t,x):

P (Xt ∈ A) =

∫Ap(t,x) dx

The Fokker-Planck equation is a partial differential equation that p(t,x) can beshown to satisfy:

∂tp(t,x) +∑d

j=1 ∂j(µj(x) · p(t,x))

− 12

∑di,j=1 ∂ij(σij(x) · p(t,x)) = 0 (t,x) ∈ R+ × Rd

p(0,x) = f(x) x ∈ Rd

10

where ∂j and ∂ij are first and second order partial differentiation operators withrespect to xj and xi and xj , respectively. Under certain conditions on the initialdistribution f , the above PDE admits a unique solution. Furthermore, the solutionsatisfies the property that p(t,x) is positive and integrates to 1, which is requiredof a probability density function.

As an example consider an Ornstein-Uhlenbeck (OU) process X = (Xt)t≥0 witha random starting point distributed according to an independent normal randomvariable with mean 0 and variance v. That is, X satisfies the stochastic differentialequation (SDE):

dXt = κ(θ −Xt) dt+ σ dWt , X0 ∼ N(0, v)

where θ and κ are constants representing the mean reversion level and rate. Thenthe probability density function p(t, x) for the location of the process at time t sat-isfies the PDE: ∂tp+ κ · p+ κ(x− θ) · ∂xp− 1

2σ2 · ∂xxp = 0 (t, x) ∈ R+ × R

p(0, x) = 1√2πv· e−

x2

2v

Since the OU process with a fixed starting point is a Gaussian process, using anormally distributed random starting point amounts to combining the conditionaldistribution process with its (conjugate) prior, implying that Xt is normally dis-tributed. We omit the derivation of the exact form of p(t, x) in this case.

2.4 Stochastic Optimal Control and Optimal StoppingTwo classes of problems that heavily feature PDEs are stochastic optimal controland optimal stopping problems. In this section we give a brief overview of theseproblems along with some examples. For a thorough overview, see Touzi (2012),Pham (2009) or Cartea et al. (2015).

In stochastic control problems, a controller attempts to maximize a measure of suc-cess - referred to as a performance criteria - which depends on the path of somestochastic process by taking actions (choosing controls) that influence the dynam-ics of the process. In optimal stopping problems, the performance criteria dependson a stopping time chosen by the agent; the early exercise of American options dis-cussed earlier in this chapter is an example of such a problem.

To discuss these in concrete terms let X = (Xt)t≥0 be a controlled Ito process satis-fying the stochastic differential equation:

dXut = µ(t,Xu

t , ut) dt+ σ(t,Xut , ut) dWt , Xu

0 = 0

11

where u = (ut)t≥0 is a control process chosen by the controller from an admissi-ble set A. Notice that the drift and volatility of the process are influenced by thecontroller’s actions. For a given control, the agent’s performance criteria is:

Hu(x) = E[ ∫ T

0F (s,Xu

s , us) ds︸︷︷︸running reward

+ G(XuT )︸︷︷︸

terminal reward

]

The key to solving optimal control problems and finding the optimal control u∗

lies in the dynamic programming principle (DPP) which involves embedding theoriginal optimization problem into a larger class of problems indexed by time, withthe original problem corresponding to t = 0. This requires us to define:

Hu(t, x) = Et,x[∫ T

tF (s,Xu

s , us) ds+G(XuT )

]where Et,x[·] = E[·|Xu

t = x]. The value function is the value of the performancecriteria when the agent adopts the optimal control:

H(t, x) = supu∈A

Hu(t, x)

Assuming enough regularity, the value function can be shown to satisfy a dynamicprogramming equation (DPE) also called a Hamilton-Jacobi-Bellman (HJB) equa-tion. This is a PDE that can be viewed as an infinitesimal version of the DPP. TheHJB equation is given by:∂tH(t, x) + sup

u∈ALutH(t, x) + F (t, x, u) = 0

H(T, x) = G(x)

where the differential operator Lut is the infinitesimal generator of the controlledprocess Xu - an analogue of derivatives for stochastic processes - given by:

Lf(t,Xt) = limh↓0

Et[f(t+ h,Xt+h)]− f(t,Xt)

h

Broadly speaking, the optimal control is obtained as follows:

1. Solve the first order condition (inner optimization) to obtain the optimal con-trol in terms of the derivatives of the value function, i.e. in feedback form;

2. Substitute the optimal control back into the HJB equation, usually yielding ahighly nonlinear PDE and solve the PDE for the unknown value function;

3. Use the value function to derive an explicit expression for the optimal control.

12

For optimal stopping problems, the optimization problem can be written as:

supτ∈T

E [G(Xτ )]

where T is the set of admissible stopping times. Similar to the optimal control prob-lem, we can derive a DPE for optimal stopping problem in the form of a variationalinequality assuming sufficient regularity in the value function H . Namely,

max

(∂t + Lt)H, G−H

= 0, on [0, T ]× R

The interpretation of this equation was discussed in Section 2.2.2 for American-style derivatives where we discussed how the equation can be viewed as a freeboundary problem.

It is possible to extend the problems discussed in this section in many directions byconsidering multidimensional processes, infinite horizons (for running rewards),incorporating jumps and combining optimal control and stopping in a single prob-lem. This will lead to more complex forms of the corresponding dynamic program-ming equation.

Next, we discuss a number of examples of HJB equations that arise in the contextof problems in quantitative finance.

2.4.1 The Merton Problem

In the Merton problem, an agent chooses the proportion of their wealth that theywish to invest in a risky asset and a risk-free asset through time. They seek tomaximize the expected utility of terminal wealth at the end of their investmenthorizon; see Merton (1969) for the investment-consumption problem and Merton(1971) for extensions in a number of directions. Once again, we assume the Black-Scholes market model:

dStSt

= µ dt+ σ dWt

dBtBt

= r dt

The wealth process Xπt of a portfolio that invests a proportion πt of wealth in the

risky asset and the remainder in the risk-free asset satisfies the following SDE:

dXπt = (πt(µ− r) + rXπ

t ) dt+ σπt dWt

The investor is faced with the following optimal stochastic control problem:

supπ∈A

E [U(XπT )]

13

where A is the set of admissible strategies and U(x) is the investor’s utility func-tion. The value function is given by:

H(t, x) = supπ∈A

E [U(XπT ) | Xπ

t = x]

which satisfies the following HJB equation:∂tH + supπ∈A

((π(µ− r) + rx) · ∂x + 1

2σ2π2∂xx

)H

= 0

H(T, x) = U(x)

If we assume an exponential utility function with risk preference parameter γ, thatis U(x) = −e−γx, then the value function and the optimal control can be obtainedin closed-form:

H(t, x) = − exp[−xγer(T−t) − λ2

2 (T − t)]

π∗t =λ

γσe−r(T−t)

where λ = µ−rσ is the market price of risk.

It is also worthwhile to note that the solution to the Merton problem plays an im-portant role in the substitute hedging and indifference pricing literature, see e.g.Henderson and Hobson (2002) and Henderson and Hobson (2004).

2.4.2 Optimal Execution with Price Impact

Stochastic optimal control, and hence PDEs in the form of HJB equations, featureprominently in the algorithmic trading literature, such as in the classical work ofAlmgren and Chriss (2001) and more recently Cartea and Jaimungal (2015) andCartea and Jaimungal (2016) to name a few. Here we discuss a simple algorithmictrading problem with an investor that wishes to liquidate an inventory of sharesbut is subject to price impact effects when trading too quickly. The challenge theninvolves balancing this effect with the possibility of experiencing a negative mar-ket move when trading too slowly.

We begin by describing the dynamics of the main processes underlying the model.The agent can control their (liquidation) trading rate νt which in turn affects theirinventory level Qνt via:

dQνt = −νt dt, Qν0 = q

Note that negative values of ν indicate that the agent is buying shares. The priceof the underlying asset St is modeled as a Brownian motion that experiences a

14

permanent price impact due to the agent’s trading activity in the form of a linearincrease in the drift term:

dSνt = −bνt dt+ σ dWt, Sν0 = S

By selling too quickly the agent applies increasing downward pressure (linearlywith factor b > 0) on the asset price which is unfavorable to a liquidating agent.Furthermore, placing larger orders also comes at the cost of increased temporaryprice impact. This is modeled by noting that the cashflow from a particular transac-tion is based on the execution price St which is linearly related to the fundamentalprice (with a factor of k > 0):

Sνt = Sνt − kνtThe cash process Xν

t evolves according to:

dXνt = Sνt νt dt, Xν

0 = x

With the model in place we can consider the agent’s performance criteria, whichconsists of maximizing their terminal cash and penalties for excess inventory levelsboth at the terminal date and throughout the liquidation horizon. The performancecriteria is

Hν(t, x, S, q) = Et,x,S,q[

XνT︸︷︷︸

terminalcash

+ QνT (SνT − αQνT )︸︷︷︸terminal

inventory

− φ∫ T

t(Qνu)2 du︸︷︷︸

running inventory

]

where α and φ are preference parameters that control the level of penalty for theterminal and running inventories respectively. The value function satisfies the HJBequation:

(∂t + 12σ

2∂SS)H − φq2

+ supν(ν(S − kν)∂x − bν · ∂S − ν∂q)H = 0

H(t, x, S, q) = x+ Sq − αq2

Using a carefully chosen ansatz we can solve for the value function and optimalcontrol:

H(t, x, S, q) = x+ qS +(h(t)− b

2

)q2

ν∗t = γ · ζeγ(T−t) + e−γ(T−t)

ζeγ(T−t) − e−γ(T−t) ·Qν∗t

where h(t) =√kφ · 1 + ζe2γ(T−t)

1− ζe2γ(T−t) , γ =

√φ

k, ζ =

α− 12b+

√kφ

α− 12b−

√kφ

For other optimal execution problems the interested reader is referred to Chapter 6of Cartea et al. (2015).

15

2.4.3 Systemic Risk

Yet another application of PDEs in optimal control is the topic of Carmona et al.(2015). The focus in that paper is on systemic risk - the study of instability in theentire market rather than a single entity - where a number of banks are borrowingand lending with the central bank with the target of being at or around the aver-age monetary reserve level across the economy. Once a characterization of optimalbehavior is obtained, questions surrounding the stability of the system and thepossibility of multiple defaults can be addressed. This is an example of a stochas-tic game, with multiple players determining their preferred course of action basedon the actions of others. The object in stochastic games is usually the determinationof Nash equilibria or sets of strategies where no player has an incentive to changetheir action.

The main processes underlying this problem are the log-monetary reserves of eachbank denoted Xi =

(Xit

)t≥0

and assumed to satisfy the SDE:

dXit =

[a(Xt −Xi

t

)+ αit

]dt+ σ dW i

t

where W it = ρW 0

t +√

1− ρ2W it are Brownian motions correlated through a com-

mon noise process, Xt is the average log-reserve level and αit is the rate at whichbank i borrows from or lends to the central bank. The interdependence of reservesappears in a number of places: first, the drift contains a mean reversion term thatdraws each bank’s reserve level to the average with a mean reversion rate a; sec-ond, the noise terms are driven partially by a common noise process.

The agent’s control in this problem is the borrowing/lending rate αi. Their aim is toremain close to the average reserve level at all times over some fixed horizon. Thus,they penalize any deviations from this (stochastic) average level in the interim andat the end of the horizon. They also penalize borrowing and lending from thecentral bank at high rates as well as borrowing (resp. lending) when their ownreserve level is above (resp. below) the average level. Formally, the performancecriterion is given by:

J i(α1, ..., αN

)= E

[∫ T

0fi(Xt, α

it

)dt+ gi

(XiT

)]where the running penalties are:

fi(x, αi) = 1

2

(αi)2︸︷︷︸

excessive lendingor borrowing

− qαi(x− xi

)︸︷︷︸borrowing/lending in“the wrong direction”

+ ε2

(x− xi

)2︸︷︷︸deviation from the

average level

16

and the terminal penalty is:

gi(x) = c2

(x− xi

)2︸︷︷︸deviation from the

average level

where c, q, and ε represent the investor’s preferences with respect to the variouspenalties. Notice that the performance criteria for each agent depends on the strate-gies and reserve levels of all the agents including themselves. Although the paperdiscusses multiple approaches to solving the problem (Pontryagin stochastic max-imum principle and an alternative forward-backward SDE approach), we focus onthe HJB approach as this leads to a system of nonlinear PDEs. Using the dynamicprogramming principle, the HJB equation for agent i is:

∂tVi + inf

αi

N∑j=1

[a(x− xj) + αj

]∂jV

i

+σ2

2

N∑j,k=1

(ρ2 + δjk(1− ρ2)

)∂jkV

i

+ (αi)2

2 − qαi(x− xi) + ε2

(x− xi

)2= 0

V i(T,x) = c2

(x− xi

)2

Remarkably, this system of PDEs can be solved in closed-form to obtain the valuefunction and the optimal control for each agent:

V i(t,x) =η(t)

2

(x− xi

)2+ µ(t)

αi,∗t =

(q +

(1− 1

N

)· η(t)

)(Xt −Xi

t

)where η(t) =

−(ε− q)2(e(δ+−δ−)(T−t) − 1

)− c

(δ+e(δ+−δ−)(T−t) − δ−

)(δ−e(δ+−δ−)(T−t) − δ+

)− c(1− 1

N2 )(e(δ+−δ−)(T−t) − 1

)µ(t) = 1

2σ2(1− ρ2)

(1− 1

N

) ∫ T

tη(s) ds

δ± = −(a+ q)±√R, R = (a+ q)2 +

(1− 1

N2

)(ε− q2)

17

2.5 Mean Field GamesThe final application of PDEs that we will consider is that of mean field games(MFGs). In financial contexts, MFGs are concerned with modeling the behavior ofa large number of small interacting market participants. In a sense, it can be viewedas a limiting form of the Nash equilibria for finite-player stochastic game (such asthe interbank borrowing/lending problem from the previous section) as the num-ber of participants tends to infinity. Though it may appear that this would makethe problem more complicated, it is often the case that this simplifies the underly-ing control problem. This is because in MFGs, agents need not concern themselveswith the actions of every other agent, but rather they pay attention only to theaggregate behavior of the other agents (the mean field). It is also possible in somecases to use the limiting solution to obtain approximations for Nash equilibria of fi-nite player games when direct computation of this quantity is infeasible. The term‘’mean field” originates from mean field theory in physics which, similar to thefinancial context, studies systems composed of large numbers of particles whereindividual particles have negligible impact on the system. A mean field game typ-ically consists of:

1. An HJB equation describing the optimal control problem of an individual;

2. A Fokker-Planck equation which governs the dynamics of the aggregate be-havior of all agents.

Much of the pioneering work in MFGs is due to Huang et al. (2006) and Lasryand Lions (2007), but the focus of our exposition will be on a more recent work byCardaliaguet and Lehalle (2017). Building on the optimal execution problem dis-cussed earlier in this chapter, Cardaliaguet and Lehalle (2017) propose extensionsin a number of directions. First, traders are assumed to be part of a mean field gameand the price of the underlying asset is impacted permanently, not only by the ac-tions of the agent, but by the aggregate behavior of all agents acting in an optimalmanner. In addition to this aggregate permanent impact, an individual trader facesthe usual temporary impact effects of trading too quickly. The other extension is toallow for varying preferences among the traders in the economy. That is, tradersmay have different tolerance levels for the size of their inventories both throughoutthe investment horizon and at its end. Intuitively, this framework can be thoughtof as the agents attempting to “trade optimally within the crowd.”

Proceeding to the mathematical description of the problem, we have the follow-ing dynamics for the various agents’ inventory and cash processes (indexed by asuperscript a):

18

dQat = νat dt, Qa0 = qa

dXat = −νat (St + kνat ) dt, Xa

0 = xa

An important deviation from the previous case is the fact that the permanent priceimpact is due to the net sum of the trading rates of all agents, denoted by µt:

dSt = κµt dt+ σ dWt

Also, the value function associated with the optimal control problem for agent a isgiven by:

Ha(t, x, S, q) = supν

Et,x,S,q[

XaT︸︷︷︸

terminalcash

+ QaT (ST − αaQaT )︸︷︷︸terminal

inventory

− φa∫ T

t(Qau)2 du︸︷︷︸

running inventory

]

Notice that each agent a has a different value of αa and φa demonstrating theirdiffering preferences. As a consequence, an agent can be represented by their pref-erences a = (αa, φa). The HJB equation associated with the agents’ control problemis:(∂t + 1

2σ2∂SS

)Ha − φaq2 + κµ · ∂SHa + sup

ν

(ν · ∂q − ν(S + kν) · ∂x)Ha

= 0

Ha(T, x, S, q;µ) = x+ q(S − αaq)

This can be simplified using an ansatz to:−κµq = ∂tha − φaq2 + sup

ν

ν · ∂qha − kν2

ha(T, q) = −αaq2

Notice that the PDE above requires agents to know the net trading flow of the meanfield µ, but that this quantity itself depends on the value function of each agentwhich we have yet to solve for. To resolve this issue we first write the optimalcontrol of each agent in feedback form:

νa(t, q) =∂qh

a(t, q)

2k

Next, we assume that the distribution of inventories and preferences of agents iscaptured by a density function m(t, dq, da). With this, the net flow µt is simplygiven by the aggregation of all agents’ optimal actions:

µt =

∫(q,a)

∂ha(t, q)

2k︸︷︷︸trading rate of agent

with inventory qand preferences a

m(t, dq, da)︸︷︷︸aggregated according to

distribution of agents

19

In order to compute the quantity at different points in time we need to understandthe evolution of the density m through time. This is just an application of theFokker-Planck equation, as m is a density that depends on a stochastic process (theinventory level). If we assume that the initial density of inventories and preferencesis m0(q, a), we can write the Fokker-Planck equation as:

∂tm+ ∂q

(m · ∂h

a(t, q)

2k︸︷︷︸drift of inventory

processQat under

optimal controls

)= 0

m(0, q, a) = m0(q, a)

The full system for the MFG in the problem of Cardaliaguet and Lehalle (2017) in-volves the combined HJB and Fokker-Planck equations with the appropriate initialand terminal conditions:

− κµq = ∂tha − φaq2 +

(∂qha)2

4k(HJB equation - optimality)

Ha(T, x, S, q;µ) = x+ q(S − αaq) (HJB terminal condition)

∂tm+ ∂q

(m · ∂h

a(t, q)

2k

)= 0 (FP equation - density flow)

m(0, q, a) = m0(q, a) (FP initial condition)

µt =

∫(q,a)

∂ha(t, q)

2km(t, dq, da) (net trading flow)

Assuming identical preferences αa = α, φa = φ allows us to find a closed-formsolution to this PDE system. The form of the solution is fairly involved so we referthe interested reader to the details in Cardaliaguet and Lehalle (2017).

20

Chapter 3

Numerical Methods for PDEs

Although it is possible to obtain closed-form solutions to PDEs, more often wemust resort to numerical methods for arriving at a solution. In this chapter we dis-cuss some of the approaches taken to solve PDEs numerically. We also touch onsome of the difficulties that may arise in these approaches involving stability andcomputational cost, especially in higher dimensions. This is by no means a com-prehensive overview of the topic to which a vast amount of literature is dedicated.Further details can be found in Burden et al. (2001), Achdou and Pironneau (2005)and Brandimarte (2013).

3.1 Finite Difference MethodIt is often the case that differential equations cannot be solved analytically, so onemust resort to numerical methods to solve them. One of the most popular numer-ical methods is the finite difference method. As its name suggests, the main ideabehind this method is to approximate the differential operators with difference op-erators and apply them to a discretized version of the unknown function in thedifferential equation.

3.1.1 Euler’s Method

Arguably, the simplest finite difference method is Euler’s method for ordinary dif-ferential equations (ODEs). Suppose we have the following initial value problem

y′(t) = f(t)

y(0) = y0

for which we are trying to solve for the function y(t). By the Taylor series expan-sion, we can write

y(t+ h) = y(t) +y′(t)

1!· h+

y′′(t)

2!· h2 + · · ·

21

for any infinitely differentiable real-valued function y. If h is small enough, and ifthe derivatives of y satisfy some regularity conditions, then terms of order h2 andhigher are negligible and we can make the approximation

y(t+ h) ≈ y(t) + y′(t) · h

As a side note, notice that we can rewrite this equation as

y′(t) ≈ y(t+ h)− y(t)

h

which closely resembles the definition of a derivative;

y′(t) = limh→0

y(t+ h)− y(t)

h.

Returning to the original problem, note that we know the exact value of y′(t),namely f(t), so that we can write

y(t+ h) ≈ y(t) + f(t) · h.

At this point, it is helpful to introduce the notation for the discretization schemetypically used for finite difference methods. Let ti be the sequence of valuesassumed by the time variable, such that t0 = 0 and ti+1 = ti + h, and let yi be thesequence of approximations of y(t) such that yi ≈ y(ti). The expression above canbe rewritten as

yi+1 ≈ yi + f(ti) · h,

which allows us to find an approximation for the value of y (ti+1) ≈ yi+1 given thevalue of yi ≈ y(ti). Using Euler’s method, we can find numerical approximationsfor y(t) for any value of t > 0.

3.1.2 Explicit versus implicit schemes

In the previous section, we developed Euler’s method for a simple initial valueproblem. Suppose one has the slightly different problem where the source term fis now a function of both t and y.

y′(t) = f(t, y)

y(0) = y0

A similar argument as before will now lead us to the expression for yi+1

yi+1 ≈ yi + f(ti, yi) · h,

22

where yi+1 is explicitly written as a sum of terms that depend only on time ti.Schemes such as this are called explicit. Had we used the approximation

y(t− h) ≈ y(t)− y′(t) · h

instead, we would arrive at the slightly different expression for yi+1

yi+1 ≈ yi + f(ti+1, yi+1) · h,

where the term yi+1 appears in both sides of the equation and no explicit formulafor yi+1 is possible in general. Schemes such as this are called implicit. In the gen-eral case, each step in time in an implicit method requires solving the expressionabove for yi+1 using a root finding technique such as Newton’s method or otherfixed point iteration methods.

Despite being easier to compute, explicit methods are generally known to be nu-merically unstable for a large range of equations (especially so-called stiff prob-lems), making them unusable for most practical situations. Implicit methods, onthe other hand, are typically both more computationally intensive and more nu-merically stable, which makes them more commonly used. An important measureof numerical stability for finite difference methods is A-stability, where one teststhe stability of the method for the (linear) test equation y′(t, y) = λy(t), with λ < 0.While the implicit Euler method is stable for all values of h > 0 and λ < 0, the ex-plicit Euler method is stable only if |1 + hλ| < 1, which may require using a smallvalue for h if the absolute value of λ is high. Of course, all other things being equal,a small value for h is undesirable since it means a finer grid is required, which inturn makes the numerical method more computationally expensive.

3.1.3 Finite difference methods for PDEs

In the previous section, we focused our discussion on methods for numericallysolving ODEs. However, finite difference methods can be used to solve PDEs aswell and the concepts presented above can also be applied in PDEs solving meth-ods. Consider the boundary problem for the heat equation in one spatial dimen-sion, which describes the dynamics of heat transfer in a rod of length l:

∂tu = α2 · ∂xxuu(0, x) = u0(x)

u(t, 0) = u(t, l) = 0

We could approximate the differential operators in the equation above using a for-ward difference operator for the partial derivative in time and a second-ordercentral difference operator for the partial derivative in space. Using the notation

23

ui,j ≈ u(ti, xj), with ti+1 = ti + k and xj+1 = xj + h, we can rewrite the equationabove as a system of linear equations

ui+1,j − ui,jk

= α2

(ui,j−1 − 2ui,j + ui,j+1

h2

),

where i = 1, 2, . . . , N and j = 1, 2, . . . , N , assuming we are using the same numberof discrete points on both dimensions. In this two dimensional example, the points(ti, xj) form a two dimensional grid of size O

(N2). For a d-dimensional problem,

a d-dimensional grid with size O(Nd)

would be required. In practice, the expo-nential growth of the grid in the number of dimensions rapidly makes the methodunmanageable, even for d = 4. This is an important shortcoming of finite differencemethods in general.

x

t

(ti, xj)

boundarycondition

init

ial

cond

itio

n mesh grid points

x

y

t

(ti, xj , yk)

boundarycondition

initialcondition

mesh grid points

Figure 3.1: Illustration of finite difference methods for solving PDEs in two (left) andthree (right) dimensions. The known function value on the boundaries is combined withfinite differences to solve for the value of function on a grid in the interior of the regionwhere it is defined.

The scheme developed above is known as the forward difference method or FTCS(forward in time, central in space). It is easy to verify that this scheme is explicitin time, since we can write the ui+1,· terms as a linear combination of previouslycomputed ui,· terms. The number of operations necessary to advance each step intime with this method should beO

(N2). Unfortunately, this scheme is also known

to be unstable if h and k do not satisfy the inequality α2 kh2 ≤ 1

2 .

Alternatively, we could apply the Backward Difference method or BTCS (back-ward in time, central in space) using the following equations:

24

ui+1,j − ui,jk

= α2

(ui+1,j−1 − 2ui+1,j + ui+1,j+1

h2

).

This scheme is implicit in time since it is not possible to write the ui+1,· terms asa function of just the previously computed ui,· terms. In fact, each step in timerequires solving system of linear equations of size O

(N2). The number of oper-

ations necessary to advance each step in time with this method is O(N3)

whenusing methods such as Gaussian elimination to solve the linear system. On theother hand, this scheme is also known to be unconditionally stable, independently ofthe values for h and k.

3.1.4 Higher order methods

All numerical methods for solving PDEs have errors due to many sources of inac-curacies. For instance, rounding error is related to the floating point representationof real numbers. Another important category of error is truncation error, whichcan be understood as the error due to the Taylor series expansion truncation. Finitedifference methods are usually classified by their respective truncation errors.

All finite methods discussed so far are low order methods. For instance, the Eu-ler’s methods (both explicit and implicit varieties) are first-order methods, whichmeans that the global truncation error is proportional to h, the discretization gran-ularity. However, a number of alternative methods have lower truncation errors.For example, the Runge-Kutta 4th-order method, with a global truncation errorproportional to h4, is widely used, being the most known method of a family offinite difference methods, which cover even 14th-order methods. Many Runge-Kutta methods are specially suited for solving stiff problems.

3.2 Galerkin methodsIn finite difference methods, we approximate the continuous differential operatorby a discrete difference operator in order to obtain a numerical approximation ofthe function that satisfies the PDE. The function’s domain (or a portion of it) mustalso be discretized so that numerical approximations for the value of the solutioncan be computed at the points of the so defined spatial-temporal grid. Furthermore,the value of the function on off-grid points can also be approximated by techniquessuch as interpolation.

Galerkin methods take an alternative approach: given a finite set of basis functionson the same domain, the goal is to find a linear combination of the basis functionsthat approximates the solution of the PDE on the domain of interest. This problemtranslates into a variational problem where one is trying to find maxima or minima

25

of functionals.

More precisely, suppose we are trying to solve the equation F (x) = y for x, where xand y are members of spaces of functionsX and Y respectively and that F : X → Yis a (possibly non-linear) functional. Suppose also that φi∞i=1 and ψj∞j=1 formlinearly independent bases for X and Y . According to the Galerkin method, anapproximation for x could be given by

xn =n∑i=1

αiφi

where the αi coefficients satisfy the equations⟨F

(n∑i=1

αiφi

), ψj

⟩= 〈y, ψj〉 ,

for j = 1, 2, . . . , n.1 Since the inner products above usually involve non-trivial in-tegrals, one should carefully choose the bases to ensure the equations are moremanageable.

3.3 Finite Element MethodsFinite element methods can be understood as a special case of Galerkin methods.Notice that in the general case presented above, the approximation xn may not bewell-posed, in the sense that the system of equations for αi may have no solution orit may have multiple solutions depending on the value of n. Additionally, depend-ing on the choice of φi and ψj , xn may not converge to x as n → ∞. Nevertheless,one could discretize the domain in small enough regions (called elements) so thatthe approximation is locally satisfactory in each region. Adding boundary consis-tency constraints for each region intersection (as well as for the outer boundaryconditions given by the problem definition) and solving for the whole domain ofinterest, one can come up with a globally fair numerical approximation for the so-lution to the PDE.

In practice, the domain is typically divided in triangles or quadrilaterals (two-dimensional case), tetrahedra (three-dimensional case) or more general geomet-rical shapes in higher dimensions in a process known as triangulation. Typicalchoices for φi and ψj are such that the inner product equations above reduce to asystem of algebraic equations for steady state problems or a system of ODEs in thecase of time-dependent problems. If the PDE is linear, those systems will be linear

1https://www.encyclopediaofmath.org/index.php/Galerkin_method

26

https://www.encyclopediaofmath.org/index.php/Galerkin_method

as well, and they can be solved using methods such as Gaussian elimination or iter-ative methods such as Jacobi or Gauss-Seidel. If the PDE is not linear, one may needto solve systems of nonlinear equations, which are generally more computationallyexpensive. One of the major advantages of the finite element methods over finitedifference methods, is that finite elements can effortlessly handle complex bound-ary geometries, which typically arise in physical or engineering problems, whereasthis may be very difficult to achieve with finite difference algorithms.

3.4 Monte Carlo MethodsOne of the more fascinating aspects of PDEs is how they are intimately related tostochastic processes. This is best exemplified by the Feynman-Kac theorem, whichcan be viewed in two ways:

• It provides a solution to a certain class of linear PDEs, written in terms of anexpectation involving a related stochastic process;

• It gives a means by which certain expectations can be computed by solvingan associated PDE.

For our purposes, we are interested in the first of these two perspectives.

The theorem is stated as follows: the solution to the partial differential equation∂th+ a(t, x) · ∂xh+ 1

2b(t, x)2 · ∂xxh+ g(t, x) · h(t, x) = c(t, x) · h(t, x)

h(T, x) = H(x)

admits a stochastic representation given by

h(t, x) = EP∗t,x

[∫ T

te−

∫ ut c(s,Xs) ds · g(u,Xu) du+H(XT ) · e−

∫ Tt c(s,Xs) ds

]where Et,x[ · ] = E [ · |Xt = x] and the process X = (Xt)t≥0 satisfies the SDE:

dXt = a(t,Xt) dt+ b(t,Xt) dWP∗t

where W P∗ =(W P∗t

)t≥0

is a standard Brownian motion under the probability mea-sure P∗. This representation suggests the use of Monte Carlo methods to solve forunknown function h. Monte Carlo methods are a class of numerical techniquesbased on simulating random variables used to solve a range of problems, such asnumerical integration and optimization.

Returning to the theorem, let us now discuss its statement:

27

• When confronted with a PDE of the form above, we can define a (fictitious)processX with drift and volatility given by the processes a(t,Xt) and b(t,Xt),respectively.

• Thinking of c as a “discount factor,” we then consider the conditional ex-pectation of the discounted terminal condition H(XT ) and the running termg(t,Xt) given that the value of X at time t is equal to a known value, x.Clearly, this conditional expectation is a function of t and x; for every valueof t and x we have some conditional expectation value.

• This function (the conditional expectation as a function of t and x) is preciselythe solution to the PDE we started with and can be estimated via Monte Carlosimulation of the process X .

A class of Monte Carlo methods have also been developed for nonlinear PDEs, butthis is beyond the scope of this work.

28

Chapter 4

An Introduction to Deep Learning

The tremendous strides made in computing power and the explosive growth indata collection and availability in recent decades has coincided with an increasedinterest in the field of machine learning (ML). This has been reinforced by the suc-cess of machine learning in a wide range of applications ranging from image andspeech recognition, medical diagnostics, email filtering, fraud detection and manymore.

2013 2014 2015 2016 2017 2018

10

20

30

40

50

60

70

80

90

100

Figure 4.1: Google search frequency for various terms. A value of 100 is the peak popu-larity for the term; a value of 50 means that the term is half as popular.

As the name suggests, the term machine learning refers to computer algorithmsthat learn from data. The term “learn” can have several meanings depending on

29

the context, but the the common theme is the following: a computer is faced with atask and an associated performance measure, and its goal is to improve its perfor-mance in this task with experience which comes in the form of examples and data.

ML naturally divides into two main branches. Supervised learning refers to thecase where the data points include a label or target and tasks involve predictingthese labels/targets (i.e. classification and regression). Unsupervised learningrefers to the case where the dataset does not include such labels and the task in-volves learning a useful structure that relates the various variables of the inputdata (e.g. clustering, density estimation). Other branches of ML, including semi-supervised and reinforcement learning, also receive a great deal of research atten-tion at present. For further details the reader is referred to Bishop (2006) or Good-fellow et al. (2016).

An important concept in machine learning is that of generalization which is relatedto the notions of underfitting and overfitting. In many ML applications, the goalis to be able to make meaningful statements concerning data that the algorithmhas not encountered - that is, to generalize the model to unseen examples. It ispossible to calibrate an assumed model “too well” to the training data in the sensethat the model gives misguided predictions for new data points; this is known asoverfitting. The opposite case is underfitting, where the model is not fit sufficientlywell on the input data and consequently does not generalize to test data. Strikinga balance in the trade-off between underfitting and overfitting, which itself can beviewed as a tradeoff between bias and variance, is crucial to the success of a MLalgorithm.

On the theoretical side, there are a number of interesting results related to ML. Forexample, for certain tasks and hypothesized models it may be possible to obtain theminimal sample size to ensure that the training error is a faithful representation ofthe generalization error with high confidence (this is known as Probably Approx-imately Correct (PAC) learnability). Another result is the no-free-lunch theorem,which implies that there is no universal learner, i.e. that every learner has a task onwhich it fails even though another algorithm can successfully learn the same task.For an excellent exposition of the theoretical aspects of machine learning the readeris referred to Shalev-Shwartz and Ben-David (2014).

4.1 Neural Networks and Deep LearningNeural networks are machine learning models that have received a great deal ofattention in recent years due to their success in a number of different applications.The typical way of motivating the approach behind neural network models is tocompare the way they operate to the human brain. The building blocks of the

30

brain (and neural networks) are basic computing devices called neurons that areconnected to one another by a complex communication network. The communica-tion links cause the activation of a neuron to activate other neurons it is connectedto. From the perspective of learning, training a neural network can be thought ofas determining which neurons “fire” together.

Mathematically, a neural network can be defined as a directed graph with verticesrepresenting neurons and edges representing links. The input to each neuron is afunction of a weighted sum of the output of all neurons that are connected to itsincoming edges. There are many variants of neural networks which differ in ar-chitecture (how the neurons are connected); see Figure 4.2. The simplest of theseforms is the feedforward neural network, which is also referred to as a multilayerperceptron (MLP).

MLPs can be represented by a directed acyclic graph and as such can be seen asfeeding information forward. Usually, networks of this sort are described in termsof layers which are chained together to create the output function, where a layeris a collection of neurons that can be thought of as a unit of computation. In thesimplest case, there is a single input layer and a single output layer. In this case,output j (represented by the jth neuron in the output layer), is connected to theinput vector x via a biased weighted sum and an activation function φj :

yj = φj

(bj +

d∑i=1

wi,jxi

)It is also possible to incorporate additional hidden layers between the input andoutput layers. For example, with one hidden layer the output would become:

yk = φ

b(2)k +

m2∑i=1

w(2)j,k · ψ

(b(1)j +

m1∑i=1

w(1)i,j xj

)︸︷︷︸

input layer to hidden layer

︸︷︷︸

hidden layer to output layer

where φ, ψ : R→ R are nonlinear activation functions for each layer and the brack-eted superscripts refer to the layer in question. We can visualize an extension ofthis the process as a simple application of the chain rule, e.g.

f(x) = ψd(· · ·ψ2(ψ1(x)))

Here, each layer of the network is represented by a function ψi, incorporating theweighted sums of previous inputs and activations to connected outputs. The num-ber of layers in the graph is referred to as the depth of the neural network and the

31

Figure 4.2: Neural network architectures. Source: “Neural Networks 101”by Paul van der Laken (https://paulvanderlaken.com/2017/10/16/neural-networks-101)

32

https://paulvanderlaken.com/2017/10/16/neural-networks-101

https://paulvanderlaken.com/2017/10/16/neural-networks-101

Input 1

Input 2

Input 3

Input 4

Output

Hiddenlayer

Inputlayer

Outputlayer

Figure 4.3: Feedforward neural network with one hidden layer.

number of neurons in a layer represents the width of that particular layer; see Fig-ure 4.3.

The term “deep” neural network and deep learning refer to the use of neuralnetworks with many hidden layers in ML problems. One of the advantages ofadding hidden layers is that depth can exponentially reduce the computationalcost in some applications and exponentially decrease the amount of training dataneeded to learn some functions. This is due to the fact that some functions can berepresented by smaller deep networks compared to wide shallow networks. Thisdecrease in model size leads to improved statistical efficiency.

It is easy to imagine the tremendous amount of flexibility and complexity that canbe achieved by varying the structure of the neural network. One can vary the depthor width of the network, or have varying activation functions for each layer or eveneach neuron. This flexibility can be used to achieve very strong results but can leadto opacity that prevents us from understanding why any strong results are beingachieved.

Next, we turn to the question of how the parameters of the neural network are es-timated. To this end, we must first define a loss function, L(θ;x,y), which willdetermine the performance of a given parameter set θ for the neural network con-sisting of the weights and bias terms in each layer. The goal is to find the parame-ter set that minimizes our loss function. The challenge is that the highly nonlinearnature of neural networks can lead to non-convexities in the loss function. Non-convex optimization problems are non-trivial and often we cannot guarantee thata candidate solution is a global optimizer.

33

4.2 Stochastic Gradient DescentThe most commonly used approach for estimating the parameters of a neural net-work is based on gradient descent which is a simple methodology for optimizinga function. Given a function f : Rd → R, we wish to determine the value of x thatachieves the minimum value of f . To do this, we begin with an initial guess x0 andcompute the gradient of f at this point. This gives the direction in which the largestincrease in the function occurs. To minimize the function we move in the oppositedirection, i.e. we iterate according to:

xn = xn−1 − η · ∇xf (xn−1)

where η is the step size known as the learning rate, which can be constant or de-caying in n. The algorithm converges to a critical point when the gradient is equalto zero, though it should be noted that this is not necessarily a global minimum.In the context of neural networks, we would compute the derivatives of the lossfunctional with respect to the parameter set θ (more on this in the next section) andfollow the procedure outlined above.

One difficulty with the use of gradient descent to train neural networks is the com-putational cost associated with the procedure when training sets are large. Thisnecessitates the use of an extension of this algorithm known as stochastic gradi-ent descent (SGD). When the loss function we are minimizing is additive, it can bewritten as:

∇L(θ;x,y) =1

m

m∑i=1

∇θLi(θ;x(i),y(i)

)where m is the size of the training set and Li is the per-example loss function. Theapproach in SGD is to view the gradient as an expectation and approximate it witha random subset of the training set called a mini-batch. That is, for a fixed mini-batch of size m′ the gradient is estimated as:

∇θL(θ;x,y) ≈ 1

m′∇θ

m′∑i=1

Li

(θ;x(i),y(i)

)This is followed by taking the usual step in the opposite direction (steepest de-scent).

4.3 BackpropagationThe stochastic gradient descent optimization approach described in the previoussection requires repeated computation of the gradients of a highly nonlinear func-tion. Backpropagation provides a computationally efficient means by which this

34

can be achieved. It is based on recursively applying the chain rule and on definingcomputational graphs to understand which computations can be run in parallel.

As we have seen in previous sections, a feedforward neural network can be thoughtof as receiving an input x and computing an output y by evaluating a function de-fined by a sequence of compositions of simple functions. These simple functionscan be viewed as operations between nodes in the neural network graph. Withthis in mind, the derivative of y with respect to x can be computed analytically byrepeated applications of the chain rule, given enough information about the oper-ations between nodes. The backpropagation algorithm traverses the graph, repeat-edly computing the chain rule until the derivative of the output y with respect to xis represented symbolically via a second computational graph; see Figure 4.4.

w

x

y

z

f

f

f

w

x

y

z

f

f

f

dxdw

dydx

dzdy

f ′

f ′

f ′

dzdw

×

dzdx

×

Figure 4.4: Visualization of backpropagation algorithm via computational graphs. Theleft panel shows the composition of functions connecting input to output; the right panelshows the use of the chain rule to compute the derivative. Source: Goodfellow et al.(2016)

The two main approaches for computing the derivatives in the computational graphis to input a numerical value then compute the derivatives at this value, returninga number as done in PyTorch (pytorch.org), or to compute the derivatives of asymbolic variable, then store the derivative operations into new nodes added tothe graph for later use as done in TensorFlow (tensorflow.org). The advantageof the latter approach is that higher-order derivatives can be computed from thisextended graph by running backpropagation again.

35

pytorch.org

tensorflow.org

The backpropagation algorithm takes at most O(n2)

operations for a graph withn nodes, storing at most O

(n2)

new nodes. In practice, most feedforward neuralnetworks are designed in a chain-like way, which in turn reduces the number ofoperations and new storages to O (n), making derivatives calculations a relativelycheap operation.

4.4 SummaryIn summary, training neural networks is broadly composed of three ingredients:

1. Defining the architecture of the neural network and a loss function, also knownas the hyperparameters of the model;

2. Finding the loss minimizer using stochastic gradient descent;

3. Using backpropagation to compute the derivatives of the loss function.

This is presented in more mathematical detail in Figure 4.5.

1. Define the architecture of the neural network by setting its depth (numberof layers), width (number of neurons in each layer) and activation functions

2. Define a loss functional L(θ;x,y), mini-batch size m′ and learning rate η

3. Minimize the loss functional to determine the optimal θ:

(a) Initialize the parameter set, θ0

(b) Randomly sample a mini-batch of m′ training examples(x(i),y(i)

)(c) Compute the loss functional for the sampled mini-batch

L(θi;x

(i),y(i))

(d) Compute the gradient ∇θL(θi;x

(i),y(i))using backpropagation

(e) Use the estimated gradient to update θi based on SGD:

θi+1 = θi − η · ∇θL(θi;x(i),y(i))

(f) Repeat steps (b)-(e) until ‖θi+1 − θi‖ is small.

Figure 4.5: Parameter estimation procedure for neural networks.

36

4.5 The Universal Approximation TheoremAn important theoretical result that sheds some light on why neural networks per-form well is the universal approximation theorem, see Cybenko (1989) and Hornik(1991). In simple terms, this result states that any continuous function defined ona compact subset of Rn can be approximated arbitrarily well by a feedforward net-work with a single hidden layer.

Mathematically, the statement of the theorem is as follows: let φ be a nonconstant,bounded, monotonically-increasing continuous function and let Im denote the m-dimensional unit hypercube. Then, given any ε > 0 and any function f defined onIm, there exists N, vi, bi,w such that the approximation function:

F (x) =N∑i=1

viφ (w · x+ bi)

satisfies |F (x)− f(x)| < ε for all x ∈ Im.

A remarkable aspect of this result is the fact that the activation function is inde-pendent of the function we wish to approximate! However, it should be noted thatthe theorem makes no statement on the number of neurons needed in the hiddenlayer to achieve the desired approximation error, nor whether the estimation of theparameters of this network is even feasible.

4.6 Other Topics

4.6.1 Adaptive Momentum

Recall that the stochastic gradient descent algorithm is parametrized by a learn-ing rate η which determines the step size in the direction of steepest descent givenby the gradient vector. In practice, this value should decrease along successiveiterations of the SGD algorithm for the network to be properly trained. For a net-work’s parameter set to be properly optimized, an appropriately chosen learningrate schedule is in order, as it ensures that the excess error is decreasing in eachiteration. Furthermore, this learning rate schedule can depend on the nature of theproblem at hand.

For the reasons discussed in the last paragraph, a number of different algorithmshave been developed to find some heuristic capable of guiding the selection ofan effective sequence of learning rate parameters. Inspired by physics, many ofthese algorithms interpret the gradient as a velocity vector, that is, the directionand speed at which the parameters move through the parameter space. Momen-tum algorithms, for example, calculate the next velocity as a weighted sum of the

37

gradient from the last iteration and the newly calculated one. This helps minimizeinstabilities caused by the high sensitivity of the loss function with respect to somedirections of the parameter space, at the cost of introducing two new parameters,namely a decay factor, and an initialization parameter η0. Assuming these sensi-tivities are axis-dependent, we can apply different learning rate schedules to eachdirection and adapt them throughout the training session.

The work of Kingma and Ba (2014) combines the ideas discussed in this section ina single framework referred to as Adaptative Momentum (Adam). The main ideais to increase/decrease the learning rates based on the past magnitudes of the par-tial derivatives for a particular direction. Adam is regarded as being robust to itshyperparameter values.

4.6.2 The Vanishing Gradient Problem

In our analysis of neural networks, we have established that the addition of lay-ers to a network’s architecture can potentially lead to great increases in its perfor-mance: increasing the number of layers allows the network to better approximateincreasingly more complicated functions in a more efficient manner. In a sense, thesuccess of deep learning in current ML applications can be attributed to this notion.

However, this improvement in power can be counterbalanced by the VanishingGradient Problem: due to the the way gradients are calculated by backpropaga-tion, the deeper a network is the smaller its loss function’s derivative with respectto weights in early layers becomes. At the limit, depending on the activation func-tion, the gradient can underflow in a manner that causes weights to not updatecorrectly.

Intuitively, imagine we have a deep feedforward neural network consisting of nlayers. At every iteration, each of the network’s weights receives an update thatis proportional to the gradient of the error function with respect to the currentweights. As these gradients are calculated using the chain rule through backprop-agation, the further back a layer is, the more it is multiplied by an already smallgradient.

4.6.3 Long-Short Term Memory and Recurrent Neural Networks

Applications with time or positioning dependencies, such as speech recognitionand natural language processing, where each layer of the network handles onetime/positional step, are particularly prone to the vanishing gradient problem. Inparticular, the vanishing gradient might mask long term dependencies between

38

observation points far apart in time/space.

Colloquially, we could say that the neural network is not able to accurately remem-ber important information from past layers. One way of overcoming this difficultyis to incorporate a notion of memory for the network, training it to learn whichinputs from past layers should flow through the current layer and pass on to thenext, i.e. how much information should be “remembered” or “forgotten.” Thisis the intuition behind long-short term memory (LSTM) networks, introduced byHochreiter and Schmidhuber (1997).

LSTM networks are a class of recurrent neural networks (RNNs) that consists oflayers called LSTM units. Each layer is composed of a memory cell, an input gate,an output gate and a forget gate which regulates the flow of information from onelayer to another and allows the network to learn the optimal remembering/forget-ting mechanism. Mathematically, some fraction of the gradients from past layersare able to pass through the current layer directly to the next. The magnitude ofthe gradient that passes through the layer unchanged (relative to the portion thatis transformed) as well as the discarded portion, is also learned by the network.This embeds the memory aspect in the architecture of the LSTM allowing it to cir-cumvent the vanishing gradient problem and learn long-term dependencies; referto Figure 4.6 for a visual representation of a single LSTM unit.

LSTM

input gate

forget gate

output gate

xt

yt−1+ σ × yt

×

M

ct−1

ct

Figure 4.6: Architecture of an LSTM unit: a new input xt and output of the lastunit yt−1 are combined with past memory information ct−1 to produce new outputyt and store new memory information ct. Source: “A trip down long-short memorylane” by Peter Velickovic (https://www.cl.cam.ac.uk/˜pv273/slides/LSTMslides.pdf)

39

https://www.cl.cam.ac.uk/~pv273/slides/LSTMslides.pdf

https://www.cl.cam.ac.uk/~pv273/slides/LSTMslides.pdf

Inspired by the effectiveness of LSTM networks and given the rising importance ofdeep architectures in modern ML, Srivastava et al. (2015) devised a network thatallows gradients from past layers to flow through the current layer. Highway net-works use the architecture of LSTMs for problems where the data is not sequential.By adding an “information highway,” which allows gradients from early layers toflow unscathed through intermediate layers to the end of the network, the authorsare able to train incredibly deep networks, with depth as high as a 100 layers with-out vanishing gradient issues.

40

Chapter 5

The Deep Galerkin Method

5.1 IntroductionWe now turn our attention to the application of neural networks to finding solu-tions to PDEs. As discussed in Chapter 3, numerical methods that are based ongrids can fail when the dimensionality of the problem becomes too large. In fact,the number of points in the mesh grows exponentially in the number of dimen-sions which can lead to computational intractability. Furthermore, even if we wereto assume that the computational cost was manageable, ensuring that the grid is setup in a way to ensure stability of the finite difference approach can be cumbersome.

With this motivation, Sirignano and Spiliopoulos (2018) propose a mesh-free methodfor solving PDEs using neural networks. The Deep Galerkin Method (DGM) ap-proximates the solution to the PDE of interest with a deep neural network. Withthis parameterization, a loss function is set up to penalize the fitted function’sdeviations from the desired differential operator and boundary conditions. Theapproach takes advantage of computational graphs and the backpropagation al-gorithm discussed in the previous chapter to efficiently compute the differentialoperator while the boundary conditions are straightforward to evaluate. For thetraining data, the network uses points randomly sampled from the region wherethe function is defined and the optimization is performed using stochastic gradientdescent.

The main insight of this approach lies in the fact that the training data consistsof randomly sampled points in the function’s domain. By sampling mini-batchesfrom different parts of the domain and processing these small batches sequentially,the neural network “learns” the function without the computational bottleneckpresent with grid-based methods. This circumvents the curse of dimensionalitywhich is encountered with the latter approach.

41

5.2 Mathematical DetailsThe form of the PDEs of interest are generally described as follows: let u be anunknown function of time and space defined on the region [0, T ]×Ω where Ω ⊂ Rd,and assume that u satisfies the PDE:

(∂t + L)u(t,x) = 0, (t,x) ∈ [0, T ]× Ω

u(0,x) = u0(x), x ∈ Ω (initial condition)u(t,x) = g(t,x), (t,x) ∈ [0, T ]× ∂Ω (boundary condition)

The goal is to approximate u with an approximating function f(t,x;θ) given by adeep neural network with parameter set θ. The loss functional for the associatedtraining problem consists of three parts:

1. A measure of how well the approximation satisfies the differential operator:∥∥∥ (∂t + L) f(t,x;θ)∥∥∥2

[0,T ]×Ω, ν1

Note: parameterizing f as a neural network means that the differential operator canbe computed easily using backpropagation.

2. A measure of how well the approximation satisfies the boundary condition:∥∥∥f(t,x;θ)− g(t,x)∥∥∥2

[0,T ]×∂Ω, ν2

3. A measure of how well the approximation satisfies the initial condition:∥∥∥f(0,x;θ)− u0(x)∥∥∥2

Ω, ν3

In all three terms above the error is measured in terms of L2-norm, i.e. using∥∥h(y)∥∥2

Y,ν =∫Y |h(y)|2ν(y)dy with ν(y) being a density defined on the region Y .

Combining the three terms above gives us the cost functional associated with train-ing the neural network:

L(θ) =∥∥∥ (∂t + L) f(t,x;θ)∥∥∥2

[0,T ]×Ω,ν1︸︷︷︸differential operator

+∥∥∥f(t,x;θ)− g(t,x)

∥∥∥2

[0,T ]×∂Ω,ν2︸︷︷︸boundary condition

+∥∥∥f(0,x;θ)− u0(x)

∥∥∥2

Ω,ν3︸︷︷︸initial condition

42

The next step is to minimize the loss functional using stochastic gradient descent.More specifically, we apply the algorithm defined in Figure 5.1. The descriptiongiven in Figure 5.1 should be thought of as a general outline as the algorithmshould be modified according to the particular nature of the PDE being considered.

1. Initialize the parameter set θ0 and the learning rate αn.

2. Generate random samples from the domain’s interior and time/spatial boundaries,i.e.

• Generate (tn, xn) from [0, T ]× Ω according to ν1

• Generate (τn, zn) from [0, T ]× ∂Ω according to ν2

• Generate wn from Ω, according to ν3

3. Calculate the loss functional for the current mini-batch (the randomly sampledpoints sn = (tn, xn), (τn, zn), wn):

• Compute L1(θn; tn, xn) = ((∂t + L) f(θn; tn, xn))2

• Compute L2(θn; τn, zn) = (f(τn, zn)− g(τn, zn))2

• Compute L3(θn;wn) = (f(0, wn)− u0(wn))2

• Compute L(θn; sn) = L1(θn; tn, xn) + L2(θn; τn, zn) + L3(θn; zn)

4. Take a descent step at the random point sn with Adam-based learning rates:

θn+1 = θn − αn∇θL(θn; sn)

5. Repeat steps (2)-(4) until ‖θn+1 − θn‖ is small.

Figure 5.1: Deep Galerkin Method (DGM) algorithm.

It is important to notice that the problem described here is strictly an optimiza-tion problem. This is unlike typical machine learning applications where we areconcerned with issues of underfitting, overfitting and generalization. Typically,arriving at a parameter set where the loss function is equal to zero would not bedesirable as it suggests some form of overfitting. However, in this context a neu-ral network that achieves this is the goal as it would be the solution to the PDE athand. The only case where generalization becomes relevant is when we are unableto sample points everywhere within the region where the function is defined, e.g.for functions defined on unbounded domains. In this case, we would be interestedin examining how well the function satisfies the PDE in those unsampled regions.The results in the next chapter suggest that this generalization is often very poor.

43

5.3 A Neural Network Approximation TheoremTheoretical motivation for using neural networks to approximate solutions to PDEsis given as an elegant result in Sirignano and Spiliopoulos (2018) which is similarin spirit to the Universal Approximation Theorem. More specifically, it is shownthat deep neural network approximators converge to the solution of a class of quasilinearparabolic PDEs as the number of hidden layers tends to infinity. To state the result inmore precise mathematical terms, define the following:

• L(θ), the loss functional measuring the neural network’s fit to the differentialoperator and boundary/initial/terminal conditions;

• Cn, the class of neural networks with n hidden units;

• fn = arg minf∈Cn

L(θ), the best n-layer neural network approximation to the PDE

solution.

The main result is the convergence of the neural network approximators to the truePDE solution:

fn → u as n→∞

Further details, conditions, statement of the theorem and proofs are found in Sec-tion 7 of Sirignano and Spiliopoulos (2018). It is should be noted that, similar to theUniversal Approximation Theorem, this result does not prescribe a way of design-ing or estimating the neural network successfully.

5.4 Implementation DetailsThe architecture adopted by Sirignano and Spiliopoulos (2018) is similar to that ofLSTMs and Highway Networks described in the previous chapter. It consists ofthree layers, which we refer to as DGM layers: an input layer, a hidden layer andan output layer, though this can be easily extended to allow for additional hiddenlayers.

From a bird’s-eye perspective, each DGM layer takes as an input the original mini-batch inputs x (in our case this is the set of randomly sampled time-space points)and the output of the previous DGM layer. This process culminates with a vector-valued output ywhich consists of the neural network approximation of the desiredfunction u evaluated at the mini-batch points. See Figure 5.2 for a visualization ofthe overall architecture.

44

w1 · x+ b1

S1

x

DG

MLayer

DG

MLayer

DG

MLayer

SL+1 w · SL+1 + b y

σ

Figure 5.2: Bird’s-eye perspective of overall DGM architecture.

Within a DGM layer, the mini-batch inputs along with the output of the previouslayer are transformed through a series of operations that closely resemble thosein Highway Networks. Below, we present the architecture in the equations alongwith a visual representation of a single DGM layer in Figure 5.3:

S1 = σ(w1 · x+ b1

)Z` = σ

(uz,` · x+ wz,` · S` + bz,`

)` = 1, ..., L

G` = σ(ug,` · x+ wg,` · S` + bg,`

)` = 1, ..., L

R` = σ(ur,` · x+ wr,` · S` + br,`

)` = 1, ..., L

H` = σ(uh,` · x+ wh,` ·

(S` R`

)+ bh,`

)` = 1, ..., L

S`+1 =(

1−G`)H` + Z` S` ` = 1, ..., L

f(t,x;θ) = w · SL+1 + b

where denotes Hadamard (element-wise) multiplication, L is the total numberof layers, σ is an activation function and the u, w and b terms with various super-scripts are the model parameters.

Similar to the intuition for LSTMs, each layer produces weights based on the lastlayer, determining how much of the information gets passed to the next layer. InSirignano and Spiliopoulos (2018) the authors also argue that including repeatedelement-wise multiplication of nonlinear functions helps capture “sharp turn” fea-tures present in more complicated functions. Note that at every iteration the orig-inal input enters into the calculations of every intermediate step, thus decreasingthe probability of vanishing gradients of the output function with respect to x.

Compared to a Multilayer Perceptron (MLP), the number of parameters in eachhidden layer of the DGM network is roughly eight times bigger than the same

45

Sold

x

uz · x+ wz · S + bz

ug · x+ wg · S + bg

ur · x+ wr · S + bh

Z

G

R

(1−G)H + Z S

uh · x+ wh · (S R) + bh

H

Snewσ

σ

σ

σ

Figure 5.3: Operations within a single DGM layer.

number in an usual dense layer. Since each DGM network layer has 8 weight ma-trices and 4 bias vectors while the MLP network only has one weight matrix andone bias vector (assuming the matrix/vector sizes are similar to each other). Thus,the DGM architecture, unlike a deep MLP, is able to handle issues of vanishing gra-dients, while being flexible enough to model complex functions.

Remark on Hessian implementation: second-order differential equations call for thecomputation of second derivatives. In principle, given a deep neural networkf(t,x;θ), the computation of higher-order derivatives by automatic differentiationis possible. However, given x ∈ Rn for n > 1, the computation of those derivativesbecomes computationally costly, due to the quadratic number of second derivativeterms and the memory-inefficient manner in which the algorithm computes thisquantity for larger mini-batches. For this reason, we implement a finite differencemethod for computing the Hessian along the lines of the methods discussed inChapter 3. In particular, for each of the sample points x, we compute the value ofthe neural net and its gradients at the points x + hej and x − hej , for each canon-ical vector ej , where h is the step size, and estimate the Hessian by central finitedifferences, resulting in a precision of order O(h2). The resulting matrix H is thensymmetrized by the transformation 0.5(H +HT ).

46

Chapter 6

Implementation of the DeepGalerkin Method

In this chapter we apply the Deep Galerkin Method to solve various PDEs thatarise in financial contexts, as discussed in Chapter 2. The application of neural net-works to the problem of numerically solving PDEs (and other problems) requires agreat deal of experimentation and implementation decisions. Even with the basicstrategy of using the DGM method, there are already numerous decisions to make,including:

• the network architecture;

• the size of the neural network to use to achieve a good balance between exe-cution time and accuracy;

• the choice of activation functions and other hyperparameters;

• the random sampling strategy, selection of optimization and numerical (e.g.differentiation and integration) algorithms, training intensity;

• programming environment.

In light of this, our approach was to begin with simple and more manageable PDEsand then, as stumbling blocks are gradually surpassed, move on to more challeng-ing ones. We present the results of applying DGM to the following problems:

1. European Call Options:We begin with the Black-Scholes PDE, a linear PDE which has a simple an-alytical solution and is a workhorse model in finance. This also creates thebasic setup for the remaining problems.

47

2. American Put Options:Next, we tackle American options, whose main challenge is the free bound-ary problem, which needs to be found as part of the solution of the problem.This requires us to adapt the algorithm (particularly, the loss function) to han-dle this particular detail of the problem.

3. The Fokker-Plank Equation:Subsequently, we address the Fokker-Planck equation, whose solution is aprobability density function that has special constraints (such as being posi-tive on its domain and integrating to one) that need to met by the method.

4. Stochastic Optimal Control Problems:For even more demanding challenges, we focus on HJB equations, which canbe highly nonlinear. In particular, we consider two optimal control problems:the Merton problem and the optimal execution problem.

5. Systemic Risk:The systemic risk problem allows us to apply the method to a multidimen-sional system of HJB equations, which involves multiple variables and equa-tions with a high degree of nonlinearity.

6. Mean Field Games:Lastly, we close our work with mean field games, which are formulated interms of conversant HJB and Fokker-Planck equations.

The variety of problems we manage to successfully apply the method to attests tothe power and flexibility of the DGM approach.

6.1 How this chapter is organizedEach section in this chapter highlights one of the case studies mentioned in thelist above. We begin with the statement of the PDE and its analytical solution andproceed to present (possibly several) attempted numerical solutions based on theDGM approach. The presentation is done in such a way as to highlight the experi-ential aspect of our implementation. As such the first solution we present is by nomeans the best, and the hope is to demonstrate the learning process surroundingthe DGM and how our solutions improve along the way. Each example is intendedto highlight a different challenge faced - usually associated with the difficulty ofthe problem, which is generally increasing in examples - and a proverbial “moralof the story.”

48

An important caveat is that, in some cases, we don’t tackle the full problem in thesense that the PDEs that are given at the start of each section are not always in theirprimal form. The reason for this is that the PDEs may be too complex to implementin the DGM framework directly. This is especially true with HJB equations whichinvolve an optimization step as part of the first order condition. In these cases weresort to to simplified versions of the PDEs obtained using simplifying ansatzes,but we emphasize that even these can be of significant difficulty.

Remark (a note on implementation): in all the upcoming examples we use the samenetwork architecture used by Sirignano and Spiliopoulos (2018) presented in Chap-ter 5, initializing the weights with Xavier initialization. The network was trainedfor a number of iterations (epochs) which may vary by example, with random re-sampling of points for the interior and terminal conditions every 10 iterations. Wealso experimented with regular dense feedforward neural networks and managedto have some success fitting the first problem (European options) but we foundthem to be less likely to fit more irregular functions and more unstable to hyperpa-rameters changes as well.

6.2 European Call Options

1: One-Dimensional Black-Scholes PDE

∂tg(t, x) + rx · ∂xg(t, x) + 1

2σ2x2 · ∂xxg(t, x) = r · g(t, x)

g(T, x) = G(x)

Solution:

g(t, x) = x Φ(d+)−Ke−r(T−t)Φ(d−)

where d± =ln(x/K) + (r ± 1

2σ2)(T − t)

σ√T − t

As a first example of the DGM approach, we trained the network to learn the valueof a European call option. For our experiment, we used the interest rate r = 5%,the volatility σ = 25%, the initial stock price S0 = 50, the maturity time T = 1 andthe option’s strike price K = 50. In Figure 6.2 we present the true and estimatedoption values at different times to maturity.

First, we sampled uniformly on the time domain and according to a lognormaldistribution on the space domain as this is the exact distribution that the stock

49

prices follow in this model. We also sampled uniformly at the terminal time point.However, we found that this did not yield good results for the estimated function.These sampled points and fits can be seen in the green dots and lines in Figure 6.1and Figure 6.2.

Figure 6.1: Different sampling schemes: lognormal (green), uniform on [0, 1]× [0, 100](blue) and uniform on [0, 1]× [0, 130] (red)

Since the issues seemed to appear at regions that were not well-sampled we re-turned to the approach of Sirignano and Spiliopoulos (2018) and sampled uni-formly in the region of interest [0, 1] × [0, 100]. This improved the fit, as can beseen in the blue lines of Figure 6.2, however, there were still issues on the right endof the plots with the fitted solution dipping too early.

Finally, we sampled uniformly beyond the region of interest on [0, 1] × [0, 130] toshow the DGM network points that lie to the right of the region of interest. Thisproduced the best fit, as can be seen by the red lines in Figure 6.2.

Another point worth noting is that the errors are smaller for times that are closerto maturity. This reason for this behavior could be due to the fact that the estima-tion process is “drawing information” from the terminal condition. Since this termis both explicitly penalized and heavily sampled from, this causes the estimatedfunction to behave well in this region. As we move away from this time point, thisstabilizing effect diminishes leading to increased errors.

50

Figure 6.2: Call Prices as a function of Stock Price: the black dashed line is the true valuefunction, calculated using the Black and Scholes Formula; the green, blue and red linescorrespond to the three sampling methodologies described above.

Moral: sampling methodology matter!

6.3 American Put Options

2: Black-Scholes PDE with Free Boundary

∂tg + rx · ∂xg + 1

2σ2x2 · ∂xxg − r · g = 0 (t, x) : g(t, x) > G(x)

g(t, x) ≥ G(x) (t, x) ∈ [0, T ]× Rg(T, x) = G(x) x ∈ R

where G(x) = (K − x)+

Solution: No analytical solution.

51

In order to further test the capabilities of DGM nets, we trained the network tolearn the value of American-style put options. This is a step towards increasedcomplexity, compared to the European variant, as the American option PDE for-mulation includes a free boundary condition. We utilize the same parameters asthe European call option case: r = 5%, σ = 25%, S0 = 50, T = 1 and K = 50.

In our first attempt, we trained the network using the method prescribed by Sirig-nano and Spiliopoulos (2018). The approach for solving free boundary problemsthere is to sample uniformly over the region of interest (t ∈ [0, 1], S ∈ [0, 100] in ourcase), and accept/reject training examples for that particular batch of points, de-pending on whether or not they are inside or outside the boundary region impliedby the last iteration of training. This approach was able to correctly recover optionvalues.

As an alternative approach, we used a different formulation of the loss functionthat takes into account the free boundary condition instead of the acceptance/re-jection methodology. In particular, we applied a loss to all points that violated thecondition (t, x) : g(t, x) ≥ G(x) via:∥∥∥∥max− (f(t, x;θ)− (K − x)+) , 0

∥∥∥∥2

[0,T ]×Ω, ν1

Figure 6.3 compares the DGM fitted option prices obtained using the alternativeinequality loss for different maturities compared to the finite difference methodapproach. The figure shows that we are successful at replicating the option priceswith this loss function.

Figure 6.4 depicts the absolute error between the estimated put option values andthe analytical prices for corresponding European puts given by the Black-Scholesformula. Since the two should be equal in the continuation region, this can be a wayof indirectly obtaining the early exercise boundary. The black line is the boundaryobtained by the finite difference method and we see that is it closely matched byour implied exercise boundary. The decrease in the difference between the two op-tion prices below the boundary as time passes reflects the deterioration of the earlyexercise optionality in the American option.

Moral: loss functions matter!

52

Figure 6.3: Comparison of American put option prices at various maturities computedusing DGM (red) vs. finite difference methods (blue)

Figure 6.4: Absolute difference between DGM-estimated American put option pricesand analytical solution for corresponding European put options.

53

6.4 Fokker-Planck Equations

3: Fokker-Planck Equation for OU process with random Gaussian start

∂tp+ κ · p+ κ(x− θ) · ∂xp− 12σ

2 · ∂xxp = 0 (t, x) ∈ R+ × R

p(0, x) = 1√2πv· e−

x2

2v

Solution: Gaussian density function.

The Fokker-Planck equation introduces a new difficulty in the form of a constrainton the solution. We applied the DGM method to the Fokker-Planck equation forthe Ornstein–Uhlenbeck mean-reverting process. If the process begins at a fixedpoint x0, i.e. its initial distribution is a Dirac delta at x0, then the solution for thisPDE is known to have the normal distribution

XT ∼ N

(x0 · e−κ(T−t) + θ

(1− e−κ(T−t)

),σ2

2κ

(1− e−2κ(T−t)

))Since it is not possible to directly represent the initial delta numerically, one wouldhave to approximate it, e.g. with a normal distribution with mean x0 and a smallvariance. In the case where the starting point is Gaussian, we use Monte Carlo sim-ulation to determine the distribution at every point in time, but we note that thedistribution should be Gaussian since we are essentially using a conjugate prior.

For the DGM algorithm, we used loss function terms for the differential equationitself, the initial condition and added a penalty to reflect the non-negativity con-straint. Though we intended to include another term to force the integral of thesolution to equal one, this proved to be too computationally expensive, since anintegral must be numerically evaluated at each step of the network training phase.For the parameters θ = 0.5, σ = 2, T = 1, κ = 0, Figure 6.5 shows the densityestimate p as a function of position x at different time points compared to the sim-ulated distribution. As can be seen from these figures, the fitted distributions hadissues around the tails and with the overall height of the fitted curve, i.e. the fitteddensities did not integrate to 1. The neural network estimate, while correctly ap-proximating the initial condition, is not able to conserve probability mass and theGaussian bell shape across time.

To improve on the results, we apply a change of variables:

p (t, x) =e−u(t,x)

c(t)

54

Figure 6.5: Distribution of Xt at different times. Blue bars correspond to histogramsof simulated values; red lines correspond to the DGM solution of the required Fokker-Planck equation.

where c(t) is a normalizing constant. This amounts to fitting an exponentiatednormalized neural network guaranteed to remain positive and integrate to unity.This approach provides an alternative PDE to be solved by the DGM method:

∂tu+ κ(x− θ)∂xu−σ2

2

[∂xxu− (∂xu)2

]= κ+

∫(∂tu)e−udx∫e−udx

Notice that the new equation is a non-linear PDE dependent on a integral term. Tohandle the integral term and avoid the costly operation of numerically integratingat each step, we first uniformly sample tjNtj=1 from t ∈ [0, T ] and xkNxq=1 from[xmin, xmax], then, for each tj , we use importance sampling to approximate theexpectation term by

It :=

Nx∑k=1

(∂tu(tj , xk))w(xk)

wherew(xk) = eu(tj ,xk)∑Nx

k=1 eu(tj ,xk)

Note that since the density for uniform distribution is constant within the samplingregion, the denominator terms for the weights are cancelled. The L1 loss is then

55

approximated by:

1

Nt

1

Nx

Nt∑j=1

Nx∑k=1

(∂t + L)u(tj , xk, It, θ)

Even though the resulting equation turns out to be more complex, using this tech-nique to train the network by solving for u(x, t) and transforming back to p(x, t)allowed us to achieve stronger results as evidence by the plots in Figure 6.6 .

Figure 6.6: Distribution of Xt at different times. Blue bars correspond to histogramsof simulated values; red lines correspond to the DGM solution of the required Fokker-Planck equation using the modified approach.

Notice that the network was able to accurately recover the shape and preserve theprobability mass across time steps.

It is interesting to note that, in this example, the loss of linearity in the PDE wasnot as important to being able to solve the problem than imposing the appropriatestructure on the desired function.

Moral: prior knowledge matters!

56

6.5 Stochastic Optimal Control ProblemsIn this section we tackle a pair of nonlinear HJB equations. The interest is in boththe value function as well as the optimal control. The original form of the HJBequations contains an optimization term (first-order condition) which can be diffi-cult to work with. Here we are working with the simplified PDE once the optimalcontrol in feedback form is substituted back in and an ansatz is potentially used tosimplify further. Since we are interested in both the value function and the optimalcontrol, and the optimal control is written in terms of derivatives of the value func-tion, a further step of numerically differentiating the DGM output (based on finitedifferences) is required for the optimal control.

6.5.1 Merton problem

4: Merton Problem - Optimal Investment with Exponential Utility∂tH − λ2

2σ2(∂xH)2

∂xxH+ rxH = 0 (t, x) ∈ R+ × R

H(T, q) = −αq2

Solution (value function and optimal control):

H(t, x) = − exp[−xγer(T−t) − λ2

2 (T − t)]

π∗t =λ

γσe−r(T−t)

where λ =µ− rσ

In this section, we attempt to solve the HJB equation for the Merton problem withexponential utility. In our first attempts, we found that the second-order derivativeappearing in the denominator in the above equation generates large instabilities inthe numerical resolution of the problem. Thus, we rewrite the equation by multi-plying out to obtain:

− λ2

2σ2(∂xH)2 + ∂xxH

(∂tH −

λ2

2σ2+ rxH

)= 0

In this formulation, the equation becomes a quasi-linear PDE which was more sta-ble numerically. The equation was solved with parameters r = 0.05, σ = 0.25,µ = 0.2 and γ = 1, with terminal time T = 1, in the region (t, x) ∈ [0, 1]2, withoversampling of 50% in the x-axis.

57

Figure 6.7: Approximate (red) vs. analytical (blue) value function for the Merton prob-lem.

Figure 6.8: Absolute (left panel) and relative (right panel) error between approximateand analytical solutions of the Merton problem value function.

58

Figure 6.9: Approximate (red) vs. analytical (blue) optimal control for the Merton prob-lem.

Figure 6.10: Absolute (left panel) and relative (right panel) error between approximateand analytical solutions of the optimal control.

59

The estimated value function (Figures 6.7 and 6.8) and optimal control (Figure 6.9)are compared with the analytical solution below. Examining the plots, we find thatthe value function is estimated well by the neural network. Notice, however thatat t = 0, the error between the approximate and analytical solutions is larger, butwithin an acceptable range. This may once again be due to the fact that the termi-nal condition has a stabilizing effect on the solution that diminished as we moveaway from that time point.

In general, we are interested in the optimal control associated with the HJB equa-tion. In this context, the optimal control involves dividing by the second-orderderivative of the value function which appears to be small in certain regions. Thiscauses a propagation of errors in the computed solution, as seen in Figures 6.9 and6.10. The approximation appears to be reasonably close at t = 1, but divergesquickly as t goes to 0. Notice that regions with small errors in the value functionsolution correspond to large errors in the optimal control.

6.5.2 Optimal Execution

5: Optimal Liquidation with Permanent and Temporary Price Impact∂th(t, q)− φq2 + 1

4κ (bq + ∂qh(t, q))2 = 0 (t, q) ∈ R+ × Rh(T, q) = −αq2

Solution:

h(t) =√kφ · 1 + ζe2γ(T−t)

1− ζe2γ(T−t) · q2

where γ =

√φ

k, ζ =

α− 12b+

√kφ

α− 12b−

√kφ

For the second nonlinear HJB equation, the optimal execution problem was solvedwith parameters with k = 0.01, b = 0.001, φ = 0.1, α = 0.1, from t = 0 to terminaltime t = T = 1, with q ∈ [0, 5], with oversampling of 50% in the q-axis. Theapproximation in the plots below shows a good fit to the true value function. Theoptimal control solution for the equation only depends on the first derivative ofthe solution, so the error propagation is not as large as in the previous problem, asshown in the computed solution, where there is a good fit, worsening when q goesto 0 and t goes to T .

60

Figure 6.11: Approximate (red) vs. true (blue) value function for the optimal executionproblem.

Figure 6.12: Absolute error between approximate and analytical solutions for the valuefunction of optimal execution problem.

61

Figure 6.13: Approximate (red) vs. true (blue) optimal trading rate for the optimalexecution problem.

Figure 6.14: Absolute error between approximate and analytical solutions for the optimalcontrol in optimal execution problem.

Moral: going from value function to optimal control is nontrivial!

62

6.6 Systemic Risk

6: Systemic Risk

∂tVi +

N∑j=1

[a(x− xj)− ∂jV j

]∂jV

i

+σ2

2

N∑j,k=1

(ρ2 + δjk(1− ρ2)

)∂jkV

i

+12(ε− q)2

(x− xi

)2+ 1

2

(∂iV

i)2

= 0

V i(T,x) = c2

(x− xi

)2 for i = 1, ..., N.

Solution:

V i(t,x) =η(t)

2

(x− xi

)2+ µ(t)

αi,∗t =

(q +

(1− 1

N

)· η(t)

)(Xt −Xi

t

)where η(t) =

−(ε− q)2(e(δ+−δ−)(T−t) − 1

)− c

(δ+e(δ+−δ−)(T−t) − δ−

)(δ−e(δ+−δ−)(T−t) − δ+

)− c(1− 1

N2 )(e(δ+−δ−)(T−t) − 1

)µ(t) = 1

2σ2(1− ρ2)

(1− 1

N

) ∫ T

tη(s) ds

δ± = −(a+ q)±√R, R = (a+ q)2 +

(1− 1

N2

)(ε− q2)

The systemic risk problems brings our first system of HJB equations (which hap-pen to also be nonlinear). This was solved for the two-player (N = 2) case withcorrelation ρ = 0.5, σ = 0.1, a = 1, q = 1, ε = 10, c = 1, from t = 0 to terminaltime t = T = 1, with (x1, x2) ∈ [0, 10]× [0, 10], and results were compared with theanalytical solution.

Note that the analytical solution has two symmetries, one between the value func-tions for both players, and one around the x1 = x2 line. The neural network solu-tion captures both symmetries, fitting the analytical solution for this system. Theregions with the largest errors were found in the symmetry axis, as t goes to 0, butaway from those regions the error in the solution becomes very low. Once againthis may be attributed to the influence of the terminal condition.

63

Figure 6.15: Analytical solution to the systemic risk problem.

Figure 6.16: Neural network solution to the systemic risk problem.

64

Figure 6.17: Absolute error between approximate and analytical solutions for the sys-temic risk problem.

Figure 6.18: Relative error between approximate and analytical solutions for the sys-temic risk problem.

65

The systemic risk problem was also solved for five players with the same param-eters as above to test the method’s capability with higher dimensionality both interms of the number of variables and the number of equations in the system. Inthe figures below, we compare the value function for a player when he deviatesby ∆x from the initial state from x0, with x0 = 5. Note that all players have thesame value function by symmetry. The plots show that the neural network trainedusing the DGM approach is beginning to capture the overall shape of the solution,although there is still a fair amount of deviation from the analytical solution. Thissuggests that more training time, or a better training procedure, should eventuallycapture the true solution with some degree of accuracy.

Figure 6.19: Analytical solution to the systemic risk problem with five players.

Figure 6.20: Neural network solution to the systemic risk problem with five players.

66

6.7 Mean Field Games

7: Optimal Liquidation in a Mean Field with Identical Preferences

− κµq = ∂tha − φaq2 +

(∂qha)2

4k(HJB equation - optimality)

Ha(T, x, S, q;µ) = x+ q(S − αaq) (HJB terminal condition)

∂tm+ ∂q

(m · ∂h

a(t, q)

2k

)= 0 (FP equation - density flow)

m(0, q, a) = m0(q, a) (FP initial condition)

µt =

∫(q,a)

∂ha(t, q)

2km(t, dq, da) (net trading flow)

Solution: see Cardaliaguet and Lehalle (2017).

The main challenge with the MFG problem is that it involves both an HJB equa-tion and a Fokker-Planck equation. Furthermore, the density governed by Fokker-Planck equation must remain positive on its domain and integrate to unity as wepreviously saw. The naıve implementation of the MFG problem yields poor re-sults given that the integral term µt is expensive to compute and the density inthe Fokker-Planck equation has some constraints that must be satisfied. Using thesame idea of exponentiating and normalizing used in Section 6.4, we rewrite thedensity m(t, q, a) = 1

c(t)e−u(t,q,a) to obtain a PDE for the function u:

−∂tu+1

2k(−∂qu∂qv + ∂qqv) +

∫(∂tu)e−udx∫e−udx

= 0

Both integral terms are handled by importance sampling as in the Fokker-Planckequation with exponential transformation.

The equation was solved numerically with parametersA, φ, α, k = 1, with terminaltime T = 1. The initial mass distribution was a normal distribution with meanE0 = 5 and variance 0.25. Results were calculated for t ∈ [0, 1] and q ∈ [0, 10].The value function, optimal control along with the expected values of the massthrough time were compared with the analytical solution (an analytical solutionfor the probability mass is not available; however the expected value of this distri-bution can be computed analytically).

67

The analytical solutions for the value function and the optimal control were in anacceptable range for the problem, though it should be noted that for t = 0, theapproximation diverges as q grows, but still fits reasonably well. The implied ex-pected value from the fitted density showed a very good fit with the analyticalsolution. The probability mass could not be compared with an analytical solution,but it’s reasonable to believe that it is close to the true solution given the remainingresults.

Figure 6.21: Approximate (red) vs. analytical solution for the value function of the MFGproblem.

68

Figure 6.22: Approximate (red) vs. analytical solution for the optimal control for theMFG problem.

Figure 6.23: Approximate (red) vs. analytical solution for the expected value of thedistribution of agents for the MFG problem.

69

Figure 6.24: Unnormalized probability mass of inventories for MFG; the curve shifts leftas all traders are liquidating.

70

6.8 Conclusions and Future WorkThe main messages from the implementation of DGM can be distilled into threemain points:

1. Sampling method matters: similar to choosing a grid in finite differencemethods, where and how the sampled random points used for training arechosen is the single most important factor in determining the quality of theresults.

2. Prior knowledge matters: having some information about the solution candramatically improve the accuracy of the approximations. This proved to beinstrumental in the Fokker-Planck and MFG applications. It also rings truefor finite difference methods and even Monte Carlo methods (a good analogyis the use of control variates).

3. Training time matters: in some cases, including some of our earlier attempts,the loss functions appeared to be decreasing with iterations and the shapeof solutions seemed to be moving in the right direction. As is the case withneural networks, and SGD-based optimization in general, sometimes the an-swer is to let the optimizer run longer. As a point of reference, Sirignano andSpiliopoulos (2018) ran the algorithm on a supercomputer with a GPU clusterand achieve excellent results in up to 200 dimensions.

The last point regarding runtime is particularly interesting. While finite differencemethods are memory-intensive, training the DGM network can take a long amountof time. This hints at a notion known in computer science as space-time tradeoff.However it should be noted that the finite difference approach will simply not runfor high dimensionality, whereas the DGM (when properly executed) will arrive ata solution, though the runtime may be long. It would be interesting to study thespace-time tradeoff for numerical methods used to solve PDEs.

As discussed earlier in this work, generalization in our context refers to how wellthe function satisfies the conditions of the PDE for points or regions in the func-tion’s domain that were not sampled in the training phase. Our experiments showthat the DGM method does not generalize well in this sense; the function fits wellon regions that are well-sampled (in a sense, this can be viewed as finding a solu-tion to a similar yet distinct PDE defined over the region where sampling occurs).This is especially problematic for PDEs defined on unbounded domains, since it isimpossible to sample everywhere for these problems using uniform distributions.Even when sampling from distributions with unbounded support, we may under-sample relevant portions of the domain (or oversample regions that are not as rel-evant). Choosing the best distribution to sample from may be part of the problem,

71

i.e. it may not be clear which is the appropriate distribution to use in the gen-eral case when applying DGM. As such, it would be interesting to explore efficientways of tackling the issue of choosing the sampling distribution. On a related note,one could also explore more efficient methods of random sampling particularly inhigher dimensions, e.g. quasi-Monte Carlo methods, Latin hypercube sampling.

Also, it would be interesting to understand what class of problems DGM can (orcannot) generalize to; a concept we refer to as meta-generalization. Is there an ar-chitecture or training method that yields better results for a wider range of PDEs?

Another potential research question draws inspiration from transfer learning, whereknowledge gained from solving one problem can be applied to solving a differentbut related problem. In the context of PDEs and DGM, does knowing the solutionto a simpler related PDE and using this solution as training data improve the per-formance of the DGM method for a more complex PDE?

Finally, we remark that above all neural networks are rarely a “one-size-fits-all”tool. Just as is the case with numerical methods, they need to be modified based onthe problem. Continual experimentation and reflection is key to improving results,but a solid understanding of the underlying processes is vital to avoiding “black-box” opacity.

72

A Note On Performance

In order to have a sense on how sensitive DGM is to the computational environ-ment used to train the neural networks, we benchmarked training times both usingand not using graphical processing units (GPUs). It is well established among ma-chine learning practitioners that GPUs are able to achieve much higher throughputon typical neural net training workloads due to parallelization opportunities at thenumerical processing level. On the other hand, complex neural network architec-tures such as those of LSTM models may be harder to parallelize. Some disclaimersare relevant at this point: these tests are not meant to be exhaustive nor detailed.The goal is only to evaluate how much faster using GPUs can be for the model athand. Other caveats are that we are using relatively small scale training sessionsand we are running on a relatively low performance GPU (GeForce 830M).

First test scenario

Here we start with a DGM network with 3 hidden layers and 50 neurons per layer.At first, we train the network as usual and then with no resampling in the train-ing loop to verify that the resampling procedure is not significantly impacting thetraining times. The numerical values are given in seconds per optimization step.

Seconds / optimization steps CPU GPURegular training 0.100 0.119

Training without resampling 0.099 0.112

Table 6.1: In loop resampling impact

Surprisingly, however, we also verify that the GPU is actually running slower thanthe CPU!Next, we significantly increase the size of the network to check the impact on thetraining times. We train networks with 10 and 20 layers, with 200 neurons in bothcases.

Seconds / optimization steps CPU GPU10 layers 5.927 3.87320 layers 13.458 8.943

Table 6.2: Network size impact

Now we see that the (regular) training times increase dramatically and that theGPU is running faster then the CPU as expected. We hypothesize that, given thecomplexity of DGM network architecture, the GPU engine implementation is not

73

able to achieve enough parallelization in the computation graph to run faster thanthe CPU engine implementation when the network is small.

Second test scenario

We begin this section by noting that each hidden layer in the DGM network isroughly eight times bigger than a multilayer perceptron network, since each DGMnetwork layer has 8 weight matrices and 4 bias vectors while the MLP networkonly has one weight matrix and one bias vector per layer. So here we train a MLPnetwork with 24 hidden layers and 50 neurons per layer (which should be roughlyequivalent with respect to the number of parameters to the 3 layered DGM net-work above). We also train a bigger MLP network with 80 layers and 200 neuronsper layer (which should be roughly equivalent with respect to the number of pa-rameters to the 10 layered DGM network above).

Seconds / optimization steps CPU GPU24 layers 0.129 0.07780 layers 5.617 2.518

Table 6.3: Network size impact

From the results above we verify that the GPU has a clear performance advantageover the CPU even for small MLP networks.We also note that, while the CPU training times for the different network architec-tures (with comparable number of parameters) were roughly equivalent, the GPUengine implementation is much more sensitive to the complexity of the networkarchitecture.

74

Bibliography

Achdou, Y., Pironneau, O., 2005. Computational methods for option pricing.Vol. 30. Siam.

Almgren, R., Chriss, N., 2001. Optimal execution of portfolio transactions. Journalof Risk 3, 5–40.

Bishop, C. M., 2006. Pattern Recognition and Machine Learning. Springer.

Black, F., Scholes, M., 1973. The pricing of options and corporate liabilities. Journalof Political Economy 81 (3), 637–654.

Brandimarte, P., 2013. Numerical methods in finance and economics: a MATLAB-based introduction. John Wiley & Sons.

Burden, R. L., Faires, J. D., Reynolds, A. C., 2001. Numerical analysis. Brooks/colePacific Grove, CA.

Cardaliaguet, P., Lehalle, C.-A., 2017. Mean field game of controls and an applica-tion to trade crowding. Mathematics and Financial Economics, 1–29.

Carmona, R., Sun, L.-H., Fouque, J.-P., 2015. Mean field games and systemic risk.Communications in Mathematical Sciences 14 (4), 911–933.

Cartea, A., Jaimungal, S., 2015. Optimal execution with limit and market orders.Quantitative Finance 15 (8), 1279–1291.

Cartea, A., Jaimungal, S., 2016. Incorporating order-flow into optimal execution.Mathematics and Financial Economics 10 (3), 339–364.

Cartea, A., Jaimungal, S., Penalva, J., 2015. Algorithmic and high-frequency trad-ing. Cambridge University Press.

Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems 2 (4), 303–314.

Evans, L. C., 2010. Partial Differential Equations. American Mathematical Society,Providence, R.I.

75

Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y., 2016. Deep Learning. Vol. 1.MIT press Cambridge.

Henderson, V., Hobson, D., 2002. Substitute hedging. Risk Magazine 15 (5), 71–76.

Henderson, V., Hobson, D., 2004. Utility indifference pricing: An overview. Volumeon Indifference Pricing.

Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural computa-tion 9 (8), 1735–1780.

Hornik, K., 1991. Approximation capabilities of multilayer feedforward networks.Neural networks 4 (2), 251–257.

Huang, M., Malhame, R. P., Caines, P. E., 2006. Large population stochastic dy-namic games: closed-loop Mckean-Vlasov systems and the Nash certainty equiv-alence principle. Commun. Inf. Syst. 6 (3), 221–252.

Kingma, D. P., Ba, J., 2014. ADAM: A method for stochastic optimization. arXivpreprint arXiv:1412.6980.

Lasry, J.-M., Lions, P.-L., 2007. Mean field games. Japanese Journal of Mathematics2 (1), 229–260.

Merton, R., 1971. Optimum consumption and portfolio-rules in a continuous-timeframework. Journal of Economic Theory.

Merton, R. C., 1969. Lifetime portfolio selection under uncertainty: Thecontinuous-time case. The Review of Economics and Statistics, 247–257.

Pham, H., 2009. Continuous-Time Stochastic Control and Optimization with Finan-cial Applications. Vol. 61. Springer Science & Business Media.

Shalev-Shwartz, S., Ben-David, S., 2014. Understanding Machine Learning: FromTheory to Algorithms. Cambridge University Press.

Sirignano, J., Spiliopoulos, K., 2018. DGM: A deep learning algorithm for solvingpartial differential equations. arXiv preprint arXiv:1708.07469.

Srivastava, R. K., Greff, K., Schmidhuber, J., 2015. Highway networks. arXivpreprint arXiv:1505.00387.

Touzi, N., 2012. Optimal Stochastic Control, Stochastic Target Problems, and Back-ward SDE. Vol. 29. Springer Science & Business Media.

76

Machine Learning theGreeks

Team 2

MAURICIO DAROS ANDRADEInstituto de Matematica Pura e Aplicada

PHILLIPE CASGRAINUniversity of Toronto

LUCAS MEIRELES TOMAZ DE ALVARENGAEMAp, Fundacao Getulio Vargas

AISHAMERIANE VENES SCHMIDTUniversidade Federal de Santa Catarina

Supervisor:MICHAEL LUDKOVSKI, University of California Santa Barbara


Contents

Introduction 4

1 Methodology – The Greeks 61.1 Delta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Theta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Methodology – Gaussian Processes 102.1 Prior specification: the mean, the covariance functions and

its hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Predicting prices and deltas . . . . . . . . . . . . . . . . . . . . 20

3 Goodness-of-Fit and No-Arbitrage 243 Goodness-of-Fit Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 244 No-Arbitrage Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Fitting a GP to Black-Scholes Data 274.1 One dimensional case . . . . . . . . . . . . . . . . . . . . . . . 274.2 Two dimensional case . . . . . . . . . . . . . . . . . . . . . . . 29

5 Fitting a GP to Option Price Data 335.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Fitting the Gaussian Process . . . . . . . . . . . . . . . . . . . . 345.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Conclusion 40

A Appendix A - Proofs 42

2

We would like to thank our mentor Mike Ludkovski for all of the helpful advice and hisguidance throughout this project. We would also like to thank all of the organizers, inparticular (and in no order) Andrea Macrina, Rodrigo Targino and Yuri Saporito, for thetremendous support and this wonderful and unique opportunity. Lastly, we would like tothank all of the other participants in the FMTC 2018 for the great fun we’ve had and thegreat community we created during this short time.

Aisha, Lucas, Maurıcio and Phillipe.

3

Introduction

In finance, hedging refers to the practice of reducing already existing market ex-posure through the construction of an additional portfolio of assets. In practice, ahuge amount of weight is placed on being able to correctly hedge risks using what-ever means are available. In the case of derivative products, hedging can becomeparticularly complicated, particularly when the payoff of these instruments followconvoluted rules. Because of this, a large body of research has been dedicated todevising ways of hedging the exposure which comes from holding portfolios ofderivatives.

The classical approach to hedging derivatives consists of constructing portfolios ofsimpler assets that have values that are related to the payoff of these derivatives.One determines how much of each of these simple assets must be held by comput-ing quantities called Greeks. Simply put, Greeks are a measure of sensitivity of thederivative price with respect to small movements in the corresponding (or under-lying) asset.

The computation of these Greeks typically requires the use of a model for stochasticevolution the underlying assets and a way to compute the price of the derivativeunder this model. One can also opt for more simple market models to obtain closedform expressions of options prices, but at the cost of accuracy. For more compli-cated market models, there exist a variety of methods for computing Greeks suchas ?, or which makes use of Monte Carlo simulations, or the plentitude of othermethods outlined in ?, but all of these can become very computationally expen-sive. Each of these existing methods have the drawback that they must make as-sumptions about the underlying stochastic behaviour of asset prices, often tradingoff realism for computational feasibility. Moreover, fitting these models to matchobserved data is often a very cumbersome process in itself and can be even moredifficult than computing the Greeks.

Instead of taking a model-based approach, we propose a model-free and data-oriented method for computing the Greeks of derivatives. This approach leveragesthe power of the Gaussian process as a model for latent functional relationships

4

in data. The basic idea behind our approach will be to model derivative prices asa function of relevant variables through the use of a Gaussian process. After fit-ting this model directly to market data, we can then compute the desired Greeksby differentiating the model directly. This approach has a number of advantages.First, there is no need to model the stochastic evolution of the market, instead wemodel the observed price data directly using a ‘non-parametric’ approach. Second,the model can provide confidence assessments on the derived Greeks, which canindicate the quality of the model estimations. And last, but not least, this approachis extremely computationally efficient and can go from data to Greeks in a matterof seconds, even when the number of underlying variables is high. This methodtakes after ?, which uses the Gaussian process in order to model insurance contractprices directly from data and to infer mortality improvement by using the modelgradients. The use of Gaussian processes for modelling non-lienar relationshipsin data also has often been applied to solve problems in finance such as in ?,?, ?.We make use of the DiceKriging ? R package for all estimation and prediction ofGaussian processes using data. For a comprehensive guide on Gaussian processes,we point the reader to ?.

First, Section 1 covers the basic definitions relating to Greeks, providing a financialcontext for the problem we are dealing with. Section 2 an overview of the Gaus-sian process, deriving important properties relating to its differentiability. In theSection 3, we test the performance and properties of the Gaussian process whencalculating the Greeks of a toy model for options prices. Lastly, in Section 4 andSection 5 we apply the model to real options data and compare the model to abenchmark to assess its performance.

5

Chapter 1

Methodology – The Greeks

Hedging strategies are intended to protect an investor against market exposurefrom holding porfolios of derivative instruments. The main idea behind hedgingstrategies is to hold an additional porfolio of simpler assets which can mimic thebehaviour of the original portfolio that an investor is trying to hedge in order toreduce the variation in value of the whole. The Greeks are computed quantities thatindicate the sensitivity of a derivative’s asset price in another asset price. Com-puted Greeks can be used to construct hedging portfolios for all kinds of deriva-tives. Assuming that the price of a dervative contract P can be written as a functionof some relevant variables x, we define the Greeks to be any derivatives of theserelevant variables. Typically such computations require the use of a model on theunderlying asset price, and can sometimes be very difficult to compute. Before wedelve into our Gaussian process methodology for computing Greeks, we give anoverview of the classical methods used for their computation.

In the remainder of this chapter, we explore various Greeks through the exampleof a European option. A European Call option written on an asset with price Stat time T with a strike of K will pay the owner an amount max(ST − K, 0) at apre-specified time T . European Puts with the same strike instead pay the owner anamount max(K −ST , 0) at time T . There are of course countless variations of thesekinds on options, such as ‘American’, ‘Bermudan’ or other exercise styles, or evenvariations on the payoff which have path-wise properties, but we will focus on thesimple European-style options for our examples.

1.1 Delta

The Greek ∆ represents the “rate of change of the option price with respect to the price ofthe underlying asset” (?), but also can be understood as the number of stocks neededto perform a simple hedge of a purchased option. For a Put option, the Delta isnegative, meaning that a long position in the Put option should be balanced with

6

a long position in the underlying asset stock. Visually, one can see the Delta as theslope of the tangent line when graphing up the option and stock price, as indicatedby figure (1.1). Formally, we have

∆ =∂y

∂S, (1.1)

Figure 1.1: Delta is the rate of change in the option price with respect to the priceof an underlying asset. Adapted from ?.

For Call options have positive Delta values (ranging from 0 to 1) because if theunderlying asset price goes up, the Call price will increase as well. On the otherhand, we have Deltas over the interval from −1 to 0 for Put options.

Example 1.1. Delta of European Stock Put OptionWe will now briefly discuss how to employ the Black-Scholes model to value an Eu-ropean Put on a non-dividend-paying stock and also how to calculate the Greeksfrom this because it will generate the “true” values that we are going to comparewith the estimates obtained throught the results section.

In this model, the stock price is assumed to be stochastic and with dynamics givenby the following stochastic differential equation

dS(t) = µS(t)dt+ σS(t)dW (t) (1.2)

where µ is called the drift, σ the volatility and the termW (t) corresponds to a Brow-nian Motion and is responsible by the uncertainty inherent to the stock price. Whatis implied is that the stock price has a trajectory that passes through a large numberof points, which are assumed independent and follow a normal distribution.

Solving (1.2) provides us with the Black-Scholes result (readers interested in thederivation can see it on ?), which are going to be used to generate the toy data

7

prices and Deltas for the examples used in this paper. Given a stock price S in timet, we can compute:

d1 =log(S/K) + 1/2(r − q + σ2(T − t))

σ√T − t

(1.3)

d2 = d1 −√T − t. (1.4)

Using the formulas above, one finds that the Put price is given by

BSp = K · exp(−r · (T − t)) · Φ(−d2)− S · exp(−q · (T − t)) · Φ(−d1), (1.5)

and the Delta is equal to

∆ = exp(−q · (T − t)) · (Φ(d1)− 1), (1.6)

where Φ(·) is the cumulative normal standard distribution.

1.2 Theta

Another Greek of interest is the Θ, which is the derivative with respect to time. Inanother words, it contains information about the value of the option when holdingconstant everything except the time. Formally,

Θ =∂P

∂τ. (1.7)

Example 1.2. Theta of European Stock Put OptionFor our European Put, we have

Θ =−Sφ(d1)σ exp(−q · T )

2√T

− qSΦ(d1) exp(−q · T ) + rK exp(−r · T )Φ(−d2), (1.8)

where φ(·) and Φ(·) are the normal distribution density and cumulative distribu-tion function, repectively.

An option price tends to decay as time passes (with everything else remaining con-stant), due to the decrease in uncertainty regarding the stock final price, meaningthat we usually expect negative values for Theta. Unlike the Delta, here there isno surprise about what will happen regarding the value of the asset when holdingeverything except time, so there is no “Θ hedge”, but it is a common proxy forGamma, another Greek.

8

1.3 Gamma

Gamma refers to the second derivative with respect to the price of the underlyingasset, i.e., is a metric related to the convexity of a given derivative w.r.t. the un-derlying asset and reflects the variation of Delta. Larger values of Gamma indicatethat Delta is more sensitive to fluctuations in the underlying asset, indicating thatin a Delta hedge strategy would be better to reasses more frequently the portfolioposition in order to avoid hedging error.

Γ =∂2P

∂2τ. (1.9)

Example 1.3. Gamma of European Stock Put OptionFor the European Put (or Call) option, Gamma is given by equation (1.10).

Γ =φ(d1) exp−qT

Sσ√T

. (1.10)

From (1.10) it is obvious that Γ should always be positive for the European Stockopetion. Notice that holding everything except stock price constant, Gamma willbe bell-shaped, due to the gaussian term in the denominator.

An investor would like to mantain zero net total positions in Delta and Gamma, i.e.,ideally, one wants to always minimize risk by securing a position that counterbal-ances the market fluctuations. In practice, this is not possible since would requirethe knowledge of all prices, shocks and changes that could affect the portfolio, notto mention would inccur in high transaction costs. Accordingly to ?, it is usual, inthe situation of managing a large portfolio depending on a single underlying asset,to reassess the portfolio once a day to achieve zero Delta. However, this is not easyfor Gamma, since it is non trivial to find another options for the required volumewith competitive prices, therefore it is common to just monitore Gamma.

9

Chapter 2

Methodology – Gaussian Processes

A common problem in many areas of study, including finance, is to estimate the alatent relationship betweein two variables of interest, x and y, where we typicallyrefer to x to be a model input and y to be a model output. The challenge is identifythe latent function f , which best explains the relationship between x and y. To putin other words, one has a training data D((x(i), y(i))|i = 1, . . . , n) and wishes tobe able to use new information (or test data) x∗ to obtain estimates for y∗. In thischapter we will go through the basic theory of Gaussian Processes models (GP) toemulate f and will discuss methods to compute its gradients. We leave the bigpicture of our methodology to the end of this chapter, since there the reader willhave a better comprehension of the Gaussian process. But anyway, the idea is to fita Gaussian Process in the data and take advantage of the properties of the processin order to calculate the Delta.

In what follows, all bold variables are vectors and X is a collection of vectors. As-sume that for a given stock, the price can be seen as a function of the inputs, f(x),plus a random disturbance. Formally,

y = f(x) + ε (2.1)

where ε ∼ N (0, σ2) is the error term. In order to use our model when new observa-tions arise (or even to fit a model), one should make some assumptions about thebehaviour of the function f . A large number of methods using different assump-tions were developed in past decades and, accordingly to ?, they basically fall intwo categories: (1) the analysis is either constrained in a fixed class of functions(quadratic, exponential, etc.) - which incurs in a drastic loss of flexibility - or (2) themodels that assign prior probabilities to the space of functions in order to, combin-ing with the training data, decide the functions that fit better the data (and producebetter predictions). Although more flexible, there is a problem with this second ap-proach, related to its dimensionality: it is not possible, with a finite time constraint,to test all possible sets of function. One possible way to overcome this problem is

10

by seeing f in (2.1) as a latent factor and fitting a GP to describe its behaviour.

The idea behind GP is to generalize the Gaussian distribution in order to use it ina function environment: while the Gaussian distribution (in fact, any probabilitydistribution) is used to describe random variables (either in scalar or vector form),one can use stochastic processes to describe the underlying behaviour of functions.This approach is useful because, whilst a function is defined over an infinite num-ber of points, the knowledge about the process that governs the function makestrivial the task of assessing the function in a fixed set of points or even taking newpoints into consideration (?). Formally, a Gaussian process could be defined as in(2.1).

Definition 2.1. Gaussian ProcessA Gaussian process (GP) is a collection of random variables (f(x))x∈RD , such that

for every collection of x(1), . . . ,x(N),(f(x(i))

)Ni=1

is a Gaussian vector.

In order to completely characterize a GP, one only needs to know the mean andcovariance functions, denoted by m(x) and k(x,x′) respectively, that we define asfollows:

m(x) = E[f(x)],

k(x,x′) = cov(f(x), f(x′)). (2.2)

The notation f(x) ∼ GP(m(x), k(x,x′)) means that f(x) follows a Gaussian pro-cess with mean m(x) and covariance function k(x, x′). Since the GP is a collectionof Gaussian random variables (RV), it is trivial to show that if (f(X(1)), f(X(2))) ∼N (µ,Σ). Then, from the normal distribution properties, we automatically havespecified f(X(1)) ∼ N (µ1,Σ11) (where µ1 and Σ11 are the correspondent part tof(X(1)) in the Σ matrix and mu vector respectively). Note that the GP design,through its covariance function, allows us to explicitly compute the correlationstructure between entries of a feature, like different asset maturities.

The posterior distribution for f(x∗)|(X,y), where y = f(X) + ε, is obtained by us-ing the Bayes theorem, which combines our prior knowledge (incorporated withinthe GP and is denoted p(f(x∗)) and the likelihood p(y|f(x),x). The prior on f(x∗)is established from information known about the asset behaviour and could beused to include some real-life constraints into the model. This is made through themean and covariance functions. For both mean and covariance functions, in ourexamples, we will show the effect that different prior specifications have on thepredictions, both in the in-sample and out-of-sample cases. Regarding the prior onthe covariances, we will focus in three different common specifications that will be

11

presented to the reader in this chapter.

Like said previously, our main interest lies in making forecasts, i.e., we want tobe able to predict the price for a given stock, y∗ = f(x∗) + ε. Having the priorand the training data, we are ready to compute the predictive posterior, f(x∗)|y,X,once the test data is available. Given the nature of GP, f(x)|y,X is also a GP withf(x)|y,X ∼ GP(m(x), k(x,x′)), where the new mean and covariance functions,m(x) and k(x,x′), are the functions of the prior and the likelihood information.The new mean and covariance functions represent, respectively, the prediction pro-duced by the model using the data (y,X) and a quality measure of this prediction.Another use of the posterior is to obtain smoothed trajectories for the prices. Infact, once the posterior distribution is obtained, it can also estimate credible inter-vals for predictions.

In order to obtain the conditional distribution f(x∗)|X,y, we can use the joint priordistribution conditional to the observations, which is also normal. Thus, it is pos-sible to show that (see Appendix A for further details):

f(x)|X,y ∼ GP(m(x), k(x,x′)), where (2.3)

m(x) = k(x∗,X)[k(X,X) + Σ]−1y and (2.4)

k(x,x′) = k(x,x′)− k(x,X)T [k(X,X) + Σ]−1k(x′,X). (2.5)

From (2.4)-(2.5) we can see how the posterior is a combination of prior belief withdata information. For example, for the mean function, we are using the prior pre-cision (k(x,x′) + Σ)−1 to weight the information obtained from y, meaning thatless precision (or higher uncertainty) in the prior will give more “weight” to thetraining set. In a similar way, the posterior covariance function is being discountedby a factor reflecting the influence of observations.

Example 2.2. Toy example using Black-Scholes prices and delta for Europeanstyle-optionWe generated data from a Black-Scholes formula using the following notation andspecification:

• S, is price of the stock. In the examples, S = 40;

• K, the strike price of the option. In this example, K ∈ (25, 55);

• r, the risk-free interest rate. Unless specified otherwise, r = 0.06;

• σ = 0.22, the volatility of the underlying asset;

• t and T , initial and final time. Thus, T − t is the maturity (for simplicity,assume t = 0). In this example, T = (0.5, 1, 3);

12

Figure (2.1) shows the generated data from the Black Scholes formula, at differentstrikes and maturities (left), as well as the predictive mean and correspondent cred-ible intervals drawn from a GP using a Gaussian kernel and a linear mean function(right).

0

5

10

30 35 40 45 50Stock Price

Put

pric

e

(a) Simulated put prices for different maturi-ties and strikes using a Black Scholes model.

0

5

10

30 35 40 45 50Stock price

Put

pric

e

maturity

0.5

1

3

(b) Mean and 95% confidence bands for aGaussian Process with Gaussian kernel anda constant mean function.

Figure 2.1: Toy example for an European put prices.

2.1 Prior specification: the mean, the covariance functions and its hyper-parameters

The covariance function contains the information about the relationship betweendifferent datapoints, i.e. is the kernel choice gives the sense of distance between thedata. For instance, two observation points that are nearby each other have a higherprobability (intuitively) of having akin y values than two distant points. Assumingthat the mean function is equal to zero, then, for a known covariance function (alsoknown as kernel)1 k, inference for the GP defined in (2.1) will be reduced to findingthe coefficient of a linear regression model.

In practice, however, one does not know k, thus inference for the kernel is neces-sary. This is achieved through finding the hyperparameters of the kernel equation.In this work we used three different specifications for the kernel: Gaussian (kG(·)),Matern3/2 (kM3(·)) and Matern5/2 (kM5(·)), given by equations (2.6)-(2.8), respec-tively. Figure (2.2) contains draws from the predictive posterior for each kernelconsidering the data from Example (2.2).

1It is a requisite that k satisfies a some constraints, which can be difficult to assess. Therefore, theusual procedure is to choice in advance a family of parametric kernels that is known to be positivedefinite beforehand (?).

13

kG(|x− y|) := exp

(−||x− y||

2

2θ2

)(Gaussian)

(2.6)

kM3(|x− y|) :=

(1 +

√3|x− y|θ

)exp

−√

3|x− y|θ

(Matern3/2)

(2.7)

kM2(|x− y|) :=

(1 +

√5|x− y|θ

+5|x− y|2

3θ2

)exp

−√

5|x− y|θ

(Matern5/2)

(2.8)

Example 2.3. Kernel specification

0

5

10


Put

pric

e

maturity

0.5

1

3

(a) Gaussian kernel. Thiskernel is infinitely differen-tiable, resulting in very smallconfidence bands when com-pared to the other kernel. Italso produces smoother tra-jectories that will most oftenpoorly fit real data.

0

5

10


Put

pric

e

maturity

0.5

1

3

(b) Matern3/2 kernel. Thiskernel is once differentiable,resulting in fatter tailsand, consequently, largerconfidence intervals.

0

5

10


Put

pric

e

maturity

0.5

1

3

(c) Matern5/2 kernel. Thiskernel is twice differentiableand is widely used due to itsproperties.

Figure 2.2: Toy example for an European put prices. Each graph contains the pre-dictive mean and its 95% confidence bands for a GP with a constant mean function,considering different kernels.

In the Gaussian kernel, since the normal distribution is infinitely differentiable, oneexpects to have a much smoother fit than in comparison to the Matern 3/2 and 5/2kernels, because they are only once and twice differentiable (and produce widercredible intervals due to the fatness in their tails). The excess of smoothness in theGaussian can be a non-desirable characteristic because it may not represent the ac-tual properties of the real data, thus the preference for Matern kernels arise (?). All

14

three kernels will be generalized latter to higher dimensions.

The parameter θ in equations (2.6)-(2.8) is called length-scale parameter and drivesthe smoothness of the process for a given kernel. Bigger values of the length-scaleare associated with flatter curves, which will have a direct impact on whether theestimated trajectories are able to satisfy the monotonicity constraints and thereforewill be related to the ∆ estimation. This smooth effect can be seen as stickiness:higher θ implies that distant Put Prices are more correlated than would be witha smaller θ. The last effect of this parameter can be seen in the predictive mean,where the length scale will act as the weight between the information from thelikelihood and prior in order to make predictions.

Example 2.4. The length-scaleThe figure 2.3a shows a toy example for an European put prices. Each graphcontains the predictive mean and its 95% confidence bands for a GP with a con-stant mean function, considering different kernels and varying length. Notationfor length scale is θ = (θS , θM ), where the first coordinate refers to Strike and thesecond refers to the Maturity. Notice how the length-scale impacts on the process.

2.2 Differentiability

Delta and Gamma can be computed from the first and second derivative of a GP.The biggest advantage of this approach is that Gaussian Process can be well-behavedin many ways - like having enough derivatives and having closed formulas - andwhen the derivative exists, it will also be a Gaussian Process, leading, in somecases, to solutions in closed formulas. The computations we are interested are themean and the variance of the derivative of the process given the observed data. Westart introducing a differentiability notion from which we can rely on theorems.

Definition 2.5. Differentiability in GPA process f(x) is differentiable on the i-th coordinate, in the Mean Squared (MS)sense, if there is another process denoted by ∂

∂xif(x) such that

E

[(f(x + εei)− f(x)

ε− ∂

∂xif(x)

)2]

ε→0−−→ 0 (2.9)

Computationally, MS differentiability allows us to use finite differences in order tocompute the derivatives. More precisely, by taking a sufficient small ε, one expectsthat the L2 norm of the approximation error from replacing ∂f

∂xi(x) by f(x+εei)−f(x)

εcan be as small as desired.

Theorem 2.6. MS-Differentiability in GP A process f(x) is MS differentiable inthe i-th coordinate if ∂2k

∂xi∂x′i(xi,x

′i) exists.

15

0

5

10


Put

pric

ematurity

0.5

1

3

(a) Gaussian kernel withθ = (10.1, 2.48).

0

5

10


Put

pric

e

maturity

0.5

1

3

(b) Gaussian kernel withθ = (30.3, 7.43).

0

5

10


Put

pric

e

maturity

0.5

1

3

(c) Gaussian kernel withθ = (1.57, 3.18).

0

5

10


Put

pric

e

maturity

0.5

1

3

(d) Matern3/2 kernel with θ =(30.88, 5.00).

0

5

10


Put

pric

e

maturity

0.5

1

3

(e) Matern3/2 kernel with θ =(92.63, 15.00).

0

5

10


Put

pric

e

maturity

0.5

1

3

(f) Matern3/2 kernel with θ =(5.56, 2.24).

0

5

10


Put

pric

e

maturity

0.5

1

3

(g) Matern5/2 kernel with θ =(36.00, 2.34).

0

5

10


Put

pric

e

maturity

0.5

1

3

(h) Matern5/2 kernel with θ =(108.00, 7.02).

0

5

10


Put

pric

e

maturity

0.5

1

3

(i) Matern5/2 kernel with θ =(6.00, 1.53).

Figure 2.3: Toy example for an European put prices. Each graph contains the pre-dictive mean and its 95% confidence bands for a GP with a constant mean function,considering different kernels and varying length-scales. Notation for length scaleis θ = (θS , θM ), where the first coordinate refers to Strike and the second refers tothe Maturity.

16

Proof. The proof can be seen in ?.

Since we will be working only with kernels that are Gaussian, Matern3/2 or Matern5/2,we won’t need to worry anymore about differentiability: for all them ∂2k

∂xi∂x′i(x, x′)

exists. Proposition (2.7) shows how to calculate the mean and the variance of a MSdifferentiable GP.

Proposition 2.7. Given f(x) ∼ GP(m(x), k(x,x′)) such that ∂2k∂xi∂x′i

(x,x′) and ∂m∂xi

(x)

exists, then

∂f

∂xi(x) ∼ GP

(∂m

∂xi(x),

∂2k

∂x′i∂xi(x,x′)

)(2.10)

Proof. The proof is on Appendix A.

Notice that the derivative of the Gaussian Process remains being a Gaussian Pro-cess. In this way, we can continue applying Theorem 2.6 in order to see to whatextent the process is MS differentiable. More precisely, if ∂4k

∂2xi∂2x′i(x,x′) exists, then

∂f∂xi

(x) is MS differentiable, i.e., f(x) is twice MS differentiable. One can keep look-ing at further derivatives of k(x,x′) in order to see how many times MS differen-tiable is f(x). For the kernels in (2.6)-(2.8), we can state the following:

1. Matern3/2 kernel is once MS differentiable2;

2. Matern5/2 kernel is twice MS differentiable;

3. Gaussian kernel is infinitely MS differentiable (trivial to show since it is thekernel of a Gaussian distribution).

Corollary 2.8. Sample path differentiabilityGiven f(x) ∼ GP(m(x), k(x,x′)) such that ∂2k

∂xi∂x′i(x,x′) and ∂m

∂xi(x′) exists, then

∂f

∂xi(x)|X, f(X) ∼ GP

( ∂k∂xi

(x,X)k(X,X)−1f(X),∂2k

∂xi∂yi(x,x′)

− ∂k

∂xi(x,X)k(X,X)−1

∂k

∂yi(X,x′)

)(2.11)

2The general formula for the Maternν(d) kernel is

kν(d) = σ2 21−ν

Γ(ν)

(√2νd

ρ

)νBν

(√2νd

ρ

),

where Γ is the gamma function Bν(·) is the modified Bessel function of the second kind and ν and ρare covariance hyperparameters. It can be shown that a Gaussian process with Matern kernel is ν−1times differentiable.

17


It is direct to see that (f(x), ∂f∂xi (x′)) is a (higher dimension) Gaussian Process - the

proof for this is simply in the fact that the L2 limit of Gaussian vectors is also aGaussian vector. One application of this result is when we are interested in sam-pling (f(x), ∂f∂xi (x

′))T , or its conditioned version. But in order to do so, it is nec-essary to know the covariance between f(x) and ∂f

∂xi(x′), which are derived in the

following proposition.

Proposition 2.9. Covariances in a GP

cov(∂f

∂xi(x), f(x′)

)=

∂k

∂xi(x,x′) (2.12)


The mean of (f(x), ∂f∂xi (x′)) is just the mean of each separately process joint to-

gether. The kernel of this Gaussian Process is also just an assembly of thingswe’ve computed above. Also, given the tools shown, one can simply calculatethe kernel and mean of (f(x), ∂f∂xi (x))|X,X′, f(X), ∂f∂xi (X

′). The reader might alsobe interested in gathering not only the process and the derivative together, butalso higher order derivatives. The computation can be of interest, but mathemat-ically one would only need to repeat the propositions above several times. To seethis, notice that

(∂f∂xi

(x), f(x′))

is a Gaussian Process (with (x,x′) varying). Hence,we can take the derivative (if conditions from Theorem 2.6 are satisfied) on coor-dinate xj (or x′j) in order to get the distribution of (f(x), ∂f∂xi (x

′), ∂2f∂xj∂xi

(x′′)) (or

(f(x), ∂f∂xi (x′), ∂f∂xj (x′′))). Repeating this procedure several times produces the de-

sired high-order derivatives of the given GP.

Although MS differentiability is a notion that provide us a useful theorem, onecan argue that this notion is not one being looked for: in real life, only a singlesample of the Gaussian process is being observed and we are interested in workingwith its derivative. In order to relieve a reader with such thoughts, we display thefollowing theorem:

Theorem 2.10. Distribution of the derivative of a GP.Let X(t) be a stochastic process defined over [0, 1]. Suppose there are pi < ri andαi positive constants, i = 1, 2, such that

E|X(t+ h)−X(t)|p1 ≤ α1|h|| log |h||1+r1

and

18

E|X(t+ h)− 2X(t) +X(t− h)|p2 ≤ α2|h|1+p2| log |h||1+r2

.

Then there is another stochastic process Y (t) such that P(X(t) = Y (t)) = 1 forevery t, and Y ′(ω)(t) exists and is continuous for every ω in the probability space.

Proof. Proof can be seen in ?.

Note that, in practice, we are only able to observe a finite set of points x(i)ni=1

of the Gaussian Process. If there is a stochastic process Y (x) such that P(f(x) =Y (x)) = 1 for every x, it is not possible to distinguish between f(x) and Y (x),because P(f(x(i)) = Y (x(i)), i = 1, · · · , n) = 1. Therefore, for our purposes, f(x)will be differentiable if it satisfies the hypothesis stated in (2.10), and it won’t benecessary to think about another process Y (x).

Example 2.11. The meanThe mean function can assume different forms in order to getting a better fit. Inprevious examples we assumed m(x) = a, where a is a fixed real number. Assum-ing that the mean function is zero (or a constant) will imply that predictions for x,when using stationary kernels, will shrink towards zero (the constant). One needsto decide case-by-case if it is reasonable to have one or another behaviour, depend-ing on the characteristics of the data. Figure (2.4) has the effect of changes in themean function using a Gaussian kernel.

A generic structure for the mean function can be given by

m(x) = β0 +

p∑i=1

βjhj(x), (2.13)

where hj(·) are prespecified functions and the βj need to be estimated. Universalkriging refers to simultaneously fitting the Gaussian process and and β := (β1, . . . , βp)

T

estimates. The particular case where the mean function is assumed to be a constantis called ordinary kriging. The following equations obtained through universal krig-ing:

β := (HT (k(X,X) + Σ)−1H)−1HT (k(X,X) + Σ)−1y (2.14)

m∗(X∗) := h(X∗)β + kT∗ (k(X,X) + Σ)−1(y −Hβ) (2.15)

k∗(X∗,X∗) := k∗∗ + (h(X∗)T − kT∗ (k + Σ)−1)T (HT (k + Σ)−1H)−1

(h(X∗)− kT∗ (k(X,X) + Σ)−1H) (2.16)

19

0

5

10

30 40 50Stock price

Put

pric

e

(a) m(x) = 0.

0

5

10

30 40 50Stock price

Put

pric

e

maturity

0.5

1

3

(b) m(x) = cons+K(x) + T (x)

0

5

10

30 40 50Stock price

Put

pric

e

(c) m(x) = cons+K(x)2 + T (x)

0

5

10

30 40 50Stock price

Put

pric

e

maturity

0.5

1

3

(d) m(x) = constant + K(x)2 +T (x)2 +K(x)T (x) +K(x) + T (x)

Figure 2.4: Toy example for an European put prices. Each graph contains the pre-dictive mean and its 95% confidence bands for a GP with a Gaussian kernel fordifferent mean functions.

where k∗ =k(X∗, x

i)1≤i≤n and H = (h(x1), . . . , h(xn)). Notice that (2.16) re-

duces to (2.5) when h(·) ≡ 0.

? highlight that imposing a too complex structure in the mean function can leadto overfitting problems. Cross validation could be used to counter this, but specif-ically regarding our toy example data, cross validation underperformed in all ex-periments and was not used in the final results.

2.3 Predicting prices and deltas

In order to estimate Delta, one needs to compute the derivatives from the GP. Inthe case of the Gaussian and Matern kernels, we have closed-formulas that arepresented below. Observe that since k(x,x′) = k(x′,x), we have ∂k

∂xi(x,x′) =

∂k∂x′i

(x′,x), and also ∂2k∂xi∂x′i

(x,x′) = ∂k∂x′i∂xi

(x′,x). For all the three kernels we work

20

with, Gaussian, Matern3/2 and Matern5/2, these are of the form

k(x,x′) = σ2D∏i=1

kθi(xi, x′i),

where θ ∈ RD++ is the length scale (of each coordinate), σ is the standard devia-tion. Then, if one wants derivatives of the kernel, one only need to assemble thederivatives in the following formulas:

∂k

∂xi(x,x′) =

∂kθi∂x

(xi, x′i)

∏j=1,··· ,Dj 6=i

kθj (xj , x′j) (2.17)

∂2k

∂xi∂x′i(x,x′) =

∂2kθi∂x′∂x

(xi, x′i)

∏j=1,··· ,Dj 6=i

kθj (xj , x′j) (2.18)

The “sub-kernel” kθ and its derivative for the Gaussian, Matern3/2 and Matern5/2

kernels will be specified in Corollary 2.12.

Corollary 2.12. Sub-kernel and derivatives for the Gaussian, Matern3/2 and Matern5/2

kernels in a Gaussian ProcessFor the Gaussian kernel, we have

kθ(x, x′) = e−

(x−x′)22θ , (2.19)

whose first and second derivatives are given by

∂kθ∂x

(x, x′) =x′ − xθ

e−(x−x′)2

2θ (2.20)

∂2kθ∂x′∂x

(x, x′) =

(1

θ− (x− x′)2

θ2

)e−

(x−x′)22θ . (2.21)

The Matern3/2 has a sub-kernel of the form:

kθ(x, x′) =

(1 +

√3|x− x′|θ

)e−√3|x−x′|θ , (2.22)

which has first and second derivatives equal to

∂kθ∂x

(x, x′) = −3|x− x′|θ2

e−√3|x−x′|θ and (2.23)

∂2kθ∂x′∂x

(x, x′) =

(3

θ2− 3√

3|x− x′|θ3

)e−√3|x−x′|θ , (2.24)

21

respectively. Finally, the Matern5/2 sub-kernel is defined as

kθ(x, x′) =

(1 +

√5|x− x′|θ

+5(x− x′)2

3θ2

)e−√5|x−x′|θ . (2.25)

Its derivatives are given by

∂kθ∂x

(x, x′) =

(−5(x− x′)

3θ2− 5√

5(x− x′)|x− x′|3θ3

)e−√5|x−x′|θ and (2.26)

∂2kθ∂t∂x

(x, x′) =5

3

(√5|x− x′|θ3

+1

θ2− 5(x− x′)2

θ4

)e−√5|x−x′|θ . (2.27)

Although it is possible to use the results from Corollary 2.12 to compute the deriva-tives of the Gaussian process in closed formula, the team opted by employing thefinite difference method due to technical and time constraints. Even though, ourresults performed better, in all three Kernels, than using a simple finite differenceon the raw simulated data, as shown in example 2.13.

Example 2.13. Derivative from GP versus finite difference directly in dataIn this example our aim is to show that differentiating the surface generated bythe GP tend to approximate the true derivatives better than using a method likefinite-differences in the raw data (in a naive style).In order to do so, we compared the results obtained by derivating the Gaussian Pro-cess (using a Gaussian kernel) and the finite difference method with the true Deltacalculated with equation (1.6). Note that here we were able to compute the deriva-tives with finite difference method only because we are using simulated data, soeven though it is producing not so bad approximations, this method cannot beemployed when dealing with real world applications. We observe that we arenot using the closed formulas for the derivatives here due to technical and timeconstraints. Instead, we used finite differences in order to compute the Gaussianprocess derivative. We highlight that this is different from doing finite differencesdirectly on the data.

22

−0.025

0.000

0.025

0.050

30 35 40 45Stock Price

Del

ta E

rror

(a) Matern3/2 kernel.

−0.04

−0.02

0.00

0.02

0.04

0.06


Del

ta E

rror

(b) Matern5/2 kernel.

−0.04

−0.02

0.00

0.02

0.04

0.06


Del

ta E

rror

Maturity

0.5

1

3

Estimation Type

Finite Difference

Gaussian Process

(c) Gaussian kernel.

Figure 2.5: Toy example for an European put prices. Each graph contains in they axis the difference between the true Delta and the estimated value calculatedusing the derivative (closed formula in GP versus finite difference method). Valuesoutside the x range (30, 50) are outside sample predictions, therefore the estimationerror is bigger.

23

Chapter 3

Goodness-of-Fit and No-Arbitrage

3 Goodness-of-Fit Metrics

In order to make comparisons between different specifications and methods, oneneeds to compute goodness of fit metrics. We elected root mean squared error(rMSE), bias and coverage as the main summarizing metrics when comparing withthe true values (generated by a Black-Scholes model) for our experiments. TherMSE measures the square root of the difference between an estimate and its truevalue, thus reflects the accuracy of an estimation method. In terms of values, betterestimates will be near or at zero, indicating that the generated values match thevalue their are trying to predict. On other hand, the bias measures how far awaythe sample mean of the generated points is from the true parameter value. The cov-erage is the percentage of true data points from the testing set that fall into the 95%confidence interval generated by the estimations. Due to the gaussian nature ofour problem, we know that the estimated prices from the GP will follow a normaldistribution, therefore one just needs the mean and standard deviation in order tocompute those intervals.

Formally, we one can define these quantities (rMSE, bias and coverage) as follows.

Definition 3.1. Root mean squared error and biasLet y1, . . . , yn be a vector whose entries are estimated by a stochastic model usingsome training set X = x1, x2, . . . , xp. Consider that y1, . . . , yn are the ”true” val-ues associated with X . Then, the root mean squared error (rMSE) measures thedistance between y1, . . . , yn and y1, . . . , yn, i.e.,

rMSE =1

n

n∑i=1

√(yi − yi)2. (3.1)

And the Bias for the same set of estimated is defined by

24

Bias =1

n

n∑i=1

(yi − yi). (3.2)

Although useful to assess the model estimates quality, both rMSE and bias do nottake into account the uncertainty that is inherent to the stochastic nature of themodel. In order to do so, we calculated the coverage of the estimates as defined indefinition (3.2).

Definition 3.2. CoverageLet y1, . . . , yn and y1, . . . , yn as in definition (3.1). If y1, . . . , yn are estimates gener-ated by a gaussian processes, it follows from the definition that y1, . . . , yn are jointlynormally distributed with mean µy and variance σ2y . The coverage is defined as

Coverage =

(1

n

n∑i=1

Iyi∈C

)× 100%, (3.3)

where Iyi⊂C is the indicator function that is equal to 1 when yi lies in the interval

between(µy − 1.96× σ2

y√n

)and

(µy + 1.96× σ2

y√n

).

4 No-Arbitrage Constraints

The financial nature of our empirical problem imposes some additional constraintsthat need to be satisfied, so that the prices generated by our model make sense.These additional constraints make sure that the option prices generated by theGaussian process do not allow for arbitrage, if they were to be prices listed onan exchange. Although this section does not delve into the derivations and proofsof the no arbitrage conditions, we refer the reader to ?? for the mathematical detailsof no-arbitrage conditions.We review some well known no-arbitrage contraints that we will use to assesswhether our model prices allow arbitrage. The first one is the positivity of the gen-erated prices, that is, option prices should always be positive,

P (S0,K, T, r, σ) ≥ 0 , (3.4)

which follows from the fact that the payoff of an option is always be non-negative,and hence its price should be non-negative as well.Secondly, we require monotonicity with respect to the underlying asset price. Fora Put option, this is intuitively because having the current asset price at a lowervalue corresponds to a higher probability that the option ends up in the money. Aformal derivation of this requirement can be obtained via a no-arbitrage argument,however. This requirement translates to,

∂P

∂S(S,K, T, r, σ) < 0 , (3.5)

25

for all S > 0.Lastly, the data generated by our model must be concave in the variable S. Thismeans that we require

∂2P

∂S2(S0,K, T, r, σ) ≤ 0 . (3.6)

It can shown that given a violation in convexity it is possible to create a portfolio ofoptions that generates arbitrage. Therefore, it is desirable that a model satisfies theconditions in (3.4), (3.5) and (3.6) in order to produce a more trustworthy approxi-mation to real life.

The above no-arbitrage constraints ignore the presence of a bid-ask spread and ofmarket frictions. For this reason, we may slightly relax these conditions via a smallslack variable ε > 0. Rather than veifying the conditions of (3.4), (3.5) and (3.6)directly, we can instead verify whether there they are satisfied, up to a degree ε oferror.

The gaussian process generates prices for many values of S simultaneouly. Forthis reason, we verify that all three hypotheses for data generated over a collec-tion of values. More specifically, we generate prices at all values of S present inour training set, and verify that the no-arbitrage conditions are met over this entirecollection of points. This means that when reporting the results we are refering tothe “percentage of paths (generated by the model) for whom all test data satisfied the con-traints” - we are calling this method “pathwise”. For example, in a given simulation,if we have “α% monotonicity” it means that, from a sample of N generated pathsfrom the GP, in α% of them we had all testing points satisfying the condition thatfor xi > xj , then f(xi) < f(xj), for i, j ∈ 1, . . .M (and N , M ∈ N).

26

Chapter 4

Fitting a GP to Black-Scholes Data

Given that the we have exposed the reader to the basic properties of the Gaussianprocess at Chapter 2, we can now give a more detailed overview of our method-ology in a simulated environment. We start fitting a Gaussian process on optionprice data (artificial) with varying maturity, price, strike and so on. Then, given thefitted model, we can apply the whole toolbox developed in Chapter 2, to computethe Greeks of the price process by directly differentiating the GP. Since the Greeksthemselves will be Gaussian processes, we can compute their point-wise varianceand other statistical properties to assess the quality of these estimates. The advan-tage of our technique is that it is completely data driven and it doesn’t need toassume a stochastic model for asset prices.

In the sections that follow, we explain the full methodology for fitting Gaussianprocesses to options data. We cover the many sensitivities of these models to var-ious parameters that can be adjusted by the practitioner. We illustrate these ideasthrough toy data generated by the Black-Scholes model. All of the examples in thischapter make use of the DiceKriging R package from ? to estimate GP parameters.

4.1 One dimensional case

This section will present results for a Gaussian process fit to one dimensional case.Our training data set is composed of the underlying asset price, the Put price andthe true Put Delta. We split our data into a training set (7 observations) and a vali-dation set (300 observations). We fitted the Gaussian process with our training setand using maximum likelihood estimation (MLE) as our estimation method. In ourexperiments, we compare the effectiveness of different mean functions and kernels.Table 4.1 shows the hyperparameter for each kernel we have tested, while using alinear mean function.

In Table 4.2, we compare the quality of each model. All models achieve coverage

27

Gauss Matern3/2 Matern5/2θK 14.94 40.44 45.70σ2 24.61 43.73 94.85β0 32.45 32.15 35.54βK -0.55 -0.51 -0.51Nugget 0 0 0

Table 4.1: Posterior Hyperparameters estimates using DiceKriging package withMLE as the estimation method

RBF Matern 3/2 Matern 5/2Price.rMSE 0.0063404 0.1393374 0.0347314Price.bias 0.0004471 -0.0218995 -0.0049340Price.cover 1.0000000 1.0000000 1.0000000Delta.rMSE 0.0027753 0.0516587 0.0145079Delta.bias 0.0001252 0.0130240 0.0030147Delta.cover 1.0000000 1.0000000 1.0000000

Table 4.2: Quality of the models using DiceKriging package with MLE as the esti-mation method

of 100%, where coverage is the percentage of data points lying within the 95% con-fidence bands of the model. RBF kernel exhibits the best metrics between out of allof the models that were shown.

In contrast, the no-arbitrage conditions are not being fulfilled in our models. TheRBF and Matern5/2 are balanced, but to use these kernels in this situation, It wouldbe necessary to create a way to reject samples from this model and recalculate theGaussian process. Another option was to increase our number of observations, aswe can see in the 4.4. It is important to note that we have changed the validationset in this case, because this constraint depends on the range of your information.In other words, if you get enough number of points with a small distance betweenthem, the constrain will be rejected more often. With this study, we can betterunderstand how kernels behave.

RBF Matern 3/2 Matern 5/2Monotonocity 0.70 0.00 0.61Positivity 0.97 0.35 0.92Convexity 0.86 0.00 0.66N experiments 200000 200000 200000Sample size 41 41 41

Table 4.3: Percentage of times that each of the constraints are respected in 200,000 runs of 41 sam-ples, with 20 as minimum strike and 60 as maximum strike spaced one by one. The train data had 7points.

28

RBF Matern3/2 Matern5/2monotonicity 1.00 0.25 1.00positivity 1.00 0.97 1.00convexity 1.00 0.01 1.00N experiments 200000 200000 200000Sample size 41 41 41

Table 4.4: Percentage of times that each of the constraints are respected in 200,000 runs of 41 sam-ples, with 20 as minimum strike and 60 as maximum strike spaced one by one. The train data had 21points.

Monotonocity Positivity Convexity N Experiments Size sample θk θt Train set size0.30 0.31 0.01 2e+05 31 15.96 5.00 70.74 0.56 0.93 2e+05 31 35.60 1.72 130.99 1.00 1.00 2e+05 31 55.63 0.95 371.00 1.00 1.00 2e+05 31 70.80 0.79 911.00 1.00 1.00 2e+05 31 82.85 0.72 1871.00 1.00 1.00 2e+05 31 91.70 0.72 3371.00 1.00 1.00 2e+05 31 96.79 0.71 553

Table 4.5: Behaviour of the model as we increase the number of observations used to train theMatern 5/2 model. In each case, we test the model using 25 randomly generated paths from thegaussian process on 31 different prices. Each row represents an experiment with a different trainingset with a different size. We display the resulting length-scales for each predictor variable, as wellas the empirical probability of satisfying each no-arbitrage constraint. We use an approximatelyexponentially increasing scale for the number of training samples to be used, where the training setis always evenly spaced across the interval [30, 50].

4.2 Two dimensional case

This section will present the results using the Matern5/2 kernel, where our data setis again composed of the underlying price, maturity, Put price and the true Black-Scholes Delta value. We split our data into a training set and validation set, justas before. We compare the in-sample (21 observations) and out-of-sample (1203observations) results of our model. The Gaussian process is fit using the trainingset and with maximum likelihood estimation (MLE). Also, we have used a linearmean function to compare the kernels, β0 = −6.29, βt = −1.0732, βs = 0.3807, θt =1.9664, θs = 27.4383, σ2 = 14.13397 and Nugget = 0.00.

In Figure 4.2, we can see how our model predicts the Put price. It is already possi-ble to see that our model has problems with the positivity constraint. As discussedin the last section, it is possible to improve by increasing our number of observa-tions. Using at least 32 observations, our model is satisfying the constraints almostall times, as we can see in table 4.5.

Another advance of the increase the number of observations used to train themodel is that we improve some of the metrics, as we can see at Table 4.6. But

29

−0.04

−0.02

0.00

0.02

0.04

0.06


Del

ta E

rror

Maturity

0.5

1

3

Estimation Type

Finite Difference

Gaussian Process

Figure 4.1: Comparison between finite differences and derivatives from a Gaussian Process withMatern5/2 kenel for different combinations of maturities and stock prices using data from a BlackScholes model.

rMSE eBias cover θk θt Train set size0.14 -0.04 0.79 15.96 5.00 70.02 0.00 1.00 35.60 1.72 130.01 0.00 1.00 55.63 0.95 370.00 0.00 1.00 70.80 0.79 910.00 0.00 1.00 82.85 0.72 1870.00 0.00 1.00 91.70 0.72 3370.00 0.00 1.00 96.79 0.71 553

Table 4.6: Behaviour of the model as we increase the number of observations used to train themodel in relation to Delta rMSE, Delta eBias and Delta cover. Just as before, each row representsa trained Matern 5/2 model trained on a set of increasing size. For each row, we display variousperformance metrics for a fixed training set size. The training sets are generated in the same fashionas in Figure 4.5.

as sometimes the amount of information is limited, we choose to keep our modelwith 21 observations. So in Figure 4.3, we can see that our model has a small dif-ference between predicted Delta and true Delta, inside the region it was trained. InFigure 4.1, our model has better estimative than the finite difference.

30

x

x

xx

x x x

x

x

xx

x x x

xx

xx

x x x0

5

10

30 40 50

Spot Price

Put

Pric

e

Maturity

xxx

0.5

1

3

Figure 4.2: Simulated put prices for different maturities and strikes using a Black Scholes model

31

x x x x x x xx x x x x x xx x x x x x x

−0.2

0.0

0.2

30 40 50

Spot Price

Pre

dict

ed D

elta

− T

rue

Del

ta

Maturity

0.5

1

3

Training Locations

x TRUE

Figure 4.3: Difference between true Delta and predict Delta for maturities and strikes using a BlackScholes model

32

Chapter 5

Fitting a GP to Option Price Data

In this chapter we apply the Gaussian Process (GP) model to estimate call optiondelta by training it to historical options prices. First, we describe the data and pro-vide details on the the cleaning and data processing methodology that was appliedbefore training the GP. Second, we describe the methodology that was used in fit-ting the GP, as well as some additional considerations that are needed due to theparticular nature of the data. In addition to this, we discuss some observationsthat were made regarding model sensitivity and qualitative model properties assome hyperparameters are tweaked in the fitting procedure. Lastly, we provideand discuss the delta predictions coming from the fit GP model. These predictionsare benchmarked against the Black-Scholes implied delta for comparison purposes.Just as before, we use the DiceKriging package from ? for fitting the model to data.

5.1 The Data

The options data used in this numerical experiment is obtained from ?, which isfreely available available online for download. This dataset contains options priceswith varying exercise types and for a large number of underlying equities and in-dices. The data contains market-close prices of these options during each businessday during the month of October in 2015.

Our objective we will be to estimate the delta and theta greeks of European calloptions on SPX using this historical data. From this data, we only keep optionsthat had non-zero traded volume over the course of the day and which had non-zero quoted price. Moreover, we remove any options that had less than a week ormore than one year until expiry. In this dataset, we calculate time until expiry inbusiness days between data collection and the exercise date of the option. Afterfiltering this dataset, we select options with strike equal to USD$ 2000 . We selectthis strike since it was one of the most traded strikes over the observed period, andis close to being at-the-money on average, with the average value of SPX over the

33

month being roughly USD$ 2024.141. Figure 5.1 gives an visual overview of thedata that is used for fitting the Gaussian Process.

Figure 5.1: This figure displays the filtered SPX Call option price data for the monthof October 2015. In the above figure, the axis labeled by S represents the price ofthe underlying SPX index, the axis labeled by T represents the remaining time untilmaturity for the option in calendar days, and the vertical axis represents the optionprice. In this figure, each color groups together prices from a business day. Thevertical level of each point represents the observed option midprice at a particularpoint (T, S).

5.2 Fitting the Gaussian Process

From this dataset, we fit the Gaussian process (GP) using both the option bid andask prices a function of the midprice of the underlying asset and the remainingtime until the option expiry. Since there is a gap between the bid and ask prices,we are required to fit the GP with a non-zero nugget. We choose the nugget at a

34

particular point to be proportional to the square bid-ask spread at each particulardata point plus a quantity proportional to the variance of days over which thedata was sampled in days. Our choice to make the nugget proportional to thespread comes from the fact that at any particular point in the space of predictors,the variance in data points is usually proportional to the bid ask spread. Second,we add additional noise proportional to the length of the sampling window sincethe variance in option prices and the underlying asset is proportional to the timewindow over which these prices are sampled. Because of this, one would expectpoints from a surface sampled over a period of months to be much noisier thanone sampled within a single day, since if time separating data points is large, theexpected variation in option prices is expected to scale accordingly. We specify thispoint-wise observational noise to be of the form

Noise = β (Spread)2 + γ (Date Range) , (5.1)

where the spread is measured in dollars and the date range is measured as a frac-tion of a year. We tune the parameters β, γ > 0 according to our data. For thisparticular data set, we find that β = 2 and γ = 1.5 strikes a good balance of fittingprices well and keeping the predicted price surface smooth.

We fit the data using a Radial Basis Function (RBF) kernel and a constand trendcoefficient, estimating the parameters using maximum likelihood estimation. TheRBF kernel was chosen due to it’s smoothness properties, and through experimen-tation with other kernels. We found that the smoothest implied GP surfaces wereobtained with this kernel, a property we desired to see in the price fits. We fit themodel with a variety of nugget size multipliers and a number of lengthscales andpick those that give us desired behaviours. In particular, we find that just estimat-ing the lengthscale and having a nugget multiplier that is too small causes someundesired behaviour in the fits that would violate no-arbitrage constraints, such asnon-monotonicity in prices and non-convexity. The best remedy to this problemsappears to be increasing the nugget variance until the predicted price surface sat-isfies these constraints.

To ensure the stability of our model, we use a rolling window methodology inwhich we train our model on 15 days of data, and then use the trained model topredict the value of the option prices out-of-sample during the following day. Thistraining window is rolled over the course of the month to assess the stability of thefits and the variation in price prediction performance. During these tests, we findthat the parameter estimates remain relatively stable, with the range of lengthscalesfor prices and maturities remaining in the range (122, 183) and (270, 286), respec-tively, and their means at 153.8 and 278.6, respectively. We also test the models outof sample using the mean absolute deviation (MAD) as a performance metric. Wefind the rolling MAD from the midprice in percentage-points to be 4.5% on aver-

35

age with a standard deviation of 0.035%. The mean posterior GP surface from thesetrained models cross the out-of-sample bid-ask spreads in the predicted price data65% of the time on average. Although the model error seems high, this is expectedsince there is a significant amount of noise in the data day-to-day.

5.3 Results

Lastly, we fit the model on the entire set of options data in our dataset. In Figure 5.2,we compare the predicted expected price surface to the training set to qualitatevelyassess the fit.

Figure 5.2: This figure displays the filtered SPX Call option price data for the monthof October 2015 shown in Figure 5.1 with all of the same axes. In this plot, wecompare the predicted call option price surface to the bid and ask prices found inthe dataset that the model was trained on. We find here that the model surfacetraverses the bid-ask spread of the training spread in most cases.

36

We find from Figure 5.2 that the model fit appears quite reasonable, based on a vi-sual inspection. Moreover, we find that the in-sample mean absolute deviation inpercentage points from the midprice in our model with respect to the midprice tobe 3.9 % and the estimated bias in percentage points to be less than 0.03%. Lastly,we find that the surface crosses the bid-ask spreads of around 65% the trainingdata. The model estimates for the lengthscales for the S and T variables are roughly163 and 328, respectively. The lengthscale that was estimated for S lies within therange of the rolling window fits, whereas the one for T lies outside. One reason forthis might be an effect of training on a larger sample and having a larger variationin the response variable. Using this model, we compute the predicted Delta andTheta surfaces by taking the derivatives of the Gaussian process in the appropriatedirections. The resulting Delta and Theta surfaces are displayed in Figure 5.3.

Figure 5.3: Figures for the predicted Delta (left) and Theta (right) surfaces from themodel fit to SPX European options. The blue points in each graphs are the impliedBlack-Scholes Deltas and Thetas (respectively), which are used as a benchmarkcomparison for the estimated surface.

We can see from the left graph in Figure 5.3 that the estimated Delta surface isrelatively close in shape to the implied Black-Scholes results that are displayed aspoints. One consistent pattern is that the estimated surface generated by the GPis almost always below the set of Black-Scholes points. We find that the estimatedGP delta is 6.5% smaller than the corresponding Black-Scholes Delta on average.Overall, the Delta values predicted by the GP seem very reasonable and are withina decent range of the benchmark points. The predicted Delta surface also satis-fies the necessary no-arbitrage conditions that are required of them; that is, theysaisfy the monotonicity and convexity requirements that are expected of any pric-ing model. In contrast, the predicted GP Theta values are quite different from thebenchmark Black-Scholes Thetas. One hypothesis as to why this difference is so

37

large is that the Black Scholes time to maturity may be computed differently thanwith the method that was used with our dataset. The Theta values, at the veryleast, show the correct monotonicity in the time-to-maturity direction. It may alsobe possible that Black-Scholes implied Theta estimates do not actually accuratelymeasure the change in value of options proces as they approach their maturity.

0.1

0.2

0.3

0.4

0.5

0.6

1925 1950 1975 2000 2025

Underlying Price

Pre

dict

ed D

elta

Days to Maturity

40

120

Figure 5.4: This figure displays the predicted Delta for the GP model as well asconfidence bands for days to maturity 40 and 120. One prominent feature we seehere is that the confidence bands of the Deltas become largs as we approach theleft edge of the data. This is due to the fact that the model has less data to trainon in these areas. The mean Delta curves demonstrate the required convexity andmonotonicity properties for option prices.

Example confidence bands for the predicted Deltas are found in Figure 5.4. Wefind that these confidence bands around the predicted Deltas only cover the Black-Scholes implied values around 24 % of the time, indicating a significant differencein the two predicted values.

38

Overall, the Gaussian process model performs quite well on the data it was pro-vided. The fits appear to be fairly stable and to match the data quite well under anumber of metrics. Moreover, the GP gives sensible estimates of the Greeks alongwith confidence estimates of the values it is generating.

39

Conclusion

Over the course of this paper, we’ve developed a Gaussian process model for thecomputation of Greeks. The approach we developed involves fitting a Gaussianprocess directly to observed option data, and then differentiating the model di-rectly to obtain the Greeks. In Chapter 2, we develop the necessary theory requiredfor the differentiation of Gaussian Processes. In this section we derive the neces-sary formulas for computing these derivatives. Next, we explain the methodologyof the Gaussian process approach in detail and apply this to toy data generatedby the Black-Scholes model in order to test its properties and to guide some intu-ition about the model when fitting it to options data. In this section we list out theconsiderations that must be made regarding fitting derivatives data, such as no-arbitrage requirements and the consequences of various available model choices.Lastly, in Chapter 5, we test the model out on data and assess its performance. Wefind that the model performs well on this example and gives sensible estimates ofthe Greeks in its particular application.

As with any statistical model, the Gaussian process approach has the potential tobe a very powerful tool, but practicioners must be conscious of various caveatsthat may be present. In particular, it is important to perform the necessary gamutof tests on each model, since the performance seen in the example found in Chap-ter 5 may vary in different applications. Moreover, an intuition for the particularapplication of this approach should be developed, as was done in Chapter 2, to beable to fully understand the properties of the GP models and to be able to predictits behaviour when data is thrown at it.

There are many future directions for this kind of work. One direction would be tostudy a restriction of Gaussian processes that are guaranteed to satisfy no-arbitrageproperties. We have tried to generate processes satisfying such properties by usingrejection sampling techniques, but we found that they were too sample inefficient.We believe that the proper way to approach such problems is throught the use ofMCMC to compute the joint conditional distribution of a Gaussian process and itsfirst two derivatives, with the necessary restrictions for no-arbitrage treated as cen-sored observations of these processes.

40

Another future direction can be related to the fitting procedure for the model. Infitting the GP model in Section 5, we found that the noise level at different datapoints can vary quite a lot. Instead of manually specifying how this observationalnoise scales, an interesting problem would be to try methods that can estimatedifferent nugget levels at different points in the data, perhaps controlling the sizethese nuggets using tools coming from the sparse regression literature or by usingother regularization methods.

More generally, non-parametric methods and data-driven methods such as the oneprovided in this report have a huge amount of potential and can replace the clunkymethods that exist today. Methods of this sort even have the potential to be appliedin other areas of finance beyond the estimation of Greeks.

41

Appendix A

Appendix A - Proofs

This chapter contains the proofs for some of the results used in the report.

Proposition 0.1. Conditional distribution of gaussian vectors

Proof. Gaussian vectors have the interesting property of remaining gaussian afterbeing conditioned on some of its entries. We show this: let X ∼ N (0,Σ) and saythat we have observed X2 = x2, where we let

X =

[X1

X2

]and Σ =

[Σ11 Σ12

ΣT12 Σ22

]Based on this information we can deduce the distribution ofX1

∣∣[X2 = x2] by doingthe following computation:Define

Z = X1 − Σ12Σ−122 X2

Notice that Z is gaussian and that

cov(Z,X2) = E[(X1 − Σ12Σ−122 X2)X2] = E[X1X

T2 ]− Σ12Σ

−122 E[X2X

T2 ]

= Σ12 − Σ12Σ−122 Σ22 = 0

Then we can use the fact that uncorrelated gaussian vectors are independent andthus have that Z and X2 are independent. Therefore

X1 = Z + Σ12Σ−122 X2

⇒ X1

∣∣[X2 = x2] = Z + Σ12Σ−122 x2

from this we deduce that X1

∣∣[X2 = x2] is a gaussian with mean E[Z] + Σ12Σ−122 x2

and variance var(Z). Explicitly, one gets

42

X1

∣∣[X2 = x2] ∼ N(Σ12Σ

−122 x2 , Σ11 − Σ12Σ

−122 ΣT

12

)

Proof. Proof of Theorem 2.6 - MS-Differentiability in GP.

E[∂

∂xif(x)

]= lim

ε→0E[f(x+ εei)− f(x)

ε

]= lim

ε→0

m(x+ εei)−m(x)

ε=∂m

∂xi(x)

and

V(∂f

∂xi(x)

)=

= limε→0

V(f(x+ εei)− f(x)

ε

)= lim

ε→0

1

ε2[k(x+ εei, x+ εei)− 2K(x+ εei, x) +K(x, x)]

= limε→0

1

ε

[k(x+ εei, x+ εei)− k(x+ εei, x)

ε− k(x+ εei, x)− k(x, x)

ε

]= lim

ε→0

1

ε

[(K(x+ εei, x+ εei)− k(x, x+ εei)

ε− ∂k

∂xi(x, x+ εei)

)+∂k

∂xi(x, x+ εei)−

∂K

∂xi(x, x)

+

(∂k


K(x+ εei, x)− k(x, x)

ε

)]= lim

ε→0

1

ε

[o(ε) +

∂K


∂k

∂xi(x, x) + o(ε)

]= lim

ε→0

1

ε

[∂K


∂K

∂xi(x, x)

]=

∂2K

∂xi∂x′i(x, x′)

Proof. Proof of Corollary 2.8 - Distribution of the derivative of a GP.From (0.1) , we have

f

([xx′

]) ∣∣∣∣∣X, f ∼ N(K

([xx′

], X

)k(X,X)−1f(X),

K

([xx′

],

[xx′

])− k

([xx′

], X

)k(X,X)−1k

(X,

[xx′

]))

43

from which we can extract the kernel K and mean m of f(x)|X, f(X):

k(x, x′) = k(x, x′)− k(x,X)k(X,X)−1k(X,x′) (A.1)

m(x) = k(x,X)k(X,X)−1f(X) (A.2)

Now we can make explicit the mean and the kernel of ∂f∂xi

(x)|X, f(X) by taking thederivatives:

∂m

∂xi(x) =

∂K

∂xi(x,X)k(X,X)−1f(X) (A.3)

∂2k

∂xi∂yi(x, x′) =

∂2k

∂xi∂yi(x, x′)− ∂K

∂xi(x,X)k(X,X)−1

∂K

∂yi(X,x′) (A.4)

Proof. Proof of Proposition 2.9 - Covariances in a GP

cov(∂f

∂xi(x), f(x′)

)= lim

ε→0cov

(f(x+ εei)− f(x)

ε, f(x)

)= lim

ε→0

1

ε(k(x+ εei, x

′)− k(x, x′))

=∂k

∂xi(x, x′)

44

Machine Learning andStochastic Control inAlgorithmic Trading

TEAM LATENT SURFERS

FELIPE NASCIMENTO, IMPALUCAS FARIAS, EMAp/FGVHONGXUAN YAN, University of SydneyRAUL RIVA, EPGE/FGV

Supervisor:SEBASTIAN JAIMUNGAL, University of Toronto


Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Trader’s Optimal Control Problem . . . . . . . . . . . . . . . . . . . . 33 Hawkes Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Estimated Transition and Emission Probabilities . . . . . . . . 174.3 Computing Intensities . . . . . . . . . . . . . . . . . . . . . . . 21

5 Revisiting The Control Problem . . . . . . . . . . . . . . . . . . . . . . 235.1 Analytical Solution . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Role For Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2

Abstract

This project investigates how to generate profits out of short-term prediction of as-set prices (short-term alpha) by maximizing a performance measure. This leadsto a statistical arbitrage problem that is solved by convex analysis methods. Thetrader controls his trading intensity and uses price information to decide on howto trade. We investigate two stochastic market models. The first is a MultivariateHawkes process and the second a Hidden Markov Model (HMM). The models areapplied to and estimated from equity and FX market data. The maximum likeli-hood (ML) approach is adopted to estimate unknown parameters in MultivariateHawkes processes and the expectation-maximization (EM) algorithm, combinedwith the forward-backward message passing algorithm, is adopted in HMM. Un-der the HMM approach, forward recursion and SDE are applied to filter the latentstates. Cross-excitation is not necessarily symmetric between assets under our cal-ibration. Also, our results show that limiting the state space for the observations inHMM might hurt midprice change intensities estimation.

1 Introduction

Algorithmic and High-Frequency Trading cover a diverse set of topics, but largelyfalls into three categories: optimal execution (selling/acquiring a large numbers ofassets, aiming to optimize some sort of performance criterion), statistical arbitrage(making profits out of historical patterns such as mean-reversion i.e., short-term”alpha”) and market making (optimally providing liquidity to markets).

This report investigates how to estimate cross-exciting (multi-variate Hawkes) pro-cesses and Hidden Markov Models using multiple assets data (both equity and FXmarkets) and how to integrate these techniques in an algorithmic trading setting.

Electronic trading in financial markets is now in the norm, and machine learn-ing techniques are increasingly useful for aiding the understanding of how dataevolves. Augmented machine learning methods also allow one to estimate modelsefficiently and potentially provide more accurate predictions.

Recently, the mathematical finance literature has discussed several aspects of thistopic. Cartea and Jaimungal (2016a) assume a general stochastic process for order-flow and provide an explicit closed-form expression for the optimal execution s-trategy when a permanent and temporary impact is linear on the overall marketrate of trading and the investor’s execution rate.

Casgrain and Jaimungal (2018c) investigate the optimal trading strategy for a sin-gle asset when there are latent alpha components to the asset price dynamics, andwhere the trader uses price information to learn about the latent factors. Prices candiffuse as well as jump. The trader’s goal is to optimally trade subject to this modeluncertainty and end the trading horizon with zero inventory. By treating the trad-er’s problem as a continuous-time control problem, where information is partiallyobscured, they succeed in obtaining a closed-form strategy, up to the computationof an expectation that is specific to the trader’s prior assumptions on the modeldynamics. The optimal trading strategy they find can be computed with ease for awide variety of models, and they demonstrate its performance by comparing it, insimulation, with approaches that do not make use of learning.

Our project assumes that assets’ midprices are a combination of counting processesfor up and down movements with corresponding stochastic intensities. We inves-tigate two different models for intensity. One is based on a multi-variate HawkesProcess, with exponential decay, and one is based on a Hidden Markov Model, i.e.,modulated by a finite state hidden Markov chain where the intensities are function-s of a latent state for each asset. Afterward, we propose a method for integratingthese models into the trader’s optimal control problem.

1

Hawkes Processes (HP) is a type of point processes, and, according to Laub et al.(2015a) point processes gained a significant amount of attention in the field of s-tatistics during the 1950s and 1960s. First, Cox (1955) introduced the notion ofa doubly stochastic Poisson process (often called the Cox process) and Bartlet-t (1963a), Bartlett (1963b) and Bartlett (1964) investigated statistical methods forpoint processes based on their power spectral densities. Lewis (1964) formulateda point process model (for computer failure patterns) which was a step in the di-rection of the HP. The activity culminated in the significant monograph in Cox andLewis (1966) on time series analysis. Modern researchers appreciate this text as animportant development of the point process theory since it canvassed their widerange of applications, as pointed out by Cox and Lewis (1966).

It was in this context that Hawkes (1971a) set out to bring Bartlett’s spectral anal-ysis approach to a new type of process: a self-exciting point process. The processHawkes described was a one-dimensional point process (though originally speci-fied for t ∈ R as opposed to t ∈ [0,∞)). Many years later, Bremaud and Massoulie(1996) generalized the HP to its nonlinear form.

Bacry et al. (2015), Hasbrouck (1991) and Engle and Russell (1998) were the first toadvocate that the modeling of financial data at the transaction level could be ad-vantageously done within the framework of continuous time point processes. Sincethen, point processes applications to finance is an ongoing, very active topic in theeconometric literature. Bowsher (2007) recognized the flexibility and the advan-tages of using the class of multivariate counting processes that can be specified bya conditional intensity vector. More specifically, he introduced a bivariate Hawkesprocess in order to model the joint dynamics of trades and mid-price changes ofthe NYSE.

Another competing approach is a Hidden Markov Model approach, which is usefulfor capturing the stochastic nature of many economic and financial variables. Thebasic theory of HMM was published in the series of classical papers by Baum andPetrie (1966), Baum and Eagon (1967), Baum and Sell (1968), Baum et al. (1970) andBaum (1972). Since then, many problems in finance are solved by this technique,such as parameter estimation for the term structure of interest rates, option pricingand credit risk modeling as gathered by Mamon and Elliott (2007). Besides these,Cartea and Jaimungal (2011) employ a hidden Markov model to examine how theintraday dynamics of the stock market have changed and how to use this informa-tion to develop trading strategies at high frequencies. More specifically, they showhow to employ their model to profit from the bid-ask spread.

The organization of this paper is as follows. In the second section, we pose the

2

trader’s optimal control problem. In the two following sections, we discuss the useand estimation of Hawkes processes and Hidden Markov Models. Section 5 revis-its the control problem. Sections 6 and 7 present future research and a conclusion.

2 Trader’s Optimal Control Problem

Our project aims to aid the decision process of a trader. We work with high-frequency data, both for equity markets and FX markets, and estimate the parame-ters of a multi-variate Hawkes model for asset midprice dynamics. We also deployan HMM technique to account for hidden factors. Afterward, we present a wayto relate these two techniques to the problem of a trader that aims at optimizing aperformance criteria and optimally controlling its trading intensity.

Let Si = (Sit)t≥0 denote the midprice process of asset i, and let N i,± = (N i,±t )t≥0

denote the counting process for up and down movements of the midprice withcorresponding intensities λi,± = (λi,±t )t≥0, i ∈M = 1, · · · ,m. Hence,

Sit = N i,+t −N i,−

t . (1)

Denote by St the m-dimensional column vector of midprices at time t. This is ob-served by a trader that has an inventoryQ0 at t = 0. The trader wishes to maximizea performance measure choosing the intensity of its markets orders. Without anyloss of generality, we assume that this trader wishes to sell assets.

The trader is not able to choose when to trade. He can only choose the intensityof such market orders. The motivation of such setup is that if market orders werepredictable, they could be exploited by high-frequency traders.

For each asset i, the trader can choose a corresponding vector-valued intensity(νi,+t , νi,−t )t≥0 for buy and sell orders, respectively. The controlled counting pro-cesses for these orders are denoted by (Mν,+

t ,Mν,−t ). He wishes to trade between

t = 0 and t = T .

Under these definition, the trader optimizes

Hν = E

[−

T∫0

Sᵀu−dM

ν,+u +

T∫0

Sᵀu−dM

ν,−u − a

T∫0

||νu||2du

+SᵀTQ

νT − α||Q

νT ||2

]

3

where

Qνu = Q0 +

u∫0

ν+s ds−

u∫0

ν−s ds

is the expected inventory and νu = ν+u − ν−u is an m-dimensional column vector of

market order intensity differences.

The first two terms are instant cashflows, integrated over all the trading horizon.The third term is a penalty on the trading speed, given a constant a > 0. The fourthterm is the terminal cash position of this trader. The last term represents a penaltyon the size of the final position. We do not impose that at t = T the trader shouldhave a zero position. However, that can be achieved if we let α→∞.

This can be viewed as a statistical arbitrage problem, or an optimal execution prob-lem depending on the perspective one takes, because the agent controls the speedof trading.

It can be shown, by integrating by parts, that the performance criterion can bewritten as

Hν = Sᵀ0Q0 − α||Q0||2 + E

[ T∫0

(λ+t − λ

−t )ᵀQ

νt dt

−T∫

0

(α||νt||2 + 2ανᵀtQνt )dt

]

where λ+t − λ

−t are midprice counting processes intensity differences.

This formulation is interesting because it shows that the performance measure isstrictly concave over the set of admissible strategies. The direct implication is that,at the optimal strategy, the Gateux derivative should be zero for all directions ω:

〈DHν , ω〉 = E

[ T∫0

ωᵀt

( T∫t

Et[λ+u − λ−u ]du− 2aνt − 2αQ

νt

)dt

]= 0

The optimal intensity ν and optimal Qνt implied expected inventory must satisfythe FBSDE and terminal condition:

(−2a)dνt = Et[λ+t − λ

−t ] + dMt (2)

νT = −αaQνT (3)

4

The key part to solve this problem is to compute the expectation in equation 2.The next two sections explore two competing approaches to model the midpricecounting processes intensities.

3 Hawkes Processes

3.1 Definitions

Hawkes (1971a) proposed the so-called Hawkes process to describe a one-dimensionalpoint process.

Definition 3.1 (Hawkes processes). Consider a counting process (Nt)t≥0 with as-sociated history Ft : t ≥ 0,

P (N(t+ h)−N(t) = k|Ft) =

λth+ o(h), k = 1

o(h), k > 11− λth+ o(h), k = 0

First, for the univariate case, we model the intensity λt of the counting process bythe particular form of Hawkes process that satisfies the following SDE (Carlssonet al., 2007a)

dλt = κt(θt − λt)dt+ ηtdNt

The solution for λt can be written

λt = e−κtt(λ0 − θt) + θt +

∫ t

0e−κt(t−u)ηtdNu

Following a similar idea, we have multivariate Hawkes process:

Definition 3.2 (multivariate Hawkes processes). Given the vector of intensities andcounting processes

λt =((λ1,+t , λ1,−

t ), (λ2,+t , λ2,−

t ), · · · , (λm,+t , λm,−t ))

andNt =

((N1,+

t , N1,−t ), (N2,+

t , N2,−t ), · · · , (Nm,+

t , Nm,−t )

)satisfy the SDE

λt = λ0 +

∫ t

0κu(θu − λu)du+

∫ t

0ηudNu (4)

with (κt)t≥0(2m × 2m dim), (θt)t≥0(2m × 1 dim) and (ηt)t≥0(2m × 2m dim) aredeterministic functions of time.

Daley and Vere-Jones (2007) provided the Hawkes process likelihood.

5

Theorem 3.3 (likelihood for Hawkes processes). Let Nt be a regular point processon [0;T ] for some finite positive T , and let t1, · · · , tk denote a realisation of Nt over[0;T ]. Then, the likelihood L of Nt is expressible in the form

L =

[T∏t=1

λt

]exp

(−∫ T

0λudu

)This is an interesting way to model price jumps in two different senses. First, theHawkes model has a self-excitation feature. For example, when N i,+

t jumps, it in-creases its own intensity, meaning that more jumps become more likely. This is dif-ferent from a Poisson process because in that case the intensity is constant throughtime. Hawkes processes never “forget” they have jumped.

The second sense in which multivariate Hawkes processes are interesting to modelprice jumps is that they might share some cross-excitation feature, provided thatthe matrix ηu is not a diagonal matrix. This captures some sort of cointegration be-tween assets, but not in the classical time series sense. When one asset moves, itmodifies not only the intensity of its own counting process but also the intensity ofthe counting process for other assets, and this applies both to go-up processes likeN i,+t and go-down processes like N i,−

t .

3.2 Estimation

One can understand the parameters of this model in the following way. The pa-rameter λ0 is a baseline quantity for the intensity, from which it will evolve. Now,θu is the of long-run value that the intensity reverts to. We take it as a constantfunction of time. On the other hand, κu is related to the half-life of this process. Wedrop the time index u, as we will also assume that κ is a constant matrix. More-over, we will assume that κ is a diagonal matrix, so intensity decay of a given point

6

process is affected only by itself.

The most interesting parameter is probably ηu. We shall make no assumptions re-garding symmetry but we will assume that this matrix is again constant throughtime. It generates the cross-excitation feature of Hawkes processes. When anevent occurs and a process jumps, the elements in the diagonal of η measure self-excitation while off-diagonal elements measure cross-excitation.

Parameter estimation was performed by maximum likelihood estimation. Fromdata, one can obtain series of up and down movements for asset midprices andprocede to likelihood optimization.

In order to estimate such a model and study its implications, we use two differentdatasets. One of them is the main one used for this project and the second onewas used as an exercise to understand if the same observed relations would ariseregarding Hawkes processes.

The first dataset is high-frequency data for one stock and one related ETF: INTC(Intel Corporation) and SMH (VanEck Vectors Semiconductor ETF), both traded atNASDAQ. We used data for the 21 trading days of March, 2018. Therefore, in thenotation above, we have m = 2, resulting in a total of 24 parameters.1

This is a vast amount of data as the timestamps are in nanoseconds. Due to compu-tational limitations, we used 12 minutes of data everyday, from 10am to 10:12am2.values 1 and 2 portray one day of data.

In reality, as a first try, we used only one hour of data (from 10am to 11am) for onlyone day (March 1st, 2018) to have an idea of what to look for. One might expectspecial cross-excitation between these two assets since the ETF used tracks an in-dex of stocks related to the semiconductor business.

Figure 3 illustrates3 the counting processes for the midprices we are modeling.

It’s possible to see from the plot that there are far more jumps for SMH than forINTC, at least for this window of data.

We can also see from the plot that the go-up and go-down processes of the sameasset follow a close path. This indicates that there may be cross-excitation not on-

1We assumed that λ0 = θ, so the baseline quantity coincides with the long-run mean. This reducesthe number of parameters and is, after all, a sensible assumption.

2Trading starts at 9:30.3Best seen in color.

7

Figure 1: SMH high-frequency midprice data for March 1st, 2018

Figure 2: INTC high-frequency midprice data for March 1st, 2018

ly among different assets but also among ups and downs of the same asset. Ourmethodology allows for this type of relationship.

Data suggests that these processes share some driving forces. The results for thisfirst try, with only one day of data but with more than 3 × 105 data points, showsome interesting features in tables 1 and 2. The long-run mean θ is higher for theprocesses of SMH than for INTC, although they are quite similar for a given asset.The parameter κ is higher for INTC, meaning that perturbations on its process-es’ intensity fade away faster than for SMH, at least for this sample. Actually, thesomewhat big values for κ indicate that perturbations are in general dissipated

8

Figure 3: Counting processes for both INTC and SMH

very fast.

Regarding the cross-excitation parameters in η, we see that the values on the maindiagonal are bigger than other values. That’s somewhat expected due to the fac-t that they control self-excitation directly. It seems sensible that movements of agiven asset midprice might induce more volatility on its own midprice that on themidprice of other assets.

Another interesting aspect of these paremeters is that INTC movements tend togenerate much excitement on the intensity of SMH movements of same signal, butnot the other way around. Although not trivial, this effect could be anticipated ifone realizes that INTC is a big-company stock for which midprices are probablycorrelated with the semiconductor business, something that affects the stocks inSMH ETF directly.

At a second stage, we estimated this model with 21 days of trading from March,2018. The results in tables 3 and 4, although quantitatively a bit different, share thesame main features observed on the first estimation.

The values for θ are similar to the ones found in the first estimation. In the samefashion, the new κ values remain similar to what we had found previously, withthe worth comment that they are a bit lower now but still in the same order of mag-

9

Process θ κ

N+,SMHt 2.11 1808.07

N−,SMHt 2.09 1526.74

N+,INTCt 0.47 1997.92

N−,INTCt 0.46 2089.25

Table 1: Optimal θ and κ for one day of INTC and SMH data

Process N+,SMHt N−,SMH

t N+,INTCt N−,INTCt

N+,SMHt 527.23 127.16 403.80 11.55

N−,SMHt 126.80 423.59 5.01 361.84

N+,INTCt 62.44 0.58 662.23 187.39

N−,INTCt 1.84 60.58 175.45 753.64

Table 2: Optimal η for one day of INTC and SMH data

Process θ κ

N+,SMHt 1.60 1457.18

N−,SMHt 1.58 1432.75

N+,INTCt 0.46 1738.46

N−,INTCt 0.46 1772.56

Table 3: Optimal θ and κ for all days (12 minutes each day) of INTC and SMH data

Process N+,SMHt N−,SMH

t N+,INTCt N−,INTCt

N+,SMHt 371.33 102.99 272.40 4.91

N−,SMHt 107.23 362.58 7.50 284.21

N+,INTCt 52.55 1.05 550.47 217.20

N−,INTCt 1.35 54.30 207.29 568.14

Table 4: Optimal η for all days (12 minutes each day) of INTC and SMH data

10

nitude.

The newly estimated matrix η portrays, as before, values on the main diagonal thatare bigger than the other values. It’s also possible to notice the same effect of INTCjumps over SMH but not the other way around, as before.

The second dataset used to estimate this model and compare results is on EURUSDand GBPUSD FX data, from February 2017. This is also high frequency data. Aninteresting feature of this dataset is that it’s from a 24-hour market. We used 5 daysof data (February 5th, 6th, 7th, 8th and 9th) and 5 minutes of data everyday, be-tween 10:00am and 10:05am. Although this might seem a small amount of data ata first sight, this represents roughly 5× 105 data points.

Figure 4: GBPUSD exchange rate on February 6th, 2017

values 4 and 5 portray one day of FX data. Estimation results are in the next tables.

Process θ κ

N+,GBPt 2.43 238.58

N−,GBPt 2.36 261.73N+,EURt 2.32 176.94

N−,EURt 2.23 180.20

Table 5: Optimal θ and κ for FX data (5 days, 5 minutes per day)

The results show that the long-run average is basically the same among both assetsand both processes. Compared to the other case, κ values are one order of magni-tude smaller, indicating that this processes have intensities that decay slower than

11

Figure 5: EURUSD exchange rate on February 6th, 2017

Process N+,GBPt N−,GBPt N+,EUR

t N−,EURt

N+,GBPt 104.24 47.59 15.91 2.18

N−,GBPt 50.60 115.32 2.00 17.81N+,EURt 13.64 1.11 71.19 41.51

N−,EURt 0.74 14.89 43.79 70.34

Table 6: Optimal η for FX data (5 days, 5 minutes per day)

the ones in INTC and SMH data.

Regarding the ηmatrix, main diagonal entries are again bigger than the other ones,as before. However, in this case, cross-excitation is not so asymmetric as in theprevious case. There seems to be no major one-side-only drivers of cross-excitationin this case.

4 Hidden Markov Model

4.1 General Framework

The Hidden Markov Model (HMM) is one of the most popular approaches for tem-poral event modeling, as highlighted by (Lin et al., 2016). It uses a sequence oflatent states with Markovian property to model the dynamics of temporal events.Each event in the HMM is associated with a latent state that determines the condi-tional probability of observable events.

The state of an event is independent of all but its most recent predecessor’s state

12

(i.e., Markovianity) following a transition probability distribution. The HMM con-sists of 4 components:

• an initial state probability distribution

• a finite latent state space

• a state transition matrix, governing the sequence of latent states

• an emission probability distribution for each latent state.

We only observe the outcomes governed by the conditional distributions generat-ed by latent states. The latent states are interpreted as different regimes, in whichthe structure of markets can be potentially different. A possible motivation for thisapproach would be the fact that certain pieces of news can introduce, for instance,extra volatility during the trading period. However, one cannot observe how theseregimes evolve. One can only observe, in this case, asset midprice changes ∆Sit .

As a simplification of this setup, we make the important hypothesis that midpricescan move at maximum one tick per data period. This implies that, for any asset,there are only three possible observables outcomes: price can move up by one tick,move down by one tick or remain the same.

According to the midprice definition in Equation 1, the changes in midprice aregiven by

∆Sit = ∆N i,+t −∆N i,−

t

and, by Definition 3.1P (∆N i,±

t = 1) ≈ λi,±t ∆t.

Within this approach, intensity is assumed to be driven by a HMM with a hiddenMarkov chain Z = (Zt)t≥0 with generator matrix G and Zt ∈ R = 1, 2, · · · ,K,such that λi,±t = Λi,±(t, Zt) where Λi,±(t, z) are deterministic functions of time andregime for each asset i. The generator matrix G ∈ RK×K has non-diagonal entriesGij ≥ 0 in i 6= j and diagonal entries Gij = −

∑i 6=jGij . C is defined so that

P (Zt+∆t = j|Zt = i) =(eG∆t

)i,j

= Ci,j . In a discrete time approximation, this canbe viewed as the graphical model in Figure 6, where emission probabilities ψ inour case are regarded as conditional probabilities over the observable data.

Hence, given the emission probabilities, we can estimate intensities λ+ and λ−. Inthis setup, these intensities are assumed constant under the same regime.

This Hidden Markov Model assumes that the observation at time t was generatedby some process whose state Zt is hidden from the observer. Second, it assumesthat the state of this hidden process satisfies the Markov property: that is, given

13

Figure 6: Directed graph representation of the price movement intensity.

the value of Zt−1, the current state Zt is independent of all the states prior to t. Theoutputs also satisfy a Markov property with respect to the states: given Zt, ∆St isindependent of the states and observations at all other time indices. In our case,we further assume that there are only three types of observations for a given asseti, i.e., ∆Sit ∈ −1, 0, 1. These Markov properties mean that the joint distributionof a sequence of states and observations can be factored in the following way

P (Z1:T ,∆S1:T ) = P (Z1)P (∆S1|Z1)T∏t=2

P (Zt|Zt−1)P (∆St|Zt).

In order to estimate the unknown parameters θ, we adopt expectation-maximization(EM) combined with the forward-backward message passing algorithm (Ghahra-mani, 2001). Maximizing the log likelihood in Equation 5 directly is often difficultbecause the log of the sum can potentially couple all of the parameters of the model(We have dropped the superscript (t) in Equation 5 by evaluating the log likelihoodfor a single observation, again for notational convenience).

L(θ) = logP (∆S|θ) = log∑Z

P (∆S,Z|θ) (5)

We can simplify the problem of maximizing L with respect to θ by making use ofthe following insight. Any distribution Q(Z) over the hidden variables defines a

14

lower bound LB on L:

log∑Z

P (∆S,Z|θ) = log∑Z

Q(Z)P (∆S,Z|θ)

Q(Z)

≥∑Z

Q(Z) logP (∆S,Z|θ)

Q(Z)

=∑Z

Q(Z) logP (∆S,Z|θ)−∑Z

Q(Z) logQ(Z)

= LB(Q, θ)

The Expectation Maximization (EM) algorithm alternates between maximizing LBwith respect to Q and θ, respectively, holding the other fixed. Starting from someinitial parameters θ0.

E step : Qk+1 ← arg maxQ

LB(Q, θk)

M step : Qk+1 ← arg maxθ

LB(Qk+1, θ)

It is easy to show that the maximum in the E step is obtained by setting Qk+1(Z) =P (Z|∆S, θk), at which point the bound becomes an equality: LB(Qk+1; θk) = L(θk).The maximum in the M step is obtained by maximizing∑

Z

Q(Z) logP (∆S,Z|θ) =∑Z

P (Z|∆S, θk) logP (∆S,Z|θ)

In our case, the log probability of the hidden variables and observations is

logP (Z1:T ,∆S1:T ) = logP (Z1) +

T∑t=1

logP (∆St|Zt) +

T∑t=2

logP (Zt|Zt−1) (6)

Let us represent the K-valued discrete state Zt using K-dimensional unit columnvectors, Each of the terms in Equation 6 can be decomposed into summations overZ. Hence, the transition probability is

P (Zt|Zt−1) =

K∏i=1

K∏j=1

(Cij)Zt;iZt−1;j

where Cij is the probability of transitioning from state i to state j, arranged in aK ×K matrix C. Then

logP (Zt|Zt−1) =K∏i=1

K∏j=1

Zt;jZt−1;i logCij

= Z>t logCijZt−1.

15

If the initial state probabilities are arranged in a vector π, then

logP (Z1) = Z>1 logπ

Since ∆St is a discrete variable which can take on D = −1, 0, 1 values, then weagain represent it using D-dimensional unit vectors and obtain

logP (∆St|Zt) = ∆S>t (logψ)∆St

where ψ is a D ×K emission probability matrix. The parameter set for the HMMis θ = C,π, ψ. Since the state variables are hidden we cannot compute Equa-tion 6 directly. The EM algorithm, which in the case of HMMs is known as theBaum-Welch algorithm, allows us to circumvent this problem by computing theexpectation of Equation 6 under the posterior distribution of the hidden states giv-en the observations. We denote the expected value of some quantity f(Z) withrespect to the posterior distribution of Z by Ef(Z)

Ef(Z) =

∫Zf(Z)P (Z|∆S, θ)dZ

The M step in such setting is straightforward: we take derivatives of Equation 6with respect to the parameters, set to zero, and solve subject to the sum-to-oneconstraints that ensure valid transition, emission and initial state probabilities. Fordiscrete ∆St coded in the same way as Zt, the M step is:

Cij ∝T∑t=2

E(1Zt=i,Zt−1=j )←∑T

t=2 E(1Zt=i,Zt−1=j )∑Tt=2 E(1Zt−1=j )

πi ← E(1Z1=i)

ψd;i ←∑T

t=1 ∆St;dE(1Zt=i)∑Tt=2 E(1Zt;i)

The necessary expectations are computed using the forward-backward algorithm.The forward-backward algorithm is an instance of belief propagation applied tothe Bayesian network corresponding to a hidden Markov model. The forward passrecursively computes αt, defined as the joint probability of Zt and the sequence ofobservations ∆S1 to ∆St.

αt = P (Zt,∆S1:t)

=

∑Zt−1

P (Zt−1,∆S1:t−1)P (Zt|Zt−1)

P (∆St|Zt)

=

∑Zt−1

αt−1P (Zt|Zt−1)

P (∆St|Zt)

16

The backward pass computes the conditional probability of the observations ∆St+1

to ∆ST given Zt

βt = P (∆St+1:T |Zt)=

∑Zt+1

P (∆St+2:T |Zt+1)P (Zt+1|Zt)P (∆St+1|Zt+1)

=∑Zt+1

βt+1P (Zt+1|Zt)P (∆St+1|Zt+1)

From these it is easy to compute the expectations needed for EM

E(1Zt;i) =αt;iβt;i∑j αt;jβt;j

E(1Zt=i,Zt−1=j ) =αt−1;jΦijP (∆St|Zt;i)βt;i∑k,l αt−1;kΦklP (∆St|Zt;l)βt;l

4.2 Estimated Transition and Emission Probabilities

We estimated transition and emission probabilities for both the case of two andthree latent regimes. In each case, we used the same datasets described in theprevious section. The only modification is that for INTC and SMH, we convertedtimestamps to seconds and used the average midprice within a given second as thecurrent midprice. On FX data, we did the same but with 100 milliseconds.

This section analyzes results about transition and emission probabilities, startingwith the data from SMH and INTC. Due to the fact that we have two assets andthree possible outcomes for each asset at each point in time, there are nine observ-able states.

Regime 1 Regime 2Regime 1 0.87 0.13Regime 2 0.21 0.79

Table 7: Transition probabilities for 2 regimes, using INTC and SMH data

From table 7, one can see that the two-regime setup generates a transition probabil-ity matrix that implies that both regimes are persistent. For example, conditionalon being on regime 1, there’s 87% of chance of staying in the same regime in thenext period.

17

Regime 1 INTC, - INTC, 0 INTC, +SMH, - 0.16 0.14 0.02SMH, 0 0.09 0.20 0.09SMH, + 0.02 0.12 0.16

Table 8: Emission probabilities for regime 1 (2-regime setup), using INTC and SMHdata


Table 9: Emission probabilities for regime 2 (2-regime setup), using INTC and SMHdata

Tables 8 and 9 portray interesting results. Regime 1 seems to be much more volatilethan regime 2. On regime 2, INTC never moves. Actually, there is more than 50%of chance that nothing will happen within this regime. On the other hand, the sim-ilar figure for regime 1 is around 20%. On regime 2, these two assets present verydifferent volatilities.

Turning to the 3-regime case, similar results are attained but with some importan-t differences. The transition probabilities in table 10 represent the evolution of aMarkov Chain that’s also persistent, as before. The elements on the main diago-nal are bigger than off-diagonal elements, indicating persistence. Regimes 1 and2 seem to be more connected one to another than they are to Regime 3, which ishighly persistent.

Regime 1 Regime 2 Regime 3Regime 1 0.55 0.27 0.18Regime 2 0.52 0.48 0.00Regime 3 0.00 0.18 0.82

Table 10: Transition probabilities for 3 regimes, using INTC and SMH data

As a matter of fact, the emission probabilities for regime 3 are very similar to regime2 of the 2-regime setup. It’s more likely that nothing will happen than somethingmight happen. Also, under this regime, INTC never moves. On the regime 2 ofthe 3-regime setup, there’s zero probability of both assets keeping their midprices:something will happen with probability one.

18


Table 11: Emission probabilities for regime 1 (3-regime setup), using INTC andSMH data


Table 12: Emission probabilities for regime 2 (3-regime setup), using INTC andSMH data


Table 13: Emission probabilities for regime 3, using INTC and SMH data

Comparing this approach to the Hawkes calibration, one can see a parallel resultemerging. On the Hawkes model, INTC could cross-excite SMH but the reversemovement was weaker. So, SMH could be seen as a more active process. In somesense, this also appears here in the regimes where INTC never moves, although inan even greater contrast.

Now, we analyze the HMM estimates for FX data. For this dataset, one periodcorresponds to 100 milliseconds. Table 14 shows that the states are also persistentwhen compared to equity data. Actually, regime 2 is even more persistent. Thenext two tables show the emission probabilities for this case.

Transition Probabilities (FX) Regime 1 Regime 2Regime 1 0.86 0.14Regime 2 0.12 0.88

Table 14: Transition probabilities for 2 regimes, using FX data

Once more, these states portray different behavior regarding emission probabil-ities. The values are somewhat well-balanced in regime 1, but not so much on

19

Regime 1 EUR, - EUR, 0 EUR, +GBP, - 0.19 0.10 0.10GBP, 0 0.09 0.04 0.10GBP, + 0.10 0.09 0.19

Table 15: Emission probabilities for regime 1 (2-regime setup), using FX data



regime 2. On this second regime, the probability of two non-zero movements hap-pening is zero. Actually, the probability of both assets keeping their prices is higherthan 70%. However, the remaining probabilities are well-balanced.

In the 3-regime setup, regimes are still very persistent. The entries in the maindiagonal of the transition matrix are bigger than other values off-diagonal. Still,there is one regime of lower volatility in asset prices and that’s regime 1 now. Theemission probabilities look similar to the second regime in the 2-regime setup. Theprobability that nothing moves is even greater, more than 80%.





On the other hand, regime 2 can be seen as a high volatility regime. There is ze-ro probability that both assets stay with the same price. This figure is somewhatsimilar to the one in the regime 1 of the 2-regime case. The marginal probabilitiesfor no movement are also very low, supporting the classification of this regime as

20



a high volatility one. Regime 3 seems balanced all around, although the highestfigure appeared in the cell corresponding to no movement at all.

4.3 Computing Intensities

Once we have computed emission probabilities, we propose a method to derivethe intensities λt implied by the Hidden Markov Model. Unlike the first methodusing Hawkes processes, we are not modeling the midprice counting processesintensities directly. We assume that, given a regime, λt remains constant over time.This means that the counting processes, given a regime, are Poisson processes:

Nt+∆t −Nt|Nt, Zt ∼ Poisson(λZt∆t)

However, since the trader cannot observe in which regime the market is in, the in-tensities of midprice jumps can still be seen as stochastic. Because we work with 2assets at a time and each of them have two associated counting processes, for eachregime we have 4 values for intensities. Ideally, values should be similar to θ fromthe Hawkes process method.

Because counting processes follow a Poisson process in a given regime, we can inprinciple derive closed forms for the probabilities of events such as (∆SSMH =1,∆SINTC = 0), which coincide with values for the emission probabilities. Theseprobabilities depend on λ. We find λ minimizing the sum of squared errors be-tween emission probabilities and theoretical results.

On tables 20 and 21 we see the results of matched intensities. The values for regime2 in the 2-regime setup and regime 3 in the 3-regime setup for INTC are exactly ze-ro. This is somewhat expected due to the fact these are regimes in which INTCnever moves.

The values for intensities regarding INTC are in line with the ones found for θINTC

in the Hawkes process method, when they are positive. However, the values forSMH are far from satisfactory. From the Hawkes process, we were expecting val-ues around 2.

21

Calibrated Intensities λ+,INTC λ−,INTC λ+,SMH λ−,SMH

Regime 1 0.48 0.49 0.64 0.65Regime 2 0.00 0.00 0.32 0.29

Table 20: Calibrated intensities for INTC and SMH in the 2-regime setup

Calibrated Intensities λ+,INTC λ−,INTC λ+,SMH λ−,SMH

Regime 1 0.22 0.13 0.40 0.30Regime 2 0.40 1.52 0.45 1.64Regime 3 0.00 0.00 0.33 0.30

Table 21: Calibrated intensities for INTC and SMH in the 3-regime setup

This is probably related to the truncation we generated to simplify the observablestate space. In given two consecutive timestamps, prices might have changed morethan one tick. These movements, if positive and larger than 1, were mapped into1. If they were negative and lower than -1, then they were mapped into -1. Thisimplies that the values 1 and -1 are overepresented in the sample.

As an approximation, we assume that at most 3 events (total ups and downs) canhappen per timestamp. This generates a combinatorics problem we are to solveand correct for the probabilities of observing -1,0 and 1, under our assumptionsof at most 3 movements. The results of previous tables were generated under thisassumption.

We repeated the exercise with FX data. The values are in general lower than whatwe were expecting from the Hawkes calibration (which was around 2), suggest-ing that our truncation hypothesis might be hurting this methodology. For the2-regime setup, the calibrated intensities for the second regime are especially low.This is exactly the regime in which there was a high probability that nothing wouldmove and prices would rest still.

Calibrated Lambdas λ+,EUR λ−,EUR λ+,GBP λ−,GBP

Regime 1 1.06 1.02 1.02 1.00Regime 2 0.11 0.10 0.07 0.06

Table 22: Calibrated intensities for FX data in the 2-regime setup

22

Calibrated Lambdas λ+,EUR λ−,EUR λ+,GBP λ−,GBP

Regime 1 0.09 0.07 0.03 0.02Regime 2 1.36 1.16 1.14 1.05Regime 3 0.31 0.36 0.49 0.50

Table 23: Calibrated intensities for FX data in the 3-regime setup

5 Revisiting The Control Problem

Now, we revisit the trader’s control problem. We are able to solve for the optimalstrategy directly. This solution is unique since the functional that describes the per-formance criterion is strictly concave.

The optimal strategy is the solution of the FBSDE in equations 2 and 3 and is givenby:

ν∗t = − 1aα + T − t

Qνt +

T∫t

( aα + T − taα + T − s

)E[λ+

s − λ−s |Ft]ds

The equation above has as interesting interpretation. The first term is a fractionof the expected inventory at time t. The baseline rate that multiplies the expectedinventory gets bigger as time evolves, but this evolution is deterministic. As timeapproaches T , the penalty for holding inventory at the final stage kicks in and thetrader tries to terminate its position.

The second term corrects for statistical arbitrage opportunities. This correction isdue to the evolution of the filtration and the enhancement of its projection of mid-price counting process intensities.

If one is concerned with directly using this approach to trade, computing the ex-pectation under the integral is a must. Notice that

λ+s − λ−s = Pλ

where P is a m × 2m matrix of ones, zeros and negative ones that, for m = 2assumes the form of

P =

[1 −1 0 00 0 1 −1

]By the linearity of expectation, it’s enough to compute the expected value of λ.

23

5.1 Analytical Solution

Under the first approach, we were model the intensity of midprice jumps as Hawkesprocesses. From there, we can derive a closed form solution of the expectationEt(λu) = E(λu|Ft), where the filtration considered is always the natural one creat-ed by midprices.

Let’s define gt,udef= Et[λu] = Et[λu|Ft]⇒ gt,t = λt.

In order to solve∫ ut ηdNs, we introduce the compensator term

∫ ut ηλsds, which

from equation 4 yields:

λu = λ0 +

∫ u

tκ(θ − λs)ds+

∫ u

tη(dNs − λsds︸︷︷︸

dNs

) +

∫ u

tηλsds

So, we can rewrite:

Et[λu] = gt,u = λ0 +

∫ u

tκ(θ − Et[λs])ds+

∫ u

tηEt[λs]ds

and Et[dNs] = 0, because it is a martingale. Hence, we have:

gt,u = λ0 +

∫ u

t(κθ − (κ− η)gt,s)ds

taking the derivative with respect to u, we have that:

˙gt,udef=∂gt,u∂u

=∂

∂u

(∫ u

t(κθ − (κ− η)gt,s)ds

)⇒

˙gt,u = κθ − (κ− η)gt,u

gt,t = λt

In order to solve the ODE above, we set first ˙gt,u = −(κ − η)ght,u where ght,u is thehomogeneous solution.

Therefore, we have ght,u = e−(κ−η)(u−t)B, where B is an unknown constant vector,given the information up to instant t. For the particular solution, we further setgpt,u = A ⇒ 0 = κθ − (κ − η)A, and consequently we get A = (κ − η)−1κθ,provided that (κ − η) is a non-singular matrix, which is verified under the set ofestimates parameters.

Based on the initial condition gt,t = λt,

gt,t = (κ− η)−1κθ +B = λt ⇒ B = λt − (κ− η)−1κθ

Notice thatB is Ft-measurable. Finally:

gt,u = E[λu|Ft] = e−(κ−η)(u−t)(λt − (κ− η)−1κθ) + (κ− η)−1κθ

24

5.2 Filtering

Under the HMM approach, one cannot derive a closed form solution for the ex-pectation under the integral sign since there are latent states. However, the onlyuncertainty the trader faces is about the latent states. Given the latent state, thevalues for λ are fixed and known by the trader.

The following form solves the E[λ+u − λ−u |Ft] in the HMM approach:

E[λ+u − λ−u |Ft] =

K∑j=1

K∑s=1

πjt [C(u−t)]j,s(λ

+,st − λ−,st )

whereπjt = E[1Zt=j|Ft], ∀j = 1, · · · ,K

is the forward filter with initial condition π0.

We provide an explanation of how to implement the forward filter. First, this filterprocess π can be calculated by forward recursion from HMM.

Algorithm 1 Compute posterior distribution π by forward recursion

1: Initialize π by set πj = ψj(∆S1)πj0 for regime j2: Update π by πjt = ψj(∆St)

∑Jj=1C

>ijπ

jt−1

3: Normalize π by πjt :=πjt∑J

j=1 πjt

However, this discrete approach is not accurate enough in practice, we adopt abetter approach derived by

πjt =Γjt∑Kj=1 Γjt

where Γjt solving the SDE

dΓjt

Γjt= (λ+,j

t− − 1)(dN+t − dt) + (λ−,jt− − 1)(dN−t − dt) +

K∑j=1

(ΓjtΓit

)Cijdt (7)

where t− represents the left limit. In order to solve the Equation 7, we simplifyEquation 7 by removing A = (λ+,j

t− − 1)(dN+t − dt) + (λ−,jt− − 1)(dN−t − dt) , then we

have

dljt =

K∑j=1

ljtCijdt

25

where lit is the simplified Γit. Then, this ODE have the solution in the form

ljt = eC(t−tk)ljt . (8)

In order to correct for the term we omitted (A), we write for t ≥ tk

Γt = eC(t−tk)

Γ1tky1t

· · ·Γjtky

jt

(9)

with ytk = 1 and we take the derivative of Equation 8, we have

dΓt = CΓtdt+ eC(t−tk)

Γ1tkdy1t

· · ·Γjtkdy

jt

Comparing with the Equation 7, we know that

eC(t−tk)

Γ1tkdy1t

· · ·Γjtkdy

jt

=

[(λ+,1t− − 1)(dN+

t − dt) + (λ−,1t− − 1)(dN−t − dt)]

Γ1t−

· · ·[(λ+,jt− − 1)(dN+

t − dt) + (λ−,jt− − 1)(dN−t − dt)]

Γjt−

then we have

dyjt = yjt

[(λ+,jt− − 1)(dN+

t − dt) + (λ−,jt− − 1)(dN−t − dt)]

by solving the equation, we have

yjt = (λ+j)N+t (λ−j)N

−t e−(λ+j−1)(t−tk)e−(λ−j−1)(t−tk)

Hence, the Equation 9 gives us the final approximate solution

Γt = eC(t−tk)

Γ1tk

(λ+j)N+t (λ−j)N

−t e−(λ+1−1)(t−tk)e−(λ−1−1)(t−tk)

· · ·Γjtk(λ+j)N

+t (λ−j)N

−t e−(λ+j−1)(t−tk)e−(λ−j−1)(t−tk)

The next values show simulated data for standard Poisson processes for 3 differentintensities. We implemented the forward filter on this data showcase it.

26

0 50 100 150 200 250 300

02

46

8

Time

∆N

Figure 7: Simulated Poisson processes for different intensities (regimes)

0 50 100 150 200 250 300Time points

Reg

imes

with

hig

hest

pro

babi

lity

12

3

Figure 8: The filter can trace back the correct hidden regime with good accuracy

27

0 50 100 200 300

0.25

0.30

0.35

0.40

0.45

0.50

Regime 1

Time

prob

abili

ty

0 50 100 200 300

0.25

0.30

0.35

0.40

0.45

0.50

Regime 2

Time

prob

abili

ty

0 50 100 200 300

0.25

0.30

0.35

0.40

0.45

Regime 3

Time

prob

abili

ty

Figure 9: Probability of being on a given regime in a given point in time

6 Role For Future Research

There are some gaps that could be filled in this work in an interesting way. Themain motivation to propose two competing models for asset midprices is to under-stand which one can generate better trading strategies, as well to see the quantita-tive differences implied by these models. An obvious follow-up would be testingboth strategies on real data and compare their performances. There’s nothing onthe derivation of the optimal strategy that assumes a Hawkes model or a HiddenMarkov Model, for instance.

A natural next step in terms of the Hidden Markov Model setup would be imple-menting a forward filter on true data, which is a key element to compute the ex-pectation under the integral sign in the optimal strategy derivation. But this wouldbe useful only if the matched lambdas were accurate. It’s not clear at this stage ifthe truncation hypothesis should be abandoned once for all or if one should comeup with a better way to match them with emission probabilities.

Finally, this exercise was done for two assets at a time. There’s no obvious reasonother than simplicity to work with only two assets. In fact, if the trader adopts thestatistical arbitrage point of view of this approach and is agnostic regarding theclass of assets he wants to trade, the number m of assets should be fairly large tocapture profitable co-movements of asset prices.

28

7 Conclusion

The introduction of stochastic control and machine learning techniques in the con-text of automatic trading can aid decision process and enrich the way we modelfinancial data through time. In this spirit, we proposed a way to model the stochas-tic control problem faced by a trader and offered two competing ways of modelingasset midprices.

Both the Hawkes model and the HMM model were calibrated using two differen-t datasets. We used tick data for INTC (Intel Corporation) and SMH (an ETF ofsemiconductor-business stocks) and also tick data on FX markets for GBPUSD andEURUSD. We found some interesting features of this data through this exercise,such as asymmetric cross-excitation in the case for INTC and SMH for the Hawkesmodel.

We also showcased the optimal strategy in its closed form, including the analyticalsolution for the expectation of midprice counting processes intensities at a futurepoint in time u conditional on the filtration at t. Further work should look intoimplementing the forward filter and use the calibration given in the HMM sectionand contrast it against the Hawkes-like solution. Another interesting extensionwould be allowing a bigger number of assets and a bigger number of observablestates along the HMM framework.

29

Bibliography

Almgren, R., 2012. Optimal trading with stochastic liquidity and volatility. SIAMJournal on Financial Mathematics 3 (1), 163–181.

Almgren, R., Chriss, N., 2001. Optimal execution of portfolio transactions. Journalof Risk 3, 5–40.

Almgren, R., Thum, C., Hauptmann, E., Li, H., 2005. Equity market impact. RISK-LONDON-RISK MAGAZINE LIMITED- 18 (7), 57.

Bacry, E., Mastromatteo, I., Muzy, J.-F., 2015. Hawkes processes in finance. MarketMicrostructure and Liquidity 1 (01), 1550005.

Bank, P., Soner, H. M., Voß, M., 2017a. Hedging with temporary price impact. Math-ematics and financial economics 11 (2), 215–239.

Bank, P., Soner, H. M., Voss, M., 2017b. Hedging with temporary price impact.Mathematics and financial economics 11 (2), 215–239.

Bank, P., Soner, H. M., Voss, M., 2017c. Hedging with temporary price impact.Mathematics and financial economics 11 (2), 215–239.

Bartlett, M., 1963a. The spectral analysis of point processes. Journal of the RoyalStatistical Society. Series B (Methodological), 264–296.

Bartlett, M., 1964. The spectral analysis of two-dimensional point processes.Biometrika 51 (3/4), 299–311.

Bartlett, M. S., 1963b. Statistical estimation of density functions. Sankhya: The In-dian Journal of Statistics, Series A, 245–254.

Baum, L., 1972. An inequality and associated maximization technique in statisticalestimation of probabilistic functions of a markov process. Inequalities 3, 1–8.

Baum, L. E., Eagon, J. A., 1967. An inequality with applications to statistical estima-tion for probabilistic functions of markov processes and to a model for ecology.Bulletin of the American Mathematical Society 73 (3), 360–363.

30

Baum, L. E., Petrie, T., 1966. Statistical inference for probabilistic functions of finitestate markov chains. The annals of mathematical statistics 37 (6), 1554–1563.

Baum, L. E., Petrie, T., Soules, G., Weiss, N., 1970. A maximization technique oc-curring in the statistical analysis of probabilistic functions of markov chains. Theannals of mathematical statistics 41 (1), 164–171.

Baum, L. E., Sell, G., 1968. Growth transformations for functions on manifolds.Pacific Journal of Mathematics 27 (2), 211–227.

Bertsimas, D., Lo, A. W., 1998. Optimal control of execution costs. Journal of Finan-cial Markets 1 (1), 1–50.

Bowsher, C. G., 2007. Modelling security market events in continuous time: Inten-sity based, multivariate point process models. Journal of Econometrics 141 (2),876–912.

Bremaud, P., Massoulie, L., 1996. Stability of nonlinear hawkes processes. The An-nals of Probability, 1563–1588.

Carlsson, J., Foo, M.-C., Lee, H.-H., Shek, H., 2007a. High frequency trade predic-tion with bivariate hawkes process. Working paper, Stanford University.

Carlsson, J., Foo, M.-C., Lee, H.-H., Shek, H., 2007b. High frequency trade predic-tion with bivariate hawkes process. Working paper, Stanford University.

Carlsson, J., Foo, M.-C., Lee, H.-H., Shek, H., 2007c. High frequency trade predic-tion with bivariate hawkes process. Working paper, Stanford University.

Cartea, A., Jaimungal, S., 2011. Modeling asset prices for algorithmic and high-frequency trading ssrn working paper, 24.

Cartea, A., Jaimungal, S., 2016a. Incorporating order-flow into optimal execution.Mathematics and Financial Economics 10 (3), 339–364.

Cartea, A., Jaimungal, S., 2016b. Incorporating order-flow into optimal execution.Mathematics and Financial Economics 10 (3), 339–364.

Casgrain, P., Jaimungal, S., 2018a. Algorithmic trading with partial information: Amean field game approach. arXiv preprint arXiv:1803.04094.

Casgrain, P., Jaimungal, S., 2018b. Algorithmic trading with partial information: Amean field game approach. arXiv preprint arXiv:1803.04094.

Casgrain, P., Jaimungal, S., 2018c. Trading algorithms with learning in latent alphamodels. arXiv preprint arXiv:1806.04472.

31

Casgrain, P., Jaimungal, S., 2018d. Trading algorithms with learning in latent alphamodels. arXiv preprint arXiv:1806.04472.

Chen, Y., 2016. Likelihood function for multivariate hawkes processes.

Colaneri, K., Eksi, Z., Frey, R., Szolgyenyi, M., 2016a. Shall i sell or shall i wait?optimal liquidation under partial information with price impact. arXiv preprintarXiv:1606.05079.

Colaneri, K., Eksi, Z., Frey, R., Szolgyenyi, M., 2016b. Shall i sell or shall i wait?optimal liquidation under partial information with price impact. arXiv preprintarXiv:1606.05079.

Colaneri, K., Eksi, Z., Frey, R., Szolgyenyi, M., 2016c. Shall i sell or shall i wait?optimal liquidation under partial information with price impact. arXiv preprintarXiv:1606.05079.

Cox, D., Lewis, P., 1966. The statistical analysis of series of events.

Cox, D. R., 1955. Some statistical methods connected with series of events. Journalof the Royal Statistical Society. Series B (Methodological), 129–164.

Daley, D. J., Vere-Jones, D., 2007. An introduction to the theory of point processes:volume II: general theory and structure. Springer Science & Business Media.

Engle, R. F., Russell, J. R., 1998. Autoregressive conditional duration: a new modelfor irregularly spaced transaction data. Econometrica, 1127–1162.

Frei, C., Westray, N., 2015. Optimal execution of a vwap order: a stochastic controlapproach. Mathematical Finance 25 (3), 612–639.

Ghahramani, Z., 2001. An introduction to hidden markov models and bayesiannetworks. International journal of pattern recognition and artificial intelligence15 (01), 9–42.

Hasbrouck, J., 1991. Measuring the information content of stock trades. The Journalof Finance 46 (1), 179–207.

Hawkes, A. G., 1971a. Spectra of some self-exciting and mutually exciting pointprocesses. Biometrika 58 (1), 83–90.

Hawkes, A. G., 1971b. Spectra of some self-exciting and mutually exciting pointprocesses. Biometrika 58 (1), 83–90.

Laub, P. J., Taimre, T., Pollett, P. K., 2015a. Hawkes processes. arXiv preprint arX-iv:1507.02822.

32

Laub, P. J., Taimre, T., Pollett, P. K., 2015b. Hawkes processes. arXiv preprint arX-iv:1507.02822.

Lewis, P., 1964. Journal of the Royal Statistical Society. Series B (Methodological),398.

Lin, P., Guo, T., Wang, Y., Chen, F., et al., 2016. Infinite hidden semi-markov mod-ulated interaction point process. In: Advances in Neural Information ProcessingSystems. pp. 3900–3908.

Mamon, R. S., Elliott, R. J., 2007. Hidden markov models in finance. Vol. 460.Springer.

33

A Calibration of theLognormal ForwardLIBOR Model

TEAM 4

ANDREA ANGIULI, University of California, Santa BarbaraCAMILLA ANTUNES, EMAp - FGVCIRO PAOLUCCI, IMPAANTONIO SOMBRA, EMAp - FGV

Supervisor:THOMAS A. MCWALTER, University of Cape Town


Contents

Introduction 3

1 Definitions 61 Zero-Coupon Bond and Spot Interest Rate . . . . . . . . . . . . . . . . 62 Forward Rate Agreements and Forward Interest Rates . . . . . . . . . 73 Caps and Floors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Interest Rate Swap and Forward Rate Swap . . . . . . . . . . . . . . . 105 Swaption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 The Forward-Libor Model (LFM) 131 A Review of the Forward-Libor Model (LFM) . . . . . . . . . . . . . . 142 Swaption Pricing with LFM . . . . . . . . . . . . . . . . . . . . . . . . 163 The Hull-White formula . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Data 19

4 Calculating the Caplet Volatilities 221 Calculating the Forward Rate . . . . . . . . . . . . . . . . . . . . . . . 222 Caplet Stripping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Calibration 271 The Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.1 Instantaneous Volatilities of the Caplets σi(t) . . . . . . . . . . 281.2 The Correlation Matrix ρij . . . . . . . . . . . . . . . . . . . . . 28

2 Calibration in 2 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.1 Calibration of σi(t) . . . . . . . . . . . . . . . . . . . . . . . . . 292.2 Calibration of ρij . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Calibration in 1 Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 1-step Basic Calibration . . . . . . . . . . . . . . . . . . . . . . 333.2 Stripping Free 1-step Calibration . . . . . . . . . . . . . . . . . 34

4 Calibration Results and Comparisons . . . . . . . . . . . . . . . . . . 35

Conclusion 41

2

Introduction

The money market is where financial institutions borrow and lend money withshort maturities, up to 1-year. This market provides short-term liquidity to the fi-nancial system: institutions resort to the money market to lend out their surplusand borrow money to cover any possible deficits. Some of the indexes quoted inthe money market are the LIBOR (London Inter-Bank Offered Rate) in the devel-oped markets and the JIBAR (Johannesburg Inter-Bank Agreed Rates) in the SouthAfrican market.The fixed income market is the market where long-term instruments, with matu-rities greater than 1-year, are negotiated. In this market we can trade financial in-struments that pay interest, like bonds issued by governments or companies. Otherinstruments negotiated in the fixed income market are interest rate derivatives likeforwards, futures, options and swaps.At the interest rate market, banks, brokerage houses and other institutions can ne-gotiate forward interest rates. Two parties can negotiate (buy/sell) a forward in-terest rate, through a Forward Rate Agreement (FRA), or they can exchange cashflows periodically by an Interest Rate Swap (IRS). They can also trade derivativesinstruments in this market. Two of the most important derivatives in the inter-est rate market are Caps/Floors and Swaptions. The former, caps and floors, arebaskets of caplets and floorlets, which are calls and puts on a FRA. The latter, theswaptions, are options to enter into a IRS in a future time. Banks and other entitiesuse these instruments as basis for pricing more complex financial instruments, ei-ther on short or long terms. The pension and savings industry offer products withup to 60 years or longer to maturity. This explains why it is essential to have amodel that can price these instruments as accurately as possible, even for longermaturities, based on market prices for liquid interest rate instruments (which ingeneral have short maturities).There are two well established models usually used to price these instruments:the Lognormal Forward-LIBOR Model (LFM) and the Lognormal Forward-SwapModel (LSM). Both are applications of Black’s 1976 model to price, respectively,caps/floors and swaptions. Our project is based on the LFM model.A theoretical incompatibility exists between LFM and LSM models, since forward-LIBOR rates and forward-swap rates are not both log-normal distributed under the

3

same measure. From a theoretical point of view, we cannot value swaptions underthe LFM model applying the Black formula (the LSM approach). However, VanAppel and McWalter (2018) and Brigo and Mercurio (2007) show with numericalresults that the distribution of forward-swap rates under the LFM model is almostlog-normal from a practical point of view.Brigo and Mercurio (2007) show how to calibrate the LFM model to price swaptionsusing different structures for the volatility and correlations of the forward ratestogether with different approximations of the swaption volatility.In our project, we show different ways to calibrate the LFM model. We considerthe parametrization of the volatilities of the forward rates suggested by Brigo andMercurio (2007) and the full rank parametrization for the instantaneous correlationof the Brownian Motion defining the dynamics of the forward rates proposed byJoshi (2008). Following Van Appel and McWalter (2018), we consider the Hull-White formula for the approximation of the swaption volatility (Hull and White(2000)), also derived independently by Andersen and Andreasen (2000) and Jackeland Rebonato (2003).Using this setting, we show how to price swaption under theLFM model.In chapter 1, we define some concepts from the financial market. We start withtwo basic definitions: the Zero-Coupon Bond and the Spot Interest Rate. Then, wepresent the forward interest rate and a contract to negotiate this rate: the ForwardRate Agreement (FRA). Then, we define caps and floors, two derivative instru-ments based on this contract. We also discuss the Interest Rate Swap (IRS), whichleads us to define the forward swap rate. Finally, we present the definition of swap-tions, which are options to enter into an IRS.In chapter 2, we review the Forward-Libor Model and discuss how to price swap-tions under this model. We also present the Hull-White formula, which we used toapproximate the swaption volatility.In chapter 3, we introduce the data used in our project. In particular, we receivedthree data sets: Historical Nominal Swap Curves, Historical Caps/Floors Volatili-ties and Historical At-The-Money Swaption Volatility Surfaces. We use the first set,the Historical Nominal Swap Curves, to obtain the forward rates; the second one,the Historical Caps/Floors Volatilities, is used to extract the values of the capletvolatilities using the stripping procedure which is discussed in detail. Once weobtain the caplet volatilities, we proceed to the calibration of the model based onthe data given in the third set, the Historical At-The-Money Swaption VolatilitySurfaces.In chapter 4 we discuss in detail the caplet stripping. This procedure is used to ob-tain caplet volatilities from the historical cap volatilities. This technique consists tocompute cap values by the cap volatilities, and then derive caplet volatilities. Theuse of cap volatilities is justified by the fact that caplet volatilities are not observed,since these instruments are not negotiated on the market.In chapter 5, we discuss calibration of the LFM model. As a first approach, we

4

calibrated the four-parameter structure suggested for the forward rate volatility.We approximate the volatilities of the caplets by an integral of the parametrizedfunction and minimize the squared difference between them and the volatilitiesobtained from the caplet stripping. We use the four parameters obtained to cali-brate the two-parameter structure for the instantaneous correlations of the forwardrates. To do that, we use the Hull-White formula to approximate the swaptionsvolatilities and we minimize the squared difference between them and the inputdata.As a second approach, we did a six-parameter calibration. We consider the volatil-ities of caplets and swaptions simultaneously, using as starting point the valuesobtained in the first calibration.Finally, we suggested a new calibration technique. The two previous procedureare based on caplet stripping. However, there are different ways to infer the capletvolatilities and the choice of the technique for the caplet stripping may influenceon the optimization. We propose a stripping-free calibration. This approach isimplemented doing a six-parameter calibration using the cap volatilities instead ofthe caplet volatilities. In this way, we try to avoid the dependence of the error fromthe caplet stripping choice.In the last chapter, we present the numerical results obtained for each calibrationprocedure. We compare them showing the benefits that each method brings andwe discuss which method is preferable.A future development of this project would be to investigate the stripping free cali-bration further. Indeed, this method depends on an hyperparameter and more testswould make the comparison with the other procedures more accurate. Further, weshould consider more dates from our data to explore the dependences on time.

5

Chapter 1

Definitions

We start by defining some concepts from the interest-rate market and making clearthe notation used in this project. We begin with a basic instrument, the Zero-Coupon Bond, and then we go for more sophisticated ones.

1 Zero-Coupon Bond and Spot Interest Rate

As pointed out by Brigo and Mercurio (2007) the zero-coupon bond prices are themost basic quantities in interest rate theory. In fact, we can define all interest ratesin terms of them. However, it is important to note that we do not observe zero-coupon bond prices: financial market provides interest rates prices. So, the zero-coupon bond is only a theoretical instrument.

Definition 1.1 (Zero-Coupon Bond). A T -maturity zero-coupon bond is a contractthat guarantees its holder the payment of one unit of currency at time T , with nointermediate payments. The contract value at time t < T is denoted by P (t, T ).Clearly, P (T, T ) = 1 for all T .

The zero-coupon bond is similar to a discount factor, since it gives the present valueof a payment of one unit of currency at time T . Indeed, for a deterministic rate r,the discount factor and the value of a zero-coupon are the same, but this equalityis not necessarily valid if r is stochastic. Despite this, Brigo and Mercurio (2007)show that P (t, T ) can be viewed as the expected value of the random discountfactor (that depends on a stochastic r) under a particular probability measure.In our project, we consider the value P (t, T ) as a discount factor that allows us tocompute the present value, at time t, of an amount paid at time T .Two aspects are important when dealing with interest rates: the day-count conven-tion and the compounding type.To simplify, we do not use an official day-count convention, but we consider a 365-day year, and we allow payments to be made even on weekends and holidays.

6

Regarding the compound type, we consider simply-compounded spot interest rates,that we define as follows.

Definition 1.2 (Simply-Compounded Spot Rates). The simply-compounded spotinterest rate prevailing at time t for the maturity T is the constant rate at whichan investment has to be made to produce an amount of one unit of currency atmaturity, starting from P (t, T ) units of currency at time t, when accrual occursproportionally to the investment time. It is given by

L(t, T ) =1− P (t, T )

τ(t, T )P (t, T ). (1.1)

2 Forward Rate Agreements and Forward Interest Rates

In this section we will define a contract called a Forward Rate Agreement. It willlead us to define a simply-compounded forward Interest Rate that will be the ref-erence rate in the LFM model.

Definition 2.1 (Forward Rates). Forward rates are interest rates that can be lockedin today for a future investment. They are characterized by three time instants,namely the time t at which the rate is considered, its expiry T and its maturity S,with t ≤ T ≤ S.

In other words, it is possible, at time t, to obtain a rate for some transaction relatedto a time interval in the future. Indeed, it is even possible to trade it. For example,suppose that a company has some obligations related to a time interval in the futurethat is a function of a floating interest rate. Assume also that the company wants toavoid the risk of a large increase in the rate, and would like to minimize its effects.Then, it would want to fix today a limit for the floating rate, ensuring a ceiling fortheir obligation. This leads us to the following definition.

Definition 2.2 (Forward Rate Agreement). A Forward Rate Agreement (FRA) isa contract between two parties which agree to buy or sell a forward interest rateduring a time interval in the future.

Thus, the company can agree with the other party a payment based on a fixedrate, limiting the value of the obligation in the future. On the other hand, thecompany receives from the other party a payment based on the floating rate. So,they exchange a variable and unlimited obligation to a limited one, transferring therisk of a large increase in rate to the other party.There are three instants that define a FRA: t ,T and S, with t ≤ T ≤ S. First, t is thetime which we consider the rate; then, T is the expiry time, when the forward rateis reset, and S is the maturity, when the payment is made.At the maturity S, a payment based on a fixed rate K is exchanged against a float-ing rate reset at T with maturity S, that is, L(T, S).

7

Let’s consider a FRA expiring at Ti with maturity Ti+1. At the maturity, the fixedparty pays

τiNK,

and the floating one pays

τiNL(Ti, Ti+1),

where N is the notional of the contract and τi = Ti+1 − Ti.Using Equation 1.1, we have that the payoff of the FRA at time Ti+1 is

PayoffFRA = τiN(K − L(Ti, Ti+1))

= τiN

(K − 1− P (Ti, Ti+1)

τiP (Ti, Ti+1)

)= N

(τiK −

1

P (Ti, Ti+1)+ 1

).

We can then compute the value of the contract at time t, discounting its payoff. Itfollows that

V FRA(t) = P (t, Ti+1)PayoffFRA

= P (t, Ti+1)N

(τiK −

1

P (Ti, Ti+1)+ 1

)= N

(τiKP (t, Ti+1)−

P (t, Ti+1)

P (Ti, Ti+1)+ P (t, Ti+1)

)= N (τiKP (t, Ti+1)− P (t, Ti) + P (t, Ti+1)) . (1.2)

The last equality is motivated by the fact that the value of the amount1

P (Ti, Ti+1)

at time Ti is P (Ti, Ti+1)1

P (Ti, Ti+1)= 1. Thus, its value is P (t, Ti) at time t.

There is only one fixed rate K that makes this contract fair at time t, i.e., so thatV FRA(t) = 0. This rate defines the simply-compounded forward interest rate.

Definition 2.3 (Simply-Compounded Forward Interest Rate). The simply-compoundedforward interest rate at time t for the expiry T > t and maturity S > T is denotedby F (t, T, S) and is defined by

F (t, T, S) =1

τ(T, S)

(P (t, T )

P (t, S)− 1

). (1.3)

8

In this project, we use the notation Fi(t) = F (t, Ti, Ti+1) for the forward rate fixedat t for the period [Ti, Ti+1], with t ≤ Ti < Ti+1. When t = Ti, the forward ratecoincides with the spot rate, i.e., Fi(Ti) = L(Ti, Ti+1), and we say that Ti is the resettime of Fi(t).

3 Caps and Floors

At this point we explore derivatives that are traded in the interest-rate market,beginning with caps and floors. In order to do this we have to define caplets andfloorlets.

Definition 3.1 (Caplet/Floorlet). A caplet (floorlet) is a call (put) option on a for-ward interest-rate.

Remember that a FRA involves an exchange of a floating rate for a constant rate(strike), and the holder is obliged to settle the contract. However, the purchaserof a caplet has the option to settle or not, and will do so only if the floating rateis greater than the strike. Now, we want to define the value of a caplet. Supposea caplet on a forward rate Fi(t) = F (t, Ti, Ti+1), t ∈ [0, Ti], and strike K. The fairvalue of a caplet is the expected value of its discounted payoff. Its value at time t isgiven by

V caplet(t) = EQ [τiP (t, Ti)N(Fi−1(Ti−1)−K)+], (1.4)

where N is the notional of the contract, τi = Ti+1 − Ti and Q is the risk-neutralmeasure.Similarly, the purchaser of a floorlet has the right to settle the contract or not, andhe will do so if the floating rate is less than the strike. The value at time t of afloorlet on a forward rate Fi(t) = F (t, Ti, Ti+1), t ∈ [0, Ti], and strike K is definedby

V floorlet(t) = EQ [τiP (t, Ti)N(K − Fi−1(Ti−1))+], (1.5)

where N , τi and Q are defined as previously.Since the concepts of caplets and floorlets are similar, we only discuss caplets fromthis point on, and the definitions and formulas for floorlets can be made similarly.So, we proceed with the following definition.

Definition 3.2 (Caps). An interest-rate cap is simply an adjacent series of interest-rate caplets.

Suppose we are holders of a cap on a forward rate Fi(t) = F (t, Ti, Ti+1), with resettimes Tα, Tα+1, ..., Tβ−1. At each reset time, we have the option to exercise thecorresponding caplet or not.As a set of caplets, the fair value of a cap at time t ∈ [0, Tα], is given by

9

V cap(t) =

β∑i=α+1

V caplet(i)(t) =

β∑i=α+1

E[τiP (t, Ti)N(Fi−1(Ti−1)−K)+

]. (1.6)

We can see that, valuing a cap consists in splitting the cash flow into caplets, whichcan be valued separately, and summing all their values. This means that we canconsider the effect of each forward rate separately. In this way, we do not have toworry about the correlation of the forward rates to value this instrument. We willsee later that this fact does not occur when we value swaptions.

4 Interest Rate Swap and Forward Rate Swap

Previously we discussed a contract where two parties agree to exchange a fixedinterest rate and a floating one, called Forward Rate Agreement (FRA). Now, wedefine a generalization of this contract, the Interest Rate Swap, which will lead usto the definition of the Forward Swap Rate.

Definition 4.1 (Interest Rate Swap). An interest rate swap (IRS) is an agreementbetween two counterparties to exchange cash flows periodically, whereby the setof cash flows is calculated using a fixed rate on one side, and a floating referencerate on the other.

The IRS is a contract that involves fixed and floating payments. Indeed, let Ti ∈Tα+1, Tα+2, ..., Tβ be a set of payment dates, with Ti < Ti+1, the fixed party pays

τiNK, (1.7)

and the floating one pays

τiNL(Ti−1, Ti) (1.8)

where N is the notional of the contract and K is the fixed interest rate.The IRS is called a Payer IRS (PFS) if we pay the fixed payment and receive thefloating one. In the other case, it is called a Receiver IRS (RFS).We can think of an IRS as a set of FRA’s, and evaluate it as a function of them. LetTα, Tα+1, ..., Tβ−1 be the reset dates of the floating rate and Tα+1, Tα+2, ..., Tβthe payment dates. Then, the value of a RFS is given by

V RFS(t) =

β∑i=α+1

V FRA(i)(t)

=

β∑i=α+1

τiNP (t, Ti)(K − Fi−1(t)). (1.9)

10

Finally, we can define the forward swap rate using the value of a RFS.

Definition 4.2 (Forward Swap Rate). The forward swap rate is the fixed rate thatmakes the RFS a fair contract at the present time, i.e., the value for which V RFS(t) =0. It is given by

Sα,β =P (t, Tα)− P (t, Tβ)

β∑i=α+1

τiP (t, Ti)

. (1.10)

Indeed, applying the definition of forward interest rate (1.3) in Equation 1.9, wehave

V RFS(t) =

β∑i=α+1

τiNP (t, Ti)(K − Fi−1(t))

=

β∑i=α+1

τiNP (t, Ti)

(K − 1

τi

(P (t, Ti−1)

P (t, Ti)− 1

))

=

β∑i=α+1

τiNP (t, Ti)K −β∑

i=α+1

N(P (t, Ti−1)− P (t, Ti))

= N

(K

β∑i=α+1

τiP (t, Ti)− (P (t, Tα)− P (t, Tβ))

).

It is easy to see that the value of K such that V RFS(t) = 0 is Sα,β given by (1.10).

5 Swaption

The other interest rate derivatives involved in our project are swaptions. Our goalis to calibrate the LFM model to price this product as we will see in the next chap-ters.

Definition 5.1 (Swaptions). A swaption is an option to enter into a swap at a futuretime.

Consider a swap with payments dates Tα+1, Tα+2, ..., Tβ and reset dates Tα, Tα+1,..., Tβ−1. If I hold a swaption on this contract, I have the right but not the obligationto enter into the swap until the swaption maturity, Tα. There is only one decisionto be made and, once taken, it is valid for all cash flows in [Tα, Tβ]. For this reason,it differs from a cap, which requires a decision at each reset date.

11

Since the decision is valid for all cash flows, to value a swaption we cannot separateit into single payments as we did for caps. All the payments must be consideredtogether, so that at time t the discounted payment of a swaption is given by

PayoffSwaption(t) = P (t, Tα)N

(β∑

i=α+1

τiP (Tα, Ti)(Fi−1(Tα)−K)

)+

. (1.11)

This equation leads to the following conclusions.First, we can see that each term of the sum looks like the payoff of a caplet, sothat the sum is something like the payoff of a cap. The difference is the operator(·)+ = max (·, 0). In the case of caps, this operator appears inside the sum, whilefor the case of swaption it is outside of the sum. As pointed out by Brigo andMercurio (2007), the max opperator is not distributive with respect to the sum, butis a piecewise linear and convex function, and so we have the following relation

(β∑

i=α+1

τiP (Tα, Ti)(Fi−1(Tα)−K)

)+

≤β∑

i=α+1

τiP (Tα, Ti)(Fi−1(Tα)−K)+.

At time Tα, a swaption has lower value than a corresponding cap (a cap with thesame ”period of life”). This makes sense from a practical point of view, since aholder of a cap has more rights than a holder of a swaption, being able to makedecisions throughout the cash flow period and better manage the payments.The second thing that we can conclude from (1.11) is that we cannot consider eachforward rate separately to price a swaption, but we need to consider the effect ofall of them together. Unlike caps, we have to consider the correlations between theforward rates to price a swaption. This leads us to add more parameters to ourmodel, as we’ll see later.

12

Chapter 2

The Forward-Libor Model (LFM)

In the previous chapter, we showed two main financial instruments on the interest-rate-options market: caps/floors and swaptions. There are two well-establishedmodels to value each one of these derivatives: the lognormal forward-LIBOR model(LFM) for caps/floors and the lognormal forward-swap model (LSM) for swap-tions. Under these models and some assumptions on the interest-rate distribution,we can apply the Black (1976) formula to price the relative option.Each model provides an easy way to compute the value of one product at a time.We can not use both models at the same time to value our caps and swaptions.However, we can not assume the forward-LIBOR rates and the forward-swap ratesto be lognormally distributed under the same measure. We can rewrite the forwardswap rate given by (1.10) as

Sα,β(t) =

1−β∏

j=α+1

1

1 + τjFj(t)

β∑i=α+1

τi

i∏j=α+1

1

1 + τjFj(t)

. (2.1)

Indeed, from Equation 1.3, we have

P (t, Tk)

P (t, Tα)=

k∏j=α+1

P (t, Tj)

P (t, Tj−1)=

k∏j=α+1

1

1 + τjFj(t),

for k > α. Dividing both numerator and denominator in Equation 1.10 by P (t, Tα),we have Equation 2.1.Under the LFM model, the forward-LIBOR rate is lognormally distributed. Then,there is no reason to believe that the distribution of Sα,β is log-normal, and wecannot use Black formula to price swaptions.Despite this theoretical incompatibility, Brigo and Mercurio (2007) and Van Appeland McWalter (2018) show that the distribution of Sα,β is not far from be log-normal

13

under the LFM model. It follows that we can use the parameters of the LFM modelto approximate the variance of the forward-swap rate, and input this value into theBlack formula (as other variables are known) to price swaptions.In this chapter, we review the LFM model and discuss how to value swaptionsunder this model, using an approximation of the swap rate volatilities suggestedby Hull and White (2000).

1 A Review of the Forward-Libor Model (LFM)

Like Van Appel and McWalter (2018), let’s suppose bond tenor 1 dates are givenby 0 < T1 < ... < TM < TM+1. The forward-LIBOR rate from Ti and Ti+1 followsthe Equation 1.3. We denote the Tm+1-forward measure associated with the zerocoupon bond P (t, Tm+1) by Qm+1, for 0 ≤ m ≤ M . The dynamics of the forwardrate Fi(t) under Qm+1 is given by

dFi(t) = µi(t)dt+ σi(t)Fi(t)dWm+1i (t) (2.2)

for t ≤ Ti, where

µi(t) =

−σi(t)Fi(t)m∑

j=i+1

ρi,jτjσj(t)Fj(t)

1 + τjFj(t), for i < m

0, for i = m

σi(t)Fi(t)

i∑j=m+1

ρi,jτjσj(t)Fj(t)

1 + τjFj(t), for i > m.

(2.3)

In the equation above, σi(t) is the instantaneous volatility of Fi(t) and Wm+1i (t) is

the ith component of an M -dimension Qm+1 standard Brownian motion Wm+1(t)with instantaneous correlations given by

d〈Wm+1i ,Wm+1

j 〉t = ρi,jdt. (2.4)

In particular, we have µi(t) = 0 when i = m. Thus the process 2.2 is a martingaleand Fi(t) is lognormally distributed. It follows that we can price caps and floorsunder the LFM using the Black formula, given by

Black (F, v,K, ξ) = ξ [Fφ(ξd1(F, v,K))−Kφ(ξd2(F, v,K))] , (2.5)

where ξ is equal to 1 for caps and -1 for floors, φ (·) is the standard normal cumula-tive distribution function, and

1Time-to-Maturity of a bond.

14

d1(F, v,K) =

log

(F

K

)+v2

2

v(2.6)

and

d2(F, v,K) =

log

(F

K

)− v2

2

v. (2.7)

To compute these values, we need the volatility of F . We define the average per-centage variance of the forward rate Fi(t) for t ∈ [0, Ti+1) as

v2Fi :=1

TiVari+1

[∫ Ti

0

dFi(t)

Fi(t)

]=

1

TiVari+1

[∫ Ti

0d logFi(t)

]=

1

Ti

∫ Ti

0σ2i (t)dt, (2.8)

and the value of a cap, given by Equation 1.6, can be computed as

V cap(t) =

β−1∑i=α

V caplet(t)

=

β−1∑i=α

Ei+1[τiZ(t, Ti)N(Fi(Ti)−K)+

]=

β−1∑i=α

τiZ(t, Ti)NEi+1[(Fi(Ti)−K)+

]=

β−1∑i=α

τiZ(t, Ti)N Black(Fi(0), vFi

√Ti,K, 1

). (2.9)

As we mentioned in chapter one, the evaluation of a cap does not require the corre-lations between the forward rates, since each of them can be considered separately.However, we cannot use the same approach for Swaptions under the LFM, sincewe can not consider the forward rate and the swap rate lognormally distributedat the same time. As explained before, this theoretical incompatibility can be by-passed. In the next section, we discuss how to calculate the value of a Swaptionunder the LFM.

15

2 Swaption Pricing with LFM

Let’s consider a swaption with payment dates Tα+1, Tα+2, ..., Tβ , where Ti < Ti+1.As shown in Equation 2.1, the forward-swap rate is a weighted sum of forward-rates. Since Fi(t) is lognormally distributed, there is no reason to believe thatSα,β(t) is also.However, Van Appel and McWalter (2018) and Brigo and Mercurio (2007) showthat the distribution of Sα,β(t) is almost Log-Normal under the LFM. Then, we canuse the Black formula to price swaptions.First of all, we need to calculate the average percentage variance of the forward-swap rate, v2Sα,β , in terms of the parameters of our model.Starting from the definition, we have

v2Sα,β :=1

TαVarα+1

[∫ Tα

0

dSα,β(t)

Sα,β(t)

]=

1

TαVarα+1

[∫ Tα

0d logSα,β(t)

]=

1

Tα

(Eα+1

[∫ Tα

0d〈logSα,β〉t

]+ ψ(Tα)

), (2.10)

where

ψ(Tα) = (logSα,β(0))2 −(Eα+1 [logSα,β(Tα)]

)2+ 2Eα+1

[∫ Tα

0logSα,β(t)d logSα,β(t)

]. (2.11)

Since Sα,β is not a martingale under Qα+1, ψ(Tα) is different from zero. Never-theless, Van Appel and McWalter (2018) give numerical evidence that it is closeenough to zero, and it is reasonable to assume that Sα,β is lognormally distributed(so, is a Qα+1-martingale). In this way, we can approximate the average percentagevariance of the forward-swap rate by

v2Sα,β ≈1

TαEα+1

[∫ Tα

0d〈logSα,β〉t

]:=(vLNSα,β

)2. (2.12)

We can value swaptions under the LFM by plugging v2Sα,β into Equation 2.5. To dothis, we need to compute the quadratic variation of logSα,β(t). White and Iwashita(2014) show three different approaches to approximate the expectation in (2.12). Inthe next section, we will present one of them, the Hull-White formula, the one weused in our project.

16

3 The Hull-White formula

Rebonato (1998) proposed the following reformulation of the forward-swap rategiven by Equation 1.10

Sα,β(t) =

β−1∑i=α

wi(t)Fi(t), (2.13)

with weights defined by

wi(t) =

τi∏ij=α

1

1 + τjFj(t)∑β−1k=α τk

∏kj=α

1

1 + τjFj(t)

. (2.14)

This reformulation is useful because we can obtain the dynamics of Sα,β(t) in termsof Fi(t) under the LFM and derive the quadratic variation d〈logSα,β(t)〉t.Using this reformulation, a more sophisticated formula was proposed by Hull andWhite (2000), also derived independently by Andersen and Andreasen (2000) andJackel and Rebonato (2003). In his formula, Rebonato (1998) assume that wi(t) andFi(t) are independent, and freeze the weights at time t = 0. Hull and White (2000)drop this assumption, producing the following LFM forward-swap rate dynamics:

dSα,β(t) =

β−1∑h=α

wh(t)dFh(t) =

β−1∑h=α

wh(t)σh(t)Fh(t)dWm+1h (t), (2.15)

where

wh(t) = wh(t) +

β−1∑i=α

Fi(t)∂wi(t)

∂Fh(2.16)

and

∂wi(t)

∂Fh=

wi(t)τh1 + τhFh(t)

∑β−1

k=h τk∏kj=α

1

1 + τjFj(t)∑β−1k=α τk

∏kj=α

1

1 + τjFj(t)

− Ii≥h

. (2.17)

Considering the dynamic 2.15, the quadratic variation of logSα,β(t) over [0, Tα] isgiven by

∫ Tα

0d〈logSα,β〉t =

∫ Tα

0

β−1∑h,j=α

Gh,j(t)σh(t)σj(t)ρh,jdt, (2.18)

where

17

Gh,j(t) =wh(t)wj(t)Fh(t)Fj(t)

S2α,β(t)

. (2.19)

Hull and White (2000) assume that

Gh,j(t) = Gh,j(0), (2.20)

for all t ∈ [0, Tα] and h, j ∈ α, α+ 1, ..., β − 1.This assumption can be justified by the fact that the expectation of 2.19 is approxi-matelly equal to Gh,j(0) (for more details, see Jackel and Rebonato (2003)). Indeed,Van Appel and McWalter (2018) show with a numerical test that it is a reasonableassumption. By applying (2.18) to (2.12), Hull and White (2000) derive the follow-ing approximation of the swap rate volatility

(vLNSα,β

)≈ 1

Tα

β−1∑h,j=α

Gh,j(0)ρh,j

∫ Tα

0σh(t)σj(t)dt, (2.21)

where ρh,j is the instantaneous correlation of Wα+1h and Wα+1

j .Once we have the approximation of the swap rate volatility, we can price swaptionsunder the LFM model by using Black (1976) as follows

V swaption(t) = N

β−1∑i=α

τiP (t, Ti+1) Black(Sα,β(t), vLNSα,β

√Tα,K, ξ

), (2.22)

where t ∈ [0, Tα], ξ is equal to 1 for payer swaption and -1 for receiver swaption,and Black is defined in (2.5).

18

Chapter 3

Data

In this section we will describe the data used in this project. We received threemain sets of tables. The SASC, SACF and SASV Pivot tables, the South AfricanSwap Curve values, the South African Cap Floor and the South African Swap Val-ues curve respectively.

The first table is the SASC Pivot Table, The South African Swap Curve. A partof this table can be seen in Table 3.1.

Row Labels 0 1 7 1402/jan/04 0,082456707 0,082217699 0,080800895 0,079272799 . . .05/jan/04 0,082455850 0,082217699 0,080806772 0,079290708 . . .06/jan/04 0,082455519 0,082217699 0,080809347 0,079300445 . . .07/jan/04 0,082446187 0,082217699 0,080863701 0,079406911 . . .08/jan/04 0,082451247 0,082217699 0,080834282 0,079349733 . . .09/jan/04 0,082456231 0,082217699 0,080805255 0,079292955 . . .12/jan/04 0,082455964 0,082217699 0,080806761 0,079295549 . . .13/jan/04 0,082448872 0,082217699 0,080848375 0,079379045 . . .14/jan/04 0,082439882 0,082217699 0,080900792 0,079482092 . . .15/jan/04 0,082441899 0,082217699 0,08088931 0,079461453 . . ....

......

......

. . .

Table 3.1: SASC Pivot Table

This table shows the values of NACC rates, the Nominal Annual Compound Con-tinuously rates, these values represent the spot rates at a given time. Each line ofthe Table shows the values of the spot rates relative to the periods in the columns1.In this project, we use just the values that are in the last line, in other words, justthe values relative to March 29, 2018.

1The values in the columns title does not follow a pattern.

19

The second set of tables is the SACF Pivot Data, the South African Caps and Floorsimplied Volatilities. At each date, we have a table with the implied volatilities forthe caps that start on that day and have different durations, expressed in years. Us-ing the date chosen earlier (March 29, 2018), we can see the cap implied volatilitiesin Table 3.2.

Caps Vols 03/29/20181 7,42 11,683 14,194 15,525 16,236 16,697 17,218 17,689 18,0710 18,45

Table 3.2: Caps Implied Volatilities

In Figure 3.1 is possible to see a visualization of the implied volatilities. For eachcap we have a horizontal line with the value of the implied volatility.

Figure 3.1: The implied volatilities calculated at March 29, 2018

Finally, the last set of tables is the SASV Pivot Data, The South African Swap Val-ues. At each date, the table shows the implied volatilities values associated with to14 Terms ranging from 0.25 to 10 years, and 10 Tenors ranging from 1 to 10 years.Some of the values can be seen in Table 3.3.

20

Terms Tenors 1 2 3 ... 9 100,25 7,36 9,63 10,21 ... 14,61 14,460,5 9,63 12,19 12,64 ... 15,58 15,440,75 11,36 13,73 14 ... 16,28 16,141 12,85 14,95 15,09 ... 16,82 16,691,5 15,18 15,95 16,08 ... 16,91 16,82 16,97 16,64 16,78 ... 16,93 16,893 16,93 17,53 17,36 ... 17,18 17,274 17,06 17,58 17,46 ... 17,32 17,255 16,61 17,25 17,31 ... 17,33 17,096 17,8 18,03 18,05 ... 18,05 17,927 18,93 18,74 18,92 ... 18,77 18,828 19,55 19,73 19,32 ... 19,5 19,689 20,89 19,98 20,19 ... 20,39 20,6710 19,92 20,6 21,39 ... 21,32 21,64

Table 3.3: SASV Pivot Table

21

Chapter 4

Calculating the Caplet Volatilities

In this section we discuss how to obtain the values of forward rates and capletvolatilities from our data. In particular, we show how to extract the values of for-ward rates from the spot rate curve (SASC Pivot, Table 3.1). Then, we introduceCaplet Stripping (see White and Iwashita (2014)) to obtain the caplet volatilitiesusing the cap volatilities given in the SACF Pivot table (Table 3.2).

1 Calculating the Forward Rate

First, we show how to obtain the forward rates Fi(t) using the information in thespot rate curve, observed in the market, and given by the SASC Pivot table. Westart by applying a linear interpolation to the spot rate curve in order to obtain thevalues for every quarter of year, i.e., for t = 0.25, 0.5, ..., 9.75, 10. We can see theinterpolation curve in Figure 4.1.

Figure 4.1: The Swap Curve and the interpolated points.

To proceed with the computation of Fi(t), we need to introduce the discount factor.

Definition 1.1 (Discount factor). The discount factor (also known as discount rate)

22

is the amount of interest used to bring a value from the future to a correspondingamount in the present.

As pointed out in the first chapter, the discount factor is related to the price at timet of a zero coupon bond with maturity T , that is P (t, T ). However, from now on,we will use the standard notation for the discount factor between times t and T ,that is:

Z(t, T ) = e−rT (T−t), (4.1)

where rT is the rate at time T .We are interested in the values of the discount factor at time t = 0, that is

Z(0, T ) = e−rTT , (4.2)

for times T = 0.25, 0.5, ..., 9.75, 10.Now, we are able to compute the values of the forward rate Fi(0) = f(0, Ti, Ti+1)by using the Equation 4.3:

F (t, T, S) =1

τ(T, S)

(Z(t, T )

Z(t, S)− 1

), (4.3)

where we used P (t, T ) = Z(t, T ) because of the relation between zero couponbonds and discount factors. We compute Fi(0) for i = 1, . . . , 39, since we considera time interval of 10 years split in quarters.Using the Data that were presented in Chapter 3, Figure 4.2 was constructed. Inthis plot we can see the behavior of the forward rates for each one of the periodsthat we are analyzing.

Figure 4.2: Forward rates calculated at March 29, 2018.

2 Caplet Stripping

In this section, we will introduce the caplet stripping procedure, which consistsin inferring the caplet volatilities from the cap ones. We will use the followingnotations:

23

1. σcapT (t) : Volatility of a cap starting at year t and ending in year T ;

2. σcapT = σ

capT (0);

3. σcapleti,j :Volatility of a caplet starting at ti and ending at tj .

Caplets and floorlets are not traded on the market, thus we can not directly obtaintheir values. We need a way to infer the value of the volatilities of the caplets, start-ing from the price of caps (that we do observe). This procedure is called CapletStripping. There are different ways to apply it (see White and Iwashita (2014)). Inthis section we will describe the bootstrapping procedure that is the one imple-mented in our project.To infer the caplet volatilities from the cap volatilities, we perform the followingsteps:

1. Get the cap volatilities from the market;

2. Compute the cap values using Equation 2.9:

V cap(t) = Black(Fi(0), vFi√Ti,K, 1);

3. Calculate the caplet volatilities, using the Bootstraping procedure.

The first step gets the implied volatilities for caps with 1, 2, ...10 years maturityfrom the data in the SACF table (Table 3.2). In step two, we obtain the cap valuesvia the Black formula. To do that, we need the implied volatility obtained at theprevious step and the strike. In particular, we consider the at-the-money strike, themost liquid in the market. It leads us to the following definition:

Definition 2.1. Consider a cap with payment times Tα+1, . . . , Tβ , associated yearfractions τα+1, . . . , τβ and strike K. The cap is said to be at-the-money (ATM) ifand only if

KATM =

β−1∑i=α

τiZ(T, Ti+1)Fi(t)

β−1∑i=α

τiZ(T, Ti+1)

. (4.4)

KATM is the strike that equals the swap rate for the period covered by the cap.Using the values of the forward rates and the discount factors computed in the lastsection, we can compute the value of the strikes for each of the maturities analyzed,i.e., n = 1, 2, . . . 10 years. Since we are evaluating at the initial time, t0 = 0, and all

24

the intervals have the same length, the Equation 4.4 can be written as

Kn = KATMn =

β−1∑i=α

Z(0, Ti+1)Fi(0)

β−1∑i=α

Z(0, Ti+1)

. (4.5)

In particular, each caplet covers a quarter so, for each year n, we have α = 0 andβ = 4n. As we will see later, the value of K1 is not required. We calculate Kn

starting by n = 2 obtaining:

K2 =

7∑i=0

Z(0, ti+1)Fi(0)

7∑i=0

Z(0, ti+1)

, K3 =

11∑i=0

Z(0, ti+1)Fi(0)

11∑i=0

Z(0, ti+1)

, . . .

Now that we have the cap volatilities (from step one) and the at-the-money strikesKATM

2 ,KATM3 , ...,KATM

10 , we can use Equation 2.9 to compute the cap values:

V capi(t) = Black(Fi(0), vcapi

√Ti,K

ATMi , 1

).

In the last step, we apply the Bootstrapping procedure, which will give us thevolatilities for the caplets. To illustrate this procedure, we show how to computethe value of the first seven caplets volatilities. We assume that the caplets volatili-ties are equal during each year. Then the first three caplets have volatilities givenby

σcapletj,j+1 = σcap1 ,

for j ∈ 1, 2, 3. In order to compute the next four, we consider the value of a capin terms of the values of the caplets.

Vcap2(K2, σcap2 ) =

7∑i=1

Vcaplet(K2, σcapleti,i+1 ),

Since we already know the volatilities for the first three, we can split our sum intwo terms

Vcap2(K2, σcap2 ) =

3∑j=1

Vcaplet(K2, σcap1 ) +

7∑j=4

Vcapleti(K2, σ2).

The second term is defined by σ2, that is the volatility of the next four caplets. Weknow the left side of the equation by step two, and we know all the variables to

25

compute the first sum on the right side. As σ2 is the only unknown variable, wecan define a function X(σ2) given by

X(σ2) = Vcap2(K2, σcap2 ) +

3∑j=1

Vcaplet(K2, σcap1 )−

7∑j=4

Vcapleti(K2, σ2),

By minimizing X(σ2), we obtain

σ∗2 = argmin|X(σ2)|2,

which will represent the volatilities for the caplets j = 4, 5, 6, 7.Applying this recursively, we can obtain the volatilities for all the caplets. In Fig-ure 4.3 we can see the implied volatilities for all the caplets. In this figure, eachcolor represents a different caplet and it is observed that for each year, the impliedvolatilities are the same.

Figure 4.3: Implied volatilities for the caplets after the caplet stripping.

26

Chapter 5

Calibration

In this Chapter, we present the calibration problem for the LIBOR Model. As seenin chapter 2, this model describes the dynamics of the forward rate by choosing aparticular structure of the instantaneous volatility and correlation. We present ourchoice for these structures and the parameters to calibrate using our data. We showthree ways to do the calibration: one is based on 2 steps and the others are basedon 1 step.The first approach consists of 2 steps. The first optimization step consists in findingthe best parameters for the instantaneous volatility structure of the forward rates.The second step consists in optimizing the correlation matrix parameters (ρ∞ andζ) based on our data. This procedure is essentially done before the attempt of any1-step optimization because it is less complex and produces results which can beused as initial guesses for the 1-step calibration procedure.Afterwards, we present the first 1-step calibration procedure, which consists of onesingle joint optimization of all the parameters defining the instantaneous volatilityand the correlation matrix of the forward rates. This procedure is more complexbecause it involves the optimization of 6 parameters at the same time. In order toavoid local minima, we use the values we obtained from the calibration in 2 stepsas initial guesses. Finally, we close the chapter by presenting one new method ofcalibration in 1 step. Distinct from all previous methods, this calibration does notrequire any caplet stripping. In this way, it will be independent of the influencethat a particular choice of stripping procedure may have.

1 The Problem Setup

As seen in chapter 2, the LFM describes the dynamics of the forward rates Fi(t) =f(t, Ti, Ti+1) by

dFi(t) = µi(t)dt+ σi(t)Fi(t)dWm+1i (t)

for t ≤ Ti, where

27

µi(t) =

−σi(t)Fi(t)m∑

j=i+1

ρijτjσj(t)Fj(t)

1 + τjFj(t), for i < m

0, for i = m

σi(t)Fi(t)i∑

j=m+1

ρijτjσj(t)Fj(t)

1 + τjFj(t), for i > m,

where σi(t) is the instantaneous volatility of Fi(t) and Wm+1i (t) is the ith compo-

nent of an M -dimension Qm+1 standard Brownian motion Wm+1(t) with instanta-neous correlations given by

d〈Wm+1i ,Wm+1

j 〉t = ρijdt.

In this section, we introduce the parameterization used for the instantaneous volatil-ities of the forward rates, σi(t), and the correlations of the Brownian Motions, ρij .

1.1 Instantaneous Volatilities of the Caplets σi(t)

In our project we consider the parameterization of the instantaneous volatilitiesproposed by Brigo and Mercurio (2007), given by:

σi(t, θ) = (a+ b(Ti − t)) · exp(−c(Ti − t)) + d, (5.1)

where θ = [a, b, c, d] are the parameters to be calibrated based on our data. Eachσi(t, θ) is the instantaneous volatility of the forward rate f(t, Ti, Ti+1).

1.2 The Correlation Matrix ρij

Inspired by Van Appel and McWalter (2018), we use the parametrization for theinstantaneous correlations given in Joshi (2008). The correlation between Wm

i andWmj is defined by

ρij = exp

[− |i− j|M − 1

(− log ρ∞ + ζ

M − i− j + 1

M − 2

)], (5.2)

with 0 ≤ ζ ≤ − log ρ∞, and the tenor indexes i, j are less than M , the number oftenors we have.In our project, we consider data for 10 years. We work with Tenors of quarters,giving the value of M equal to 40.

2 Calibration in 2 Steps

The first calibration we implemented consisted in two parts. The first part is thecalibration of the structure relative to the instantaneous volatility of the forward

28

rates, and the second part is the calibration of the structure of the correlation matrixρij .The second step is always done using the optimal set of parameters found in thefirst step.

2.1 Calibration of σi(t)

The first step concerns the structure of the instantaneous volatility σi(t). We useEquation 5.1 to find approximate values for the volatilities of the caplets and wecompare them with the caplet volatilities obtained by the stripping procedure inorder to find the best values for a, b, c, d.To make our implementation computationally efficient, we make a piecewise con-stant assumption for σi(t). It will make our model more suitable to possible exten-sions based on Mote Carlo simulations.We calculate a matrix of mean squared volatilities σij , given by

σ2ij(θ) :=1

τj

∫ Tj+1

Tj

σ2i (t, θ)dt, (5.3)

where τj is equal to Tj+1−Tj . The volatility of the i-th caplet defined on the interval[Ti, Ti+1] is given by

σ2i (θ) :=1

Ti − 0

∫ Ti

0σ2i (t, θ)dt, (5.4)

where σi(t) is the instantaneous volatility relative to the forward rate f(t, Ti, Ti+1).

The volatility of the i-th caplet can be expressed in terms of σ2ij as follows:

σ2i (θ) =1

Ti

∫ Ti

0σ2i (t, θ)dt =

1

iτ

i−1∑j=0

∫ Tj+1

Tj

σ2i (t, θ)dt

=1

iτ

i−1∑j=0

σ2ij(θ), (5.5)

where τ =1

4since each interval has size one quarter.

In this way we can define the following objective function:

obj1(θ) =39∑i=1

∣∣∣σi(θ)− σcapleti

∣∣∣2 (5.6)

where θ = [a, b, c, d], σi(θ) is obtained by the square root of Equation 5.5, and σcapleti

are the caplet volatilities obtained from the real data via the Caplet Stripping pro-cedure. We have values for 39 caplets, from time T1 to T39, since we are using 10years data, quarterly separated.

29

The best set of parameters obtained for this calibration is given by

θ∗1 = arg minθ

39∑i=1


∣∣∣2 . (5.7)

We minimize the objective function iteratively using Newton-Rapson with toler-ance of 10−10 and initial guesses given by Van Appel and McWalter (2018):

a0 = 0.02 , b0 = 0.2 , c0 = 0.95 , d0 = 0.08. (5.8)

Figure 5.1 shows the caplet volatilities, obtained by the caplet stripping method onthe data (the stair function), compared with the approximated values calculatedusing the parameterization of (5.1) with the optimal values obtained by calibration1

θ∗1 = [−0.17040129, 0.03999407, 0.8479712, 0.21416224].

Figure 5.1: The caplet volatilities obtained from the caplet stripping method com-pared with the values obtained using the parameterization formula with optimalparameters.

2.2 Calibration of ρij

Once that the calibration of θ = [a, b, c, d] is completed, we can proceed with thesecond step: the optimization of the parameters ρ∞ and ζ define our matrix ρij . Todo that we specify the following objective function:

obj2(ζ, ρ∞) =14∑j=1

10∑i=1

|vij(ζ, ρ∞)− vij |2, (5.9)

where vij are the swaption implied volatilities from the SASV table (Table 3.3) withterm Ti and tenor j, and vij is the volatility computed by the Hull-White formula

30

approximation:

(vLNSα,β

)≈ 1

Tα

β−1∑h,j=α

Gh,j(0)ρhj

∫ Tα

0σh(t)σj(t)dt.

where Gh,j(t) is a function of the forward rates (see Chapter 2). Notice that wecan write vi = vi(ζ, ρ∞), since here we are assuming θ = θ∗. We start by consider-ing the following initial guesses given by Van Appel and McWalter (2018) for theparameters:

ζ = 0.4 , ρ∞ = 0.2. (5.10)

To find the best values for our parameters ζ and ρ∞, we minimize the objectivefunction using the Sequential Least SQuares Programming (SLSQP)1, an iterativemethod for constrained nonlinear optimization. Indeed, our constraints are thatthe volatilities must be positive and 0 ≤ ζ ≤ − log ρ∞. The results obtained for thiscalibration are:

ζ∗2 = 0.78479326 , ρ∗2∞ = 0.24233398.

Figure 5.2 shows the errors of the swaption volatilities calculated using the pa-rameters ζ∗2 and ρ∗2∞ for each term and also the total error. The total error can beunderstood as the sum of the squared errors for the swaption volatilities relative toall terms, that is the value of the objective function 2.

Figure 5.2: The sum of squared errors for the swaption volatilities on the 2nd stepof the calibration in 2 steps for each term.

1All the optimization was done using the scipy.optimize python package.

31

Figure 5.3 illustrates the volatility surface of the swaptions given by the data andFigure 5.4 shows the surface of volatilities approximated from the parameteriza-tion. Notice that the plot for the approximated volatilities is smoother than the realdata. This is due to the fact that it is possible to approximate all the volatilities afterobtaining the right parameters, while our data is a limited sample.

Figure 5.3: swaption volatility surface from data.

Figure 5.4: Swaption volatility surface - approximation from 2nd step of calibrationin 2 steps.

32

3 Calibration in 1 Step

The 1-step calibration procedure consists in optimizing all our parameters at thesame time. Since this method gives more flexibility on the choice of the parameters,it is a more complex optimization problem that demands more time and care. Goodinitial guesses are significant for preventing local minima and to reduce the timeof convergence. In this section, we present two versions of 1-step calibration. Thesecond version is our contribution to this project. This procedure doesn’t requireany caplet stripping, making the error independent of the particular choice of thestripping technique we may use.

3.1 1-step Basic Calibration

In the previous section, we presented two independent optimizations to estimateour parameters (θ = [a, b, c, d] and [ρ∞, ζ]). In this section, we will calibrate all theparameters at once. To do that, we consider the volatilities of the caplets and theswaptions simultaneously.We use the values θ∗1, ζ∗2 and ρ∗2∞ obtained in the calibration in 2 steps as initialguesses. We define a new objective function as a sum of the two errors:

obj3(θ, ζ, ρ∞) =39∑i=1


∣∣∣2 + w1 ·14∑j=1

10∑i=1

|vij(θ, ζ, ρ∞)− vij |2. (5.11)

The first term is the same used in the calibration of σi(t). It is relative to the com-parison of the caplet volatilities. The second term represents the error obtained bycomparing the swaption volatilities obtained by (2.21) and our data. Differentlyfrom last section, vi depends on θ since we are not assuming θ = θ∗ anymore. Thevalue w1 is a weight empirically chosen to balance the effects of the two errors(caplets and swaptions errors). Indeed, the number of swaption volatilities is big-ger than the number of caplet volatilities. For this reason we choose w1 < 1.The optimization is done using the SLSQP algorithm. The results obtained for thiscalibration with weight w1 = 0.7 are:

θ∗3 = [−0.16703656, 0.00202674, 0.42493084, 0.24455927],

ζ∗3 = 1.27419386 , ρ∗3∞ = 0.01139352.

Figure 5.5 displays the error for the swaption volatilities values, for each term,when we use the parameters θ∗3, ζ∗3 and ρ∗3∞. It also shows the total error, the sumall the swaption volatility errors.

33

Figure 5.5: The sum of squared errors for the swaption volatilities on the 1-stepcalibration with w1 = 0.7.

3.2 Stripping Free 1-step Calibration

All the calibrations aforementioned depend on the caplet stripping procedure showedin the previous chapter. There are different techniques to do the caplet stripping,so removing the influence of the choice of the stripping procedure may give betterresults.In this section, we introduce the calibration method we designed during the FMTC.This calibration is independent of the caplet stripping. The object function is spec-ified as follows:

obj4(θ, ζ, ρ∞) =

10∑i=1

∣∣σcapi (θ)− σcapi

∣∣2 + w2 ·14∑j=1

10∑i=1

|vij(θ, ζ, ρ∞)− vij |2. (5.12)

While the second term does not change with respect to the previous calibration, thefirst term is now considering the cap volatilities rather than the caplet volatilities.We compare the cap volatilities σcapi obtained from the data and σcapi (θ), the esti-mated value for the cap volatilities given the set of parameters.At each step k of our calibration, we have to calculate σcapi (θk) by doing the follow-ing:

1. Calculate the caplet volatility σi(θ) for each caplet using (5.5).

2. Calculate the value of the caplets V capleti (σcapleti (θ)) using the Black formula.

3. Calculate the value of the caps V capi (σcap) using Equation 1.6. It is obtained by

the sum of the values of the composing caplets based on the same sigma(σcap).Also the first caplet is discarded, since there is no interest on contracts for thefirst quarter.

34

4. Calculate the volatilities of the caps σcapi by solving the minimization prob-lem:

σcapi = arg minσ

∣∣∣∣∣∣V capi (σ)−

∑j|capletj ∈ capi

V capletj (σcapi (θ))

∣∣∣∣∣∣2

.

We apply Newton-Rapson to obtain σcapi . We need to redo this procedure at eachstep of our calibration in order to value the objective function. We start usingθ = θ∗2, the optimal value obtained by the calibration in 2 steps. As before, weoptimize this function by using the SLSQP algorithm.Notice the optimization of the objective function 4 has another optimization pro-cess inside it in order to find the cap volatilities, which turns this stripping freealgorithm into a much more demanding computational problem. The best param-eters obtained using this calibration, for w2 = 0.1 are:

θ∗4 = [−0.17385352,−0.09175855, 1.28137408, 0.2190851],

ζ∗4 = 0.82971211 , ρ∗4∞ = 0.0220863.

Figure 5.6 shows the values of the Total Error for the swaption volatilities when weuse the parameters θ∗4, ζ∗4 and ρ∗4∞.

Figure 5.6: The sum of squared errors for the swaption volatilities on the stripping-free 1-step calibration with w2 = 0.1.

4 Calibration Results and Comparisons

In this section, we present plots and comparative analyses between the differentcalibration procedures. To analyze and compare our calibrations, we define thefollowing metrics:

35

1. relative error of the cap volatilities for Term i:

1

σcapi

∣∣σi(θ)− σcapi

∣∣2 ;

2. total relative error of the cap volatilities:

10∑i=1

1

σcapi

∣∣σi(θ)− σcapi

∣∣2 ;

3. sum of squared errors of swaption volatilities relative to term i:

14∑j=1

|vij(θ, ζ, ρ∞)− vij |2;

4. total difference of the swaption volatilities error:

14∑j=1

10∑i=1

|vij(θ, ζ, ρ∞)− vij |2;

5. relative error for the swaption volatility vij(θ, ζ, ρ∞):

1

vij(θ, ζ, ρ∞)(vij(θ, ζ, ρ∞)− vij) .

These metrics are justified since caps and swaptions are the only tradable marketinstruments in our procedures. Our ultimate goal when calibrating the LIBOR for-ward model is to better price these instruments, thus it makes sense to define thesemetrics instead of other indirect ones, like the sum of the caplet volatility errors.We start our analyses by Figure 5.7, where we show the sum of squared errors ofthe swaption volatilities relative to each Term Ti obtained by each one of the cali-brations presented in this chapter. At the bottom of the figure, we present the totaldifference of the swaption volatilities. In Figure 5.8 we present the relative errorof the cap volatilities for each Term Ti obtained by each calibration. On top of thefigure we find the total relative error for the cap volatilities.Observing the plots we can see that the calibration in 2 steps achieves the bestapproximation for the cap volatilities, since it has the smaller total relative cap er-ror. However, it underperformed remarkably compared to the other calibrationsby almost 10 times the error when we consider the swaption volatily error met-ric. Regarding the basic 1-step calibration, it gives the best performance in termsof the total difference of swaption volatility error, but it underperformed the othercalibrations by almost 10 times in terms of the relative error of the cap volatilities.

36

Figure 5.7: Comparative of total errors for swaption volatilities for the three cali-brations.

Figure 5.8: Comparative of cap volatilities relative error for the three calibrations.

To conclude, we consider the stripping-free calibration procedure developed dur-ing the FMTC competition. It attains promising and balanced results. The totalrelative error of the cap volatilities of the 1-step stripping free calibration has thesame order as the one obtained by the optimization in 2 steps, which is the onethat gives the best performance on this metric. Furthermore, the total difference ofthe swaptions volatilities for the the stripping free technique has the same order asthe one obtained by the 1-step stripping based procedure, making them the bestperforming procedures in this metric. In particular, the 1-step stripping free proce-

37

dure achieves a total relative error of the cap volatilities much smaller than the oneobtained by the basic 1-step procedure.We have shown that the stripping free 1-step calibration is able to inherit the bestfeatures of the calibration in 2 steps and the 1-step calibration based on caplet strip-ping. Indeed, it reproduces similar results to the 2 steps and 1-step stripping basedcalibrations respectively in terms of total relative error of the cap volatilities andthe total difference of the swaption volatilities.For the sake of completeness, Figures 5.9, 5.10 and 5.11, show the surfaces of rel-ative errors for the swaption volatilities related to the calibration in 2 steps, 1-stepcalibration and stripping free 1-step calibration, respectively.Figure 5.12 shows the caplet approximations for each calibration.

Figure 5.9: Relative swaption error for calibration in 2 steps surface.

38

Figure 5.10: Relative swaption error for 1-step calibration surface.

Figure 5.11: Relative swaption error for stripping free 1-step calibration surface.

39

Figure 5.12: Caplet approximations for each calibration.

40

Conclusion

In our project we studied the lognormal Forward LIBOR Model (LFM). Assum-ing a lognormal distribution of the forward rates, this model allows practitionersto price caps/floors using Black’s formula. However, an important drawback ofthis model is its theoretically incompatibility with the lognormal Forward Swapmodel, making Black’s formula not applicable to price swaptions. Numerical evi-dence presented in Van Appel and McWalter (2018) and Brigo and Mercurio (2007)show that even if the distribution of the forward swap rate is not lognormal un-der the LFM, it may be well approximated by a lognormal distribution. Allowingthis approximation, Black’s formula can be used to price swaptions under the LFMmodel. In particular, we approximate the forward swap rate volatilities using theformula presented by Hull and White (2000) and derived independently by Ander-sen and Andreasen (2000).In the LFM model, the dynamics of forward Libor rates are represented by a systemof stochastic differential equations in which the volatilities and the instantaneouscorrelations of the Brownian motions follow given structures. To use this modelin practice is essential to calibrate the parameters of the structures using marketdata. In particular, our data are represented by the Historical Nominal Swap Curve,the Historical Caps/Floors Volatilities and the Historical At-The-Money SwaptionVolatility Surfaces. Starting from these data, it was possible to generate the for-ward rates and the at-the-money strikes for the caps, necessary to proceed with thecaplet stripping (see White and Iwashita (2014)). This procedure allows the deriva-tion of the implied volatilities of the caplets using the volatilities of the caps. Thistechnique is necessary since caplet prices are not observed in the market.Once these preliminary steps were accomplished, we were able to focus on dif-ferent calibration procedures. We considered the parametrization of the volatili-ties of the forward rates suggested by Brigo and Mercurio (2007) and the full rankparametrization for the instantaneous correlation of the Brownian Motion definingthe dynamics of the forward rates proposed by Joshi (2008). Following Van Appeland McWalter (2018), we consider the Hull-White formula for the approximationof the swaption volatilities (Hull and White (2000))The first approach considers separately the parameters governing the volatilitystructures and the ones relative to the instantaneous correlation matrix. First, we

41

obtain the four parameters of the volatility curves by an optimization procedurebased on the caplet volatilities resulting from the caplet stripping. We approxi-mate the volatilities of the caplets by an integral of the parametrization functionand minimize the squared difference between them and the values from the capletstripping. For the second step, the two parameters defining the instantaneous cor-relation were deduced by an optimization procedure on the swaption volatilities.We used the four parameters obtained on the first optimization together with theHull-White formula to approximate the swaption volatilities. Then, we minimizedthe squared difference between them and the values from our data.Using the results of the first calibration as an initial guess, we defined a second cal-ibration procedure which considered all the six parameters at once. In particular,the objective function to optimize is defined by two terms. The first is obtainedby summing the squared errors relative to the caplet volatilities and the secondby summing the squared errors relative to the swaption volatilities. The results ob-tained using this procedure are finally used as an initial guess of the last calibrationprocedure. This procedure is our personal contribution to the problem. It is similarto the second one in considering all the parameters at once, but it differs for the firstterm of the objective function. Indeed, the sum of the squared errors relative to thecaplet volatilities is replaced by the sum of the squared errors relative to the capvolatilities. This represents a remarkable difference from the previous one, since itmeans the calibration procedure is unaffected by the error relative to the techniquechosen for the caplet stripping.We define five different metrics to compare all the calibration procedures imple-mented in this project. They are based on cap and swaption volatilities, since ourgoal is to better price these instruments. The first approach, the 2-steps calibration,shows the best result for the approximation of the cap volatilities, since it presentsthe smaller total relative cap error. However, this calibration presents the biggererror when approximating the swaption volatilities. The basic 1-step calibration,which is based on caplet volatilities, presents the best results when approximatingthe swaption volatilities, but has the biggest relative error approximating the capvolatilities.To conclude, we present the results relative to the 1-step stripping free calibrationdeveloped during the FMTC competition. It shows promising and balanced re-sults. The error obtained by approximating the cap volatilities has the same orderas the error obtained by the 2-step calibration, which is the best performing in thisregard. Regarding the swaption volatilities, it shows similar results to those ob-tained in the basic 1-step calibration, which is the best performing on this side. Thestripping free procedure inherits the best features from the 2-step and the basic 1-step calibrations, promising it as the most efficient in terms of the cap and swaptionvolatility approximations. These results suggest that the stripping free calibrationmay be a good alternative and encourage a more accurate study. In particular, weshould investigate more the effect of the balance between the cap volatilities and

42

the swaption volatilities in the objective function. In the future, it would be inter-esting to analyze more dates in order to see the behaviors of this procedure withrespect to the time evolution.

43

Bibliography

Andersen, L., Andreasen, J., 2000. Volatility skews and extensions of the libor mar-ket model. Applied Mathematical Finance 7 (1), 1–32.

Black, F., 1976. The pricing of commodity contracts. Journal of Financial Economics3 (1-2), 167–179.

Brigo, D., Mercurio, F., 2007. Interest Rate Models - Theory and Practice: WithSmile, Inflation and Credit , 2nd Edition. Springer Finance. Springer.

Hull, J., White, A., 2000. Forward rate volatilities, swap rate volatilities, and theimplementation of the libor market model. Journal of Fixed Income 10, 46–62.

Jackel, P., Rebonato, R., 2003. The link between caplet and swaption volatilitiesin a brace-gatarek-musiela/jamshidian framework: Approximate solutions andempirical evidence. Journal of Computational Finance 6(4), 41–59.

Joshi, M. S., 2008. The Concepts and Practice of Mathematical Finance, 2nd Edition.Cambridge University Press.

Rebonato, R., 1998. Interest-rate option models, 2nd Edition. Wiley Series in Finan-cial Engineering. Wiley.

Van Appel, J., McWalter, T. A., 2018. Efficient Log-Dated Swaption Volatility Ap-proximation in the Forward-LIBOR Model. International Journal of Theoreticaland Applied Finance 21(4), 1850020 : 1–26.

White, R., Iwashita, Y., 2014. Eight ways to strip your caplets: An introduction tocaplet stripping. OpenGamma Quantitative Research.

44

financial mathematics team challenge brazil · each team of students wrote a report on their...

Documents