markov process regression a dissertation submitted …jp891mj8064... · sam chiu participated in...

MARKOV PROCESS REGRESSION

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF MANAGEMENT

SCIENCE AND ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Michael G. Traverso

June 2014

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/jp891mj8064

© 2014 by Michael Gary Traverso. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/jp891mj8064

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Ronald Howard, Primary Adviser


Samuel Chiu


Ross Shachter

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

iv

Abstract

Regression analysis, the process of estimating the relationship between some

dependent and independent variables from empirical data is widely used in many

fields including medicine, economics, and machine learning. While many different

approaches to regression exist, two important distinctions are whether it is assumed

that the relationship has a specific parametric form (e.g., linear) and whether the

resultant prediction is deterministic (e.g., when minimizing the sum of squares) or

probabilistic. When no specific parametric form is assumed and the prediction is

probabilistic, the regression is referred to as nonparametric and Bayesian, respectively.

In this dissertation, a broad family of nonparametric Bayesian regression models is

introduced, where the prior is assumed to be a Markov process. In comparison to the

more common Gaussian process prior, this choice has several advantages such as

performance which is not dependent on the form of the likelihood functions and the

ability to enforce monotonicity of a relationship a priori.

A primary contribution of this dissertation is an algorithm for efficiently updating

Markov process priors from experimental data. While this algorithm is based on the

forward-backward algorithm for discrete hidden Markov models, extension to the

v

continuous case is not trivial. First, since the transition densities for Markov processes

cannot be calculated analytically, simulation methods are used instead. Secondly,

existing simulation methods can be extremely inefficient for calculating these updated

densities. To circumvent this issue, a new simulation technique is developed that

significantly improves the efficiency of these calculations.

Lastly, some benefits that Markov process regression has over existing regression

methods are illustrated through applied problems. In particular, applications in

medicine, consumer science, and manufacturing are presented.

vi

Acknowledgements

I owe thanks to many people for support throughout my doctoral degree. My advisor,

Ron Howard has been invaluable. He has given me space to choose my own research

direction while still providing big picture insight such as ways to make this

dissertation more accessible to an audience. I am grateful as well to Ross Shachter for

teaching some of the concepts that led to this dissertation as well as for helping to

navigate the logistics of completing a PhD. Sam Chiu participated in helpful

conversations regarding my thesis as well as sitting on my committee, and the staff of

the MS&E department, and in particular Lori Cottle, have been excellent resources as

well. I would also like to express much gratitude to Ali Abbas, without whom I would

likely never have gone to graduate school.

I would like to acknowledge Marc Melcher for lending his expertise regarding the

kidney transplant application as well as for sitting on my defense committee, and

Jaydeep Karandikar for his assistance with the machining application.

I owe many thanks to the students in the decision analysis unit for many useful

conversations as well as the camaraderie. Owen Liu, Noah Burbank, Lai

Mudchanatongsuk, and Brad Powley stand out for their friendship and insight.

vii

Lastly, I would like to thank my parents and Allison Zincke-Robles for their

support throughout these years.

viii

Contents

Abstract ........................................................................................................................ iv

Acknowledgements ...................................................................................................... vi

1 Introduction ............................................................................................................ 1

2 Existing Regression Methods ................................................................................. 5

2.1 Classical and Bayesian Parametric Regression ................................................... 6

2.2 Classical Nonparametric Regression .................................................................. 10

2.3 Bayesian Nonparametric Regression .................................................................. 12

3 Mathematical Preliminaries ................................................................................ 18

3.1 Markov Chains ....................................................................................................... 19

3.2 Markov Processes .................................................................................................. 20

3.2.1 Standard Brownian Motion .................................................................... 20

3.2.2 One Dimensional Diffusion Processes ................................................. 21

3.2.3 Multidimensional Diffusion Processes ................................................. 22

3.2.4 Jump Diffusion Processes ....................................................................... 22

ix

3.2.5 Killed Diffusion Processes ..................................................................... 23

3.3 Transition Probabilities for Markov Processe ................................................... 23

3.4 Hidden Markov Models and the Forward-Backward Algorithm .................... 27

4 Mathematical Development for Markov Process Regression .......................... 30

4.1 Inference via Brute Force Simulation ................................................................. 32

4.2 Forward-Backward Algorithm when the Transition Density is known ......... 33

4.3 Intuition behind Adjoint Processes ..................................................................... 36

4.4 Formal Construction of Adjoint Processes ........................................................ 38

4.4.1 Adjoint Chains for Discrete Time Markov Chains ............................. 40

4.4.2 Adjoint Chains for Continuous Time Markov Chains ........................ 43

4.4.3 Adjoint Processes for Diffusions ........................................................... 45

4.4.4 Adjoint Processes for Jump Diffusions ................................................ 48

4.5 Inference via Particle Filtering ............................................................................ 51

5 Choosing a Suitable Prior .................................................................................... 54

5.1 One Dimensional Diffusion Process Priors ....................................................... 54

5.2 Modeling Smoothness, Monotonicity, etc. ........................................................ 57

5.3 Modeling Discontinuities and Cusps .................................................................. 61

5.4 Curves with Multiple Dependent Variables ....................................................... 63

6 Applications ............................................................................................................ 66

6.1 Estimating Wait Times in Pooled Kidney Exchanges ...................................... 67

6.1.1 Introduction to Paired and Pooled Kidney Donation .......................... 68

6.1.2 Modeling the Relationship between Compatibility and Wait Time . 71

6.1.3 Methodology for Estimating Wait Times under each Model ............ 73

x

6.1.4 Cross-validation Methodology and Results ......................................... 75

6.2 Estimating Price-Demand Curves ....................................................................... 79

6.2.1 Input Data ................................................................................................. 81

6.2.2 Modeling the Effect of Price on Demand ............................................. 81

6.2.3 Experimental Methodology .................................................................... 86

6.2.4 Results and Analysis ............................................................................... 87

6.3 Stability Limit Prediction in Milling Operations .............................................. 91

6.3.1 Introduction to Stability in Machining .................................................. 92

6.3.2 Modeling the Stability Limit .................................................................. 93

6.3.3 Design of Experiments ............................................................................ 96

6.3.4 Results and Analysis ............................................................................. 101

7 Conclusions ......................................................................................................... 108

Bibliography .............................................................................................................. 112

xi

List of Tables

6.1. Summary statistics from the validation testing for demand estimation ......... 91

6.2. Input parameters for the cost calculations and reference stability limit ......... 97

6.3. Input parameters for the Markov model for stability limit prediction .......... 102

xii

List of Figures

2.1. Classical parametric regression for the Ford Escort dataset ............................. 7

2.2. Bayesian parametric regression for the Ford Escort dataset ............................ 9

2.3. Classical nonparametric regression for the Ford Escort dataset ..................... 11

2.4. Gaussian process regression for the Ford Escort dataset ................................ 14

2.5. Markov process regression for the Ford Escort dataset .................................. 14

3.1. Sample paths of standard Brownian Motion .................................................. 21

3.2. Bayesian network for a hidden Markov model. ............................................. 28

5.1. Example of a one dimensional Markov process prior .................................... 57

5.2. Sample paths of a geometric Brownian motion and its integral ..................... 59

5.3. Sample paths of a Brownian motion with jumps and its integral. .................. 63

6.1. Predicted transplant waiting times under each model .................................... 78

6.2. Validation results for transplant waiting times ............................................... 78

6.3. Implications for prospective transplant recipients .......................................... 79

6.4. Sample paths for demand prediction under the Markov process model. ....... 83

6.5. Sample paths for demand prediction under the Gaussian process model. ..... 84

xiii

6.6. Comparison of trained and untrained Gaussian process priors ...................... 85

6.7. Comparison of updated Markov and Gaussian predictions ............................ 88

6.8. Comparison of updated trained and untrained Gaussian predictions ............. 89

6.9. Example of a stability limit ............................................................................ 92

6.10. Tool path for millng the pocket feature.. ....................................................... 98

6.11. Cost of machining as a function of spindle speed and axial depth.. .............. 98

6.12. Reference stability limit. ............................................................................... 99

6.13. Prior sample paths of the third order Markov process model ..................... 102

6.14. Prior sample paths of the Brownian motion model ..................................... 103

6.15. Results of value based design of experiments ............................................. 104

6.16. Results of statistical design of experiments. ................................................ 105

1

Chapter 1

Introduction

Regression analysis, the process of estimating how a number of independent variables

affect one or more dependent variables from possibly noisy measurements, is widely

used in many fields including manufacturing, medicine, economics, and machine

learning, among others. While many different approaches to regression exist, two

important distinctions are whether or not it is assumed that the relationship between

the variables has a specific form (e.g., linear) and whether the resultant prediction is

deterministic (e.g., when minimizing the sum of squares) or probabilistic.

While models that assume a specific parametric form for the relationship and

produce a deterministic prediction are the easiest to work with mathematically, there

can be advantages to nonparametric regression models and to those which produce

probabilistic predictions, which are also referred to as Bayesian models. In particular,

nonparametric models may be preferred when the underlying relationship is either

complicated or not yet understood since in these cases it may be difficult to choose a

2

suitable parametric model. Bayesian models are useful because they allow one to

make a decision based on all possible choices of the underlying relationship and not

just the most likely relationship. In addition, Bayesian models can be used to optimize

design of experiments directly in terms of the expected value (such as profit) added.

This is not possible with non-Bayesian models.

While there are a wide variety of regression models which are either both

nonparametric and non-Bayesian or both parametric and Bayesian, there are

considerably fewer nonparametric Bayesian regression models. One commonly used

Bayesian nonparametric model is the family of Gaussian processes. While in certain

circumstances Gaussian processes have attractive computational properties, the

implicit assumptions may sometimes be disadvantageous. For example, the

relationship may be known beforehand to be monotonic, convex, etc. These properties

cannot be modeled using a Gaussian process. In addition, the computational

properties of Gaussian processes are most advantageous only when the likelihood

functions, which describe what is learned from an experiment, take certain forms. If

the likelihood function cannot be well approximated by one of these choices, the

benefits of using a Gaussian process model are greatly diminished.

This dissertation introduces a broad alternative family of nonparametric Bayesian

regression models. In particular, this family assumes that the prior, over possible

curves, is described by a Markov process as opposed to a Gaussian process. Although

Markov process models are generally not analytically tractable, the resultant

predictions can be calculated efficiently via simulation using extensions of existing

algorithms for discrete hidden Markov models. Substantially greater modeling

3

flexibility can be accomplished with this approach including enforcing a priori

assumptions that a given relationship is monotonic, convex, etc. In addition, the

computational requirements of this model are not dependent on the likelihood

functions. This can be advantageous depending on what is learned through

experimentation. While Markov process models will not be the best choice for every

problem, their properties are advantageous over existing approaches in a variety of

circumstances.

The remainder of this dissertation is structured as follows. Chapter 2 discusses

many existing methods of regression, how they relate to each other, and how they

relate to the Markov process regression. Chapter 3 introduces concepts from the

existing theory of Markov chains and processes needed to derive the algorithms for

Markov process regression that follow. Chapter 4 details the primary mathematical

contributions of this work. In particular, an algorithm is presented for efficiently

calculating updated predictions when the prior is given by a Markov process. This

algorithm is based on the forward-backward algorithm for discrete Markov chains.

Chapter 5 discusses the input parameters for a Markov model and how these

parameters can be chosen to model various behaviors such as monotonicity and cusps.

This chapter is less technical than Chapter 4 and it is intended for a broader audience,

including those interested in applying the algorithm without needing to understand its

derivation. Lastly, Chapter 6 demonstrates applications of Markov process regression

in medicine, consumer science, and manufacturing. In particular, focus is on

quantifying the value that can be gained by using a Markov model as opposed to

existing approaches such as parametric regression or Gaussian process regression.

4

The substantial modeling flexibility afforded by Markov process regression is also

demonstrated.

5

Chapter 2

Existing Regression Methods

In this chapter, a variety of methods for predicting relationships, which are described

by curves when there is a single independent variable, are discussed. These can for

one be classified by whether they are classical, meaning they predict a single most

likely curve, or Bayesian, meaning a probability distribution over all possible curves is

produced. Methods can also be classified by whether or not the curve is assumed a

priori to belong to some parametric family. Special focus is placed on Gaussian

process regression as it is both Bayesian and nonparametric as is the method presented

in the remainder of this work.

To illustrate the differences between these various regression methods, a simple

example is used. In particular, consider the problem of predicting the relationship

between mileage on a used car and its selling price. In any regression, although often

not explicitly stated, the resultant curve is not intended to represent a deterministic

value of the output variable, e.g., selling price. Instead, the curve represents a

6

parameter of the distribution describing the output variable, in this case the mean

selling price as a function of mileage. A publicly available data set which gives the

selling price for 1997 Ford Escorts that were sold in 1999 in a certain geographic

location is used in this example (Hope College, 1999).

2.1 Classical and Bayesian Parametric Regression

Various forms of parametric regression have been in use since the invention of

ordinary least squares in the early 1800’s. In parametric regression, the curve being

predicted is assumed to belong to a parametric family (e.g. linear, exponential,

polynomial, etc.) a priori. In classical (non-Bayesian) parametric regression, the goal

is then to find which curve in that family best explains the observed data. This is done

by minimizing, over the parameters of the family, some scoring function such as the

sum of the square differences between the observed data and that predicted under the

parametric model. The optimal parameters can then be found either analytically or

using other optimization techniques.

The prediction under an ordinary least squares regression for the Ford Escort data

set is shown in Figure 2.1. As can be seen, the linearity assumption is likely not

representative of the true relationship between mileage and mean selling price. This is

one possible drawback of parametric regression.

7

Figure 2.1. Prediction for the mean selling price for 1997 Ford Escorts sold in 1999 using ordinary least squares

regression. The linearity assumption is likely not representative of the true relationship between mileage and mean

selling price.

Parametric regression can also be performed within the context of Bayesian

inference. In Bayesian parametric regression, the goal is to assign a probability to

each curve in the parametric family as opposed to simply finding the most likely

curve. Classical parametric regression can thus be thought of a special case of

Bayesian parametric regression where one is only interested in the mode of the

posterior probability distribution.

A Bayesian regression model is specified by the choice of parametric family of the

curve being predicted, a prior probability distribution over the parameters of this

family, and a likelihood function which scores each choice of parameters based on the

observed data. One example of a commonly used likelihood function is to say that

observed data is normally distributed around the curve, as then the mode of such a

regression minimizes the sum of square differences as is common in classical

8

parametric regression. However, in many cases a Gaussian likelihood may not be

suitable and in general the likelihood functions should be chosen to best reflect what

information is gained from the experimental data.

For certain choices of model, the updated probability distribution over the

parameters given the observed data can be calculated analytically but in many cases it

must be calculated numerically or by simulation. While not the most computationally

efficient method, a simple, general, and illustrative method to calculate the updated

distribution by simulation is the following. Take samples from the prior

distribution over parameters, for each generate the resultant curve, which will be

referred to as a sample path, and assign each of these curves a weight of . For

each sample path, calculate the likelihood of the observed data and multiply this by its

prior weight to obtain an updated weight. Lastly, renormalize these updated weights

and build a weighted histogram over the parameter space to approximate the updated

distribution.

The prediction under a Bayesian linear regression for the Ford Escort data set is

shown in Figure 2.2, where the likelihood was assumed to be normally distributed. As

in the parametric case the prediction for the mean parameter is shown, as opposed to

the predicted distribution of selling price. One benefit of Bayesian regression is it

captures how much uncertainty there is in the prediction for mean selling price for a

given mileage. In addition, value of information calculations can be made with

Bayesian models but not with classical models. For example, suppose that our

decision maker can join a website for $100 that would give him historical selling

prices for another ten 1997 Ford Escorts. The question of whether or not this

9

information is worth this price is straightforward to answer with a Bayesian model but

cannot be answered with a classical model.

An additional weakness of parametric models is evident in the Bayesian setting. A

reasonable expectation would be that there should be less uncertainty about the curve

at mileages where a lot of data is available. This is often not the case for parametric

models. For example, if there were twenty data points split evenly between 0 miles

and 60,000 miles, the least amount of uncertainty would occur near 30,000 miles.

This is because uncertainty is captured as a distribution over the parameters of the

model, and the behavior of sample paths is determined globally by these parameters.

This limits the ways in which uncertainty can vary locally with selling price.

Figure 2.2. Prediction for the mean selling price for 1997 Ford Escorts sold in 1999 using Bayesian linear

regression. The median, 5%, 25%, 75% and 95% quantiles are shown. A Bayesian model captures how much

uncertainty there is in the prediction of mean selling price for a given mileage. In addition, value of information

calculations can be performed.

10

2.2 Classical Nonparametric Regression

In parametric regression it is assumed that the curve of interest belongs to a family

parameterized by a finite number of variables. In many cases, however, one may not

have a good guess of which parametric form would reasonably approximate this curve.

In addition, no parametric family can approximate all curves with an arbitrary level of

accuracy. This may be problematic if a highly accurate prediction is desired. When it

is not assumed that the curve belongs to some known parametric family, the regression

is referred to as nonparametric. This terminology, however, does not mean that a

nonparametric model has no input parameters.

There are many ways to construct a nonparametric regression and three common

approaches are discussed here. In one approach, referred to as kernel regression, the

estimated curve at each point is taken as the weighted average of all data points, where

data closer to this point is assigned higher weights than those far away. Sometimes

nonlinear schemes are also used. In a second approach, termed regularization, a

scoring function such as the sum of square differences is minimized but with an

additional penalty term to penalize unwanted behavior. For example, to dampen

oscillatory behavior, a penalty term proportional to the average of the second

derivative of the curve squared may be added. Splines result from regularization

based approaches. A third approach is to treat the curve of interest as a basis

expansion over an orthonormal basis (e.g., Fourier expansion) and then to truncate this

expansion via some scoring rule.

11

The prediction under a classical nonparametric regression for the Ford Escort data

set is shown in Figure 2.3. In particular, a smoothing spline was used, which results

from the regularization approach with a penalty term proportional to the average of the

second derivative squared. In this case, a reasonable assumption is that mean selling

price should monotonically decrease with mileage. Monotonicity is not enforced by

this penalty term, but could be by instead using, for example, a penalty based on how

much the logarithm of the derivative changes. As in the parametric case, the primary

disadvantage of using classical regression is that only a single most likely curve is

predicted as opposed to assigning a probability distribution over all possible curves.

Thus uncertainty in the prediction is not captured and value of information cannot be

calculated.

Figure 2.3. Prediction for the mean selling price for 1997 Ford Escorts sold in 1999 using a smoothing spline.

This type of regression is nonparametric. Intuitively, one would expect the mean selling price to monotonically

decrease with mileage, but this is not enforced with this model.

12

2.3 Bayesian Nonparametric Regression

One approach to Bayesian nonparametric regression is Gaussian process regression,

which has gained widespread use in fields ranging from geostatistics to machine

learning. The theoretical construction for Gaussian process regression is also quite

similar to that which follows in the remainder of this work for Markov process

regression. In particular, the primary idea is that a stochastic process implicitly

defines a probability distribution over a space of curves, such as all continuous curves.

In Gaussian process regression, this stochastic process is assumed to be a Gaussian

process, which has the property that any set of points selected on the curve is

distributed multivariate Gaussian. Thus, a Gaussian process prior is specified by its

mean at each point, and its covariance between each pair of points.

A primary benefit of the Gaussian assumption is that if the likelihoods are

Gaussian, the updated prediction is again a Gaussian process and can be calculated

analytically. Given any number of observations with Gaussian likelihoods, one can

calculate the updated joint distribution, i.e., the updated multivariate mean vector and

covariance matrix, for any number of points of interest using standard conditioning

formulae for multivariate normal distributions. While some other likelihoods can be

used, these usually requires some type of approximation method or simulation such as

expectation propagation, Laplace approximations (Rue et al., 2009), variational Bayes,

or Markov chain Monte Carlo (Diggle et al., 1998).

Gaussian process regression is closely related to the three approaches to classical

nonparametric regression described in Section 2.2, under particular linearity and

13

Gaussianity assumptions (Rasmussen and Williams, 2006). As the conditional mean

of a multivariate Gaussian random variable is linear in the variables conditioned on,

the updated mean function of a Gaussian process is a weighted sum of the observed

variables as in linear kernel regression. In addition, a Gaussian process is precisely

one for which the coefficients in an orthonormal basis expansion are jointly normally

distributed. Lastly, Gaussian processes are closely related to the regularization

approach where the penalty term is a linear combination of the integral (with respect

to some measure) of the function squared, the integral of its first derivative squared,

etc. These relationships provide insight into what can and cannot be modeled using a

Gaussian process.

The prediction under a Gaussian process regression for the Ford Escort data set is

shown in Figure 2.4. The particular process used is closely related to the smoothing

spline shown in Figure 2.3. Unlike in the parametric case the uncertainty is greatest in

areas where there are no observations, as should be expected. A downside of this

approach is that predicted mean sales price as a function of mileage is non-monotonic.

In general, no Gaussian process can be used enforce monotonicity or restrictions on

higher order derivatives such as convexity. The prediction under a Markov process

regression, which is the focus of the remainder of this thesis, is shown in Figure 2.5.

This particular process enforced monotonicity. The additional knowledge of

monotonicity also leads to less uncertainty in the prediction, such as near 60,000

miles.

14

Figure 2.4. Prediction for the mean selling price for 1997 Ford Escorts sold in 1999 using Gaussian process

regression. The median, 5%, 25%, 75% and 95% quantiles are shown. Unlike in the parametric case, uncertainty

in the prediction is greater in areas where there are no observations. Monotonicity in the prediction cannot be

enforced for any Gaussian process.

Figure 2.5. Prediction for the mean selling price for 1997 Ford Escorts sold in 1999 using Markov process

regression. The median, 5%, 25%, 75% and 95% quantiles are shown. This particular choice of Markov process

enforces monotonicity of the relationship. Knowledge of monotonicity also reduces uncertainty in the prediction,

such as near 60,000 miles.

15

When compared to Markov process regression, a primary benefit of a Gaussian

process model is its analytic tractability for Gaussian likelihoods. This can lead to

quicker computation times as Markov process regression requires calculations to be

performed by simulation instead of analytically. In addition, under a Gaussian process

model, one can calculate an updated joint distribution for any number of points while

for the Markov process model only marginal distributions can be calculated. Lastly,

while Gaussian process regression can be used to predict higher dimensional object

such as surfaces, the algorithms used to update a Markov process can only be

implemented when a single independent variable is involved (although multiple

dependent variables are allowed).

Thus, if a given application can be reasonably approximated using a Gaussian

process model, it is unlikely that there would be a benefit to using a Markov process

approach. However, just as many phenomena are not well represented by Gaussian

random variables, many real world experiments are not well approximated by

Gaussian likelihoods and prior knowledge about a curve may not be well represented

by a Gaussian process. As illustrated above, a Gaussian process model is not

appropriate if one wishes to assume that the underlying relationship is monotonic,

convex, etc. In addition, the likelihoods may be poorly represented by a normal

distribution. As a simple example, one may learn from an experiment that a curve is

above or below a given point. In this case, the likelihood would be an indicator

function which is poorly approximated by a Gaussian. The updated prediction would

likely need to be calculated using computationally expensive Markov chain Monte

Carlo methods. In addition, Gaussian processes act as linear smoothers, meaning that

16

if all the observed data points are shifted and scaled, the updated distribution is shifted

and scaled by the same amount. This assumption is quite restrictive and can be

unfavorable when one has prior information which is inhomogeneous in the dependent

and independent variables. Lastly, Markov process priors can be used to model

relationships with cusps or other singularities. These cannot be modeled with

Gaussian process priors.

It should be noted that a number of other approaches to enforcing monotonicity in

Bayesian nonparametric regression have been proposed. These approaches roughly

fall into two categories. In the first, one assumes that the curve of interest is piecewise

constant or a piecewise polynomial. Thus, the prediction is described by a distribution

over the number of piecewise segments, the location where each segment ends, and

the polynomial coefficients for each segment. Monotonicity can then be enforced by

putting certain constraints on the polynomial coefficients of each segment (Brezger

and Steiner, 2008). In the second approach, the derivative of the logarithm of the

curve is assumed to follow a Brownian motion (Shively et al., 2009). This approach is

in fact one specific instance of Markov process regression. To the author’s

knowledge, all previously proposed Bayesian nonparametric regression methods

which enforce monotonicity have relied on Markov chain Monte Carlo methods. This

is significantly more computationally expensive than the algorithms described in this

thesis as this requires one to simulate a high dimensional joint distribution.

The remainder of this work focuses on Markov process regression. As mentioned,

analytic solutions do not generally exist and instead the updated distributions must be

calculated through simulation. This approach allows for the use of arbitrary likelihood

17

functions, can be used to enforce monotonicity and other constraints on derivatives,

and has more flexibility in modeling inhomogenous prior information than Gaussian

process regression.

18

Chapter 3

Mathematical Preliminaries

In this chapter, some mathematical preliminaries regarding Markov chains and

processes are presented. The results from existing theory presented are built upon in

Chapter 4 to produce algorithms for efficiently performing regression when the prior

distribution over paths is characterized by a Markov process.

In particular, Section 3.1 presents an overview of the simpler but analogous case of

(discrete state) Markov chains. Section 3.2 presents an introduction to various classes

of Markov processes and Section 3.3 presents some well-known results on the

transition density of a Markov process which are used heavily in Chapter 4. Section

3.4 presents an algorithm for calculating updated marginal distributions of discrete

time, discrete state Markov chains. This algorithm is adapted to continuous time,

continuous state Markov processes in Chapter 4 where it is used to perform Bayesian

nonparametric regression.

19

3.1 Markov Chains

Throughout this thesis, continuous state, continuous time Markov processes are used

extensively. As a preliminary to these, consider first the simpler case of discrete time,

discrete state Markov chains.

A discrete time Markov chain is a sequence of random states where the probability

of being in each state in the next time step, given the present and all past states,

depends only on the present state. Consequently a Markov chain defined over m states

can be specified through an m-by-m matrix where is the probability that the

Markov chain will be in state j in the next time step given that it is currently in state i.

The matrix is referred to as the transition matrix and a number of calculations

regarding the Markov chain involve matrix multiplication by . If the current

distribution of states is given by a probability vector where is the probability that

the chain is currently in state , the probability of being in state in time steps is

given by [ ] . Right multiplication by can be used to calculate the expectation

of a functional at some later time. For example, suppose you get a reward of if you

are at state in time steps from now. Your expected reward given that you are

currently in state is then given by [ ] .

Discrete state Markov chains can also be defined with time treated as continuous,

although the construction is slightly more complicated. Again, their defining property

is that given the present and all past states, the future states depend only on the present

state. In continuous time, an m-state Markov chain is specified through an m-by-m

rate matrix where for small , is the probability that the Markov chain

20

transitions from state to state in units of time. The diagonal of is then defined

such that the rows of sum to 0. For continuous time Markov chain, the probability

of being in state at time given that it was in state at time 0 is given by where

is understood as the matrix exponential. Left and right multiplication by have

the same interpretations as left and right multiplication by in the discrete time case.

3.2 Markov Processes

The concept of Markov chains can also be extended to the case where both time and

the state space are continuous, in which case it is referred to as a Markov process.

Again, the defining property is that given the present and all past states, the future

states depend only on the present state.

3.2.1 Standard Brownian Motion

Perhaps the simplest example of a Markov process is a Brownian motion, which is

characterized by having continuous sample paths and dynamics that do not depend on

either time or state. A consequence of time and space invariance is that the increments

of Brownian motion are normally distributed. A Brownian motion is referred to as a

standard Brownian motion, denoted ( ), when its increments, ( ) ( ), have

mean 0 and variance for . Sample paths of a standard Brownian motion are

shown in Figure 3.1.

21

Figure 3.1. Five sample paths generated from a standard Brownian motion. Sample paths are continuous

everywhere, but differentiable nowhere.

3.2.2 One Dimensional Diffusion Processes

More generally, any Markov process which has continuous sample paths is referred to

as a diffusion process. In one dimension a diffusion process, denoted ( ), is

specified through its drift field ( ), and its diffusion field ( ). It is intuitively

useful to think of ( ) as the average and ( ) as the variance of the slope of the

sample paths of ( ) at time given that ( ) . This is not their rigorous

definition however as sample paths of diffusion processes are not differentiable. More

rigorously, given that ( ) , ( ) has mean ( ) ( ) and variance

( ) as .

22

3.2.3 Multidimensional Diffusion Processes

Diffusion processes can also be defined in -dimensions and again the process is

specified entirely by its drift and diffusion fields. In the multi-dimensional case the

drift field ( ) is now a -dimensional vector field and the diffusion field ( ) is

now a field of -by- matrices. In the multidimensional case, a process ( ) having

drift field ( ) and diffusion field ( ) satisfies that as , ( ) given

( ) has mean ( ) ( ) and covariance ( ). Again, although not

rigorously true, it is intuitively useful to think of drift as the mean and diffusion as the

covariance of the velocity of ( ) conditioned on its current state. For notational

convenience, the state space of a general diffusion process will be denoted

regardless of its dimensionality.

3.2.4 Jump Diffusion Processes

Markov processes which do not have continuous sample paths are referred to as jump

diffusion processes. The discussion that follows is restricted to jump diffusions that

generate finitely many jumps in a given interval. In general, a jump diffusion process

may also generate infinitely many infinitesimal jumps but this case does not seem

particularly relevant to curve prediction.

In specifying such a jump diffusion process, in addition to specifying the drift and

diffusion fields a jump intensity field ( ) and a field of jump densities ( )

(defined over ) is needed. The interpretation of the jump intensity field ( )is that

given that ( ) , the probability of a jump occurring in [ ) is equal to

23

( ) as . If such a jump does occur, the process is distributed ( ) after

the jump. It should be noted that the assumption of a density ( ) is purely for

notational convenience and more generally jump distributions with discrete

components can also be used.

3.2.5 Killed Diffusion Processes

Throughout Chapter 4, Markov processes defined over a state space that includes one

additional absorbing “cemetery” point, denoted , are also used. This type of process

is referred to as a killed (jump) diffusion process as this is equivalent to saying a

sample path no longer exists once it transitions into the cemetery state. In addition to

the parameters defined above, a killed diffusion also requires a killing field ( ) to

be specified. The killing field ( ) is defined as ( ( ) | ( ) ) in

the limit as and thus has the interpretation of a hazard rate.

3.3 Transition Probabilities for Markov Processes

For Markov chains the transition probability, the probability that the chain is in state

at time given it was in state at time for , could be calculated explicitly and is

given by [ ] and [ ( ) ]

in discrete and continuous time, respectively.

Evolution of states can be calculated through left multiplication by the transition

operator and forward expectations can be calculated through right multiplication by

the transition operator. Replacing the summations in matrix multiplication with

24

integrals, these relationships carry over straightforwardly to continuous state spaces

(Equations 3.1a-b).

( ( ) ) ∫ ( ( ) ) ( ( ) | ( ) )

(3.1a)

[ ( ( )) | ( ) ] ∫ ( ( ) | ( ) ) ( )

(3.1b)

Unlike for Markov chains, it is not in general possible to calculate the transition

density of a Markov process ( ( ) | ( ) ) explicitly from the drift and

diffusion fields. However, under suitable conditions, the transition density are given

in terms of the drift and diffusion fields as partial differential equations (PDE’s)

known as the Kolmogorov forward and backward equations.

Let ( ) be defined by Kolmogorov’s forward equation (Equation 3.2) and

( ) be defined by Kolmogorov’s backward equation (Equation 3.3). If

Equation 3.2 has a unique solution, then ( ) ( ( ) | ( ) ). If

Equation 3.3 has a unique solution, then ( ) ( ( ) | ( ) ).

Proofs of these statements can be found in Theorems 5.4 and 5.3 of Chapter 6 of

(Friedman, 1975), respectively. Conditions for existence and uniqueness of the

solutions to Equations 3.2-3.3 are given in Theorem 2.4.6 of (Stroock, 2008). In

particular, if ( ) and ( ) have bounded first derivative and are twice

continuously differentiable, Equations 3.2-3.3 have unique solutions. In Equations

3.2-3.3, denotes the infinitesimal generator of ( ), denotes its adjoint, and ( )

denotes the Dirac delta function.

25

( ) ( )( )

(3.2)

( ) ( )

( ) ( )( ) (3.3)

( ) ( )

The infinitesimal generator and its adjoint, and , are differential operators that

act upon functions . The adjoint of an operator is an extension of the

concept of the transpose of a matrix and has similar properties such that for any

operator , there is a unique adjoint operator and [ ] . For one

dimensional diffusion processes, and are defined in terms of the drift and

diffusion fields by Equations 3.4a-b. In multiple dimensions, and are instead

given by Equations 3.5a-b.

( )( ) ( )

( )

( )

( ) (3.4a)

( )( )

[ ( ) ( )]

[ ( ) ( )] (3.4b)

( )( ) ∑[ ( )

( )]

∑∑[ ( )

( )]

(3.5a)

( )( ) ∑

[ ( ) ( )]

∑∑

[ ( ) ( )]

(3.5b)

26

Under suitable conditions, Kolmogorov’s forward and backward equations can

also be extended to jump diffusions. For notational convenience, the infinitesimal

generator of a jump diffusion is written as the sum of the generator of its diffusion

component and the generator of its jump component . Replacing and in

Equations 3.2-3.3 with ( ) and ( ), respectively, yields the Kolmogorov

equations for jump diffusions. Operators and are defined in Equations 3.6a-b.

When a jump portion is involved, the Kolmogorov equations are no longer PDE’s

but rather partial integro-differential differential equations (PIDE’s). The existence-

uniqueness theory of PIDE’s is significantly less developed than for PDE’s, and thus

precise conditions under which the transition densities are given by the solutions to the

Kolmogorov equations are not presented. For discussion and some results on

existence-uniqueness theory for these equations, see (Meyer-Brandis, 2007).

( )( ) ( ) ∫ ( ) ( )

( ) ( ) (3.6a)

( )( ) ∫ ( ) ( ) ( )

( ) ( ) (3.6b)

Lastly, the Kolmogorov backward equation can be extended to killed (jump)

diffusions by defining instead by Equation 3.7, in which case it is known as the

Feynman-Kac formula. Note here that ( ) everywhere.

( )( ) [∑[ ( )

]

∑∑[ ( )

]

( )] ( ) (3.7)

27

3.4 Hidden Markov Models and the Forward-

Backward Algorithm

While the previous three sections provided a general overview of Markov chains and

processes, inferring which state a Markov chain or process is in at a given time from

observed data has not yet been discussed. A computationally tractable algorithm for

solving the inference problem for Markov chains is presented in this section. This

algorithm scales linearly with the length of the chain as opposed to exponentially for a

brute force approach (Rabiner and Juang, 1986). A computationally tractable

extension of this algorithm to Markov processes is given in Chapter 4 and is one of the

primary contributions of this thesis.

Here, as well as in Chapter 4, it is assumed that the observations and the

underlying chain or process form a hidden Markov model. This is essentially two

separate assumptions. First, the underlying chain or process is assumed to be Markov.

Secondly, an observation of the chain or process at time depends only on the state

of the chain or process at and not its state at any other times. These conditions are

illustrated graphically in the Bayesian network shown in Figure 3.2. It should be

noted that this algorithm can be generalized somewhat to the more general case of

Bayesian networks given by polytrees (Kschischang et al., 2001) but these do not

seem particularly useful for curve prediction and hence the more general algorithm is

omitted.

28

Figure 3.2. Bayesian network for a hidden Markov model.

When these conditions are satisfied, the probability that the chain is in each state at

a given time given the set of all observations (denoted ) can be calculated efficiently

via the forward-backward algorithm (Stratonovich, 1960). The underlying idea of the

forward backward algorithm is that the updated probability that the chain is in state

at time given all observations, ( | ), can be expressed using Bayes’ rule as

Equation 3.8. The advantage of this form is that ( |{ }) can be

calculated iteratively starting from and increasing and ({ }| )

can be calculated iteratively starting from and decreasing where denotes the

latest time step being considered. Hence, ( |{ }) is referred to as the

forward message and ({ }| ) is referred to as the backward message.

Each iteration consists of a transition of one time step and an update from the

observation at that time step (Equations 3.9a-b and 3.10a-b). The denominator

({ }|{ }) is a normalizing constant. The initial conditions to these

two equations are ( ) and ( | ), respectively, and are defined when

29

specifying the Markov chain and likelihood functions. In equations 3.9b and 3.10a,

the denominators are again normalizing constants.

( | )

( |{ }) ({ }| )

({ }|{ }) (3.8)

( |{ }) ∑ ( | ) ( |{ })

(3.9a)

( |{ }) ( | )

( |{ }) ( |{ }) (3.9b)

({ }| ) ( | ) ({ }| ) (3.10a)

({ }| ) ∑ ( | ) ({ }| )

(3.10b)

30

Chapter 4

Mathematical Development for Markov

Process Regression

This chapter presents the mathematics for Markov process regression. The underlying

idea is similar to that of Gaussian process regression in that a stochastic process is

chosen to describe the prior state of information which is then rigorously updated via

Bayes’ formula from the observations.

The primary contribution of this chapter is an algorithm for efficiently calculating

updated marginal distributions when the prior is given by a Markov process. This

approach can be significantly more efficient than Markov chain Monte Carlo, as it

does not require one to simulate a high dimensional joint distribution. In the case of

discrete Markov chains, the updated marginal distributions can be calculated

efficiently using the forward-backward algorithm (Section 3.4). However, it is not

trivial to extend this algorithm to the case of continuous state Markov processes. The

31

transition densities for Markov processes cannot be calculated analytically and

consequently simulation methods, and in particular particle filters, are used instead.

While it is efficient to simulate the forward message, direct simulation of the

backward message is extremely inefficient. A primary contribution of this work is a

method for calculating backward messages efficiently instead as a forward message of

a transformed process, termed the adjoint process.

This chapter is structured as follows. Section 4.1 presents a brute force simulation

and discusses the computational inefficiencies of such an approach. This serves as

motivation for using the forward-backward algorithm instead. Section 4.2 generalizes

the forward-backward algorithm to the continuous case. These equations, however,

are given in terms of the transition density which in most cases cannot be analytically

calculated. Thus, the remainder of this chapter focuses on a method of efficiently

simulating solutions to these equations. In particular, Section 4.3 provides an intuitive

introduction to the computational inefficiencies in directly simulating the backward

message and how they can be circumvented using adjoint processes. Section 4.4

details the formal construction of adjoint processes. Section 4.5 then presents an

efficient particle filtering algorithm for calculating updated predictions. This approach

is significantly more efficient than direct simulation due to the use of adjoint

processes.

32

4.1 Inference via Brute Force Simulation

One approach that can be used for inference in the nonparametric case is to use the

same methodology described in Section 2.1 for parametric modeling. One first

simulates the prior distribution by generating sample paths from the stochastic

process used to define the prior and assign each path a prior weight of . Then to

update the prediction, one applies Bayes’ rule on a path by path basis – multiply the

weight of each path by the likelihood of that observation and renormalize the paths

weights such that their sum is 1.

While this approach has the advantage of being straightforward, it has the

disadvantage of being computationally inefficient. In particular the number of prior

paths one would need to generate to make an updated prediction of a given resolution

increases exponentially with the number of observations.

As an illustration, suppose the prior distribution is described by as a standard

Brownian motion with initial condition ( ) . Then suppose one observes that

( ) [ ], ( ) [ ], and ( ) [ ]. The

probability of a prior path satisfying the first, first two, and all three observations is

approximately ( ( ) ( )), ( ( ) ( )) , and ( ( )

( )) , respectively, where ( ) denotes the standard normal CDF. Thus to get

even a very rough updated prediction containing 100 sample paths one would need to

generate on average around , , and prior sample paths,

respectively.

33

In general, as the information gained from the data becomes large this algorithm

becomes highly inefficient. In the remainder of this chapter, an algorithm for which

the required computation scales linearly instead of exponentially with the number of

observations is presented. This algorithm is an extension of the forward-backward

algorithm for the discrete case (See section 3.4) and consequently requires that the

stochastic process describing the prior state of information is Markov and that each

observation depends on the curve at only one point.

4.2 Forward-Backward Algorithm when the

Transition Density is known

In most past applications of the forward-backward algorithm, it has been the case that

the state space is treated as discrete, time is treated as discrete, or both. As regression

treats both time and state as continuous there are some additional challenges in

implementing the algorithm (Koller and Friedman, 2009). If the transition densities

can be calculated analytically, the forward-backward algorithm can be implemented in

much the same way as in the discrete case. However, it is not generally possible to

calculate the transition density analytically.

Equation 3.8 carries over straightforwardly to the case of Markov processes

(Equation 4.1). Here, ( ) denotes the Markov process representing the prior

distribution, ( ) denotes the observations of the curve at (if any) and denotes the

set of all observations.

34

( ( ) | )

( ( ) |{ ( ) }) ({ ( ) }| ( ) )

({ ( ) }| ( ) ) (4.1)

As illustrated in Section 3.4, the advantage of Equation 4.1 is that both terms in the

numerator can be calculated in an iterative fashion and the denominator is simply a

normalizing constant. Calculating these two terms is similar to the discrete case and

essentially involves transitioning to an observation, updating from it, transitioning to

the next observation, updating from it, etc. As in the discrete case,

( ( ) |{ ( ) }) is referred to as the forward message and

({ ( ) }| ( ) ) is referred to as the backward message.

More formally, let

denote the instances of the

independent variable at which there are observations. Let and denote the

minimum and maximum values of being considered, respectively and use as

convention and

. Noting that for [

),

( ( ) |{ ( ) }) ( ( ) |{ ( ) }), the forward message

can be calculated recursively (with increasing ) as follows in Equation 4.2a-b. The

initial condition ( ( ) ) is specified as part of the prior.

(

] ( ( ) |{ ( ) })

(4.2a)

∫ ( (

) |{ ( ) }) ( ( ) | (

) )

35

( ( ) |{ ( )

})

(4.2b)

( ( )| (

) )

( ( )|{ ( )

}) ( (

) |{ ( ) })

Similarly, noting that for [

), ({ ( ) }| ( ) )

({ ( ) }| ( ) ), the backward message can be calculated recursively

(with decreasing ) as follows in Equations 4.3a-b. For [ ], { ( )

} is an empty set and therefore the likelihood ({ ( ) }| ( ) )

everywhere. Thus, the initial condition can be taken as

({ ( ) }| ( ) ) .

({ ( ) }| (

) )

(4.3a)

( ( )| (

) ) ({ ( ) }| (

) )

[

) ({ ( )

}| ( ) )

(4.3b)

∫ ( (

) | ( ) ) ({ ( ) }| (

) )

As in the discrete case, the updated marginal distributions can be calculated

quickly using Equations 4.1-4.3 if the transition density and the likelihood of each

observation are known. While the likelihood functions are specified explicitly, the

transition density is determined by the prior and thus is specified through the prior

drift, diffusion, and jump fields. In a very limited number of circumstances, such as

when the prior process is Gaussian, the transition density can be calculated

36

analytically from these prior parameter fields (Rasmussen and Williams, 2006).

Otherwise, the forward and backward messages must be calculated numerically.

4.3 Intuition behind Adjoint Processes

In the remainder of this chapter, a method for simulating the forward and backward

messages via particle filters is presented. While forward messages are straightforward

to simulate using a particle filter, the direct approach to simulating the backward

message is much less efficient. A primary contribution of this work is a method for

calculating backward messages efficiently instead as a forward message of a

transformed process, which will be referred to as the adjoint process.

At first glance, the computational differences between simulating and forward

messages may seem counterintuitive since the calculations are nearly identical when

the transition density is known. To understand these differences, it is useful to

consider an analogy. Calculating the forward message is analogous to predicting the

location of a bottle one year after it is dropped in the ocean at a particular location.

The distribution of terminal locations can be estimated straightforwardly by dropping

10,000 bottles at this location and building a histogram of their locations after one

year. Calculating the backward message is analogous to the inverse problem of

predicting the starting location of a bottle that is found in a particular location after

one year. To simulate this directly, one could choose 10,000 points in the ocean, drop

10,000 bottles at each point and count the number of bottles from each starting point

that end up in this location after one year. While this is already considerably less

37

efficient, suppose that in addition the terminal location is only a square meter wide.

Then one may need to drop a million bottles from each starting point to get even a

single bottle at this location. As simulating the backward message requires multiple

starting points and may require a large number of sample paths from each one, it can

be highly inefficient.

One alternative approach is to instead calculate the forward and backward message

by solving the Kolmogorov equations (see Section 3.3) using numerical techniques for

partial differential equations. This is sometimes a viable option but can also lead to

numerical issues such as when the domain is unbounded, as is the case in every

application in Chapter 6. Alternatively, existence-uniqueness results for the

Kolmogorov equations can be used to construct a new process such that the backward

message for the original process can be simulated efficiently instead as the forward

message of the new process. This process will be referred to as the adjoint process.

Returning to the bottle in the ocean analogy, this is akin to the following. If the

local currents of the ocean are known, the distribution of possible starting locations

can be written as the solution to some equation from fluid dynamics. In some cases,

solving this equation to a high degree of accuracy may not be computationally

feasible. However, suppose that there exists an alternate ocean current such that the

probability of a bottle floating from point B to point A under the alternate current is

equal to the probability of a bottle floating from point A to point B under the original

current. The distribution of possible starting locations for a given terminal location

under the original current can then be simulated by dropping bottles at this location

and histogramming their terminal locations under the alternate current.

38

The alternate current which produces this result, however, is not entirely intuitive.

As an example, consider the border of a natural harbor. As the water outside of the

harbor is more turbulent, it is more likely that a bottle from outside of the harbor will

enter it than it is that a bottle from inside of the harbor will exit it, even if there is no

directed current. Consequently, the alternate current should be such that a bottle

dropped inside of the harbor is more likely to exit the harbor than a bottle dropped

outside of the harbor is to enter it. This requires that the alternate current be directed

away from the harbor, even though the original current is undirected. The

construction of the adjoint process, which is analogous to this alternate current, is

formalized in the following section.

4.4 Formal Construction of Adjoint Processes

Formalizing the intuition of the previous section, a process ( ), running backward

with respect to ( ), is sought such that the backward message for ( ) is

proportional to the forward message for ( ). A unique characterization for such a

process, which will be referred to as the adjoint process of ( ), is given in Definition

4.1. Under mild conditions given later in this section, the adjoint process exists.

39

Definition 4.1 Let ( ) be a Markov process. Then its adjoint process, denoted ( ),

is the unique killed Markov process satisfying Equation 4.4a with killing field ( )

satisfying Equations 4.4b.

∫ ( )

( ( ) | ( ) ) ( ( ) | ( ) ) (4.4a)

{ ( )} (4.4b)

In this definition, Equation 4.4a alone enforces that the forward message for ( ),

( ( ) |{ ( ) }), is proportional to the backward message for ( ),

({ ( ) }| ( ) ). To see this, substitute Equation 4.4a into Equation 4.3b

and compare the recursions of Equations 4.3a-b with those of Equations 4.2a-b. As

the backward message is propagated by multiplying by the transition operator, and the

forward message is propagated by multiplying by the adjoint of the transition operator

(equivalently, left multiplying by the transition operator in the discrete case), the term

adjoint process seems suitable.

Equation 4.4b makes the characterization of ( ) unique. A constant could be

added to ( ) in which case Equation 4.4a would still be satisfied with the same

constant added to ( ). From a simulation standpoint, uniformly killing paths is never

efficient and thus the requirement of Equation 4.4b is a good choice to impose

uniqueness.

The remainder of this section formalizes the construction of adjoint processes for

Markov chains and processes. Sections 4.4.1 and 4.4.2 present adjoint chains for

40

discrete and continuous time Markov chains, respectively. The motivation for

including these sections is that constructing adjoint chains in these cases should serve

as an intuitive introduction to their construction in the more difficult cases of

diffusions and jump diffusions. The primary purpose of this section, the construction

of adjoint processes for diffusions and jump diffusions is presented in Sections 4.4.3

and 4.4.4, respectively. Proofs for all four cases are provided.

4.4.1 Adjoint Chains for Discrete Time Markov Chains

In the case of a discrete time Markov chain, constructing the adjoint chain is relatively

straightforward. For a Markov chain with states, its adjoint Markov chain is

characterized by Definition 4.2, where the convention that indices and run over

only the original states and not over the cemetery state is taken throughout

Section 4.4.

Definition 4.2. Let be a discrete time Markov chain. Then its adjoint chain,

denoted , is the unique chain over the original states and an additional absorbing

state satisfying Equation 4.5a and having transition matrix satisfying Equations

4.5b.

( | ) ( | ) (4.5a)

{ } (4.5b)

41

These are precisely the same condition as Equations 4.4a-b, except that the state

space is now assumed discrete. Equation 4.5a can be rewritten in terms of the

transition matrices and as Equation 4.5c.

[ ] [ ] (4.5c)

It is clear that taking satisfies equation 4.5c but unless each of the

columns of sum to 1, is not a valid transition matrix and thus does not properly

define an that can then be simulated. In general, unless each of the columns of

sums to 1, there exists no valid Markov chain over the states which satisfies

Equation 4.5a. However, under mild conditions, such a Markov chain exists on the

state space which includes the additional absorbing state . Thus, in general is a

killed Markov chain.

The process for constructing a valid is straightforward. Take the transpose of ,

divide by an large enough such that all of the row sums are less than 1 and add

transitions from each state to such that each row sums to 1. To satisfy Equation 4.5b

(and minimize ), is taken as equal to the maximum column sum of . This

construction is formalized in Proposition 4.1.

42

Proposition 4.1 Let be a discrete time Markov with transition matrix which has

bounded column sums. Let be defined by Equation 4.6a and let be a discrete time

Markov chain with transition matrix defined by Equations 4.6b-e. Then is the

adjoint chain of .

{∑

} (4.6a)

(4.6b)

∑

(4.6c)

(4.6d)

(4.6e)

Proof

From Equation 4.6c, all of the row sums equal 1, and is defined specifically such

that { } . Thus, is a valid transition matrix and 4.5b is satisfied.

Equation 4.5c can be proved by induction. Equation 4.bc implies that Equation 4.5

holds for . Using Equation 4.6d in the first equality, 4.6b in the second

equality, and induction in the third, it follows that

[ ] ∑[ ]

∑[ ]

∑[ ]

[ ]

43

4.4.2 Adjoint Chains for Continuous Time Markov Chains

In continuous time, the adjoint Markov chain can be determined in much the same

way as in the discrete time case. The primary difference is that a continuous time

Markov chain is specified by its rate matrix instead of a transition matrix. For

continuous time Markov chains, the adjoint Markov chain is characterized by

Definition 4.3.

Definition 4.3 Let ( ) be a discrete time Markov chain. Then its adjoint chain,

denoted ( ), is the unique chain over the original states plus an absorbing state

satisfying Equation 4.7a and having transition matrix satisfying Equations 4.7b.

( ) ( ( ) | ( ) ) ( ( ) | ( ) ) (4.7a)

{ } (4.7b)

Equivalently, Equation 4.7a can be written in terms of the rate matrices as

Equations 4.7c. As in the discrete case, it is not in general possible to construct a rate

matrix over the original state space which is a valid rate matrix and also satisfies

Equation 4.7c. Again, under mild conditions, a killed Markov chain with rate matrix

satisfying Equation 4.7c exists. This construction is formalized in Proposition 4.2.

( )[ ( )] [ ( )]

(4.7c)

44

Proposition 4.2 Let ( ) be a continuous time Markov chain with rate matrix

which has bounded column sums. Let be defined by Equation 4.8a and let ( ) be a

continuous time Markov chain with rate matrix defined by Equations 4.8b-f. Then

( ) is the adjoint chain of ( ).

{∑

} (4.8a)

(4.8b)

(4.8c)

∑

(4.8d)

(4.8e)

(4.8f)

Proof

From Equation 4.8d, all of the row sums equal 0, and is defined specifically such

that { } . Hence, is a valid rate matrix and Equation 4.7b is satisfied.

Equation 4.7c can be shown to hold by rewriting the matrix exponential as its

Taylor series, then recognizing that the term in each summation is equal. First, let

denote the identity matrix and rewrite Equation 4.7c as

[ ( )( )] [ ( )]

Writing the matrix exponential in terms of its power series gives

45

[ ( )( )] [∑

[( )( )] ]

[ ( )] [∑

[( ) ] ]

Lastly by induction, the terms in each summation are equal. From Equation

4.8c-d,

[( )( )] [( ) ]

Using Equation 4.8e in the first equality, 4.8c-d in the second, and induction in the

third, it follows that

[(( )( ))

]

∑[(( )( ))

] [( )( )]

∑[(( )( ))

] [ ( )]

∑[( ( )) ] [ ( )]

[( ( )) ]

Thus, the two summations are equal and Equation 4.7c holds.

4.4.3 Adjoint Processes for Diffusions

In this section, adjoint processes for diffusions are constructed. These results, along

with those of Section 4.4.4 are a primary contribution of this thesis. The underlying

idea is essentially identical to as for Markov chains. An operator is sought which both

satisfies the characterization of Equations 4.4a-b and specifies a valid Markov process.

46

Again, this requires one to enlarge the state space by adding an absorbing state. Thus,

the adjoint process is in general a killed diffusion (Section 3.2.3). For diffusion

processes, the original process is specified by its infinitesimal generator and the

infinitesimal generator of the adjoint process is denoted . The construction of the

adjoint process is formalized in Theorem 4.1.

Theorem 4.1 Let ( ) be a diffusion process with infinitesimal generator defined

by Equation 3.5a for which ( ) and ( ) have bounded derivatives and are twice

continuously differentiable. Let be defined by Equation 4.9a and let ( ) be a killed

diffusion with infinitesimal generator defined by Equations 4.9b-e. Then ( ) is the

adjoint process of ( ).

( )

{

∑∑[

( )]

∑[

( )]

} (4.9a)

( ) ( ) (4.9b)

( ) ( ) ∑[

( )]

(4.9c)

( ) ∑[

( )]

∑∑[

( )]

( ) (4.9d)

( )( ) [∑[ ( )

]

∑∑[ ( )

]

( )] ( ) (4.9e)

47

Proof

Let ( ) denote the solution to the forward equation (Equation 3.2) using as

the infinitesimal generator. Let ( ) denote the solution to the Feynman Kac

formula (Equation 3.7) using as the infinitesimal generator. The initial condition for

( ) is precisely the terminal condition for ( ); ( )

( ) ( ). In addition, is defined such that ( ). As ( )

is a constant and can be factored out if unique solutions to these equations exist it

implies that

∫ ( )

( ) ( )

From the assumptions on ( ) and ( ), ( ) has a unique solution

(Theorem 2.4.6 in Stroock, 2008). From Chapter 6, Theorem 5.4 of (Friedman, 1975),

if a unique solution exists it is the transition density for ( ). Therefore,

( ) ( ( ) | ( ) ). Similarly, from the assumptions on ( )

and ( ), ( ) has a unique solution (Theorem 2.4.6 in Stroock, 2008).

From Chapter 6, Theorem 5.3 of (Friedman, 1975), if a unique solution exists it is the

transition density for ( ). Therefore, ( ) ( ( ) | ( ) ). Thus,

∫ ( )

( ( ) | ( ) ) ( ( ) | ( ) )

48

4.4.4 Adjoint Processes for Jump Diffusions

The above reasoning can also be applied to jump diffusions. However, in the case of

jump diffusions the forward and backward equations are no longer given by partial

differential equations but rather partial integro-differential equations. The existence-

uniqueness theory for PIDE’s is significantly less developed than for PDE’s although

some sufficient existence-uniqueness results have been proven (Meyer-Brandis, 2007).

As it is an open question, in this proof it is assumed that the parameter fields are

defined such that unique solutions to the Kolmogorov equations exist. The

construction of an adjoint process satisfying Equation 4.4 is presented in Theorem 4.2

and under the existence-uniqueness assumptions the proof of this theorem follows the

same reasoning as Theorem 4.1. It is also instructive to compare the construction of

the adjoint of the jump portion to to that of the construction of for continuous

time Markov chains. Recognizing ( ) ( ) as analogous to , with some

manipulation one can see that the construction of is a straightforward generalization

of Proposition 4.2 to the continuous case.

Theorem 4.2 Let ( ) be a jump diffusion with infinitesimal generator ( )

defined by Equations 3.5a and 3.6a for which ( ), ( ), ( ), and ( ) are

defined such that there exists a unique solution to its Kolmogorov forward equation.

Let be defined by Equation 4.10d and let ( ) be a killed jump diffusion with

infinitesimal generator ( ), where is defined by Equations 4.10a-c and is

defined by Equations 4.10e-h. Then ( ) is the adjoint process of ( ).

49

( ) (∫ ( ) ( )

) (4.10a)

( ) ( )

( ) ( ) (4.10b)

( )( ) ( ) ∫ ( ) ( )

( ) ( ) (4.10c)

( )

{

∑∑[

( )]

∑[

( )]

( ) ( )} (4.10d)

( ) ( ) (4.10e)

( ) ( ) ∑[

( )]

(4.10f)

( ) ∑[

( )]

∑∑[

( )]

( ) ( ) ( ) (4.10g)

( )( ) [∑[ ( )

]

∑∑[ ( )

]

( )] ( ) (4.10h)

Proof

Let ( ) denote the solution to the forward equation (Equation 3.2) with

( ) as the infinitesimal generator. Let ( ) denote the solution to the

Feynman Kac formula (Equation 3.7) using ( ) as the infinitesimal generator.

The initial condition for ( ) is precisely the terminal condition for

( ); ( ) ( ) ( ). In addition, ( ) is

defined such that ( ) ( ) ( ). As ( ) is a constant and can be

50

factored out and since it is assumed that there exists a unique solution to these

equations. it implies that

∫ ( ) ( ) ( )

Since it is assumed that a unique solution exists, ( ) is equal to the

transition density for ( ), that is ( ) ( ( ) | ( ) ). Similarly,

existence and uniqueness implies that ( ) is the transition density for ( ),

that is ( ) ( ( ) | ( ) ). Thus,

∫ ( ) ( ( ) | ( ) ) ( ( ) | ( ) )

In order for the existence-uniqueness assumption to hold, at minimum the drift and

diffusion fields must satisfy the same conditions needed for diffusions in that ( )

and ( ) must have bounded derivatives and be twice continuously differentiable.

Another clear requirement is that ∫ ( ) ( )

must be bounded as otherwise

( ) is not well defined. If these requirements hold and the parameter fields

defining ( ) are reasonably well behaved it seems reasonable to expect that the

existence-uniqueness assumptions of Theorem 4.2 to hold but again, to the author’s

knowledge, precise conditions for existence and uniqueness have not been proven.

51

4.5 Inference via Particle Filtering

Particle filtering algorithms can be used to simulate the forward and backward

messages straightforwardly using the original process and its adjoint process,

respectively. Algorithm 4.1 calculates the forward message

( ( ) |{ ( ) }) iteratively (with increasing ) as in Equations 4.2a-b.

Algorithm 4.2 calculates the backward message ({ ( ) }| ( ) )

iteratively (with decreasing ) as in Equations 4.3a-b, where Definition 4.1 is used to

instead simulate this as a forward message for ( ). Pointwise multiplication of the

forward and backward messages yields the updated marginal distributions of ( )

after normalization (Equation 4.1). For ease in exposition, these particle filtering

algorithms are presented at a high level neglecting the specific methods for

propagating sample paths, resampling schemes, and density estimation techniques.

While Algorithms 4.1 and 4.2 are very similar, there are a few differences worth

noting. First, since { ( ) } is empty for [ ], one does not need to

numerically calculate the backward message in this region as it is equal to 1

everywhere. Secondly, the adjoint process is in general a killed process and thus one

may wish to resample at indices other than where observations occur. Lastly as the

backward message is a likelihood and not a probability distribution, it may not be

normalizable and thus cannot be directly sampled from. This happens, for example,

when the likelihood is a function of only one component of a multidimensional ( ).

This problem can be circumvented by using importance sampling. An attractive

option is to multiply by the forward message during the initial sampling and when

52

resampling as this results in sampling from the updated distribution ( ( ) | ).

Importance sampling can also be advantageous in the forward message if ( ( ))

is not normalizable, such as if it assumed uniform over an unbounded .

(1) Generate samples from the distribution of ( ( )). Assign each sample a weight of .

(2) Using each sample as an initial condition, generate a sample path of ( ) in [ ].

(3) Estimate ( ( ) |{ ( ) }) in [ ) via density estimation of the sample paths

using their current weights.

(4) Update the weight of each sample path by multiplying by the likelihood

( ( ( )| (

) )). Renormalize such that the path weights sum to 1. Resample if desired.

(5) Repeat steps (2)-(4) for [

], [

],…, [

], [ ].

Algorithm 4.1. High level algorithm for calculating the forward message, ( ( ) |{ ( ) }), via particle

filtering.

(1) ({ ( ) }| ( ) ) everywhere in [ ].

(2) Normalize ( ( )| (

)) and generate samples from this distribution. Assign each

sample a weight of .

(3) Using each sample as an initial condition, generate a sample path of ( ) in [

]. Resample concurrently as desired.

(4) Estimate ( ( ) |{ ( ) }) in [

) via density estimation of the sample paths

using their current weights.

(5) Update the weight of each sample path by multiplying by the likelihood

( ( ( )| (

) )). Renormalize such that the weights sum to 1. Resample if desired.

(6) Repeat steps (3)-(5) for [

], [

] … [

], [ ].

(7) Multiply the estimate for ( ( ) |{ ( ) }) by (∫ ( )

) to obtain the

estimate for ({ ( ) }| ( ) ).

Algorithm 4.2. High level algorithm for calculating the backward message, ({ ( ) }| ( ) ), via

simulation of the forward message of the adjoint, ( ( ) |{ ( ) }).

53

In conclusion, a computationally tractable algorithm for updating priors defined by

Markov processes was derived in this chapter. While this algorithm is a generalization

of the forward-backward algorithm for discrete hidden Markov models, the extension

to the continuous case is nontrivial for two reasons. First, as the transition densities

for Markov processes can not in most cases be calculated analytically, simulation

techniques must be used instead. Secondly, existing techniques for simulating the

backward message are highly inefficient. A primary contribution of this chapter is a

computationally efficient method for calculating backward messages instead as a

forward message of the adjoint process. Using this approach, Markov process priors

can be updated using Algorithms 4.1-4.2.

54

Chapter 5

Choosing a Suitable Prior

In the previous chapter, the mathematics of performing Markov process regression

was presented. Since the algorithm itself can be automated it is perhaps more

important, from a practitioner’s viewpoint, to understand how to properly choose the

inputs than it is to understand the algorithm in great detail. Thus, this chapter instead

focuses on how to choose a prior in a practical application to reflect known behavior

about the curve of interest. As this chapter is intended for a much more general

audience than the previous one, some mathematical subtleties are ignored.

5.1 One Dimensional Diffusion Process Priors

To understand when it is suitable to use a one dimensional diffusion process as a prior,

it is important to first understand the properties of their sample paths. In particular,

the paths of a one dimensional diffusion process are continuous everywhere almost

55

surely but nowhere differentiable. In addition, the sample paths of all one dimensional

diffusion processes are almost surely not monotonic, not convex, etc. (Steele, 2001).

Thus in many cases, a one dimensional diffusion process is not the right choice for a

prior. However, if a one dimensional prior is suitable, there are less parameter fields

to assess than compared to in the more general cases that follow. In addition, these

parameter fields have straightforward interpretations.

Assessing the parameter fields ( ) and ( ) can be roughly1 thought of as

specifying a normal distribution for the slope of the curve at each point { } given

that the curve goes through the point { }. The drift field ( ) can be interpreted

as the expected slope of the curve at the point { } given that the curve goes through

the point { }. The diffusion field ( ) can be thought of as variance of the slope

of the curve at the point { } given that the curve goes through the point { } and

( )⁄ can be interpreted as a measure of confidence in the assessment of ( ).

For example, if one chooses ( ) , then all sample paths which travel

through { } have a slope of exactly ( ) at this point. On the other hand, if

( ) is chosen as large, the sample paths exhibit a much wider range of behavior

at { }. If little is known about the curve, a good choice may be a constant ( )

and ( ) everywhere. However if certain behavior is to be expected, these

parameter fields can provide significant modeling flexibility. Lastly, the distribution

at must be specified. In many cases, an uninformative distribution is desirable;

however, a proper distribution must be used so that it can be sampled from in

1 More accurately, the term slope here should be interpreted as ( ( ) ( )) ⁄ for some small but non

infinitesimal h. This expression is distributed approximately normally with mean ( ) and standard

deviation ( ) √ ⁄ . As , this term is not well defined (hence the nondifferentiability of paths) but to think

of and as the mean and standard deviation of slope is useful intuitively.

56

Algorithm 4.1. Thus, in practice a proper distribution with a large variance may be

used instead. If desired, importance sampling can be used to reduce any bias

introduced by the choice for this distribution.

In terms of assessing the parameter fields from a decision maker, it is useful to

visualize the choices of ( ) and ( ) as a vector field. One such choice is that

the slope of each vector is given by ( ) and the length of each vector is given by

( )⁄ . This choice gives an intuitive idea of what the resultant sample paths and

prior marginal distributions will look like. Using this idea, a decision maker could

build their assessment of ( ) and ( ) using a graphical interface to

manipulate vector fields which could then be interpolated to create an assessment of

( ) and ( ) over the entire space.

For example, Figure 5.1 shows one possible choice of prior parameter fields for

predicting the relationship between mileage and the mean selling price for a used car,

as introduced in Chapter 2. The resultant prior marginal distributions are also shown.

This prior assumes that the price of a zero mileage vehicle is known fairly accurately,

that there is a large drop in mean selling prices for the first few thousand miles, and a

slow but relatively predictable drop for low mileage (less than around 60 thousand

miles) vehicles. For high mileage vehicles, the drop in selling price is assumed

quicker but less certain to account for the increase in average maintenance costs but

also an increase in the uncertainty in these costs. Although monotonicity cannot be

enforced, a large negative drift is used to enforce that the sample paths are unlikely to

ever increase above the cost of a zero mileage vehicle.

57

Figure 5.1. Five prior sample paths (left) and the resultant marginal distributions (right) for one possible prior for

predicting the relationship between mileage and the mean selling price for a used car, as introduced in Chapter 2.

The superimposed vector field graphically illustrates the parameter fields where ( ) and ( ) are equal to

the slope and length of each vector, respectively.

5.2 Modeling Smoothness, Monotonicity, etc.

In many circumstances, the properties of the sample paths of one dimensional

diffusion processes may not accurately represent the properties of the curve one is

interested in predicting. Some common examples include when the curve can be

assumed a priori to be differentiable, monotonic, or convex. However, by considering

multidimensional diffusions one can define processes such that these conditions hold.

As an example, consider a two dimensional diffusion process for which the drift

and diffusion fields are of the form of Equations 5.1a-b, where a subscript of refers

to the component of the diffusion. Since the drift and diffusion of the second

component depend only on the second component, ( ) is a one-dimensional

diffusion with drift ( ) and diffusion ( ). Since ( ) has no diffusion

and a drift equal to the second component, ( ) is the integral of ( ).

58

( ) [

( )] (5.1a)

( ) [ ( )

] (5.1b)

Since all paths of ( ) are continuous, all paths of ( ) are differentiable. In

addition, if ( ) is positive everywhere then ( ) is monotonically increasing

(Figure 5.2). Thus a two dimensional diffusion in this form can be used as a prior

distribution which assumes a priori that the curve of interest, represented by ( ) is

differentiable and/or monotonic. In terms of assessment, the field ( ) now has

the interpretation as the expectation of the second derivative of ( ) given that at

time the slope of the curve is equal to (see footnote 1 in Section 5.1). Again,

( ) can be interpreted in terms of the decision maker’s confidence in their

assessment of ( ).

It is important to note that even though one may only be interested in the updated

prediction of ( ) from observations of ( ) only, the forward and backward

messages must still be jointly estimated over both dimensions and not simply over .

This is because { ( ) } and { ( ) } are independent given both ( ) and

( ) but not given ( ) alone. Thus, to correctly calculate the updated marginal for

( ) one must multiply the two dimensional forward message by the two

dimensional backward message and then marginalize over as opposed to simply

multiplying the forward message over by the backward message over .

Computation times scale exponentially with the number of dimensions, and therefore

there is a computational cost to enforcing monotonicity or differentiability.

59

Figure 5.2. Five sample paths of a standard geometric Brownian motion (left), defined by ( ) and

( ) , and the integral of each of these sample paths (right). As the sample paths of geometric Brownian

motion are continuous and positive, the sample paths of the integral of a geometric Brownian motion are

differentiable and monotonically increasing.

Consider now the more general case of two dimensional diffusion processes with

drift and diffusion fields defined in Equation 5.2a-b.

( ) [

( )] (5.2a)

( ) [ ( )

] (5.2b)

Again, the paths of ( ) are differentiable and are monotonic if ( ) is positive

everywhere. Now, however the present dynamics depend on both the current state and

its derivative. In this case, ( ) can be interpreted as the expected second

derivative of ( ) at { } given that it goes through the point { } and has slope

at this point. Again, ( ) can be interpreted as the decision maker’s

confidence in their assessment of ( ).

60

In general, any two dimensional diffusion process where the paths of ( ) are

differentiable must have drift and diffusion of the form of Equation 5.3a-b. Here,

( ) is also be monotonic provided the range of is positive.

( ) [ ( )

( )] (5.3a)

( ) [ ( )

] (5.3b)

In Equation 5.3a-b, the process ( ) no longer has the interpretation as the

derivative of ( ) and hence it is not as easy to give an intuitive meaning to the

parameter fields. There are however some cases of interest which are of the form of

Equations 5.3a-b such as an exponentially smoothed diffusion process (defined by

Equations 5.4a-b).

( ) [ ( )

( )] (5.4a)

( ) [ ( )

] (5.4b)

Higher dimensional analogues of Equations 5.1-5.3 can be easily established but

are not presented here. In general, for the paths of ( ) to be ( ) times

differentiable one must use an n-dimensional diffusion process. In order to enforce a

constraint on the ( ) derivative requires that derivative to be continuous and

hence an n-dimensional diffusion process. Thus, to enforce convexity, a three

dimensional diffusion process is required. Although difficulty in elicitation of high

dimensional diffusion processes can be alleviated by choosing simpler forms such as

61

Equation 5.1, the computational complexity of the Algorithms 4.1-4.2 still increases

exponentially with the number of dimensions.

5.3 Modeling Discontinuities and Cusps

In some cases, one may wish to assume a priori that the curve of interest is continuous

or differentiable everywhere except at a countable number of points. While no

diffusion processes satisfies these properties, one can construct a suitable prior

distribution using jump diffusions.

To construct a prior which is continuous except at a countable number a first order

jump diffusion process can be used (see Section 3.4). A one dimensional jump

diffusion requires one to specify ( ), ( ), ( ) and ( ). Here, ( )

and ( ) should be assessed as in Section 5.1. The jump intensity, ( ) can be

interpreted2 as the probability (per unit ) of there being a discontinuity at { } given

that ( ) . The field ( ) should be assessed as the PDF of ( ) given that

( ) and there is a discontinuity at { }. From a practical standpoint it is

useful to choose a suitable family of distributions for ( ) (e.g., normal) so that

one needs only to specify the parameters of this distribution (e.g., mean and variance)

for each { }.

Analogously to Section 5.2, one can specify priors for which the sample paths are

differentiable except at a countable number of points by using multidimensional jump

2 The rigorous meaning is that given that ( ) , in the limit as the probability of there being a

discontinuity in [ ) is equal to ( ).

62

diffusions. This is useful from a modeling standpoint such as when one is interested

in predicting a curve which is known to have cusps.

The forms of Equations 5.1-5.3 carry over identically to the case of jump

diffusions, but now the parameters controlling the behavior at the discontinuities in

( ) must also be specified. For the case of Equation 5.1, one would need to specify

( ) and ( ). Here ( ) has the interpretation of the probability (per unit

) of ( ) being nondifferentiable at given the slope of the curve at equals .

In addition, ( ) would have the interpretation of the PDF of the slope of the

curve at given the curve is nondifferentiable at and its slope at equals . For

the case of Equations 5.2-5.3, one would have to specify ( ) and ( )

and hence the behavior at discontinuities would depend both on the current state and

the slope going into the discontinuity.

An example of a process which can be used for modeling a curve with cusps is

shown in Figure 5.3. Higher dimensional processes can be used to model curves

which are times differentiable except at a countable number of points. As a general

rule, for the paths of ( ) to be ( ) times differentiable except at a countable

number of points, an n-dimensional jump diffusion process must be used.

63

Figure 5.3. Five sample paths of a Brownian motion with jumps where ( ) and ( ) ( ( )) (left), and the integral of each of these sample paths (right). As the sample paths of the jump Brownian motion are

piecewise continuous, the sample paths of its integral are piecewise differentiable.

5.4 Curves with Multiple Dependent Variables

One may also be interested in predicting multiple related functions of a given variable.

For example, one may wish to predict both the average birth weight and average birth

height of an infant as a function of the length of pregnancy. Even when both

processes are assumed 1st order, the assessment of the parameter fields can already be

quite complicated as they involve assessing general two dimensional drift and

diffusion fields (Equations 5.5a-b).

( ) [ ( )

( )] (5.5a)

( ) [ ( ) ( )

( ) ( )] (5.5b)

As can be seen from these equations, when predicting multiple variables the drift

fields (e.g., the expected change in birth weight and height per day of pregnancy) can

now be specified as a function of the independent variable as well as each dependent

64

variable (e.g., length of pregnancy, current weight, and current height). Specifying the

diffusion field essentially involves specifying a covariance matrix for each point and

can be somewhat complicated. This is because now there is the possibility that two

variables are dynamically related (e.g., birth weight being likely to increase on a day

which height increases significantly).

While it is beyond the scope of this work, a more in depth explanation of methods

to assess covariance matrices can be found in (Shachter and Kenley, 1989). The linear

coefficients ( ) in this construction specify how knowing the change in variable

updates the decision maker’s belief about the expected change in variable and the

conditional variances ( ) can be interpreted as the decision maker’s confidence in

their assessment of the expected change in variable conditional on all of variable ’s

parents. For example, for birth height and weight, the linear coefficient would be

positive and chosen as the expected change in weight per change in height. Knowing

the change in birth weight on a given day would increase one’s confidence in their

ability to guess the change in birth height. However, it would not result in perfect

confidence in this guess. This would be captured by having a nonzero conditional

variance.

Of course, one can couple higher order diffusion processes and jump diffusions as

well by adding additional dimensions. Being that the for two coupled one dimensional

diffusions specification of the parameters is already quite complicated in the general

case, one would likely have to make assumptions about drift and diffusion fields in

order to make specification of these parameter fields feasible. Again, the number of

65

dimensions is also limited from a computation standpoint since computation time

increases exponentially with the number of dimensions.

66

Chapter 6

Applications

In this chapter, some benefits that Markov process regression has over existing

regression methods are highlighted through applied problems in medicine, consumer

science, and manufacturing. In the first application, Markov process regression is

used to improve the prediction of waiting times for kidney transplants in pooled

exchanges. This section presents a fully realistic application of Markov process

regression which could be used today to better inform important health decisions. In

the second application, the problem of optimally pricing goods to maximize profit is

examined. Primarily, the motivation for this section is to illustrate the real world

benefits of using a Markov process prior to impose monotonicity as this cannot be

accomplished with the more common Gaussian process prior. While this is

representative of a real world decision situation and historic sales data is used in this

application, a number of assumptions are made to simplify this exposition. Lastly,

Markov process regression is used to predict vibrational stability in machining. The

67

purpose of this section is to illustrate the immense modeling flexibility allowed by

Markov process priors which is not possible with other regression methods. In this

chapter, notation that is common in each application is used and where noted conflicts

with the notation of previous chapters.

Throughout this chapter, Algorithms 4.1 and 4.2 are implemented using the

following. Except when transition densities have closed form solutions, sample paths

are propagated using the Euler-Maruyama method. This is a generalization of the

Euler-method for ordinary differential equations. When calculating the forward

message, resampling is performed at every observation. When calculating the

backward message, resampling is performed at every time step and the path weights

are importance sampled according to the forward message to avoid numerical issues

that may result when the backward message is non-normalizable. Histograms are used

to estimate the density of both the forward and backward messages. These methods

were chosen for their simplicity and ease in replication. However, more complicated

simulation methods, resampling schemes, and density estimation techniques would

likely have computational advantages.

6.1 Estimating Wait Times in Pooled Kidney

Exchanges

In this section, Markov process regression is used to improve predictions of wait times

for kidney transplant patients in paired or pooled exchanges. An accurate prediction

of this wait time is important for a patient to know, as actions can be taken to reduce

68

one’s wait time such as by accepting kidneys from older donors or finding an easier to

match donor to enter the exchange with. This section presents a fully realistic

application of Markov process regression which could be used today to better inform

important health decisions.

When compared to the (classical and parametric) proportional hazards model

(Cox, 1972) which is commonly used in wait time estimation, Markov process

regression has a number of advantages. Since Markov process regression is Bayesian,

censored data, which arises when a person left the program before receiving a kidney

or was still waiting when the data was collected, is straightforward to incorporate. In

addition, the parametric form of the proportional hazards model is a poor global fit to

the data and leads to subpar estimates of waiting times for some candidate recipients.

The predictive performance of the Markov process model vs. the proportional hazards

model is quantified using a leave-one-out cross-validated likelihood approach. This

entails training each model from all data points except for one and then calculating the

likelihood of each model for the remaining data point. This process is then repeated

for each data point.

6.1.1 Introduction to Paired and Pooled Kidney Donation

Those in need of kidney transplants have a few options. One option is to receive a

compatible kidney from a recently deceased organ donor. These types of donations

are managed by UNOS, the United Network for Organ Sharing (www.unos.org).

However, due to the volume of patients waiting for available organs, there can be an

extremely long wait time before a patient receives one. A second option is to receive

http://www.unos.org/

69

a kidney donation from a live patient, often a relative or close friend, who is a match.

If the candidate recipient and donor are not a match, another alternative is to enter into

a pooled kidney exchange.

In pooled kidney exchanges a group of candidate recipient-donor pairs are chosen

such that each recipient is able to receive a kidney from a donor in this pool. In the

simplest case with two recipient-donor pairs (say A and B), the kidney of donor A is

transplanted to recipient B and the kidney of donor B is transplanted to recipient A.

Larger cycles of donations also happen. There are a number of special cases which

arise as well. For example, an altruistic donor may join a pooled donation without

expecting a kidney in return (Rees et al., 2009) or a recipient may receive a kidney

with the expectation that the associated donor gives their kidney at a later time

(Melcher et al., 2012). The methods for allocating patients to pools is quite

complicated (Dickerson, Procaccia, and Sandholm, 2012) and aim to optimize metrics

such as average waiting time or throughput.

Two particularly important factors driving this waiting time for a given donor-

recipient pair are how many donors in the exchange the recipient is able to receive a

kidney from and how many recipients in the exchange the donor can give a kidney to.

As either increases, their wait time should be expected to decrease. Both of these

factors are often pooled into a single compatibility metric, termed “pair match power”

( ). While the precise definition may vary depending on the source, here is

taken as the fraction of donors in the exchange that the recipient can receive a kidney

from multiplied by the fraction of recipients in the exchange the donor can give a

kidney to, as in (Veale and Hil, 2011). This definition is useful because then has

70

the interpretation as the probability that both a randomly selected donor in the

exchange can give a kidney to this recipient and a randomly selected recipient in the

exchange can receive a kidney from this donor. Factors determining donor-recipient

compatibility include the donor blood type, the recipient blood type, the recipient’s

antibodies, and to a lesser extent the donor’s antigens (based on data from National

Kidney Registry and Alliance for Paired Donation, two major kidney pooled donation

programs in the United States).

Accurate knowledge of the relationship between and wait time can be

important to a prospective recipient as there are steps one can take to reduce their wait

time. For example, many recipients screen prospective kidneys based on their size,

how complicated their artery structures are, and the age of the donors, among other

reasons (Liu et al., 2014). However, if a donor knows they are likely to wait a

significant amount of time for any kidney, it may be in their best interest to lower the

screening standards. Another option may be to find a more highly demanded donor,

such as one with an O blood type. In addition, knowing the relationship between

and wait time is important to those setting the policies of the kidney exchange.

An exchange may make policy changes, for example, with the specific goal of

reducing wait times for those with very low . By examining the relationship

between and wait time before and after such a change was made, a policy maker

can determine whether or not the desired effect has been achieved.

71

6.1.2 Modeling the Relationship between Compatibility and Wait

Time

By estimating the relationship between and wait time from existing data, donor

recipient pairs can be given a better idea of what their individual wait time might be.

A commonly used assumption (Chen et al., 2012) which is made here is that suitable

matches for a given donor-recipient pair arrive as Poisson process (and hence the

waiting time follows an exponential distribution). The mean parameter of the Poisson

process (denoted ( )) is then modeled as function of the donor-recipient pair’s ,

and the curve ( ) can then be estimated via regression on existing data3. While the

Poisson assumption seems reasonable as there is no reason to expect any memory in

the process, it is possible that this is not representative of real world behavior. For

example, hidden factors influencing wait time may lead to an empirical variance

which is greater than that of the exponential distribution. In addition, queuing effects

may yield non-exponential wait times. The following analysis could be carried out

using a more complicated arrival time distribution but would likely require some

additional estimation of parameters.

While at first glance it might seem feasible to construct a simple and accurate

theoretical model of the effect of on wait time, it turns out that this is a

complicated relationship. In the highly idealized case where a donor-recipient pair

will receive a transplant the next time a match enters the exchange, if the rate of

arrivals for an arbitrary new pair entering the exchange is , the rate at which a

3 This definition for ( ) is specific to Section 6.1. It is common used in this application but conflicts with its

use in the remainder of this thesis.

72

compatible match will be found is ( ) . However, as the size of pools

increase or the number of altruistic donors increase, pairs with a lower are more

likely to receive transplants (Rees et al., 2009). For example, if only very large pools

were formed and always involved an altruistic donor, this would effectively allow all

interested parties to join the pool and ( ) would be close to constant. While it is

possible that a theoretical model could be developed that accurately estimated wait

times, it would likely be very complicated due to the optimization step in determining

which pools are formed.

Thus, the alternative of estimating ( ) empirically is an attractive option. A

commonly used (classical parametric) model for predicting ( ) is given in Equation

6.1. This model is equivalent to a proportional hazards model with constant hazard

rate where the covariate is taken as ( ). This is also equivalent to saying that

the curve ( ) is a straight line in ( )- ( ) space with slope

and y-intercept ( ).

( ) (6.1)

The alternative model being proposed here is to use a (Bayesian nonparametric)

Markov process prior defined by the integral of geometric Brownian motion (see

Figure 5.2) in ( )- ( ) space. This model assigns the highest

probabilities to curves satisfying Equation 6.1 but also assigns nonzero probability to

any differentiable monotonically decreasing curve. In particular, the more a curve

73

deviates from Equation 6.1, the less probability is assigned to it4. The amount in

which deviation is penalized is controlled by the percentage volatility parameter . As

, this model reduces to the proportional hazards model and as increases,

deviation from the proportional hazards assumption is penalized less.

6.1.3 Methodology for Estimating Wait Times under each Model

For the Markov process model, the prior for the arrival rate ( ( )) was assumed

to be the (negative of the) integral of a geometric Brownian motion with respect to

(negative) ( ). To be in line with the notation of the previous Chapters, let

( ) denote ( ( )) and denote ( ( )). Then, this model is

described by Equations 6.2a-b. The percentage volatility was chosen as .

The motivation for defining as ( ( )) instead of ( ) is that a

significant amount of data is available near and less data is available at

lower . Having a substantial amount of data near which for this choice

corresponds to effectively suppresses any dependence on the prior

distribution of ( ). Although consequently it is not particularly sensitive to these

choices, the prior distribution of ( ) used in the validation was that ( ) and

( ) are independent with distributions given by Equations 6.3a and 6.3b

respectively.

4 For ease in exposition, some mathematical nuances are ignored here.

74

( ) [ ] (6.2a)

( ) [

] (6.2b)

( ) ( ) (6.3a)

( ) ( ) (6.3b)

The likelihood functions for observations follow explicitly from the assumption

that arrivals of a suitable match for a given donor-recipient pair follow a Poisson

process and hence have exponentially distributed waiting times. It is necessary

however to distinguish between those people who actually received kidneys and

censored data, which arises when a person left the program before receiving a kidney

or was still waiting when the data was collected. For censored data, only a lower

bound on the time they would have waited for a kidney is known. Let and

denote the observed time waiting for a kidney and the for donor-recipient pair ,

respectively. The likelihood functions if they receive a kidney and for the censored

data points are then given by Equations 6.4a and 6.4b, respectively. These likelihoods

can of course be translated into equations in terms of ( ) and using their definitions

above.

( | ( )) ( ) ( ( ) ) (6.4a)

( | ( )) ( ( ) ) (6.4b)

The proportional hazards model was implemented using the statistics toolbox in

MATLAB R2007a. A maximum likelihood approach was then used to find the base

75

rate under the Poisson arrival assumption ( in Equation 6.1) instead of allowing for

a time varying base rate. The resultant base rate is given in Equation 6.5, where is

found in the proportional hazards step and denotes the number of donor-

recipient pairs in the dataset which actually received kidneys.

∑ ( )

(6.5a)

6.1.4 Cross-validation Methodology and Results

To evaluate the performance of each model, leave-one-out cross-validation was

performed on a set of historical waiting time data for patients in kidney pool

exchanges (National Kidney Registry, 2013). This dataset consists of 829 waiting

times, with 549 donor-recipient pairs receiving kidneys and the remaining 280 being

censored.

The validation scheme consisted of training each model on 828 of the data points,

validating on the remaining data point, and repeating this process for each of the 829

data points. The metric used in each case was the log-likelihood of each model given

the validation data point. When compared to a standard likelihood tests, cross

validated likelihood penalizes models which are overfit (Stone, 1977) and allows one

to see which data points contribute most to the overall likelihood of each model. A

metric such as the average absolute difference between the predicted average wait

time and the observed wait time would be easier to interpret. However, a metric of

this sort cannot be evaluated at censored data points. Simply throwing out these data

points would result in a bias towards models that underestimate wait times. This is

76

because donor-recipient pairs with longer wait times are more likely to be censored

and hence not be counted in the validation. In implementing the particle filtering

algorithms (Algorithms 4.1-4.2) in this validation testing and the figures which follow,

105 paths were used and the state space was discretized into a

( ) grid

The updated prediction for the mean waiting time as a function of , ( ),

from all 829 data points under each models is shown in Figures 6.1. Figure 6.2

shows the ratio between the predicted mean wait times under the Markov model vs.

under the proportional hazards model. The cumulative cross validated relative log-

likelihood, or evidence, for the Markov model over the proportional hazards model is

also plotted. Overall, the cross-validation showed that the data was times

more likely to be drawn from the Markov model than from the proportional hazards

model. This number is found by summing the cross-validated relative log-likelihood

over all 829 data points and exponentiating.

As can be seen in Figure 6.1, the primary difference between the predictions under

the proportional hazards model and the Markov process model is that the Markov

process model predicts step like behavior, such as near . Consequently,

there are number of intervals for which the Markov model predicts both longer and

shorter wait times than estimated under the proportional hazards model. Although the

predictions often differ substantially for , there is little evidence for either

model over the other in this region (Figure 6.2). This is because there are fewer

observations in this region and many are censored, which is less informative. In the

77

region where , however, the Markov process model is significantly

preferred to the proportional hazards model.

In conclusion, an accurate prediction of wait times, such as that provided by that

Markov process model can be essential to a prospective recipient’s decision making

process. For example, consider a donor-recipient pair with that is

considering accepting kidneys from donors older than the standard screening age of

40. As 40% of prospective donors are over 40 (Liu et al., 2014), this pair could raise

its effective to 0.12 by raising the cutoff age such that only the oldest 20% were

screened out. Under the Markov process model, this may be an attractive option as

this almost cuts the pair’s expected waiting time in half. If one were to use the

proportional hazards model instead, this option would probably not look as attractive

as only a modest decrease in waiting time would be predicted (Figure 6.3). As the

proportional hazards model is not supported by the data, use of the proportional

hazards model could in this case lead someone to make a decision which significantly

reduces their probability of receiving a kidney. In addition, those setting the policies

of the exchange may be interested in knowing there is such a sharp increase in waiting

time for a small change in and may wish to adjust their policies to alleviate this.

78

Figure 6.1. The median, 5%, 25%, 75% and 95% quantiles for the updated prediction of the mean wait time

( ( )) under the Markov process model in log-log space. measures compatibility of a donor-recipient pair.

The prediction under the proportional hazards model is shown in red. Green ‘X’s denote wait times where

recipients received a kidney and magenta ‘X’s denote censored data.

Figure 6.2. Comparison of expected wait times under the Markov model vs. under the proportional hazards model

(PH) and the cumulative relative log-likelihood (evidence) for the Markov model over the proportional hazards

model. is plotted on a logarithmic scale. Note that the individual instances of are not unique and the

relative log-likelihood shown is summed over repeated instances at each . The Markov model is strongly

preferred for .

79

Figure 6.3. Predicted wait time densities for and under both the Markov process model

(MP) and the proportional hazards model (PH). Under the Markov process model, the expected wait time for

is approximately half as much as for while under the proportional hazards model there

is only a modest decrease in expected wait time. From Figure 6.2, the Markov process model is much more

strongly supported by the data.

6.2 Estimating Price-Demand Curves

This section illustrates the real world benefits of enforcing monotonicity in the prior.

To demonstrate this, a common and important application of regression – predicting

the effect that pricing has on the demand for a good – was performed using both a

Markov process prior with monotonicity enforced and a (warped) Gaussian process

prior. This application is particularly well suited for the comparison because after

transforming the state space the likelihoods are Gaussian and hence no approximation

is necessary to calculate the updated Gaussian distribution. In addition, the parameters

of each model were chosen such that the two models are similar with the exception of

the monotonicity constraint.

80

To quantify the performance of each approach, cross-validation was performed

using historical store level scanner data from a major grocery store chain. In

particular, once each prior is updated from the training set, the price which maximizes

revenue under each model was used as the pricing decision for validation of that

model. The realized weekly revenue under each model was taken as the demand

averaged over all weeks in the validation set in which the good was sold at this price.

As opposed to a statistical performance metric, this approach allows one to compare

these methods directly in terms of their effect on overall revenue, which is much more

useful from a decision maker’s point of view.

Although this application is meant to be representative of a real world decision

situation, a number of assumptions were made to simplify this exposition. First, in

many circumstances it is likely that one wishes to maximize profit and not revenue

and hence would also need to factor (variable) cost of goods into the pricing decision.

Secondly, as the dataset consists of almost nine years of data, there are likely some

changes in demand behavior over time which are ignored here. Thirdly, the pricing

decision is likely influenced by other factors, such as the pricing of complementary

goods which are also ignored here. These effects could be captured using a similar but

a more complicated methodology.

81

6.2.1 Input Data

The data used in this application is store aggregated weekly scanner data from

Dominick’s, a large retail grocery store in the Chicago area, which is openly available

(James M. Kilts Center, 2013). In particular, the product category of bottled juices

was used in this application. For each product in this category (tracked by unique item

number, which may correspond to one or more UPC) and each store, the demand and

price during each week were tracked. The 93 combinations of items and stores with

the most overall demand were selected for validation.

6.2.2 Modeling the Effect of Price on Demand

Demand was modeled by a lognormal distribution with an unknown log-mean

parameter ( ) which depends on price (Equation 6.6). The curve ( ) was then

estimated via regression under a Markov model, two untrained Gaussian models, and

two trained Gaussian models. As ( ) is not particularly easy to interpret, regression

of the median parameter of the lognormal distribution, given by ( ) ( ( )), is

instead presented. For simplicity, it is assumed that the log-variance parameter

does not depend on price5. The lognormal distribution was chosen as it is both

commonly used in demand modeling and yields analytical results for Gaussian process

regression. The resulting analysis could also be carried out with other likelihoods.

( ( ) ) (6.6)

5 These definitions for ( ) and are specific to Section 6.2. It is common used in this application but

conflicts with its use in the remainder of this thesis.

82

In both the Markov and Gaussian models, relatively simple choices for the prior

processes were made. In addition, the parameters of each model were chosen such

that the two models are similar with the exception of the monotonicity constraint. In

each test, the minimum price considered was taken as the minimum price appearing in

the training set times 0.8, and the maximum price considered was taken as the

maximum price appearing in the training set divided by 0.8. The updated prediction

under each model was calculated at a set of 101 evenly spaced points along this

interval. In each test, the log-variance parameter in the likelihood, , was estimated

by averaging the sample log-variance at each price point weighted by the number of

instances of that price point in the training set.

The prior for ( ) was first modeled as a 2nd

order Markov process where it was

assumed that in log price-log demand space, the prior for ( ) was the negative of the

integral of a geometric Brownian motion (see Figure 5.2) with percentage volatility .

This is equivalent to saying that elasticity of demand is equal to the negative of a

geometric Brownian motion. Figure 6.4 shows sample paths of the prior process in

both log-log and price demand space.

In log-log space, the initial distribution at the lowest price point being considered

for ( ) was assumed to be normally distributed and its derivative was assumed to be

lognormally distributed and independent of ( ). The parameters of the initial

distribution were estimated by matching the mean and variance of these distributions

to the results of a linear regression of the training set (in log-log space). The

percentage volatility was chosen as 1.

83

Figure 6.4. Five sample paths of the Markov process prior for the median parameter ( ) in log-log space (left)

and price-demand space (right). The parameters for the initial distribution were fitted from the training data and

.

The prior for the median-demand curve ( ) was also modeled using a Gaussian

process in log price – log demand space. It was assumed that this process has an

affine mean function and a Matérn covariance function with (in log-log

space). This is a commonly used covariance function and has sample paths that are

exactly once differentiable like the sample paths of the Markov process defined above.

The parameters of the mean function were found using a linear regression of the

training set (in log-log space). The height of the covariance function was chosen such

that at the lowest price point being considered, the Gaussian and Markov priors have

the same distribution. Two choices of the length-scale parameter of the covariance

function were used: 1/5 of the price interval being considered and 1/2 of the price

interval being considered (in log-log space). For this covariance function, the

correlation between two points separated by a distance of is approximately 0.48 and

the correlation between two points separated by a distance of is approximately 0.14.

Figure 6.5 shows sample paths from the two Gaussian process priors with parameters

fit using the same training data as for the Markov process prior in Figure 6.4. As can

84

be seen in these figures, a smaller length-scale parameter results in sample paths that

change more rapidly. These calculations were performed using the open source

software GPML for Matlab.

Figure 6.5. Five sample paths of the untrained Gaussian process priors for the median parameter ( ) in log-log

space (left) and price-demand space (right). The parameters for the initial distribution were fitted from the same

training data as for the Markov process prior in Figure 6.4.

Lastly, Gaussian process regression was repeated with the parameters for the

mean, covariance and likelihood functions trained explicitly to maximize the model

likelihood given the training set. A conjugate gradient descent method was used. The

parameters for the untrained Gaussian processes were used as the initial condition and

the optimization was terminated after 100 function evaluations. Figure 6.7 shows

85

sample paths of the trained Gaussian process prior using the same training set as in

Figure 6.4 and Figure 6.5. As illustrated here, training often resulted in overfitting. In

particular, training using of the price interval as the initial condition often

resulted in a very low and training using of the price interval often resulted

in a very low height of the covariance function. These calculations were performed

using the open source software GPML for Matlab.

Figure 6.6. Five sample paths of the trained Gaussian process prior for the median parameter ( ) in log-log

space (left) and price-demand space (right) using the same training data as for the Markov prior and untrained

Gaussian priors shown above.

86

6.2.3 Experimental Methodology

For each item-store combination, “reverse” 10-fold validation was performed,

meaning that each of the ten folds was used individually as a training set (40 weeks of

data) with the remaining nine folds (360 weeks of data) used as the validation set.

This approach was chosen for two reasons. First, prior information (e.g., enforcing

monotonicity a priori) becomes less relevant as the training set size increases and

therefore the differences between the two choices of prior are less discernable.

Secondly, since a pricing decision can only be validated at prices which are in the

validation set, a large validation set is advantageous as under each model one can

select a larger range of price points. 10-fold validation was repeated ten times for

each item-store combination to minimize any variation due to the selection of folds.

Thus, a total of 9,300 validation tests were run, ten 10-fold validations for each of the

93 product-item combinations. In implementing the particle filtering algorithms

(Algorithms 4.1-4.2) in this validation testing, 104 paths were used and the state space

was discretized into a (log-demand by log-price by elasticity) grid.

In producing the following figures, 105 paths were used and the state space was

discretized into a grid.

In each case the performance metric, realized revenue, was calculated as the

following. Only prices which appeared in the validation set at least three times were

considered admissible pricing alternatives in order to diminish any effect of outliers.

The expected revenue generated at each pricing alternative was then calculated

assuming lognormal demand and the prediction for ( ) under each model. For each

model, the admissible pricing alternative which maximized the expected revenue was

87

then chosen as the pricing decision. Realized revenue under each model was then

taken as revenue generated under that model’s pricing decision, averaged over all

instances of that price in the validation set.

6.2.4 Results and Analysis

The performance of each approach varies significantly with the training data. When

the monotonicity of demand is well represented in the training data, the three

approaches produce approximately the same updated predictions and consequently the

same optimal pricing decisions. However, due to the inherent noise in the data, in a

number of sets the data did not show a clear monotonic trend. In these cases, the

updated prediction under each model can vary significantly and the mean function of

the Gaussian models may not even be monotonic. Again due to the inherent noise in

demand, this sometimes results in higher realized revenue for the Markov model and

sometimes lower realized revenue. Examples of each of these cases are shown in

Figure 6.7. In many cases, training the Gaussian models resulted in clear overfitting

(Figure 6.8), possibly due to the small training set size. Regardless, the trained models

performed fairly well in terms of realized revenue.

88

Figure 6.7. Comparison of the updated distributions and realized revenue under the Markov and untrained

Gaussian models. For each model, the median, 5%, 25%, 75% and 95% quantiles of the updated distribution for

the median parameter ( ) are shown. The ‘+’ indicate the location of training data and the ‘X’ indicates the

optimal pricing decision and realized demand. Top: Data is well representative of a monotonic trend so all three

models produce roughly the same results. Middle: Markov model outperforms. Bottom: Gaussian models

outperform.

89

Figure 6.8. Comparison of the updated distributions and realized revenue under the untrained and trained Gaussian

models. The median, 5%, 25%, 75% and 95% quantiles of the updated distribution for the median parameter ( ) are shown. The ‘+’ indicate the location of training data and the ‘X’ indicates the optimal pricing decision and

realized demand. Training the parameters often results in overfitting.

Results from the validation testing are summarized in Table 6.1. Of the test runs,

75.67% and 89.03% of the tests had the same pricing decision under the Gaussian

process with and of the price interval, respectively, as under the

Markov model. Of the test runs where the pricing decisions differed between models,

the Gaussian process with and of the price interval, resulted in a

decrease in average revenue of 21.94% and 10.44%, respectively when compared to

the Markov model. Over all tests, the Gaussian process with and

90

of the price interval resulted in a decrease in average revenue of 5.30% and 1.14%

respectively, when compared to the Markov model.

For this data set, the Gaussian process with a large length-scale parameter

performed very well. The benefit of a large length-scale parameter in this case is that

a large number of paths may be (approximately) monotonic. However, such a process

would not model a curve with rapidly changing behavior particularly well. For

example, if the price elasticity of demand changed abruptly with price, a smaller

length-scale parameter would likely yield a better prediction. As illustrated here, it is

also not necessarily easy to find an optimal length scale parameter through likelihood

maximization due to the possibility of overfitting. While under a Gaussian model one

faces the tradeoff between modeling monotonicity and rapidly changing behavior,

rapidly changing behavior can be captured in this Markov model, in addition to

monotonicity, by increasing .

From this analysis, one can see that there can be real world benefits to using a

prior which assumes a priori that a curve of interest is monotonic. If the observed

data happens to reveal the monotonic behavior of the underlying relationship, which is

likely to occur when data is abundant, the prior assumptions regarding the relationship

may be less important. However, when this is not the case, a prediction under a

monotonicity assumption may differ significantly from a prediction without this

assumption. Not enforcing monotonicity a priori may therefore negatively impact

one’s bottom line, in this case in terms of realized revenue.

91

Table 6.1. Summary statistics from the validation testing.

Markov Gaussian;

=1/5 interval Trained

Gaussian;

=1/2 interval Trained

Realized

revenue

($/item/store)

$374.07 $354.23 $359.06 $369.81 $367.23

% change vs.

Markov - -5.30% -4.01% -1.14% -1.83%

% different

pricing

alternative

- 24.33% 20.41% 11.97% 14.00%

% change

given different

pricing

alternative

- -21.94% -23.52% -10.44% -14.60%

6.3 Stability Limit Prediction in Milling Operations

In this section, the primary motivation is to illustrate the modeling flexibility inherent

to Markov process regression. To demonstrate this, a Markov process is used to

model a particularly poorly behaved curve, a stability limit in machining, which is

piecewise concave and has a number of cusps (Figure 6.9). While stability limit

prediction is a real world problem, the goal here is more to illustrate what is possible

using a Markov model and not necessarily to provide a fully genuine solution to

stability limit prediction. For a more applied treatment which is validated from real

life stability testing, see (Traverso et al., 2010).

92

Figure 6.9. Example of a machining stability limit. If a cut is made with parameters in the unstable region,

vibrations can grow uncontrollably and produce a poor cut or damage the tool.

6.3.1 Introduction to Stability in Machining

While some vibration naturally occurs when machining a piece of metal, for certain

choices of cutting parameters these vibrations can grow uncontrollably. This is

referred to as instability and can result in both a poor cut (in terms of surface location

error) and damage to the tool which can be quite costly. For a given tool and work

piece combination, two important parameters in determining whether instability will

result are the (axial) depth of the cut and the spindle speed (the speed at which the tool

is spinning). As seen in Figure 6.9, the precise relationship between these parameters

and stability is quite complicated.

While a somewhat accurate prediction for the stability limit can be made using a

deterministic algorithm such as in (Altintas and Budak, 1995), this requires knowledge

of the frequency response function for the particular tool-holder-spindle-machine

combination. The response function can be measured via impact testing, but this can

be costly, time consuming, and may need to be repeated if the tooling setup is changed

93

even slightly. Consequently, on many production floors these measurements are not

made (Schmitz and Smith, 2009). An alternative approach, which is used here, is to

infer the stability limit by performing a number of test cuts and seeing whether the

result is stable or unstable.

6.3.2 Modeling the Stability Limit

A lot can be said a priori about the stability limit – it is piecewise concave with

concavity that tends to decrease with spindle speed and has cusps which generally

occur at higher axial depths as spindle speed increases (see Figure 6.9). Here the

stability limit is modeled based on these qualitative features without regard to the

dynamical theory governing stability. It is quite possible that the following model

could be improved if stability theory was considered. For example, due to the nature

of resonance, cusps are expected to be harmonically spaced. Consequently,

constructing a Markov process with respect to inverse spindle speed may lead to a

more straightforward specification of parameters in terms of the underlying theory.

There are a number of reasons this application is well suited to a Bayesian

nonparametric approach. Since the relationship is quite complicated it is difficult to

think of a suitable choice for a parametric model. Secondly, a classical regression

method would have difficulty in updating from the indicator likelihoods. Lastly, since

a Bayesian method is used, the parameters at which to perform test cuts can be

selected to maximize the value of the information gained through experimentation.

In contrast to a Markov process prior, a Gaussian process prior would not be able

to model the piecewise concavity or the cusps and the assumption of linear filtering

94

limits the qualitative knowledge which can be incorporated into the prior. In addition,

as the likelihood functions are indicator functions (one learns that the stability limit is

above or below the given test point), a Gaussian process approach would be difficult

to implement and would likely require some approximation method.

In order to enforce both piecewise convexity and the cusp behavior, a third order

jump diffusion process is required (see Sections 5.2-5.3). As per the notation of

Chapter 5, let denote spindle speed and let ( ), ( ), and ( ) denote the

stability limit, its derivative, and its second derivative6, respectively.

The second derivative of the sample paths, ( ), is assumed to follow a geometric

Brownian motion with negative drift. This implies that ( ) is (piecewise) concave

everywhere and on average, concavity with increasing spindle speed. To account for

the cusps, the first derivative, ( ), is modeled as having a jump component. The

probability of a jump occurring in a given interval was assumed to follow a power law

in both and , where the exponent in is negative and the exponent in is

positive. This implies that on average, cusps occur at higher axial depths as spindle

speed increases and that cusps have some degree of sharpness (they are more likely to

occur when the first derivative is large). In addition, it is assumed that cusps only

occur when the first derivative, ( ), is positive. To reduce the number of

parameters, when a cusp occurs the first derivative, ( ), simply changes sign and

that the second derivative, ( ), remains the same. The parameter fields for this

model (see Chapter 5) can thus be expressed as Equations 6.7a-d.

6 Due to the jump component in ( ), the second derivative of ( ) is not well defined at the cusps, but is indeed

equal to ( ) at all other points.

95

( ) [

] (6.7a)

( ) [

] (6.7b)

( ) {

(6.7c)

( ) ( ( )) ( ) ( ) (6.7d)

Thus, this model has five open parameters. The parameter controls the average

rate at which concavity decays and can be thought of as how much the true change

in concavity is expected to deviate from . As in Chapter 5, 1/ can thus be thought

of as one’s confidence in . The parameters , , and control the behavior at the

cusps. The parameter controls the sharpness the cusps, which indirectly controls

how high (in terms of axial depth) the cusps occur. The parameter controls how

much the cusps grow (in terms of axial depth) as spindle speed increases. A good

choice for depends somewhat on and , but for a fixed and a decrease in

results in both a larger mean and a larger variance (less confidence) in the axial depth

of the cusps. Unfortunately, as ( ) has the interpretation of a hazard rate, it is

difficult to express the distribution of the axial depth of cusps directly in terms of ,

, and . Consequently, there is not a well-defined confidence parameter – all three

parameters can affect both the expected axial depth of a cusp and the variation in this.

96

6.3.3 Design of Experiments

The prior distribution was updated using two different approaches to design of

experiments. In the first, parameters at which to perform test cuts were chosen to

maximize the value of information for a particular decision situation. This requires

the specification of both a cost of machining and a reference stability limit, which is

used in lieu of real world stability testing, to determine whether a given test cut is

stable or unstable. This is the approach to experimental design one should use in a

real world decision situation. The second approach is more representative of a

statistical experimental design. Although not particularly important to a real world

decision situation, this is included as it gives a better idea how well the model globally

predicts the curve.

For the value based design of experiments, the decision situation was chosen as

similar to that of in (Traverso et al., 2010). In particular, one is interested in milling a

pocket feature of dimensions 150 mm by 100 mm by 25 mm into a piece of 6061-T6

aluminum, a material common in aircraft and watercraft construction, using a TiCN-

coated carbide tool. Additional parameters regarding the tooling setup are given in

Table 6.2 and the tool path is shown in Figure 6.10.

The cost of machining was assumed to be $2 per minute of machining (Figure

6.11). The specifics of the machining time calculation are omitted for brevity as they

can be found in an introductory textbook on machining. In short, the speed of

movement along the tool path is constant for a given set of parameters. The step

behavior in the cost function results from the number of passes which must be made to

complete the pocket. For example, at an axial depth of 5 mm, five passes are required

97

to machine the pocket but at a depth of 4.9 mm, six passes are required. For

simplicity, it was assumed that an unstable cut results in an infinite cost, effectively

forcing one to only machine the feature at a stable test cut. The reference stability

limit, used to determine whether the test cuts were stable or not, was generated using

Altintas and Budak’s algorithm and the parameters of the tooling setup given in Table

6.2. The resultant stability limit is shown in Figure 6.12.

Table 6.2. Input parameters used for the cost calculations and generation of the true stability limit.

Parameter Value Units

Tool diameter 6.35 Mm

Radial depth of cut 6.35 Mm

Number of teeth 4 N/A

Feed per tooth 0.15 Mm

Helix angle 30 Deg

Tangential coefficient 613 N/mm2

Normal coefficient 149 N/mm2

Tangential edge coefficient 7.0 N/mm

Normal edge coefficient 6.0 N/mm2

Stiffness 20 MN/m

Viscous damping coefficient .05 N/A

Natural frequency 2667 Hz

98

Figure 6.10. Tool path for millng the pocket feature. This path is repeated in each pass until the desired depth of

25 mm is reached.

Figure 6.11. Cost of machining as a function of spindle speed and axial depth. Increasing spindle speed results in

faster movement along the tool path while increased axial depth results in a shorter tool path if less passes may be

required to mill the pocket. This results in the step behavior of the cost function.

99

Figure 6.12. Reference stability limit generated using the algorithm in (Altintas and Budak, 1995) and using the

parameters given in Table 6.2. In lieu of real world stability testing, stability of test cuts was determined by

comparing the test parameters to the reference stability limit.

For the value based design of experiments, a greedy heuristic was used to make the

value of information calculation for a sequence of experiments computationally

tractable. In other words, each test point was selected without regards to what might

be learned in future testing as if it were the last test prior to milling the pocket. With

the assumption of an infinite cost of instability, one always chooses to machine the

pocket at a point which has previously been tested and is known to be stable. With

these two assumptions, when selecting the next test point, if the result is stable the cost

of machining is equal to the cost at this point and if the result is unstable the cost of

machining is equal to the cost of machining prior to performing this test. The

probability that a given test cut will be stable (given the current information state) is

the complementary CDF of ( ) and thus the expected cost of machining after

100

performing a test at any spindle speed-axial depth combination can be easily

calculated. Nine tests were performed with each being chosen to minimize this

expected cost of machining. For tests prior to getting the first stable result, it was

assumed that the cost of machining if the cut was unstable was equal to the maximum

cost over the domain of parameters being considered.

For the statistical based experimentation, a bisection methodology was used. The

spindle speed of the first test cut was selected to bisect the spindle speed domain and

the axial depth of this cut was taken as the median of the prediction of the stability

limit, ( ), at this spindle speed. Tests were then performed at the spindle speeds

which bisected each of these two regions, with the axial depth of the test cut of each

being again taken as the median of the stability limit prediction at each spindle speed.

Tests were performed sequentially in that when selecting the axial depth of the third

point, the distribution updated from the second point was used. The order in which

these bisections were performed was randomly selected. This bisection process was

performed a total of four times, for a total of test points. For both the

value and statistical based design of experiments, the MATLAB Tensor Toolbox was

used for its support of sparse three dimensional arrays (Bader and Kolda, 2012). In

implementing the particle filtering algorithms (Algorithms 4.1-4.2), 106 paths were

used and the state space was discretized into a (

) grid.

101

6.3.4 Results and Analysis

All parameters used in the following testing are given in Table 6.3. Figure 6.13 shows

sample paths of this choice of parameters as well as the effect varying parameters can

have on the sample paths. The parameters , , , , and were chosen by simply

examining sample paths from a number of parameter choices and choosing one which

had sample paths that seemed representative of stability limit behavior. A relatively

uninformative choice for the prior distribution at the lowest spindle speed (denoted

) was made, and in addition this choice has very little effect on the prediction once

a few tests have been performed. ( ), ( ), and ( ) were assumed

mutually independent.

The sequence of updated predictions, expressed in terms of the probability that a

given set of parameters will produce a stable cut, is given for the value based and

statistical based design of experiments in Figure 6.15 and 6.16, respectively. This is

the marginal complementary CDF for ( ). The sequence of updated predictions

under a Brownian motion prior is also shown, for reference. This prior does not model

the piecewise convexity, cusp behavior, or the dependence of behavior on spindle

speed. The variance parameter of the Brownian motion was taken as , again

by inspection of sample paths. To suppress dependence on the initial distribution, it

was assumed a priori under this model that the stability limit was indeed within the

axial depths being considered (1mm to 10mm). Sample paths of this process are

shown in Figure 6.14.

102

Table 6.3. Input parameters for the Markov model. All parameters are defined with respect spindle speed ( ) in

units of RPM.

Parameter Value or Distribution

-2

2

( ) ( ( ) )

( ) ( ) ( ) ( ( ) )

Figure 6.13. Five prior paths generated under various choices of parameters. Top left: baseline parameters shown

in Table 6.3. Top right: taken at of baseline which results in larger/sharper cusps. Bottom left: Magnitude

of reduced, which results in less cusp growth as spindle speed increases. Bottom right: Magnitude of

increased, which results in less variation in the sharpness of cusps.

103

Figure 6.14. Five sample paths of the Brownian motion used for comparison to the Markov prior defined above.

This prior does not model piecewise concavity, cusp behavior, or any dependence on spindle speed.

104

Figure 6.15. Updated predictions for the stability limit after 3, 6, and 9 tests under the value based design of

experiments. Left: Jump model described above. Right: Brownian motion prior. The ‘X’ indicate parameters at

which testing has been performed.

105

Figure 6.16. Updated predictions for the stability limit after 2,3 and 4 layers of bisection under the statistical

design of experiments (3, 7, and 15 experiments, respectively). Left: Jump model described above. Right:

Brownian motion prior. The ‘X’ indicate parameters at which testing has been performed.

106

From the value based design of experiments, one can see that either model

performs fairly well. Both find the location of the last cusp, which has the lowest cost,

within a few experiments. While there is in fact a small stable region which requires

only four passes (where axial depth is greater than 6.25mm), neither model finds it

within the nine experiments since this region is so small. Comparing the two models,

one sees that the jump model finds the location of the last cusp slightly quicker than

the Brownian model. In addition, the prediction for the last cusp is markedly better for

the jump model. The Brownian model underestimates the axial depth of the final cusp

and overestimates the axial depth of the stability limit as one moves away from this

cusp.

From the statistical based design of experiments, one can see that although the

jump model does a fairly good job of predicting the stability limit at higher spindle

speeds, it underestimates both the size of the cusps and number of cusps at lower

spindle speeds. This implies that the magnitude of was likely chosen too small and

was likely chosen too large. In addition, this model appears to make overconfident

predictions. Thus, it would be useful to add uncertainty to the model by increasing ,

reducing (and adjusting accordingly), or changing the form of the jump

distribution to allow uncertainty in ( ) and ( ) after a jump.

In summary, the application presented here illustrated the broad flexibility that is

possible with Markov priors and not possible with existing regression methods. First,

as with any curve which exhibits complex behavior, stability limit prediction would be

a difficult problem to approach using parametric techniques. Secondly, both value and

statistical based designs of experiments require a Bayesian methodology. In addition,

107

a classical regression may run into problems due to the nature of what is learned from

an experiment. Lastly, a Gaussian process prior would not be able to model the

piecewise convexity, cusps, or the spindle speed dependent behavior.

Although significantly more complex behavior can be modeled using Markov

process priors than other methods, this complexity invariably leads to more model

parameters to define. As can be seen in this application, these parameters may not

always have a straightforward interpretation and may therefore be difficult to make

good choices for. Consequently, care should be taken to define parameters as

intuitively as possible, so that these parameters can be assessed from individuals

intimately familiar with the behavior of the curve being modeled or can be estimated

from available data.

108

Chapter 7

Conclusions

In this dissertation a broad new family of regression models was introduced,

specifically those that utilize priors defined as Markov processes. Although the

updated predictions under these models cannot be calculated analytically in general,

they can be calculated efficiently via simulation. The mathematics for doing so

generalizes a well-known algorithm for discrete hidden Markov models.

While Markov process regression will not be the best choice for every problem, it

has properties which are advantageous over existing approaches in a variety of

circumstances. For one, it does not assume the underlying relationship is of a known

parametric form. This may be advantageous if the precise relationship between

variables is complex or not yet understood. In addition, these models allow one to

reason probabilistically about the predicted relationship which is highly beneficial in

many decision situations such as when choosing information gathering activities.

When compared to other Bayesian nonparametric models and in particular Gaussian

109

process regression, Markov process regression provides greater modeling flexibility

such as the ability to enforce monotonicity, convexity, etc. as well as incorporate

inhomogeneous prior information. Lastly, unlike many algorithms for Gaussian

process regression, the performance of this algorithm does not depend on what is

learned from observations, as encoded through the likelihood functions.

While this work provides some baseline results on using Markov processes for

regression analysis, there are a number of directions for future work. One primary

limitation of the algorithm presented here is that it is only valid when considering a

single independent variable and it is not trivial to extend this algorithm to the

multidimensional case as the resultant Bayesian network is no longer a tree. A similar

algorithm for calculating updated distributions for general Bayesian networks, loopy

belief propagation, has been developed for the discrete time/state case and could

possibly be generalized to the continuous time/state case using results from work.

While loopy belief propagation is not guaranteed to converge to the posterior

distribution, the technique sometimes works in the discrete case and may find

applications in the continuous case as well.

Another area where future work may provide value is parameter estimation.

Throughout this work, open parameters in the prior have generally been kept to a

minimum and chosen either by inspection or by comparing the likelihoods for a few

parameter choices. In addition, likelihood functions have been defined implicitly

through model assumptions and parameters of the likelihoods h ave been estimated in

an ad hoc fashion. For a discrete hidden Markov model, maximum likelihood

estimates for both the prior and likelihood function parameters can be estimated

110

simultaneously from an observed data set using the Baum-Welch algorithm, which is

an expectation-maximization algorithm. It may be possible to extend this algorithm to

Markov processes using Girsanov’s theorem, which gives the relative likelihood that

an arbitrary path is generated under one Markov process vs. under another. Parameter

estimation techniques could greatly improve Markov process regression as it would

remove the guess work of an analyst and allow for more complicated Markov models

with a larger number of open parameters.

In summary, this dissertation has extended Bayesian nonparametric regression by

introducing a second broad family of prior distributions which can be used in addition

to Gaussian processes. While there are a number of areas in which these methods

could be improved, this dissertation will hopefully lay the ground work for future

exploration into regression using Markov processes priors.

112

Bibliography

Y. Altintas and E. Budak. Analytical Prediction of Stability Lobes in Milling. Annals

of the CIRP. 44(1):357-362, 1995.

Brett W. Bader, Tamara G. Kolda and others. MATLAB Tensor Toolbox Version 2.5,

Januay 2012. URL: http://www.sandia.gov/~tgkolda/TensorToolbox/.

Andreas Brezger and Winfried J. Steiner. Monotonic Regression Based on Bayesian

P-Splines: An Application to Estimating Price Response Functions from Store Level

Scanner Data. Journal of Business and Economic Statistics, 26(1):90-104, 2008.

Y. Chen, Y. Li, J. D. Kalbfleisch, Y. Zhou, A. Leichtman, and P. X. K. Song. Graph-

Based Optimization Algorithm and Software on Kidney Exchanges. IEEE

Transactions on Biomedical Engineering, 59(7):1985-1991, 2012.

D. R. Cox. Regression Models and Life Tables (with Discussion). Journal of The

Royal Statistical Society, Series B, 34:187-220, 1972.

113

P. J. Diggle, J. A. Tawn, and R.A. Moyeed. Model Based Geostatistics (with

Discussion). Applied Statistics, 47:299-350, 1998.

John P. Dickerson, Ariel D. Procaccia, and Tuomas Sandholm. Optimizing Kidney

exchange with transplant chains: Theory and reality. In International Conference on

Autonomous Agents and Multi-Agent Systems, 2012.

Avner Friedman. Stochastic Differential Equations and Applications, Volume 1.

Academic Press, 1975. ISBN 0122682017.

Hope College. Ford Escort. [Data file]. Retrieved from

http://www.math.hope.edu/swanson/statlabs/data.html, 1999.

James M. Kilts Center, University of Chicago Booth School of Business. Dominick’s

Database. [Data file and codebook]. Retrieved from

http://research.chicagobooth.edu/kilts/marketing-databases/dominicks, 2013.

Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and

Techniques. The MIT Press, Cambridge, MA, USA, 2009. ISBN 0262013193.

F.R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum product

formula. IEEE Transactions on Information Theory, 47(2):498-519, 2001.

W. Liu, E. Treat, J. Veale, J. Milner, and M. L. Melcher. Match Offer Failures and

Information Sharing in Kidney Paired Donations, Working paper, 2014.

http://www.math.hope.edu/swanson/statlabs/data.html

http://research.chicagobooth.edu/kilts/marketing-databases/dominicks

114

M. L. Melcher, D. B. Leeser, H. A. Gritsch, J. Milner, S. Kapur, S. Busque, …, J. L.

Veale. Chain transplantation: initial experience of a large multicenter program.

American Journal of Transplantation, 12(9):2429-36, 2012.

Thilo Meyer-Brandis. Stochastic Feynman–Kac Equations Associated to Lévy–Itô

Diffusions. Stochastic Analysis and Applications, 25(5):913-932, 2007.

National Kidney Registry. [Historical pooled exchange kidney transplant wait times

for patients registered between January 1, 2010 and April 22, 2013], unpublished raw

data, 2013.

L.R. Rabiner and B.H. Juang. An introduction to hidden Markov models. IEEE ASSP

Magazine, 3(1):4-16, 1986.

Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine

Learning. Adaptive Computation and Machine Learning Series. The MIT Press,

Cambridge, MA, USA, 2006. ISBN 026218253X.

M. A. Rees, J. E. Kopke, R. P. Pelletier, D. L. Segev, M. E. Rutter, A. J. Fabrega, …,

R. A. Montgomery. A Nonsimultaneous Extended, Altruistic-Donor Chain. The New

England Journal of Medicine, 360:1096-1101, 2009.

H. Rue, S. Martino, and N. Chopin. Approximate Bayesian Inference for Latent

Gaussian Models by Using Integrated Nested Laplace Approximations. Journal of the

Royal Statistical Society B, 71(2):319-392, 2009.

115

Tony L. Schmitz and K. Scott Smith. Machining Dynamics: Frequency Response to

Improved Productivity. Springer, 2008. ISBN 0387096450.

Ross Shachter and C. Robert Kenley. Gaussian Influence Diagrams. Management

Science, 35(5):527-550, 1989.

Thomas S. Shively, Thomas W. Sager, and Stephen G. Walker. A Bayesian approach

to non-parametric monotone function estimation. Journal of Royal Statistical Society:

Series B (Statistical Methodology). 71(1):159-175, 2009.

J. Michael Steele. Stochastic Calculus and Financial Applications. Volume 45 of

Applications of Mathematics: Stochastic Modeling and Applied Probability. Springer,

2001. ISBN 0387950168.

M. Stone. On Asymptotic Equivalence of Choice of Model by Cross Validation and

Akaike’s Criterion. Journal of Royal Statistical Society B, 39:44-47, 1977.

R.L. Stratonovich. Conditional Markov Processes. Theory of Probability and

Applications. 5(2):156-178, 1960.

Daniel W. Stroock. Partial Differential Equations for Probabilists. Cambridge

Studies in Advanced Mathematics (Book 112). Cambridge University Press, 2008.

ISBN 0521886511.

Traverso, M., R. Zapata, J. Karandikar, T. Schmitz, and A. Abbas. A Sequential

Greedy Search Algorithm with Bayesian Updating for Testing in High-Speed Milling

116

Operations. Proceedings of the ASME 2010 International Manufacturing Science and

Engineering Conference. Paper No. MSEC-34048, MSEC2010, Erie, Pennsylvania,

October 12-15, 2010.

J. Veale and G. Hil. The National Kidney Registry: 175 transplants in one year.

Clinical Transplantation, 255-278, 2011.

markov process regression a dissertation submitted …jp891mj8064... · sam chiu participated in...

Documents