uclouvain · 2012. 9. 13. · contents preface 1 notation 3 1 introduction 4 1.1 basic idea . . . ....

230
The Econometrics of Duration Data and of Point Processes Michel Mouchart 1 Institut de statistique Université catholique de Louvain (B) 8th September 2004 1 Preliminary version- Not to be quoted-Comments welcome.

Upload: others

Post on 25-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

The Econometrics of Duration Data and

of Point Processes

Michel Mouchart 1

Institut de statistique

Université catholique de Louvain (B)

8th September 2004

1Preliminary version- Not to be quoted-Comments welcome.

Page 2: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Contents

Preface 1

Notation 3

1 Introduction 41.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 An Overview of Topics . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Some Examples in Economics . . . . . . . . . . . . . . 41.2.2 Economic Motivations . . . . . . . . . . . . . . . . . . 61.2.3 Specific Features of Duration Data . . . . . . . . . . . 6

1.3 Modelling Point Processes . . . . . . . . . . . . . . . . . . . . 71.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 Discrete State Space Stochastic Processes . . . . . . . . 81.3.3 Conditional Modelling . . . . . . . . . . . . . . . . . . 91.3.4 Length-biased Sampling: Censoring and Truncation . . 91.3.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.6 Modelling Strategy . . . . . . . . . . . . . . . . . . . . 121.3.7 Two levels of Modelling . . . . . . . . . . . . . . . . . . 12

2 Simple point process 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Distribution and Survivor Functions . . . . . . . . . . . . . . . 15

2.2.1 General Definitions and Properties . . . . . . . . . . . 152.2.2 (Absolutely ) Continuous Case . . . . . . . . . . . . . . 172.2.3 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . 182.2.4 Defective distributions . . . . . . . . . . . . . . . . . . 212.2.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Truncated Distributions and Hazard Functions . . . . . . . . . 232.3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 23

i

Page 3: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CONTENTS ii

2.3.2 Integrated Hazard Functions . . . . . . . . . . . . . . . 242.3.3 Hazard Function (or: age-specific failure rate) . . . . . 252.3.4 Truncated Survivor Function . . . . . . . . . . . . . . . 292.3.5 Truncated Expected duration (expected residual life) . 31

2.4 Temporal dependence . . . . . . . . . . . . . . . . . . . . . . . 312.4.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . 312.4.2 Structural instability of temporal independence . . . . 35

2.5 Some Useful Distributions for Duration Data . . . . . . . . . . 372.5.1 Exponential distribution . . . . . . . . . . . . . . . . . 392.5.2 Gamma distribution . . . . . . . . . . . . . . . . . . . 412.5.3 Weibull distribution . . . . . . . . . . . . . . . . . . . 442.5.4 Gompertz-Makeham distribution . . . . . . . . . . . . 462.5.5 Log-Normal distribution . . . . . . . . . . . . . . . . . 492.5.6 Log-Logistic distribution . . . . . . . . . . . . . . . . . 502.5.7 Inverse Gaussian distribution . . . . . . . . . . . . . . 532.5.8 Piecewise constant hazard rates . . . . . . . . . . . . . 53

2.6 Increasing Failure Rate . . . . . . . . . . . . . . . . . . . . . . 542.7 Derived Distributions . . . . . . . . . . . . . . . . . . . . . . . 55

2.7.1 Basic ideas . . . . . . . . . . . . . . . . . . . . . . . . 552.7.2 Transformation of the Hazard Function: Proportional

Hazard Function . . . . . . . . . . . . . . . . . . . . . 562.7.3 Transformation of the Time . . . . . . . . . . . . . . . 57

2.8 Conditional models for simple point process . . . . . . . . . . 612.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 612.8.2 Proportional hazard Model . . . . . . . . . . . . . . . . 632.8.3 Accelerated time . . . . . . . . . . . . . . . . . . . . . 672.8.4 Comparing Proportional hazard and Accelerated time . 702.8.5 Other conditional models . . . . . . . . . . . . . . . . . 71

3 Multivariate durations 723.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.2 Joint and Marginal Distributions . . . . . . . . . . . . . . . . 733.3 Conditional Distributions in the bivariate case . . . . . . . . . 75

3.3.1 T1 given T2 ≥ t2 . . . . . . . . . . . . . . . . . . . . 763.3.2 T1 given T2 = t2 . . . . . . . . . . . . . . . . . . . . 773.3.3 Distribution of M = minT1, T2 = T1 ∧T2 . . . . 783.3.4 T1 given M ≥ m . . . . . . . . . . . . . . . . . . . . 793.3.5 T1 given M = m . . . . . . . . . . . . . . . . . . . . 80

3.4 Conditional Distributions for p ≥ 2 . . . . . . . . . . . . . . 86

Page 4: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CONTENTS iii

3.4.1 Distribution of M = miniT1, T2, · · ·Tp = ∧1≤i≤pTi 86

3.4.2 Conditional Distributions of the Ti’s . . . . . . . . . . 873.4.3 Tj given M ≥ m . . . . . . . . . . . . . . . . . . . . 883.4.4 Tj given M = m . . . . . . . . . . . . . . . . . . . . 89

3.5 Multivariate Distributions of Durations . . . . . . . . . . . . . 893.5.1 Constructing Multivariate Distributions . . . . . . . . 893.5.2 Examples of Multivariate Durations Distributions . . . 90

4 One transition and several exits 914.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.2 Modelling multiple exits . . . . . . . . . . . . . . . . . . . . . 934.3 Competing risks models: an Introduction . . . . . . . . . . . . 944.4 The case of 2 exits (p = 2) . . . . . . . . . . . . . . . . . . . . 974.5 General case ( p ≥ 2) . . . . . . . . . . . . . . . . . . . . . . . 1014.6 Identifiability of the competing risks model . . . . . . . . . . . 1044.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5 Transitions models 1115.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.2 Basic Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 1135.3 Counting Processes . . . . . . . . . . . . . . . . . . . . . . . . 117

5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 1175.3.2 Definition of a counting process . . . . . . . . . . . . . 1185.3.3 Modelling a univariate counting Process . . . . . . . . 120

5.4 Representation of a Point Process through a counting Process 1275.5 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.5.1 Basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . 1285.5.2 Characterization of a Markovian Point Process . . . . 1295.5.3 Homogeneous Markov Process . . . . . . . . . . . . . . 1305.5.4 Stationary HMPP . . . . . . . . . . . . . . . . . . . . . 1335.5.5 Some examples of Markovian processes . . . . . . . . . 133

5.6 Semi-Markov Processes . . . . . . . . . . . . . . . . . . . . . . 1375.7 More general Processes: Some Hints . . . . . . . . . . . . . . . 138

6 Problems of Partial Observability 1406.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.1.1 Incomplete data in Duration Analysis . . . . . . . . . . 1406.1.2 “Incomplete" data: a general framework . . . . . . . . 141

6.2 Censored data . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Page 5: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CONTENTS iv

6.2.1 Censored and Truncated Data: general ideas . . . . . . 1436.2.2 Modelling right censored data . . . . . . . . . . . . . . 1436.2.3 Interval censored data . . . . . . . . . . . . . . . . . . 145

6.3 Aggregation and Heterogeneity . . . . . . . . . . . . . . . . . 1456.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 145

6.4 Endogenous Selection of the Sample . . . . . . . . . . . . . . . 1486.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 148

7 Inference: Sampling methods 1497.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . 149

7.1.1 In Marginal Models . . . . . . . . . . . . . . . . . . . . 1497.1.2 In conditional models . . . . . . . . . . . . . . . . . . . 156

7.2 Non-parametric and Semi-parametric models . . . . . . . . . . 1627.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1627.2.2 Non-parametric estimation of the Survivor function . . 1627.2.3 Semi-parametric proportional hazard model (Cox model)165

8 Inference: Bayesian Methods 1698.1 Bayesian Inference: general principles . . . . . . . . . . . . . . 1698.2 Nonparametric duration models without censoring. . . . . . . 169

8.2.1 A nonparametric Bayesian model. . . . . . . . . . . . . 1698.2.2 Some properties of the Dirichlet process. . . . . . . . . 173

8.3 Nonparametric duration models with censored observations. . 1818.4 Heterogeneity and Mixture of Dirichlet Processes. . . . . . . . 188

8.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1888.4.2 A simple model. . . . . . . . . . . . . . . . . . . . . . . 1898.4.3 A particular case. . . . . . . . . . . . . . . . . . . . . . 194

8.5 Semiparametric model with proportional hazard. . . . . . . . . 196

9 Tools 2009.1 Mathematical Analysis . . . . . . . . . . . . . . . . . . . . . . 200

9.1.1 One-sided Limit and Continuity . . . . . . . . . . . . . 2009.1.2 Directional derivatives . . . . . . . . . . . . . . . . . . 2019.1.3 Integration and differentiation . . . . . . . . . . . . . . 2029.1.4 Gamma function and associated integrals . . . . . . . . 203

9.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 2039.2.1 Some basic theorems . . . . . . . . . . . . . . . . . . . 2039.2.2 Stieltjes Integral . . . . . . . . . . . . . . . . . . . . . 2049.2.3 Expectation of non-negative random variables . . . . . 206

Page 6: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CONTENTS v

9.2.4 Conditional expectation . . . . . . . . . . . . . . . . . 2109.3 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . 211

9.3.1 General Stochastic Processes . . . . . . . . . . . . . . . 2119.3.2 Counting Processes . . . . . . . . . . . . . . . . . . . . 211

9.4 Statistics: General issues . . . . . . . . . . . . . . . . . . . . . 2119.4.1 On Conditional Models and Exogeneity . . . . . . . . . 2119.4.2 Statistical Inference . . . . . . . . . . . . . . . . . . . . 211

References 215

Page 7: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

List of Figures

1.1 A typical trajectory in the general case. . . . . . . . . . 10

2.1 A typical trajectory of a “Death process" . . . . . . . . 142.2 Distribution Function and Survivor Function in the

Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Distribution Function and Survivor Function in the

Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Typical Hazard Function for Human Life. . . . . . . . . 322.5 Typical Hazard Function for unemployment. . . . . . . 332.6 Typical Hazard Function and Survivor Function for

Companies Life. . . . . . . . . . . . . . . . . . . . . . . . . 342.7 Hazard functions in the two components example (h1 =

.5, h2 = 1 and h1 = .5, h2 = 2) . . . . . . . . . . . . . . . . . 362.8 Functions characterizing the Exponential Distribution(h =

0.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.9 Functions characteristic of the Gamma Distribution(α =

1 β = 0.5, 1, 2) . . . . . . . . . . . . . . . . . . . . . . . . . 432.10 Hazards functions of the Weibull Distribution (α =

1, β = .5, 1.5, 3) . . . . . . . . . . . . . . . . . . . . . . . . . 472.11 Hazards functions of the Log-Normal Distribution . . 502.12 Functions characteristic of the Log-Logistic Distribu-

tion (α = 1 β = 0.5, 1.5, 3) . . . . . . . . . . . . . . . . . . 522.13 Q-Q plot for the Accelerated Time Model. . . . . . . . . 69

4.1 Trajectory of a process with one transition and mul-tiple exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.2 The case τ1 ≤ τ2 (A = 1) . . . . . . . . . . . . . . . . . . . . 984.3 The case τ1 > τ2 (A = 0) . . . . . . . . . . . . . . . . . . . . 99

5.1 A typical trajectory in the general case. . . . . . . . . . 112

vi

Page 8: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

LIST OF FIGURES vii

5.2 A typical pathology excluded by the CADLAG hy-pothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.3 A typical trajectory of a counting process. . . . . . . . . 1195.4 Typical trajectory of a “Death process", or of a simple

counting process . . . . . . . . . . . . . . . . . . . . . . . . 125

6.1 Censored and uncensored data . . . . . . . . . . . . . . . 143

7.1 Curved exponential family: the case of exponentialcensored data . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.1 Characteristic Functions of Intervals . . . . . . . . . . . 2019.2 E[X]: discrete case . . . . . . . . . . . . . . . . . . . . . . . 2079.3 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2099.4 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Page 9: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

List of Tables

2.1 Main Analytical properties of some distributions on IR+ . . . . 38

3.1 Summary of the properties of Distributions conditional onequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.1 Numerical behaviour of the Kaplan-Meyer estimator . . . . . . 166

0

Page 10: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Preface

Duration data basically deal with measurements of spells, or lengths ofa time interval, between two successive realizations of a well defined event,often a change a state of an individual.

The Main objective of this textbook is to expose the basic tools formodelling Duration Data, rather than a survey of the state of the art at agiven date. This text is accordingly a first textbook on Duration data. Aparticular emphasis is given to the acquisition of the analytical tools thatwill help the readers to further study forthcoming materials in that field, notonly in the traditional sources of econometrics but also in biometrics, bio-statistics, reliability , etc. ; indeed, these fields, under the heading of "failuredata", have produced many earlier original contributions and are still veryactive in that field. Thus this text tends to favor cross-fertilizations betweendifferent fields of the application of statistical methods.

The General approach of this textbook is to embed the modelling ofduration data into the modelling of point processes, i.e. of stochastic pro-cesses in continuous time with finite state space.

The Main readership is targeted towards graduate students in econo-metrics or in statistics.

The basic Prerequisite for this text is a basic course in econometricsand in statistics, along with a basic course in calculus. Complements areincluded, and clearly separated, at the end of the volume with a double ob-jective . Firstly, this text is intended to be reasonably self-contained for anheterogeneous audience. It is indeed the experience that graduate programsin econometrics, or in economics, are attended by students with largely vary-ing backgrounds: this may be viewed as a positive feature, provided dueallowance is made to this heterogeneity. Thus such review materials will be

1

Page 11: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

LIST OF TABLES 2

superfluous for some readers, or students, but unfamiliar for other ones. Thisseparation hopefully alleviates some unnecessary frustrations of the reader.Secondly, a particular effort has been made to focus the attention on whatis properly genuine to the field of Duration data; for this purpose it seemedsuitable to physically separate, in the text, on one side, those materials per-taining to general tools of statistical methods or of mathematical analysis,and, on the other side, the kernel subject matter of Duration data. By so-doing, the reader, and the teacher, may organize more efficiently the learningof the very topic of this textbook.

The specification of the main readership and of the prerequisite also deter-mines the style of exposition. Thus, several analytical details are given, eventhough they are likely to be superfluous for a mathematically trained reader:this text also plays a role of (soft) mathematical training for some readers,with the hope of not irritating other readers. For instance, in the first chap-ters, a particular attention is devoted to give detail and explanation on someanalytical aspects which are sometimes frustrating the non-mathematicallyoriented economist; this regards , in particular, the one-sided continuity ofthe trajectories of the processes and of the distribution and survivor func-tions. In later chapters, the reader is assumed to have assimilated these issuesand less details are given.

Page 12: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Notation

IR : set of the real numbers (also: (−∞ , +∞) )

IN : set of the natural integers (= 0, 1, 2, · · · )

ZZ : set of the integers (= 0,±1,±2, · · · )

ai ↑ a : monotone right-convergence ( in IR), i.e. ai−1 < ai and ai → a

ai ↓ a : monotone left-convergence ( in IR), i.e. ai−1 > ai and ai → a

f(a+) = limx ↓ a

f(x): right-limit (f : IR → IR)

f(a−) = limx ↑ a

f(x): left-limit (f : IR → IR)

1I : characteristic function of a set

IP : Probability (of an event)

IE : Mathematical expectation(of a random variable or of a random vector))

⊥⊥ : stochastic independence

[a , b) = x ∈ IR | a ≤ x < b : interval closed on the left and open on theright

The End of a proof or of a Remark is denoted as :

3

Page 13: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 1

Introduction

1.1 Basic Idea

The modelling of many economic processes exhibits the following character-istics.

Firstly, the variable of interest (or endogenous variable) is discrete, oreven qualitative; one then says that the state space is discrete.

Secondly, the state of the system may change at any time, making thisone a continuous random variable. This feature is particularly relevant formodelling phenomena where there is no natural time unit for the decisionperiod within which decisions are made by individuals.

Thus the basic data may be viewed as a realization of a stochastic processX(t) in the discrete variable X with continuous time (t) where X(t) representsthe state of the individual (or, of the system) at time t.

1.2 An Overview of Topics

1.2.1 Some Examples in Economics

Example 1 . The labor market.An individual may be in any one of a finite collection of "states", namely,

employed, unemployed with unemployment benefit, unemployed without un-employment benefit, early retired, etc. Typically, a change of state may occur"at any time". Data may be available in the form of spells of unemploymentfor individuals included in a survey or in the form of a counting of individualsbeing in one of the above mentioned states.

4

Page 14: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 5

Example 2. Financial market.A company may be in any one of the following states: solvent, bankrupt.

Available data may come from a survey on the age of companies(including,or not, already bankrupted companies) or from time-series on the number ofbankruptcies.

Example 3. Marital Status.An individual may be in any one of the following states: single, married,

divorced.

Example 4. Durable goods.Modelling the length of life and the time of renewal for durable goods

may involve two types of consideration: technological (physical breakdown)or economical (economic obsolescence).

Example 5. Economics of Education.For a given individual, one may consider the states: high school, under-

graduate, graduate. Note that in this case, the states are ordered. For anadult having completed his studies, one may consider the duration at eachstages of his studies.

Other examples include:fidelity to a brand, duration of strikes, type of housing (including homeless

people).

Remarks

(i) It should be clear that most of these examples are naturally orientedtowards the analysis of micro-data

(ii) Typically, the set of the time index,t, is the positive part of the realline, i.e. t ≥ 0

(iii) In examples 1 and 3,there is a possibility of several changes over time:one then meets "multivariate duration". This is in contrast with example 2where only one change of state is possible (one then may speak of a "deathprocess" ).

(iv) In example 2, there are only 2 states while in examples 1 and 3 there

Page 15: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 6

may be more than 2 states: one then speaks of "competing risks".

1.2.2 Economic Motivations

In example 1 ( Labor Market)(i) Interesting questions may be:How much time do individuals spend in unemployment ? How does this

feature change with the business cycle ? How does the duration vary acrossindividuals ?

(ii) Why are these questions interesting ?a) The welfare of an individual may depend more on the duration of unem-ployment than on the mere fact of being unemployed. Thus the average du-ration may be a more relevant data than the number of unemployed becausethis one depends simultaneously on the number of occurrence of unemploy-ment spells and on the average duration of unemployment. This is also aninstance of the importance of distinguishing stock and flow: a same stockmay be obtained with long duration and small inflow or with short durationand stronger inflow.

b) In the economic theory of job search, the reservation wage depends on theduration of unemployment; thus the length of unemployment spells plays acritical role.

In example 2 (Financial Market )Questions of Interest may be:What is the average length of life of a company? Does it depend on the

sector? Does the ("instantaneous") probability of bankruptcy depend on theage of the company, on the sector, on the business cycle ?

1.2.3 Specific Features of Duration Data

a) Duration are non-negative random variablesTherefore, the Normal distribution is not any more a natural reference

distribution.

b) Data are typically "incomplete".

Page 16: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 7

b1) For example, in some surveys, unemployment is measured only forindividuals unemployed at the time of the survey. Therefore, apart from astructural model (more or less) based on economic analysis, there is a needfor a reporting model taking into account the fact that the available dataare incomplete in comparison to what economic analysis pretends to explain.Incompleteness may draw on: either the duration itself (truncation or cen-soring: this is known as length-biased sampling.) or on some unobservablecovariates (or "explanatory" variables), leading to problems of so-called "het-erogeneity".

b2) A situation often met in econometrics is the following: various sourcesof differently incomplete data are available. For example, data may be si-multaneously available on:

(i) stock of unemployed individuals with or without (some) individualcharacteristics

(ii) individual biographies(iii) distribution of individual characteristics.

Such a situation calls for a structural model allowing to handle coherentlythose different sets of data. It also points out to the need for a clear under-standing of the distinction between a structural form or a reduced form ofmodel, and eventually for a clear concept of exogeneity.

c) Specific forms of modelling durationDistribution functions are often replaced by their complement, called the

"Survivor" functions and, more importantly, so-called "Hazard" functions areoften preferred to distribution (or density ) functions in the description or thespecification of the probability law of a duration. Indeed, both the problemof structural modelling and of incomplete data push toward modelling theprobability law of duration conditionally on the system being in the samestate for a given time rather than unconditionally.

1.3 Modelling Point Processes

1.3.1 Introduction

In this section, an overview of the main problems to be faced is presented.

Motivation :

Page 17: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 8

Essentially: point processes as a reference frameworkConsider for a given individual:

discrete set of states (i.e. 2, finite or countable)change of states: at "any" time (i.e. continuous time).

Different (and complementary) Approaches- Discrete state continuous time stochastic process- Counting processes: "count" as time elapses the number of times a

specified event has realized (thus, in this case, the state space is IN.- Random measure (more general): model for indistinguishable points

distributed "randomly" in some space. Alternatively: a random distributionof (non-negative) masses on a general space.

This section :- these lecture notes: handle mainly the aspect of duration data- this section: a survey of the main themes to be considered in these

lecture notes.

1.3.2 Discrete State Space Stochastic Processes

In general

In general, a stochastic process X may be viewed as an infinite family ofrandom variables: X = (X(t) : t ∈ I) where I is an infinite set indexingthe family X. A natural way of describing the distribution of a stochasticprocess is to specify its finite-dimensional distributions; these are the mul-tivariate distributions of (X(t1), X(t2), . . . , X(tk)) for any finite k and anyt1, t2, . . . , tk ⊂ I.

A point process is a stochastic process (X(t) : t ∈ I) such that for eacht, X(t) is a discrete random variable and t is a continuous time index. Insteadof specifying all the finite-dimensional distributions of a point process, onemay specify some of its most relevant properties, viz. its counting propertiesor its interval properties.

Counting PropertiesFor any time t, one may specify the probability, for one individual, of

being in state j and, for n i.i.d. individuals, specify the distribution of thenumber of individuals being in state j. For a given time interval, ( t1, t2 ),one may also specify the distribution of the number of entrance into state j,

Page 18: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 9

either for an individual or for n i.i.d. individuals. These distributions char-acterize the counting properties.

Interval PropertiesAnother way of looking at a point process is to specify the distribution of

the time length between two changes of state. Such distributions characterizethe interval properties.

a) Duration: length of a time period in a same state or, equivalently, betweentwo consecutive changes of states; the probability law of the duration is themain interval property. Clearly, a duration is a non-negative random variable.

b) More specifically, let X = (Xt) be a stochastic process with discrete statespace and continuous time. A duration, Tj , is the length of a time intervalwhere Xt stays in the same state. Figure 1.1. illustrates the general casewhere the τj ’s are the stochastic times of change of state; they are definedrecursively as follows:

τj = g(X) = inft | Xt 6= Xτj−1 τ0 = 0 X0 = E0

Tj = τj − τj−1 :

c) In the simplest case of only 2 states (let these states be E0 and E1), theevent T = t0 means: Xt = E0 for t < t0 and Xt = E1 for t ≥ t0 and t0is the time of change of the state. If there are only two states and only onetransition possible (Death process), one may also speak of a “failure time".This is the main concern of chapters 2 and 3. Chapters 5 and 6 will considersome extensions.

d) In order to define unambiguously a duration, the time origin (t = 0) andthe time scale (unit of time) should be defined carefully.

1.3.3 Conditional Modelling

Basic idea introduction of "exogenous" variables.

1.3.4 Length-biased Sampling: Censoring and Trunca-tion

Some Examples

Page 19: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 10

τ1 τ2 τ3

E1

E0

E2

E3

Xt

t

T1 T2 T3

Figure 1.1: A typical trajectory in the general case.

Page 20: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 11

Suppose that, in a survey, a worker is asked: " When and how long have youbeen unemployed in the last 12 months ? ". If he is still unemployed at thetime of the interview he will be unable to answer the question "how long willbe your spell of unemployment?

Conclusions(i) at the level of data analysis : structural modelling of duration, from

an economic point of view, is not enough; one should also introduce a "re-porting" process taking into account the fact that the available data are"incomplete"

(ii) at the level of experimental planning : every effort should be madeto avoid length-biased sampling: thus biographical data on complete profes-sional lives are "ideal" although not often available.

(iii)We shall distinguish :structural modelling: chapters 2 to 5sampling and inference: chapters 6 to 9

1.3.5 Extensions

A-Multivariate duration

Basic idea:In many situations it may be appropriate to explicitly recognize that more

than one change of state may take place in a given realization of the process.Thus:

(i) in simple duration model: there is only one transition possible; thusthere is an initial state and a final state.

Vocabulary: One speaks of "survival" analysis in a death process or of"failure time " for the length of time before a unique point event does realize.

Examples: lifetime of machine components in industrial reliability, sur-vival time of patients in clinical trials, time taken, by a subject, to completea specified task in psychological experimentation.

(ii) In multivariate duration models: several transitions ,possibly amongthe same states, are possible.

Examples: change of residence in demographic studies, duration of un-employment or of strikes in economics.

B-Competing risks

Page 21: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 1. INTRODUCTION 12

Basic ideaWhen there are more than two states, one among several possible ideas

consists of modelling as if there were several "competing" causes of changeof state (or of failure ).

Examples- in clinical trials: several causes of death (in which case the state "dead"

is distinguished according to the cause of death)- in industrial life testing: several faults causing failure- in the labor market: an individual may leave the state "employed"

toward one of the states:" unemployed", "partially employed" or " earlyretired".

1.3.6 Modelling Strategy

Basically : parametric, non-parametric, semi-parametric models.In parametric models, the parameter space is a (subset of a ) Euclidean

space. For instance, an i.i.d. normal model.

In non-parametric models, the parameter space is a (subset of a ) func-tional space, for instance the set of all distribution functions on the real line.

In semi-parametric models, the parameter space is a (subset of a ) prod-uct of a Euclidean space and of a functional space, for instance, in semi-parametric regressions the parameter could be the variance of the residualdistribution (assumed to be normal) and an arbitrary (regression) function.

1.3.7 Two levels of Modelling

First aspect:

Chap.2 to 5: structural modelling without regards to the avail-ability of data.

Chap.6 to 8 : sampling and reporting modelling, along with prob-lems of inference.

Second aspect:

marginal modelling (process generating T )

conditional modelling (process generating T conditionally on some(typically exogenous)Z).

Page 22: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 2

Simple point process

2.1 Introduction

We first consider a simple death process, i.e. a point process characterizedby two states (E0, E1) and a unique transition (from E0 to E1). The trajecto-ries of such processes are completely characterized by a (univariate) duration.Modelling such simple trajectories is the object of this chapter. Later chap-ters consider extensions to processes with more than two states but only onetransition (Chapter 4) and to processes with multiple transitions (Chapter 5).

Let T be a “duration" i.e. the length of a time period in a fixed state(also “failure time" when the change of state is a "failure" that can occurat most once and the time origin is fixed at t = 0). In the simple case of a“Death process", the state space has only 2 states and the trajectories haveat most a unique transition; let :

X = (Xt : t ≥ 0) with Xt ∈ E0, E1 and X0 = E0

be the underlying stochastic process. Assuming the trajectories to be a.s.CADLAG (this is: for any t, X(t−) = limu↑tXu exists and X(t+) =limu↓tXu exists and X(t+) = X(t) ), duration T may then be definedas follows (see Figure 2.1):

T = g(X) = τ − 0 = inft | Xt = E1

Thus, in this simple case, τ , the instant of the (unique) transition, andT , the duration, are a same random variable.

13

Page 23: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 14

T 0 τ

E0

E1

t

Figure 2.1: A typical trajectory of a “Death process"

Thus, by definition:

Xt = E0 t < τ

= E1 t ≥ τ

The hypothesis that the trajectories of Xt are CADLAG (a.s.) means:

Xτ− = E0

Xτ+ = Xτ = E1

If we interpret E0 and E1, as "alive" and "dead" respectively, the eventT = t means "life up to time t− and death exactly at time t".

Clearly: P (T ≥ 0) = 1 and, in most models, we shall also assume thatP (T = 0) = 0 (but see (i) in Remarks 2.2.5. In structural models, T is oftena continuous r.v. but the empirical distribution function is a discrete timeprocess and non parametric methods are often based on (functional) transfor-mations of the empirical distribution function, considered as a best estimationof the “true" distribution function; this is, in particular, motivated by theGlivenko-Cantelli theorem (see, for instance, Chapter 9). Furthermore, insome applications, institutional rules may give a strictly positive probabilityof an event, or a transition, to be realized at a fixed date. Therefore in thischapter we shall explicitly consider both continuous and discrete durationmodels.

Page 24: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 15

2.2 Distribution and Survivor Functions

2.2.1 General Definitions and Properties

Introduction

We first recall the general definition of the distribution function and of itscomplement, the survivor function. Next, we give more details for the con-tinuous and the discrete case, particularly from the point of view of thecontinuity of these functions.

Distribution function

Definition

FT (t) = P (T ≤ t)

Meaning

Probability of duration at most equal to t, or of failure (or, death) notlater than t, i.e. of not surviving after time t.

Characteristic properties

(i) FT (t) ∈ [0, 1]

(ii) FT (t+) = FT (t) i .e. right − continuous

(iii) t1 > t2 ⇒ FT (t1) ≥ FT (t2) i .e. monotone non − decreasing

(iv) FT (0−) = 0 limt→∞

FT (t) = 1

Note:

“Characteristic" means: to any function satisfying those 4 properties,one may associate a unique probability measure P on IR+ such thatFT (t) = P (T ≤ t)

Page 25: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 16

Survivor function

Two different versions of the survivor function, a right-continuous one and aleft-continuous one, are met in the literature and should be distinguished, atleast in the non-continuous case.

Definition

F T (t) = P (T > t) = 1 − FT (t)

ST (t) = P (T ≥ t) = 1 − FT (t) + P (T = t)

Meaning:

F T (t) means the probability of failure (strictly) later than t; equivalently:probability of surviving at t.

ST (t) means the probability of failure not before t; equivalently : proba-bility of surviving at time t−.

Characteristic Properties

The two versions of the Survivor function enjoy the same properties exceptfor the direction of the possible discontinuities.

(i) ST (t) ∈ [0, 1] F T (t) ∈ [0, 1]

(ii) ST (t−) = ST (t) i .e. left − continuous

F T (t+) = F T (t) i .e. right − continuous (as is FT )

(iii) t1 < t2 ⇒ ST (t1) ≥ ST (t2) and F T (t1) ≥ F T (t2)

(iv) ST (0) = F T (0−) = 1 limt→∞

ST (t) = ST (∞−) = 0

limt→∞

F T (t) = F T (∞−) = 0

Relationships

F T (t−) = ST (t) = ST (t−) ≥ ST (t+) = F T (t+) = F T (t)

Page 26: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 17

2.2.2 (Absolutely ) Continuous Case

MotivationAs a point process is defined as a stochastic process in continuous time,almost all theoretical distributions used for simple point processes are of thecontinuous type.

The continuous (or diffuse) case is characterized by

P (T = t) = 0 ∀ t

or, equivalently :

ST (t) = F T (t) ∀ tThe absolutely continuous case furthermore requires that FT (or F T or ST )is differentiable. Hence the following definition.

Definition

∃fT : IR+ → IR+ such that : FT (t) =∫ t

0fT (u)du

equivalently : ST (t) =∫ ∞

tfT (u)du = 1 −

∫ t

0fT (u)du

equivalently : fT (t) = dFT (t)dt

= −dST (t)dt

The function fT (t) is called a “density" function.

Meaning

fT (t) = lim∆↓0

P [t < T ≤ t+ ∆]

Thus, the density function may be interpreted as an “instantaneous proba-bility" of failure (or of dying). In the continuous case, there exists a value oft such that FT (t) = ST (t) = .5; that value is the median of the distribution(see Figure 2.2).

Page 27: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 18

FT(t)

ST(t)

1

1/2

0 t

Figure 2.2: Distribution Function and Survivor Function in the Con-tinuous Case

2.2.3 Discrete Case

MotivationIn statistical applications, the empiricval procvess is discrete in nature. Thus,in finite sample size, the empirical distribution is discrete, even if the theoreti-cal distribution is continuous. As most non-, or semi-, paramatric proceduresare based on the empirical process, a precise description of the discrete caseis called for. Furthermore, a theoretical continuous distribution may be vied-wed as an idealisation of actually discrete measurements. Thus some modelsspecify, from the start, dsicrete distributions. Definition

∃(fj , aj) j ∈ J ⊆ IN

fj > 0∑

j∈J

fj = 1

aj ≥ 0 aj+1 > aj . . .

such that:

Page 28: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 19

FT (t) =∑

j∈J

fj1Iaj≤t =∑

j|aj≤t

fj

F T (t) =∑

j∈J

fj1Iaj>t =∑

j|aj>t

fj = 1 −∑

j|aj≤t

fj

ST (t) =∑

j∈J

fj1Iaj≥t =∑

j|aj≥t

fj = 1 −∑

j|aj<t

fj

equivalently:

fj = FT (aj) − FT (aj−) = FT (aj) − FT (aj−1)

= F T (aj−) − F T (aj) = F T (aj−1) − F T (aj)= ST (aj) − ST (aj+) = ST (aj) − ST (aj+1)

Note in particular that:

F (aj) =∑

i≤j

fi

F T (aj) = 1 −∑

i≤j

fi =∑

i>j

fi

Note also that:

FT (aj+) = FT (aj) > FT (aj−) = FT (aj) − fj = FT (aj−1)

F T (aj+) = F T (aj) = ST (aj+) < F T (aj−) = F T (aj) + fj = F T (aj−1)ST (aj+) = ST (aj+1) = ST (aj) − fj < ST (aj−) = ST (aj)ST (aj) − F T (aj) = fj

Meaning

The event T = aj means “life up to time aj− and dead at time aj " andthat event has probability fj .

Continuity Property

The CADLAG property of both FT and F T results from the correspondingproperty of their compounding characteristic functions (see Figure 2.3., seealso Section 9.1). Whereas ST is CAGLAD (left-continuous with right limit).

Page 29: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 20

1

a1 a3

a1 a2

1

a3

0

0

f3

f2

a2

f1

t

f0

t

f0

f1

f2 f3

F (t)T

(t )FT

f 1

f2

f 3

a1a

2a

3 t

f0

ST (t)

Figure 2.3: Distribution Function and Survivor Function in the Dis-crete Case

Page 30: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 21

2.2.4 Defective distributions

Motivation et definition

Usual distributions, as characterized in Section 2.2.1, assume zero probabilityfor the event "infinite duration", i.e. :

limt→∞

FT (t) = FT (∞−) = 1

limt→∞

F T (t) = F T (∞−) = 0

limt→∞

ST (t) = ST (∞−) = 0

In several applications, however, the end of the disease, or the failure of amachine, is never observed for some individuals. This has suggested to givepositive probability, or frequence in empirical distribution functions, to theevent "infinite duration". Formally, this is achieved by the use of so-called"defective distributions", obtained by defining the distribution and survivorfunctions on the compactified (positive) real line, i.e. by adding +∞ toIR+:

IR+ = IR+ ∪ +∞along with the following properties:

limt→∞

FT (t) = FT (∞−) < 1 FT (∞) = 1 FT (∞) − FT (∞−) = P (T = +∞)

limt→∞

F T (t) = F T (∞−) = P (T = +∞) > 0 F T (∞) = 0

limt→∞

ST (t) = ST (∞−) = ST (∞) = P (T = +∞) > 0

Thus both ST and F T are undefined at the right of +∞ and , at infinity, ST

is left-continuous. Note however that in IR+, +∞ is an isolated point giventhat any bounded neighborhood of +∞ has empty intersection with IR+.

Construction of Defective distributions

Simple ways for building defective distributions include the following ones:

(i)Empirical estimators: for instance, a standard estimator (Kaplan-Meyer)ofthe survivor distribution with censored data may lead to a defective distri-bution (see Section 7.2.2)

Page 31: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 22

(ii)Mixture distributions

P = (1 − a)P1 + aP2

where P1 is a non-defective distribution, P2 is a point mass on +∞ anda = P (T = +∞).

(iii)Limit procedures: for instance, let T be distributed as a log-normal dis-tribution (see e.g. Section 2.5.5 ) with parameter µ and σ2 = n; the limitwhen n → +∞ is a defective distribution putting probability .5 on 0 and.5 on +∞.

2.2.5 Remarks

(i) Probability on 0

Many models are such that: P (T = 0) = 0; this means that: F (0) =F (0−) = 0 or, equivalently, S(0) = F (0) = 1. The plausibility of such anassumption crucially depends on the specification of the time origin or onthe definition of the state space.

For instance, in a survey conducted to assess how students enter the labormarket, the time origin (i.e. the calendar time for which t = 0) for a givenindividual can be specified to be the time of her (his) graduation. If T is theduration of unemployment after leaving the school, the event T = 0 doesrealize for a student who has found a job before graduating and eventuallystarts working as early as at time t = 0. Modelling such a situation wouldrequire P (T = 0) > 0. If, however, T is defined as the duration of the unem-ployment of those who have failed to find a job before graduating, the timeorigin may be defined as the time of entrance in the state of unemploymentmaking the property P (T = 0) = 0 acceptable.

In other words, the specification whether or not P (T = 0) = 0 is es-sentially a matter of modelling a given situation; the usual distributions ofduration display the property P (T = 0) = 0 and may therefore be interpretedas distributions conditional to T > 0, leaving the possibility of modellingapart the probability of the complementary event T = 0.

Page 32: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 23

When P (T = 0) > 0, one has:

FT (0−) = 0 FT (0) = P (T = 0) > 0

F T (0−) = 1 F T (0) = 1 − P (T = 0) < 1

ST (0) = 1 ST (0+) = 1 − P (T = 0) < 1

In the sequel, we always assume P (T = 0) = 0, unless explicitly men-tioned otherwise, i.e. we consider that modelling P (T = 0) is an issue outof the scope of the main theme.

(ii) One-sided continuity

Note that, in general,:monotone + right-continuous ⇒ CADLAGmonotone + left-continuous ⇒ CAGLAD.

(iii) Structure of a probability measure on IR+

In general, any probability measure P on IR+ may be decomposed intothree components as follows:

P = a1 Pd + a2 Pac + a3 Ps aj ≥ 0,∑

1≤j≤3

aj = 1

where:

Pd: discrete

Pac: absolutely continuous (with density)

Ps: continuous singular (with nowhere density)

We shall always assume that there is no singular part (i.e. a3 = 0)

2.3 Truncated Distributions and Hazard Func-

tions

2.3.1 Motivations

(i) Problem of time dependence. Consider the questions: what is the “instan-taneous" probability of dying at time t given you are still living at time t−

Page 33: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 24

? In other words: this is the “instantaneous" probability of leaving state E0

at time t, given one has been in state E0 from time 0 to t−. The question is:"how far does it depend on t?". More generally: this is the problem of theprobability law of duration T ,conditionally to T ≥ t. (Remember that theevent T = t means “still alive at time t−“). Such conditional distributionsare “truncated" distributions.

(ii) the preceding questions are often so “natural" that modelling those trun-cated distributions may be economically more meaningful than modellingthe untruncated distributions. For instance, in many job search models, thereservation wage, at a given instant, is made a function of the duration ofunemployment up to that instant.

(iii)For the probabilistic approach to a simple point process, the history ofthe process up to time t is perfectly summarized either by T ≤ t, in whichcase X(t) = E1 or by T > t, in which case X(t) = E0. Thus the distri-bution of T truncated at T ≥ t models the short term dynamic of the process.

(iv) Censoring (see Section6.2.2) makes truncated distributions particularlyuseful.

2.3.2 Integrated Hazard Functions

Definition

HT (t) =

[0,t]

1

ST (u)dFT (u) =

[0,t]

1

F T (u−)dFT (u) (2.1)

Properties

(i) HT : IR+ → IR+

(ii) HT (t+) = HT (t) i.e. right-continuous

(iii) t1 < t2 ⇒ HT (t1) ≤ HT (t2) i.e. monotone non-decreasing

(iv) HT (0−) = 0 HT (∞) = +∞ and HT (∞−) ≤ ∞

Thus, properties (ii) and (iii) imply that HT is a CADLAG function.

Meaning and motivation

Page 34: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 25

- This is a simple transformation of the survivor function and, as we shallsee later, a useful tool for characterizing some distributions of duration.- Decomposition of Doob-Meyer (see Chapter 5 )

Remarks

(i) Note first that:

HT (0) > 0 ⇔ P (T = 0) > 0

(ii) For some problems it may be of interest to define, in the general case, dif-ferent alternative versions of the integrated (or cumulative) hazard function,namely:

HRT (t) = −lnF T (t) HL

T (t) = −lnST (t) (2.2)

Clearly, HRT , respectively HL

T , is right-continuous, respectively left-continuous,and HL

T ≤ HRT ; furthermore, in the continuous case: HR

T = HT = HLT .

In order to distinguish, in the general case, the two forms, HT in (2.1), isoften referred to as the "predictable form", in reference to the Doob-Meyerdecomposition, whereas HR

T in (2.2) is often referred to as the "exponentialform", for obvious reasons.

(iii) In this note, we privilege the right-continuous versions of the functions.Similarly to (2.2), a left-continuous version of the predictive form may beobtained by integrating over [0, t) instead of [0, t] but we shall not make useof such a definition; in case of necessity one may use HT (t−).

2.3.3 Hazard Function (or: age-specific failure rate)

(Absolutely) Continuous Case

Definition

Remember that, in the absolutely continuous case, F T (u) = ST (u) and thereexists a density function fT (t) such that:

fT (t) = −dST (t)

dt=dFT (t)

dt

Page 35: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 26

Therefore:

HT (t) =

∫ t

0

fT (u)

ST (u)du = −

∫ t

0

1

ST (u)dST (u) = − lnST (t)

Thus, in the continuous case, there exists a function hT (t) such that:

hT (t) =dHT (t)

dt= −d lnST (t)

dt=fT (t)

ST (t)

Meaning:

The function hT (t) may be viewed as an “instantaneous probability" of leav-ing the present state, conditionally on being in the initial state for a time tindeed:

h(t) = lim∆↓0

P [t < T ≤ t+ ∆ | T > t]

Thus, h(t) is also called an “age-specific failure rate" or an “age-specific deathrate".

Properties(i) hT : IR+ → IR+

(ii)

∫ ∞

0

hT (u)du = ∞

and is not necessarily monotone.

Relationships:HT (t) = − lnST (t)

is equivalent to :ST (t) = exp−HT (t)

Furthermore:ST (t) = exp−

∫ t

0hT (u)du

HT (t) =∫ t

0hT (u)du

fT (t) = hT (t) exp−∫ t

0hT (u)du

FT (t) = 1 − exp−∫ t

0hT (u)du

Page 36: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 27

Discrete Case

Definition

Remember that, in the discrete case [(aj , fj) : 1 ≤ j ≤ k], for any (inte-grable) function g(u), we have:

[0,t]

g(u)dF (u) =∑

j|aj≤t

g(aj)fj =∑

j

g(aj)fj1Iaj≤t

Therefore:

HT (t) =∑

j|aj≤t

fj

ST (aj)=

j|aj≤t

fj

F T (aj) + fj

equivalently:

HT (t) =∑

j|aj≤t

fj

fj + fj+1 + . . .

So, the discrete version of the (instantaneous ) hazard function is derivedfrom the jumps of HT (t):

hj = HT (aj) −HT (aj−) =fj

fj + fj+1 + fj+2 + . . .=

fj

ST (aj)=

fj

F T (aj) + fj

Thus we obtain:

HT (t) =∑

j|aj≤t

hj =∑

j

hj1Iaj≤t

In particular:(i) h1 = f1.(ii) (non-defective distributions) When sup aj = ak < +∞, then hk = 1.

Meaning

Remember that the event T = aj means: life until instant aj- and deathjust at aj. Thus, hj may be interpreted as the probability of dying exactlyat time aj given one is still in life at time aj-.

Relationships

Page 37: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 28

(i) We may write the survival functions as follows:

ST (t) =∏

j|aj<t

(1 − hj) =∏

j

(1 − hj)1Iaj<t

F T (t) =∏

j|aj≤t

(1 − hj) =∏

j

(1 − hj)1Iaj≤t

Indeed, let us recall the familiar identity:

c0 − c1 = c0(1 − c1c0

)

c0 − (c1 + c2) = c0(1 − c1c0

)(1 − c2c0 − c1

)

...

c0 −∑

1≤j≤k

cj = c0∏

1≤j≤k

(1 − cj

c0 −∑

1≤m≤j−1

cm)

and apply it to:

ST (t) = 1 −∑

j|aj<t

fj

=∏

j|aj<t

[1 − fj

1 −∑

1≤m≤j−1

fm

]

=∏

j|aj<t

(1 − fj

F T (aj) + fj

) =∏

j|aj<t

(1 − fj

ST (aj))

=∏

j|aj<t

(1 − hj)

similarly:

F T (t) =∏

j|aj≤t

(1 − hj)

We also obtain the following relationships:

− lnF T (t) = −∑

j|aj≤t

ln(1 − hj) = −∑

j

1Iaj≤t ln(1 − hj)

Page 38: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 29

'∑

j|aj≤t

hj = HT (t) if hj “small” i .e. − ln(1 − hj) ' hj

Thus, in the discrete case, HT (t) is approximately equal to − lnF T (t) if all hj

are small while in the continuous case HT (t) is exactly equal to − lnF T (t) =− lnST (t).

(ii) From

ST (aj) =∏

1≤i≤j−1

(1 − hi),

we also obtain:fj = hj

1≤i≤j−1

(1 − hi).

Note also that:

1 − hj = 1 − fj

ST (aj)=

ST (aj+1)

ST (aj)=

F T (aj)

F T (aj−1)(2.3)

Monotone hazard function

In the sequel, a recurrent theme is to specify whether a hazard function ismonotone or not. In particular, a hazard function which is monotone non-decreasing is said to enjoy the IFR (Increasing Failure Rate) property.

2.3.4 Truncated Survivor Function

Definition

In general:

F T (t | t0) = P (T > t+ t0 | T > t0) =P (T > t+ t0)

P (T > t0)=

F T (t+ t0)

F T (t0)(2.4)

ST (t | t0) = P (T ≥ t0 + t | T ≥ t0) =P (T ≥ t+ t0)

P (T ≥ t0)=

ST (t+ t0)

ST (to)(2.5)

In the absolutely continuous case:

Page 39: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 30

ST (t | t0) = exp [HT (t0) −HT (t+ t0)]

= exp

[

−∫ t+t0

t0

hT (u)du

]

In the discrete case [(aj , fj) : 1 ≤ j ≤ k], from (2.3), we obtain:

1 − hj = F T (aj | aj−1) = ST (aj+1 | aj) (2.6)

Properties

(i) ST (· | ·) : IR2+ → [0, 1]

(ii) ST (0 | t0) = 1 ∀ t0

limt→∞

ST (t | t0) = 0 ∀ t0

(iii) monotone non − increasing in t for any t0(iv) left − continuous in t for any t0

Note that:ST (t | 0) = ST (t) ∀t

is always true, whereas:

F T (t|0) = F T (t) ∀trequires P (T = 0) = 0.

Remark:

Truncated Survivor Functions are particularly useful when dealing with groupedduration data, i.e. data for which we only know that the duration lies in agiven interval. Consider indeed a given grid on the time axis:

0 < t1 < t2 < . . . < tI

and define:cj = tj − tj−1 with: t0 = 0

Page 40: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 31

Note that:c1 = t1 and tk =

1≤j≤k cj tj = tj−1 + cjand that:

tk−1 ≤ T < tk = ∩1≤j≤k−1

T ≥ tj ∩ T < tkThen we may write:

P [tk−1 ≤ T < tk] = P (T < tk | T ≥ tk−1).P (T ≥ tk−1 | T ≥ tk−2) . . . P (T ≥ t1 | T ≥ 0)

= [1 − ST (ck | tk−1)]∏

1≤j≤k−1

S(cj | tj−1)

2.3.5 Truncated Expected duration (expected residual

life)

Definition

rT (t) = E [T − t | T ≥ t]

= E [T | T ≥ t] − t

=1

ST (t)

∫ ∞

t

ST (u)du

Properties:

(i) rT : IR+ → IR+

(ii) rT (0) = E(T ) > 0

Relationships

In the absolutely continuous case, we have:

ST (t) =rT (0)

rT (t)exp

[

−∫ t

0

1

rT (u)du

]

2.4 Temporal dependence

2.4.1 Basic concepts

a) Basic question

Page 41: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 32

h ( t )

18-22 40 t

Figure 2.4: Typical Hazard Function for Human Life.

How does hT (t) depend on t? Remember that, in the continuous case:

hT (t) = lim∆↓0

P [t < T ≤ t+ ∆ | T > t]

This question is best appreciated through the next examples.

b) Some Examples

(i) length of human life (Figure 2.4)A typical hazard function for the human is reproduced in Figure 2.4. At first,h is sharply decreasing reflecting the childhood mortality related to a naturalselection process; the peak around 18 − 22 years is partly due to motorcycleaccidents, that around 40 years is due to the first heart accidents; finally, thefunction becomes increasing due to advanced aging.

(ii) unemployment duration (see Figure 2.5)A typical hazard function for unemployment is given in Figure 2.5 with afirst period of active looking for job and, consequently, increasing proba-

Page 42: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 33

h(t)

typically

most active period for looking for job.

period of discouraging

t

Figure 2.5: Typical Hazard Function for unemployment.

bility of leaving that state, and thereafter a period of discouragement and,consequently, decreasing probability of leaving that state.

(iii) length of life of companies (Figure 2.6)A typical hazard function for the length of life of companies is given in Figure2.6: here, the survivor function is more or less exponentially decreasing sothat the hazard function is roughly constant.

c) Definition

Duration T has the property of “temporal independence" iff:

⇔ hT (t) = h

⇔ ST (t | t0) = ST (t)

⇔ rT (t) = r

⇔ T ∼ Exp(h)

Remark: ST (t | t0) = ST (t) ⇔ ST (t + t0) = ST (t) ST (t0), i.e. lnST (t) islinear in t.

Page 43: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 34

S ( t )

typically

h ( t )

t

Figure 2.6: Typical Hazard Function and Survivor Function for Com-panies Life.

Page 44: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 35

d) Interpretation

The process underlying the duration T has "no memory": in particular thetruncated survivor function does not depend on t0. That property is charac-teristic of and characterizes the exponential distribution.

This property requires the continuity of the distribution of T . Indeed, in thediscrete case [(aj , fj) : 1 ≤ j ≤ k], hj may not be constant, given thath1 = f1 and hk = 1.

2.4.2 Structural instability of temporal independence

In a sense, the property of temporal independence is a limit property: it isindeed typically unstable as shown by the next example.

An example in Operations Management.Consider a machine with 2 components in parallel and let:

Tj : duration of component j

T = maxT1, T2 : duration of the machine

assume furthermore that:T1⊥⊥T2

Clearly:

P [max(T1, T2) ≤ t] = P[

T1 ≤ t⋂

T2 ≤ t]

= P [T1 ≤ t]P [T2 ≤ t]

therefore:FT (t) = FT1(t)FT2(t)

Now, let us assumeTj ∼ Exp (hj)

Page 45: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 36

2. 4. 6. 8. 10.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 2.7: Hazard functions in the two components example (h1 =.5, h2 = 1 and h1 = .5, h2 = 2)

Therefore, we successively obtain:

FT (t) =[1 − e−h1t

]·[1 − e−h2t

]

fT (t) = h1e−h1t

[1 − e−h2t

]+ h2e

−h2t[1 − e−h1t

]

= h1e−h1t + h2e

−h2t − (h1 + h2)e−(h1+h2)t

ST (t) = 1 − FT (t) = e−h1t + e−h2t − e−(h1+h2)t

hT (t) =h1e

−h1t + h2e−h2t − (h1 + h2)e

−(h1+h2)t

e−h1t + e−h2t − e−(h1+h2)t

=h1e

h2t + h2eh1t − (h1 + h2)

eh1t + eh2t − 1

It may be checked that hT (0) = 0, hT (∞) = min(h1, h2) and that hT isunimodal, i.e. first increasing and then decreasing. ( see Figure 2.7 )

When h1 = h2 = h, we obtain:

Page 46: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 37

hT (t) =h(e−ht − e−2ht)

e−ht − 12e−2ht

= heht − 1

eht − 12

It may be checked that hT (0) = 0 and that h′T (t) > 0 t, i.e. hT is monotoneincreasing but bounded: hT (∞) = h ( see Figure 2.8 )

Figure 2.8. Hazard Function when h1 = h2 = .5(missing figure)

In Chapter 6.3, we shall see that aggregation over heterogeneous individuals,each with exponential duration, also destroys the exponential character ofthe aggregate duration.

2.5 Some Useful Distributions for Duration Data

Preliminary Remarks

(i) In principle, any distribution of a non-negative random variable may beused; in practice, the most useful ones are given below. Table 2.5 summarizestheir main analytical properties.

(ii) As a general remark, let us note the following simple fact.

Let:

U ∼ FU such that: supp (FU) = u | fU(u) > 0 = IRT = eU U = lnT

Then supp FT = IR+

Example:U ∼ N(µ, σ) and T = eU ⇒ T ∼ log Norm

Page 47: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 38

Name Survivor function Density function Hazard rateST (t) fT (t) hT (t)

Exponential e−αt α e−αt α

Gamma * (1) α (αt)β−1e−αt

Γ(β)*

Weibull exp[−(αt)β] α β (αt)β−1 exp[−(αt)β ] α β (αt)β−1

Gompertz-Makeham γ + βeαt

Compound exponential αβ

(t + α)β

β αβ

(t + α)β+1β

t +α

Log-normal * (σt)−1(2π)−12 *(2)

exp[− 12σ2 (log t − µ)2]

Log-logistic [1 + (αt)β]−1 β αβtβ−1[1 + (αt)β]−2 β αβtβ−1[1 + (αt)β]−1

∗ no analytical form(1) Incomplete gamma function(2) Non-monotonic

Table 2.1: Main Analytical properties of some distributions on IR+

Furthermore, we also have:

ST (t) = SU(ln t)

fT (t) = (1/t)fU(ln t)

hT (t) = (1/t)hU(ln t)

HT (t) = − lnSU(ln t) = HU(ln t)

(iii) In this section, we only consider continuous (non-defective)distributions.Thus, ST (t) = F T (t)

Page 48: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 39

2.5.1 Exponential distribution

Definition

T ∼ Exp (α) α > 0 if and only if:

FT (t) = 1 − e−αt ST (t) = e−αt fT (t) = αe−αt

hT (t) = α (no temporal dependence) HT (t) = αt

rT (t) = α−1(∗)

ST (t | t0) = ST (t)(∗∗)

indeed:

(∗)∫ ∞

tST (u)du =

∫ ∞

te−αudu = 1

α

∫ ∞

αte−vdv

= − 1α

∫ ∞

αtde−v = − 1

α[e−v]

∞αt = − 1

α[0 − e−αt] = e−αt

α

⇒ rT (t) = 1ST (t)

∫ ∞

tST (u)du = 1

e−αt · 1αe−αt = 1

α

(∗∗) ST (t+ t0) = ST (t) · ST (t0)

⇒ ST (t | t0) = ST (t)ST (t0)ST (t0)

= ST (t)

A useful property is given by next lemma.

Lemma.T ∼ exp(h) ⇒ cov (T, log T ) = E(T )

Figure 2.8 represents graphically those functions.

Moments

E[T r] =Γ(r + 1)

αr

Therefore:

E(T ) = r(0) =1

α

Page 49: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 40

2. 4. 6. 8. 10.

0.25

0.5

0.75

1.

1.25

1.5

1.75

2.

Figure 2.8: Functions characterizing the ExponentialDistribution(h = 0.5)

Furthermore:

V (T ) =1

α2⇒ σT =

1

α

implying:

CV (T ) =σT

E(T )= 1

Empirical test of exponentiality

The fact that the integrated hazard function is linear in t: HT (t) = − lnST (t) =αt suggests a natural way of performing an empirical test of exponentialityby examining whether the plot of (−ln(ST (t), t) is close to an homogenousline (with positive slope), where ST (t) is the empirical survivor function.

Remark

When modelling duration, the exponential distribution has often a role sim-ilar to that of the normal distribution for real-valued random variables: adistribution of reference. Thus, when specifying a model, if an assumptionof exponentiality does not appear to be suitable, one may look either for a

Page 50: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 41

structural explanation ( in the spirit of Section 2.4.2) of the structural in-stability of temporal dependence) or for a transformation of the time thatwould, for instance, render the (instantaneous ) hazard function flat.In particular, the integral transform theorem (Theorem 9.2.1), provides sucha transformation: the integrated hazard function H(T ) of any continuousrandom variable has an exponential distribution with unit parameter. Con-versely, several of the distributions to be examined may be obtained from sim-ple transformations of an exponential variable. The points to be stressed arethe following: firstly, those transformations are generally parametric eventu-ally introducing richer families than the exponential one; secondly, several ofthose families may also be viewed as embedding the exponential family intoa larger one, a member of which being the exponential: this will provide easytests for exponentiality with, however, specific alternatives.

2.5.2 Gamma distribution

Definition

T ∼ Ga(α, β) α > 0, β > 0 (where α is a scale parameter and β is a formparameter ) if and only if:

fT (t) =α(αt)β−1e−αt

Γ(β)=αβtβ−1e−αt

Γ(β)

FT (t) =Γ(αt | β)

Γ(β)= Iβ(αt)

ST (t) = 1 − Iβ(αt)

where Γ(β) and Γ(t|β) (alternatively, Iβ(t)) are the complete and the incom-plete gamma functions respectively:

Γ(β) =

∫ ∞

0

uβ−1e−udu

Γ(t | β) =

∫ t

0

uβ−1e−udu = Γ(β)Iβ(t)

Hazard function

hT (t) =tβ−1e−αt

∫ ∞

tuβ−1e−αudu

Page 51: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 42

Under v = u− t, we obtain:

1

hT (t)=

∫ ∞

0

(1 +v

t)β−1

e−αvdv

From which we conclude that the gamma distribution is asymptotically ex-ponential in the following sense:

limt→∞

hT (t) = α

Furthermore:

β > 1 ⇒ hT (t) increasing with hT (0) = 0β = 1 ⇒ hT (t) = α ∀tβ < 1 ⇒ hT (t) decreasing with hT (0) = ∞

Moments and parametrization

E[T r] =Γ(r + β)

Γ(β) αr

E(T ) =β

α

V (T ) =β

α2

Furthermore:

CV (T ) =

√β

α· αβ

=1√β

Remarks

(i)T ∼ Exp(α) ⇐⇒ T ∼ Ga(α, 1)

(ii) Clearly, α is a scale parameter. As may be seen from Figure 2.9 , for agiven value of E(T ), the parameter β may be adjusted so as to give the formof the distribution.

(iii) The particular case where β is a positive integer, β = k ∈ IN+, isknown as an "Erlang distribution and may be obtained as the distribution ofa sum of k independent exponentially distributed, with common parameterα, random variables.

Page 52: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 43

1. 2. 3. 4. 5.

1.

0.8

0.6

0.4

0.2

ST(t)

β<1

β=1

β>1

1. 2. 3. 4. 5.

1.2

1.

0.8

0.6

0.4

0.2

fT(t)

β<1

β=1

β>1

2. 4. 6. 8. 10.

3.

2.5

2.

1.5

1.

0.5

hT(t)

β<1

β=1

β>1

Figure 2.9: Functions characteristic of the Gamma Distribution(α =1 β = 0.5, 1, 2)

Page 53: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 44

Practically

The distribution Ga(α, β):(i) is more flexible than the exponential distribution(ii) is asymptotically exponential as far as the hazard function is concerned(iii) always has a monotone (increasing of decreasing) hazard function.(iv) is not too convenient because of the incomplete gamma function.

2.5.3 Weibull distribution

Definition

Two slightly different parametrizations are frequently met in the literature.

First Parametrization

T ∼W∗(α∗, β) ⇔ T β ∼ exp(αβ∗ ) α∗, β > 0

Equivalently:

T0 ∼ exp(αβ∗ ) T = T

0 T0 = T β

Therefore:

fT (t) = α∗β(α∗t)β−1e−(α∗t)β

ST (t) = exp−(α∗t)β

hT (t) = α∗β(α∗t)β−1

HT (t) = (α∗t)β

In this parametrization, α∗ is a scale parameter.

Second ParametrizationLet us reparametrize the Weibull distribution as follows:

α = αβ∗ α∗ = α

By so-doing, we obtain

T ∼W (α, β) ⇔ T β ∼ exp(α) α, β > 0

fT (t) = αβtβ−1e−αtβ

ST (t) = exp−αtβhT (t) = αβtβ−1

HT (t) = αtβ

Page 54: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 45

Note that :T ∼ exp(α) ⇔ T ∼W (α, 1)

Moments

E[T r] =Γ( r

β+ 1)

αr∗

=Γ( r

β+ 1)

αrβ

E[T ] =Γ( 1

β+ 1)

α∗=

Γ( 1β

+ 1)

α1β

V [T ] =Γ( 2

β+ 1) − Γ( 1

β+ 1)2

α2∗

=Γ( 2

β+ 1) − Γ( 1

β+ 1)2

α2β

Behavior of the hazard rate

We shall analyze the behavior of the hazard rate in the second parametriza-tion only and consider its first two derivatives:

d

dthT (t) = αβ(β − 1)tβ−2

Therefore:β < 1 ⇒ h

T < 0β > 1 ⇒ h

T > 0

Next, we have:d2

dt2hT (t) = αβ(β − 1)(β − 2)tβ−3

Therefore:0 < β < 1 ⇒ h

′′

T > 01 < β < 2 ⇒ h

′′

T < 0β > 2 ⇒ h

′′

T ;> 0

In figure 2.11 are shown some typical functions characterizing the Weibullcase.

Empirical test for the Weibull distribution

Notice that:lnHT (t) = ln[− lnST (t)] = lnα + β ln t

This is a linear function in ln t. Thus, an empirical test for the Weibulldistribution would be to check whether the plot ln[− ln ST (t)], ln t givesapproximately a straight line.

Page 55: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 46

Practically

The Weibull distribution:

(i) displays simple analytical forms of fT (t), ST (t), and hT (t) but is not amember of the exponential family for unknown β(ii) has some flexibility in the form of hT (t) which is, nevertheless, alwaysmonotone(iii) is not asymptotically exponential

2.5.4 Gompertz-Makeham distribution

This distribution is largely used in demography and in actuarial sciences andmay be viewed as an example of a distribution where the hazard function isspecified in such a way that the corresponding integrated hazard function,i.e. its integral on the interval [o, t], is analytically convenient, from whichthe survivor and the density functions are derived. Therefore, the parametersfirst introduced in the hazard function, may be directly interpreted in termsof the properties they induced on the hazard function. In this case, thehazard function is an exponential function.

Definition

T ∼ GM(α, β, γ) α ∈ IR (β, γ) ∈ IR2+ β + γ > 0

if and only if:hT (t) = γ + βeαt

HT (t) = γt+ βα−1(eαt − 1)

ST (t) = exp−γt+ βα−1(eαt − 1)

fT (t) = (γ + βeαt)exp− γt+ βα−1(eαt − 1)

Page 56: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 47

ST(t)

0.5 1. 1.5 2.52. 3.

1.

0.8

0.6

0.4

0.2

3.

2.5

2.

1.5

1.

0.5

0.5 1. 1.5 2.52.

fT(t)

3.

2.5

2.

1.5

1.

0.5

0.2 0.4 0.6 0.8 1.

hT(t)

β<1

β=2

1<β<2

β>2

β=1

Figure 2.10: Hazards functions of the Weibull Distribution (α =1, β = .5, 1.5, 3)

Page 57: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 48

The pattern of the hazard function, as displayed in Figure 2.12, is obtainedfrom the following characteristics:

dhT (t)

dt= αβeαt

d2hT (t)

dt2= α2βeαt

hT (0) = γ + β

hT (∞) = ∞ if α > 0

= γ + β if α = 0

= γ if α < 0

Thus, the hazard function is always monotone (indeed: α dhT (t)dt

≥ 0)

and convex (indeed d2hT (t)dt2

≥ 0).

(Figure 2.12: missing)Figure 2.12. Hazard Functions of the Gompertz-Makeham Distribution

Remarks

(i) Note the following identification problems at the frontier of the parameterspace (of the hazard function):

T ∼ GM(0, β, γ) if and only if T ∼ Exp(β + γ)

T ∼ GM(α, 0, γ) if and only if T ∼ Exp(γ)

(ii) The Gompertz (1779-1865) distribution corresponds to the case: γ = 0.Makeham(1860) added an additional constant term. When α > 0, theGompertz-Makeham distribution allows one to describe situations where thehazard rate is rapidly increasing, with t, as is the case for the length of hu-man life.

(iii) Note also that the parameter β > 0 may be reparametrized into areal-valued parameter: λ = lnβ i.e. β = eλ (see, for instance, the programSURVCALC)

(iv) Under the heading of the Gompertz distribution, some works also usethe specification:

hT (t) = eλ + γ t (λ, γ) ∈ IR2

Page 58: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 49

2.5.5 Log-Normal distribution

Motivation:

One interest of this specification is to provide a unimodal (and non mono-tonic) hazard function.

Definition

T ∼ LN(µ, σ) ⇐⇒ lnT ∼ N(µ, σ)

⇔ fT (t) =1

σt√

2πexp− 1

2σ2(ln t− µ)2

=1

σtϕ(

ln t− µ

σ)

ST (t)· = 1 − Φ(ln t− µ

σ)

hT (t) =1

σt

ϕ( ln t−µσ

)

1 − Φ(ln t− µ

σ)

︸ ︷︷ ︸

Mill′s ratio

where ϕ and Φ denote the density function and the distribution function ofa standardized normally distributed random variable respectively.

Note that:limt→∞

hT (t) = 0 limt→0

hT (t) = 0

This implies that the log-normal distribution always has a unimodal andtherefore non-monotonic hazard function, as shown in Figure 2.11

Moments

E[T r] = eµ r + r2 σ2

2

E[T ] = eµ + σ2

2

Page 59: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 50

t

h ( t )

σ = 1 σ < 1

Figure 2.11: Hazards functions of the Log-Normal Distribution

V [T ] = e2 µ +σ2

[eσ2 − 1]

2.5.6 Log-Logistic distribution

Definition

Let us first recall the density (ψ) and the distribution (Ψ) function of thestandard logistic distribution:

X ∼ L(0, 1)

if and only if

ψ(x) =e−x

(1 + e−x)2

= Ψ(x)[1 − Ψ(x)]

Ψ(x) =1

1 + e−x=

ex

1 + ex

1 − Ψ(x) =e−x

1 + e−x=

1

1 + ex

Taking advantage of the stability of the logistic distribution for the linear

Page 60: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 51

transformations, the log-logistic distribution is defined similarly to the log-normal distribution, namely:

T ∼ LL(µ, σ2) ⇔ lnT − µ

σ∼ L(0, 1)

Alternatively:

fT (t) =1

σtψ(

ln t− µ

σ) =

1

σ t

exp−( ln t−µσ

)

1 + exp−( ln t−µσ

)2

ST (t) = 1 − Ψ(ln t− µ

σ) =

1

1 + exp( ln t−µσ

)=

exp−( ln t−µσ

)

1 + exp−( ln t−µσ

)

hT (t) =1

σt

ψ( ln t−µσ

)

1 − Ψ( ln t−µσ

)=

1

σtΨ

[ln t− µ

σ

]

=1

σt

1

1 + exp−( ln t−µσ

)

It is sometimes useful to reparametrize the log-logistic distribution as follows:

α = e−µσ β =

1

σi .e. σ =

1

βµ = − lnα

β

We then obtain:

fT (t) = βαtβ−1

(αtβ + 1)2

ST (t) =1

αtβ + 1

hT (t) = βαtβ−1

αtβ + 1=

αβ

α t + t1−β

HT (t) = ln[αtβ + 1]

Behavior of the hazard rate

We notice that for any value of α or of β, we have hT (∞−) = 0. Furthermore:

h′T (t) =αβ[(β − 1)t−β − α]

[α t + t1−β]2

if β = 1

h′T (t) < 0

Page 61: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 52

2. 4. 6. 8. 10.

0.2

0.4

0.6

0.8

1.

0.5 1. 1.5 2. 2.5 3. 3.5 4.

0.2

0.4

0.6

0.8

1.

1.2

1.4

2. 4. 6. 8. 10.

0.2

0.4

0.6

0.8

1.

1.2

1.4

1.6

Figure 2.12: Functions characteristic of the Log-Logistic Distribution(α = 1 β = 0.5, 1.5, 3)

Page 62: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 53

hT (0) = α hT (∞−) = 0

hT (t) =α

α t + 1: convex, monotonically decreasing to 0

if β > 1

hT (0) = 0

argsup hT (t) =

[β − 1

α

] 1β

> 0

if β < 1

h′T (t) < 0

hT (0+) = ∞ hT (∞−) = 0

monotonically decreasing from ∞ to 0

Remark

It is sometimes useful to also reparametrize α into α∗ = e−µ = αβ µ =− lnα∗ making of α∗ a scale parameter, as in the first parametrization of theWeibull distribution.Figure 2.15 gives typical forms of the hazard function. It may be noticedthat these hazard functions are always unimodal but not always monotonic:this depends wether β ≤ 1 or β > 1.

2.5.7 Inverse Gaussian distribution

Motivation

Definition(Figure 2.16: missing)Figure 2.16.Hazard Functions of the Inverse-Gaussian Distribution

2.5.8 Piecewise constant hazard rates

Let us consider a finite partition of the positive real line, defined by meansof k fixed endpoints: 0 = a0 < a1 < · · ·aj < · · ·ak < ak+1 = ∞ and assume

Page 63: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 54

the hazard rate to be constant inside of each of the so-defined intervals:

hT (t) =∑

1≤j≤k+1

αj 1I(aj−1 aj ](t) (2.7)

For obvious reason, the hazard rate (2.7) has also been called "piecewiseexponential". The integrated hazard function is therefore:

HT (t) =

∫ t

0

hT (u) du

=∑

i

αi(ai − ai−1)1It≤ai + αi+1(t − ai) 1Iai<t≤ai+1 (2.8)

and the survivor function:

ST (t) = exp−HT (t)= exp−αi+1(t − ai) 1Iai<t≤ai+1

×∏

i

exp−αi(ai − ai−1)1It≤ai (2.9)

Note that:

ST (ak + ∆) = exp−αk+1 ∆∏

1≤i≤k

exp−αi(ai − ai−1)

−→ 0 (when ∆ → ∞) only if αk+1 > 0 (2.10)

In other words, the piecewise constant hazard rates specification provides anon-defective distribution only if αk+1 > 0.

This is a flexible specification for a continuous distribution function asit introduces an arbitrary number of k + 1 parameters αi. Making k in-creasing with the simple size, i.e. k becomes k(n), is a way of introduc-ing non-parametric methods by means of an increasing parametrization.In some cases, one can also consider 2k + 1 parameters [(ai, αi) : i =1, · · · , k, αk+1]; in such a case, the points of discontinuity of hT (t) becomealso unknown parameters.

2.6 Increasing Failure Rate

Up to now we have analysed the hazard functions of given (and well-known)distributions. An alternative modelling strategy has been suggested when

Page 64: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 55

introducing the Gompertz-Makeham distribution, namely to start by mod-elling the hazard function and deduce the form of the distribution and of thedensity functions. This approach is widely used in reliability theory, mainlyinterested in the failure mechanism of some equipment. Thus, the simplest,and widely used, property is that of the monotonicity of the hazard rate, andmore particularly the case of a monotonically increasing hazard rate, usuallyrefered as the property of Increasing Failure Rate (IFR).

The simplest case of an IFR is the linear one:

hT (t) = at + b a, b > 0,

which implies:

ST (t) = exp[−(1

2a t2 + bt]

and that may be obtained as: T = minX1, X2 withX1⊥⊥X2, X1 ∼ Exp(b)and X2 ∼W (.5α, 2).

Another simple, and interesting, case is:

hT (t) = ln(at + b) a > 0, b ≥ 1

In which case, we obtain:

ST (t) =et b

ba

(at + b)at + b

a

This case implies a decreasing rate of increase, namely:

d

dthT (t) =

a

at + b

This section provides another argument for insisting that modelling asimple point process is often more naturally based on the property of thehazard rate than on the property of the density or of the distribution functionof the duration. We shall come back to this idea in Chapter 5.

2.7 Derived Distributions

2.7.1 Basic ideas

(i) From any of the distributions presented above, one may "derive" otherdistributions either by transforming the random variable (here, the time, or

Page 65: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 56

duration) or by transforming one of the functions characterizing those dis-tributions. As the only requirement for a function to be used as a hazardfunction is to be non-negative with non-converging integral and because thehazard function is often structurally meaningful, it may be conjectured thatone natural way of transforming a distribution into another one is to trans-form its hazard function and derive therefrom the other characteristics of thetransformed distribution.

(ii) In general, a so-called ”baseline" duration T0 is taken as a starting pointwith a distribution characterized by a parameter α , i.e. by the functionsS0(t | α), f0(t | α), h0(t | α)or H0(t | α). The derived distribution in-troduces a further parameter β and becomes the distribution of the dura-tion T , characterized by the parameter θ = (α, β) , i.e. by the functionsST (t | θ), fT (t | θ), hT (t | θ) and HT (t | θ). In many applications, the baselineduration T0 may be interpreted as the duration for individuals who have notbeen subjected to a "treatment" while the duration T is interpreted as theduration for those who have been subjected to a "treatment". In this case,the parameter β is a way of describing the action of the "treatment". In nextchapter, the "treatment" is typically the effect of exogenous variables.

(iii) In some cases, the baseline distribution is completely known, makingof α a known constant (possibly, a value associated to a "representative"individual) whereas in other cases α represents an unknown parameter.

2.7.2 Transformation of the Hazard Function: Propor-tional Hazard Function

The simplest transformation of a non-negative function is the homothetictransformation. This produces the proportional hazard model, the hazardfunction of which is constructed as follows.

hT (t | θ) = β h0(t | α)

where:h0(t | α) is a ”baseline" hazard function with parameter αθ = (α, β) is the new parameter with β > 0

Equivalently, the proportional hazard model may also be characterized, in

Page 66: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 57

terms of the baseline distribution, as follows.

HT (t | θ) =

∫ t

0

hT (u | θ)du = βH0(t | α)

ST (t | θ) = e−βH0(t|α) = [S0(t | α)]β

fT (t | θ) = hT (t | θ)ST (t | θ)

= βh0(t | α)[S0(t | α)]β

= βf0(t | α)[S0(t | α)]β−1

Meaning: If h0(t | α) represents the hazard function for the individuals"without treatment", hT (t | θ) specifies that the effect of the "treatment" isto multiply h0(t | α) by some (unknown) constant β The value of β <1 or>1 specifies whether the treatment decreases or increases the hazard rate.

2.7.3 Transformation of the Time

c1) A basic lemma

Let t→ k(t, β) = kβ(t) strictly increasing i .e.δ

δtk(t, β) > 0 ∀ β

T0 be a “baseline” time

T = kβ(T0) T0 = k−1β (T )

Page 67: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 58

Then ST (t | θ) = S0(k−1β (t) | α)

fT (t | θ) = [d

dtk−1

β (t)]f0(k−1β (t) | α)

hT (t | θ) = [d

dtk−1

β (t)]f0(k−1β (t) | α)S0(k

−1β (t) | α)−1

= [d

dtk−1

β (t)]h0(k−1β (t) | α)

HT (t | θ) = H0(k−1β (t) | α)

Remark

In this lemma, "strictly increasing" may be replaced by "strictly monotone",provided the derivative with respect to t is taken in absolute value.

c2) Exponential baseline time

In the particular case where the baseline time is distributed exponentially,i.e. :

h0(t | α) = α

we have:

hT (t | θ) = α[d

dtk−1

β (t)]

Hence, the behavior of hT (t | θ) is completely determined by the behavior ofk(t, β); in particular:

[d

dthT (t | θ)] · [ d

2

dt2k−1

β (t)] > 0

Therefore: hT (t | θ) is increasing when k−1β (t) is convex and hT (t | θ) is

decreasing when k−1β (t) is concave. This feature allows one to construct, in

principle, any arbitrary hazard function.

c3) Translation transformation

The simplest transformation is the translation:

T = T0 + β T0 = T − β where β > 0

Page 68: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 59

This transformation converts a distribution on the positive real line into adistribution on the interval [β,∞), the support of which now depends on theparameter β. Note that, in some cases, a negative value of β may also becontemplated.The transformed distribution can be characterized as follows:

ST (t | θ) = S0(t− β | α)

fT (t | θ) = f0(t− β | α)

hT (t | θ) = f0(t− β | α) · S0(t− β | α)−1

= h0(t− β | α)

HT (t | θ) = H0(t− β | α)

An example is the Generalized (or: translated) Weibull distribution:

T ∼ GW (α, β, γ) ⇔ T = T0 + γ T0 ∼W (α, β)

Thus:

ST (t) = 1 t ≤ γ

= exp−α(t − γ)β t ≥ γ

c4) Scale transformation: accelerated time

Let us consider the homothetic transformation:

T = β−1T0 T0 = β T where β > 0

The transformed distribution can be characterized as follows:

ST (t | θ) = S0(tβ | α)

fT (t | θ) = βf0(tβ | α)

hT (t | θ) = βh0(tβ | α)

HT (t | θ) = H0(tβ | α)

This transformation is also called an "accelerated time" model: the homo-thetic transformation may indeed be viewed as a change of scale of the timevariable. If β > 1, the "treatment" effect consists of "accelerating" the timewhereas if β < 1, the "treatment" effect consists of "decelerating" the time

Page 69: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 60

c5) Power transformationLet us consider the power transformation:

T = T1β

0 T0 = T β where β > 0

The transformed distribution can be characterized as follows:

ST (t | θ) = S0(tβ | α)

fT (t | θ) = βtβ−1f0(tβ | α)

hT (t | θ) = βtβ−1h0(tβ | α)

HT (t | θ) = H0(tβ | α)

As seen earlier, if the distribution of T0 is exponential, the distribution of Tis Weibull. This was an example of the use of the power transformation.

c6) Log transformationLet us consider the Log transformation:

T = lnT0 T0 = eT

The transformed distribution can be characterized as follows:

ST (t | α) = S0(et | α)

fT (t | α) = et f0(et | α)

hT (t | α) = et h0(et | α)

HT (t | α) = H0(et | α)

Example: exponential baseline distributionLet:

T0 ∼ Exp(α) f0(t | α) = αe−αt S0(t | α) = e−αt h0(t | α) = α

Then, the transformed time follows a double exponential distribution:

fT (t | α) = αet−α et

ST (t | α) = e−αet

hT (t | α) = α et

Page 70: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 61

2.8 Conditional models for simple point pro-

cess

2.8.1 Introduction

a) 2 levels of analysis

(i) descriptive (or, exploratory) data analysis: covariates may be used totake into account observable factors of heterogeneity by realizing separateanalyses.

Example 1:Let us consider a sample of 100 students and plot the histogram of theirweight.

Let : Y weightZ sex, i .e. Z = 0 for female

Z = 1 for male

It seems rather natural to partition the sample into two subsamples: one forthe female students (i.e. Z = 0) and another one for the male students (i.e.Z = 1). By so-doing, we obtain two "simpler" histograms; in other words,the factor sex "explains" part of the complexity of the observed phenomenon.

Example 2Let us consider a sample of 100 workers and measure the following variables:

T Duration of unemploymentZ1 sex 2 levelsZ2 level of education 3 levelsZ3 sector of activity 5 levels

Total : 2.3.5 = 30 levels

This would suggest the necessity of 30 separate analyses but this is clearlyunsuitable with only 100 observations: the way those Z-variables do influencethe distribution of T should therefore be parametrized.

(ii) structural modelling. The parameter of interest may be such that the(marginal) process generating some covariates is noninformative for the pa-rameter of interest which, at the same time, is a function of a parametersufficient to parametrize the process conditional on those covariates. Thosecovariates are then called "exogenous variables" and will generally be denoted

Page 71: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 62

by “Z” whereas the other variables will be called "endogenous", because themodel describes the way they are generated, and are denoted by “Y ” (or “T”,in the case of a variable of duration). For more details, see Section 9.4.1 Onexogeneity. In such a case, it is admissible to specify only the process condi-tional on those exogenous variables, leaving the marginal process generatingthose exogenous variables virtually unspecified. In other words, for the pa-rameter of interest, p(t | z, θ) is as informative as p(t, z | θ) According to ageneral principle of parcimony, the conditional model is therefore prefered.

b) Two simple ideas

(i) In general, a natural way of specifying conditional models is to make theparameters of a (marginal) distribution dependent on the conditioning vari-able. Thus, in FT (t | θ), one would transform θ into g(z, θ) where g wouldbe a known function. For example, Y ∼ N(µ, σ2) could be transformed into(Y | Z) ∼ N(α + βZ, σ2). Similarly, T ∼ exp(θ) could be transformed into(T | Z) ∼ exp[g(Z, θ)] where, e.g., g(Z, θ) = exp(−Z ′θ).

(ii) In modelling individual data (and, in particular for duration data), afrequently used strategy consists of starting with a so-called "base-line" dis-tribution for the distribution of an individual of reference, i.e. either a "non-treatment" individual (e.g. an individual for which Z =0) or a "represen-tative" individual (e.g. an individual for which Z = E(Z)) and thereaftermodelling, in the spirit of Section 2.7, what makes the other individualsdifferent from that individual of reference. Typical examples are:

proportional hazard: the effect of the regressors (Z) is to multiply thebaseline hazard function by a scale factor.

accelerated life: the effect of the regressors (Z) is to rescale the timevariable.

c) distinguish time varying covariates and time constant covariates

The covariates may represent:

- environmental characteristics such as macro-economic variables- individual characteristics such as sex, education, age a.s.o.- “treatment" variables such as presence or absence of social benefits.

Some variables may also represent the interaction effect of two covariates. Inthe sequel, it results being important to distinguish those covariates that aretime-invariant from those that are not time-invariant.

Page 72: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 63

d) interpretation of the parameters

Most models on duration data consider the hazard function as the object ofinterest, rather than, as usual in many econometric models, the regressionfunction. Because a hazard function is, by necessity, non-negative, it is even-tually non linear in the exogenous variables. Therefore, the partial derivativesof interest are not constant but are functions of the value of the covariates.This feature clearly makes interpreting the coefficients more difficult.

e) Parameters of interest and semi-parametric modelling

Let us reconsider the approach of "derived " distributions as seen in Section2.7 where a baseline hazard function h0(t | α) is modified into hT (t | θ) whereθ is an enlarged parameter: θ = (α, β). In conditional models, h0(t | α) isoften interpreted as the hazard function of the duration for a “representative"individual and it is therefore natural to modify β into g(z, β) which now mod-els the effect of the exogenous variables Z on the baseline distribution of theduration. In such a case, β is an economically meaningful parameter of finitedimension and of interest whereas α is actually a nuisance parameter aboutwhich economic theory has not much to say. This may cause specificationerrors on h0(t | α) as well as a scarcity of prior information on α.

In such a case, modelling often relies on one of the following two extremestructures of modelling :(i) h0(t | α) is specified in a most simplest way such as h0(t | α) = h0(t),i.e. is completely known, or h0(t | α) = α, i.e. the baseline distribution isexponential and therefore depends on only one unknown parameter.(ii) h0(t | α) is specified in the most general way: h0(t | α) = α(t), here α is afunctional parameter( i.e. α is a non-negative function such that its integralon the positive real line diverges). This is a "semi-parametric model"withparameter θ = (α, β) where α takes its value in a functional space whereasβ takes its value in a (finite dimensional) Euclidean space. This approachis particularly attractive in situations where economic theory would not givemuch information on the structure of h0(t | α).

2.8.2 Proportional hazard Model

a) Definition

In the proportional hazard model, the effect of the exogenous variable isspecified as multiplying a baseline hazard function by a function that depends

Page 73: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 64

on the exogenous variable only; it is accordingly defined as follows:

hT (t | z, θ) = h0(t | α) g(z, β) θ = (α, β)

where h0(t | α) is a baseline hazard function and g is a known non-negativefunction. Let us define the other characteristics of the baseline distributionalong with their usual relationships:

h0(t | α) baseline hazard function

S0(t | α) baseline survivor function

f0(t | α) baseline density function

H0(t | α) baseline integrated hazard function

where, as usual in the continuous case:

h0(t, α) =f0(t, α)

S0(t, α)

H0(t | α) = − lnS0(t | α) =

∫ t

0

h0(t | α)du

Therefore, the proportional hazard model is equivalently characterized asfollows:

HT (t | z, θ) = g(z, β)

∫ t

0

h0(u | α)du = g(z, β)H0(t | α)

ST (t | z, θ) = exp

−g(z, β)

∫ t

0

h0(u | α)du

= exp−g(z, β)H0(t | α)

= [S0(t | α)]g(z,β)

fT (t | z, θ) = hT (t | z, θ) · ST (t | z, θ) in general

= g(z, β) · h0(t | α)[S0(t | α)]g(z,β)

= g(z, β) · f0(t | α)[S0(t | α)]g(z,β)−1

Page 74: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 65

Notice that the density of a proportional hazard model accepts the followingfactorization:

fT (t | z, θ) = f1(α) f2(β) f3(α, β)

Particular case

When T0 ∼ Exp(α):

hT (t | z, θ) = α g(z, β)

HT (t | z, θ) = α t g(z, β)

ST (t | z, θ) = exp−α t g(z, β)fT (t | z, θ) = α g(z, β) exp−α t g(z, β)

b) identification

The problem of identifying separately the functions g and h0 raises from thesimple remark that for any k > 0 : g. h0 = gk.k−1h0. A rather naturalsolution consists of defining an individual of reference – i.e. a particularvalue z0 of Z- for which g(z0, β) = 1 for any value of β and, consequently,hT (t | z0, θ) = h0(t | α).

Two typical normalizations are:

(i) g(0, β) = 1 when Z = 0 has meaning

(ii) g(mγ, β) = 1 where mγ = E(Z | γ) is the mathematical expectation inthe exogenous process generating (Z | γ) : in this case, h0(t | α) is the hazardfunction of the "average individual".

c) A remark:

In the proportional hazard model, the effect of the exogenous variable z maybe evaluated as follows:

∂zhT (t | z, θ) = h0(t | α) · ∂

∂zg(z, β)

and therefore, for any value of z, the sign of ∂∂zhT is the same as that of ∂

∂zg.

Furthermore:∂

∂zlnhT (t | z, θ) =

∂zln g(z, β)

depends on z and β only and is therefore independent of t; furthermore, thelog-derivative of hT does not depend on the base-line distribution.

Page 75: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 66

d) A particular case: Cox model

DefinitionThe function g(z, β) should clearly be non-negative. An easy way to obtainthat property without restriction on β is the log-linear specification, viz. :

g(z, β) = exp(z′β) β ∈ IRk(i .e. no restriction of non − negativity)

This implies:

ST (t | z, θ) = exp−H0(t | α) exp(z′β) = [S0(t | α)]exp(z′β)

fT (t | z, θ) = h0(t | α) exp[z′β − H0(t | α) exp(z′β)]hT (t | z, θ) = h0(t | α) exp(z′β)

HT (t | z, θ) = H0(t | α) exp(z′β)

Interesting properties of Cox model include the following

δ

δzlnhT (t | z, θ) =

δ

δzln g(z, β) = β

i.e. constant proportional effect of z on the instantaneous conditional prob-ability of leaving state Eo.

Regression representation of Cox modelLet us define:

εt = lnHT (t | z, θ) = lnH0(t | α) + z′β

In view of Theorem 9.2.1, the (conditional) distribution of εt is a standarddouble exponential one:

P [εt > a | z, θ] = exp−[exp(a)],

independently of z or of θ, with expectation E[ε | z, θ] = 0.5772. We maynow write:

lnH0(t | α) = − z′β + εt

This is a non-linear regression unless α is known, with a fixed (non-normal)residual distribution . When the baseline distribution is exponential, so thatH0(t | α) = α t, we obtain:

ln t = − lnα − z′β + εt, (2.11)

Page 76: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 67

now a linear regression model with a fixed (non-normal) residual distribution.RemarkSome authors give a slighltly different regression representation of this model,namely:

ε∗t = − lnHT (t | z, θ) = − lnH0(t | α) − z′β

in which case:− lnH0(t | α) = z′β + ε∗t

andP [ε∗t ≤ a | z, θ] = 1 − exp−[exp(a)],

2.8.3 Accelerated time

a) Basic idea:

In the accelerated time, also known as: accelerated life, model, the effect ofthe exogenous variable is specified as modifying the time scale, exactly as inSection 2.7.3.c4; it is accordingly defined as follows:

T = [g(z, β)]−1T0 T0 = g(z, β)T

where g(z, β) > 0, or, equivalently:

hT (t | z, θ) = g(z, β) · h0(t · g(z, β) | α)

HT (t | z, θ) = H0(t · g(z, β) | α)

ST (t | z, θ) = S0(t · g(z, β) | α)

fT (t | z, θ) = g(z, β) · f0(t · g(z, β) | α)

with, as usual, θ = (α, β) and g ≥ 0. This specification is particularly at-tractive when the base-line distribution admits a scale parameter.

Notice that the density of an accelerated time model accepts the followingfactorization:

fT (t | z, θ) = f1(β) f2(α, β)

Also, in the particular case where T0 ∼ Exp(α), we have:

hT (t | z, θ) = α g(z, β),

i.e. the behaviour of hT is totally dependent on g(z, β) and the model is thesame as a proportional hazard model.

Page 77: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 68

b) A remark:

In the accelerated time model, the effect of the exogenous variable z may beevaluated as follows:

∂zlnhT (t | z, θ) =

∂zln g(z, β) +

∂zlnh0(t g(z, β) | α)

=∂

∂zln g(z, β)

×[

1 +t g(z, β)

h0(t g(z, β) | α)

∂(t g(z, β)h0(t g(z, β) | α)

]

Exercise. Evaluate ∂∂z

lnhT (t | z, θ) in the particular case where g(z, β) =exp(z′β).

c)Empirical test for the accelerated time model

Let us consider the quantile functions, i.e. the inverse of the Survivor (ratherthan, as more usually, of the Distribution) function,:

qT (p | z, θ) = S−1T (p | z, θ) 0 ≤ p ≤ 1

q0(p | α) = S−10 (p | α) 0 ≤ p ≤ 1

Because of the strict monotonicity (in the continuous case) of the Survivorfunctions, we have:

q0(p | α) = g(z, β) · qT (p | z, θ)

In the (q0(p | α), qT (p | z, θ))-space, this gives, for a fixed value of z, an ho-mogenous straight line, the gradient of which is given by g(z, β). This featuresuggests that an easy empirical test for the accelerated time model may beobtained through an examination of the so-called "Q-Q-plot" ( i.e. plot ofthe two quantiles) for a fixed value of Z and a fixed (typically, estimated)value of θ = (α, β) , see e.g. Figure 2.13.

d) Regression representation of the accelerated time model

A substantial advantage of the accelerated time model is to allow for an easyrepresentation in terms of a regression model; indeed this model may also bewritten, in logarithmic terms, as follows:

lnT = lnT0 − ln g(z, β)

Page 78: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 69

1

q0(t | α)

g(z,β)

1 qT(t | z, β)

x

x

x x

x

x

x

Figure 2.13: Q-Q plot for the Accelerated Time Model.

Page 79: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 70

If we define:

µ0 = E[lnT0 | α]

ε = lnT0 − µ0

we may also write:lnT = µ0 − ln g(z, β) + ε

In particular:

(i) if lnT0 ∼ N(µ, σ2) i.e. T0 ∼ LN(µ, σ2), then ε ∼ N(0, σ2). Thus weobtain a normal regression model (if there is no censoring).

(ii) if g(z, β) = exp(z′β), we obtain a linear regression model:

lnT = µ0 − z′β + ε.

If furthermore T0 ∼ Exp(α), we have µ0 = − ln α and eventually recover(2.11).

Exercise(i) check that, when g(z, β) = exp(z′β) , identification requires that there isno constant in z′β.(ii) check that, for recovering (2.11) when furthermore T0 ∼ Exp(α), the dis-tributions of the the residual term coincide indeed in view of the distributionof lnT0.

2.8.4 Comparing Proportional hazard and Accelerated

time

The integrated hazard

HPH(t | z, θ) = g(z, β) H0(t | α) (2.12)

HAT (t | z, θ) = H0(tg(z, β) | α) (2.13)

Thus, the action of the exogenous variables z is to introduce a scale factor inthe coordinate of the baseline Integrated hazard function in the case of thePH model whereas the AT model introduces a scale factor in the axis of thetime. This is useful for a graphical comparaison of the two mdels.

Page 80: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 2. SIMPLE POINT PROCESS 71

Particular case: Weibull baseline

In the particular case of a Weibull baseline distribution, namely H0(t | α) =λtγ where α = (λ, γ), along with a log-linear action of the exogenousvariable, namely g(z, β) = exp(β ′z), we obtain:

HPH(t | z, θ) = exp(β ′PHz) λt

γ (2.14)

HAT (t | z, θ) = λ[t exp(β ′AT z))]

γ (2.15)

These two models become therefore identical under the reparametrization:βPH = γ βAT . In such a case, the PH model may be estimated as an ATmodel, i.e. a nonlinear regression model, followed by a suitable reparametriza-tion, when there are no censored data.

2.8.5 Other conditional models

Combining Proportional hazard and Accelerated time, (Chen and Jewell(2001)) have proposed the following model:

hT (t|z, θ) = h0[t exp(β ′1z)|α] exp(β ′

2z) θ = (α, β1, β2) (2.16)

This model encompasses the Proportional hazard and the Accelerated timemodels as follows:

PH: β1 = 0

AT: β1 = β2

To be checked: identifiability and interpretability of the 3 parameters.

Page 81: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 3

Multivariate durations

3.1 Introduction

Up to now, we have considered only simple point processes; these are pro-cesses with only 2 states and 1 transition; thus, one state is an "initial" oneand the other one is an "absorbing" one. Therefore a trajectory of such aprocess is completely described by a unique random variable : the durationof the stay in the initial state. This chapter is kind of a technical detour:it introduces multivariate duration distributions because they will appear,under different forms, in most of the ensuing chapters. We first sketch someexamples that motivate various aspects of the specification and of the anal-ysis of multivariate distributions.

Example 1. Multivariate simple processes.Let us consider a population of young individuals and define the followingbivariate point process :

Xt = (Yt, Zt)Yt = family status ∈ unmarried, marriedZt = professional status ∈ employed, unemployed

Assuming, for expository purposes, that the problem of interest is the in-teraction between the first exit from the unmarried status, considered as aninitial state, and the first exit from the unemployed status, considered as anintial state, we shall have interest in a bivariate distribution of durations,namely :

72

Page 82: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 73

T = (T1, T2)T1 = date of first marriageT2 = date of starting the first job

Example 2. Multiple exits models.In next chapter, we introduce a first generalization of simple point processes,namely processes with one transition and k exits. This is the case, for in-stance, when one wants to distinguish several ways of leaving unemployment,such as: full time job, part time job, reschooling or disease. The economicrationale for such an interest is that one should suspect that these differ-ent exits may be explained by different economic mechanisms. One way ofdealing with such cases is to identify each exit with a different event andto model the date of leaving the initial state as the date of the event whichfirst realizes. This leads to introduce a vector τ = (τ1, τ2, · · · , τp) of (latent)durations along with the interpretation that the observed duration is equalto the minimum of the τj ’s.

Example 3. Multiple transitions processes.Finally, general point processes are characterized by the possibility of mulp-tiple transitions among states, which may be two or more. The trajectoriesof such processes are characterized, among other things, by a vector of dura-tions of the stays in the visited states.

In this chapter, we show how to characterize and how to analyse a multi-variate distribution of durations, without considering applications to specificmodels. This chapter is mainly an extension of chapter 2. The main difficultyto be dealt with is a precise treatment of conditional distributions.

For the sake of exposition, this chapter concentrates the attention mainlyon bivariate distributions. The extension to higher dimensional distributionswill be either sketchy or confined to specific issues. Similarly we mainly con-sider continuous distributions although we shall also deal with some questionsrelated to discrete or to mixed distributions.

3.2 Joint and Marginal Distributions

The characterization of the joint and of the marginal distributions of multi-variate durations does not raise particular problems with respect to what hasbeen presented in chapter 2. A main objective of this section is to familiarize

Page 83: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 74

the reader with some notation, in particular those concerning partial deriva-tives. The presentation in this section is accordingly rather sketchy. Forthe sake of simplicity we concentrate the attention on a bivariate durationT = (T1, T2), the extension to the higher dimensional case being ratherstraightforward.

The joint survivor function is defined, and denoted, as :

S1,2(t1, t2) = P (T1 ≥ t1, T2 ≥ t2)

and the joint integrated hazard function is defined, and denoted, as :

H1,2(t1, t2) = − lnS1,2(t1, t2)

In the absolutely continuous case, the joint density function is defined, anddenoted, as :

f1,2(t1, t2) =∂2

∂t1 ∂t2S1,2(t1, t2)

= D1D2[S1,2(t1, t2)]

where Dj means (partial) differentiation with repect to tj (for more detail,see Section 9.1). The joint hazard rate ( joint instantaneous hazard function)is defined, and denoted, as :

h1,2(t1, t2) =∂2

∂t1 ∂t2H1,2(t1, t2)

= D1D2[H1,2(t1, t2)]

=f1,2(t1, t2)

S1,2(t1, t2)

The marginal distributions are characterized exactly as in chapter 2. Weremind here some of those characteristics along with their relationships withthe joint distribution and used the notation: e1 = (1, 0)′ e2 = (0, 1)′.

The marginal survivor function is defined, and denoted, as :

Sj(t) = P (Tj ≥ t) = S1,2(t ej)

and the marginal integrated hazard function is defined, and denoted, as :

Hj(t) = − lnSj(t) = − lnS1,2(t ej)

Page 84: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 75

In the absolutely continuous case, i.e. when S1,2(t ej) is differentiable, themarginal density function is defined, and denoted, as :

fj(t) = lim∆↓0

P [t ≤ Tj ≤ t + ∆]

= − d

d tS1,2(t ej) = − d

d tSj(t)

and the marginal hazard rate ( marginal instantaneous hazard function) isdefined, and denoted, as :

hj(t) = lim∆↓0

P [t ≤ Tj ≤ t + ∆ | Tj ≥ t]

=d

d tHj(t) =

fj(t)

Sj(t)

As far as joint and marginal distributions are concerned, the extension to thehigher dimensional case, i.e. (p ≥ 2), is rather straightforward and will notbe given detail.

3.3 Conditional Distributions in the bivariate

case

In the sequel, we shall meet conditional distributions of, say T1, given con-ditions either on T2 or on M = min(T1, T2), also written as M = T1 ∧ T2.These conditions will appear either as equality or as inequality. For later ref-erences, we give some details on these conditional distributions at the sametime as we also establish the relevant notation. In general, the first line givesa definition whereas the subsequent lines give alternative notations or rela-tionships of potential usefulness in the computations. In view of future uses,we also take care of first giving formulae valid with non-continuous distribu-tions before particularizing the general formulae to the absolutely continuouscase. We first handle, with some detail, the bivariate case and next sketchthe extension to higher-dimensional cases.

Page 85: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 76

3.3.1 T1 given T2 ≥ t2

The conditional survivor function of (T1 | T2 ≥ t2) is defined, and denoted, as:

S≥1 |2(t1 | t2) = P (T1 ≥ t1|T2 ≥ t2) =

S1,2(t1, t2)

S2(t2)

The conditional integrated hazard function is defined, and denoted, as :

H≥1|2(t1|t2) = − lnS≥

1 |2(t1 | t2) = lnS2(t2) − lnS1,2(t1, t2)

When S1,2(t1, t2) is differentiable in t1, we may define the conditional densityfunction of (T1 | T2 ≥ t2) :

f≥1 |2(t1 | t2) = fT1(t1 | T2 ≥ t2)

= limε↓01εP [t1 ≤ T1 < t1 + ε | T2 ≥ t2]

= −D1[S1,2(t1, t2)]

S2(t2)

= −S≥1 |2(t1 | t2) D1[lnS1,2(t1, t2)]

where the last equality makes use of the identity Dj [S1,2] = S1,2 Dj[lnS1,2] .

The conditional hazard rate of (T1 | T2 ≥ t2) is defined, and denoted, as:

h≥1 |2(t1 | t2) = limε↓01εP [t1 ≤ T1 < t1 + ε |T1 ≥ t1, T2 ≥ t2]

= D1[H≥1|2(t1|t2)]

= − D1[lnS1,2(t1, t2)] = −D1[S1,2(t1, t2)]

S1,2(t1, t2)

=f≥

1 |2(t1 | t2)S≥

1 |2(t1 | t2)

It is easily checked that (T1 | T2 ≥ 0) ∼ T1 ; in particular : f≥1 |2(t1 | 0) =

f1(t1).

Page 86: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 77

3.3.2 T1 given T2 = t2

The basic property defining the conditional survivor function of (T1 | T2 = t2),denoted as

S=1 |2(t1 | t2) = P (T1 ≥ t1|T2 = t2) ,

is given by the identity (actually defining a conditional probability):

S1 ,2(t1, t2) =

[t2,∞)

S=1 |2(t1 | v) dF2(v) (3.1)

When F2 is absolutely continuous - i.e. dF2(v) = f2(v) dv - S1 ,2(t1, t2)is differentiable in t2 and its derivative is deduced from the fundamentaltheorem of calculus, namely :

S1 ,2(t1, t2) = −∫

[t2,∞)

D2[S1 ,2(t1, v)] dv

from which we conclude :

S=1 |2(t1 | t2) = −D2[S1 ,2(t1, t2)]

f2(t2)=

S1 ,2(t1, t2)

f2(t2)h≥2 |1(t2 | t1)

and similarly for conditional integrated hazard function:

H=1 |2(t1 | t2) = − lnS=

1 |2(t1 | t2)= ln[f2(t2)] − ln(−D2[S1,2(t1, t2)]) (3.2)

If furthermore, S1 ,2(t1, t2) , and therefore S=1 |2(t1 | t2) , are also differen-

tiable in t1, we may define the conditional density function of (T1 | T2 = t2):

f=1 |2(t1 | t2) = −D1[S

=1 |2(t1 | t2)] =

D1D2[S1 ,2(t1, t2)]

f2(t2)

=f1 ,2(t1, t2)

f2(t2)(3.3)

and similarly for the conditional hazard rate of (T1 | T2 = t2) :

h=1 |2(t1 | t2) = lim

ε↓0

1

εP [t1 ≤ T1 < t1 + ε |T1 ≥ t1, T2 = t2]

=f=

1 |2(t1 | t2)S=

1 |2(t1 | t2)= −D1[ln(−D2[S1,2(t1, t2)])] (3.4)

Page 87: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 78

the last equality comes from the differentiation , with respect to t1 , of (3.2 )From D2[S1,2] = S1,2 D2[lnS1,2] , we derive :

h=1 |2(t1 | t2) = −D1[lnS1,2(t1, t2) + ln−D2[lnS1,2(t1, t2)]]

and obtain the following relationship :

h=1 |2(t1 | t2) = h≥1 |2(t1 | t2) − D1[lnh

≥2 |1(t2 | t1)]. (3.5)

3.3.3 Distribution of M = minT1, T2 = T1 ∧ T2

From the equality among events :M ≥ m = T1 ≥ m,T2 ≥ m, wederive that the survivor function of M : SM(m) = P (M ≥ m) is given by :

SM(m) = S1,2(m,m) = S1,2(me+)

where e+ = (1, 1)′. The integrated hazard function of M is therefore givenby :

HM(m) = − lnS1,2(m,m) = − lnS1,2(me+)

In general, one has dSM(m) = dS1,2(me+) . When S1,2 is differentiablein the direction of e+, i.e. De+(S1,2) exists (see Section 9.1.2), we have adensity function of M :

fM(m) = limε↓01εP [m ≤ T1 ≤ m+ ε, m ≤ T2 ≤ m+ ε, ]

= − d

dmSM(m)

= −De+ [S1,2(me+)]

where De+ is a directional derivative (in the direction e+, see Section 9.1.2)If, furthermore, S1,2 has both partial derivatives :

fM(m) = − (D1[S1,2(m,m)] + D2[S1,2(m,m)]) (3.6)

Page 88: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 79

When De+(S1,2) exists, we have the hazard rate of M :

hM(m) = limε↓01εP [m ≤M ≤ m+ ε|M ≥ m]

=fM(m)

SM(m)

= −D[ln SM(m)]

= −De+ [lnS1,2(me+)]

If, furthermore, S1,2 has both partial derivatives , we get the relationship :

hM (m) = h≥1 |2(m |m) + h≥2 |1(m |m).

3.3.4 T1 given M ≥ m

Note first that M ≥ m clearly implies T1 ≥ m . Therefore, we obtainthe conditional survivor function of (T1 |M ≥ m),

S≥1 |M(t1 |m) = P (T1 ≥ t1 |M ≥ m),

as follows :

S≥1 |M(t1 |m) =

S1,2(t1, m)

S1,2(m,m)t1 ≥ m

= 1 t1 < m

When S1,2 is differentiable in t1, we have the conditional density functionof (T1 |M ≥ m) :

f≥1 |M(t1 |m) = fT1(t1 |M ≥ m)

= limε↓01εP [t1 ≤ T1 ≤ t1 + ε|M ≥ m]

= −D1[S≥1 |M(t1 |m)] 1It1≥m

= − 1

S1,2(m,m)D1[S1,2(t1, m)] 1It1≥m

Page 89: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 80

The conditional hazard rate of (T1 |M ≥ m) is accordingly :

h≥1 |M(t1 |m) = hT1(t1 |M ≥ m)

= limε↓01εP [t1 ≤ T1 ≤ t1 + ε| T1 ≥ t1, M ≥ m] 1It1≥m

= limε↓01εP [t1 ≤ T1 ≤ t1 + ε|M ≥ m] 1It1≥m

=f≥

1|M(t1|m)

S≥1|M(t1|m)

1It1≥m

= −D1[lnS1,2(t1, m)] 1It1≥m

Note the relationship :

h≥1 |M(t1 | m) = h≥1 |2(t1 | m)1It1≥m (3.7)

which may also be viewed as the consequence, for t1 ≥ m , of the equalityamong events :

T1 ≥ t1, M ≥ m = T1 ≥ t1, T2 ≥ m

3.3.5 T1 given M = m

Note first the equality among the following events :

T1 ≥ t1 , M ≥ m = T1 ≥ (m ∧ t1) , T2 ≥ m

or equivalently;

T1 ≥ t1 , M ≥ m = T1 ≥ t1 , T2 ≥ m t1 ≥ m

= T1 ≥ m, T2 ≥ m = M ≥ m t1 < m

Therefore,

P (T1 ≥ t1 , M ≥ m) = S1,M(t1, m)

= S1,2(t1, m)] 1It1≥m + S1,2(m,m) 1It1<m

= S1,2(t1, m) 1It1≥m + SM(m) 1It1<m

Page 90: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 81

Again, the basic property defining the conditional survivor function of (T1 |M =m), denoted as

S=1 |M(t1 |m) = P (T1 ≥ t1|M = m) ,

is given by the identity (actually defining a conditional probability):

P (T1 ≥ t1, M ≥ m) =

[m,∞)

S=1 |M(t1 | u) dFM(u) (3.8)

We first notice that :

S=1 |M(t1 |m) = 1 if t1 < m

S1,M(t1, m) = S1,2(t1, m) =∫

[m,∞)S=

1 |M(t1 | u) dFM(u) if t1 ≥ m

When SM(m) is differentiable - so that, dFM(m) = fM(m) dm - we obtainsuccessively :

S=1 |M(t1 |m) = − D2[S1 ,2(t1, m)]

fM(m)if t1 ≥ m

=S1 ,2(t1, m)

fM(m)h≥2 |1(m |t1) if t1 ≥ m (3.9)

Thus, in the absolutely continuous case, the conditional integrated hazardfunction of (T1 |M = m ) may be written, from(3.6), as:

H=1 |M(t1 |m) = ln[−(D1[S1,2(m,m)] + D2[S1,2(m,m)])]

− ln(−D2[S1 ,2(t1, m)]) 1It1≥m (3.10)

The conditional density function of (T1 |M = m ), defined and denoted as :

f=1 |M(t1 |m) = fT1(t1 |M = m)

= limε↓01εP [t1 ≤ T1 ≤ t1 + ε|M = m]

Page 91: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 82

may be evaluated as follows :

f=1 |M(t1 |m) = −D1[S

=1 |M(t1 |m)] 1It1≥m

=D1D2[S1 ,2(t1, m)]

fM(m)1It1≥m =

f1 ,2(t1, m)]

fM(m)1It1≥m

= − 1

fM(m)D1[S1 ,2(t1, m)] h≥2 |1(m |t1)

+S1 ,2(t1, m) D1[h≥2 |1(m |t1)] 1It1≥m

= − S1,2(t1, m)

fM(m)D1[lnS1,2(t1, m)] h≥2 |1(m |t1)

+ D1[h≥2 |1(m |t1)] 1It1≥m

=S1 ,2(t1, m)

fM(m)h≥1 |2(t1 |m) h≥2 |1(m |t1)

− D1[h≥2 |1(m |t1)] 1It1≥m

=S1 ,2(t1, m) h≥2 |1(m |t1)

fM(m)h≥1 |2(t1 |m)

− D1[lnh≥2 |1(m |t1)] 1It1≥m

(where D1 stands for Dt1). From (3.4), we obtain:

f=1 |M(t1 |m) = S=

1|M(t1|m) h=1 |2(t1 |m) 1It1≥m (3.11)

Note that, from (3.1) and (3.8) we also have:

S=1 |M(t1 |m) f=

1 |M(t1 |m) = S=1 |2(t1 |m) f=

1 |2(t1 |m) = S1 ,2(t1, m) h≥2 |1(m |t1)if t1 ≥ m

Therefore, the conditional hazard rate of (T1 |M = m ), defined and denotedas:

h=1 |M(t1 |m) = limε↓0

1εP [t1 ≤ T1 ≤ t1 + ε| T1 ≥ t1, M = m]

= D1[H=1 |M(t1 |m)],

Page 92: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 83

may be evaluated, from (3.11) and (3.4), as follows :

h=1 |M(t1 |m) =

f=1 |M(t1 |m)

S=1 |M(t1 |m)

= h=1 |2(t1 |m) 1It1≥m

= −D1 ln−D2[S1,2(t1, m)]1It1≥m (3.12)

Table 3.3.5 conveniently summarizes the main properties and relation-ships of the distributions conditional on equalities.

Page 93: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 84

T1|T2 = t2 ∀t1, t2

survivor function S=1 |2(t1 | t2) =

S1 ,2(t1, t2)

f2(t2)h≥2 |1(t2 |t1)

density function f=1 |2(t1 | t2) =

f1 ,2(t1, t2)

f2(t2)

hazard rate h=1 |2(t1 | t2) = h≥1 |2(t1 | t2) + D1[lnh

≥2 |1(t2 | t1)]

T1|M = m t1 ≥ m t1 < m

survivor function S=1 |M(t1 |m) =

S1 ,2(t1, m)

fM(m)h≥2 |1(m |t1) 1

density function f=1 |M(t1 |m) = h=

1 |2(t1 |m)S=1 |M(t1 |m) 0

hazard rate h=1 |M(t1 |m) = h=

1 |2(t1 |m) 0

Table 3.1: Summary of the properties of Distributions conditionalon equalities

Page 94: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 85

When T1 and T2 are independent, we get :

S1,2(t1, t2) = S1(t1) S2(t2)

f1,2(t1, t2) = f1(t1) f2(t2)

H1 ,2(t1, t2) = H1(t1) + H2(t2)

h1 ,2(t1, t2) = h1(t1) + h2(t2)

S≥1 |2(t1 | t2) = S=

1 |2(t1 | t2) = S1 (t1)

f≥1 |2(t1 | t2) = f=

1 |2(t1 | t2) = f1 (t1)

h≥1 |2(t1 | t2) = h=1 |2(t1 | t2) = h1 (t1)

furthermore :

SM(m) = S1(m) S2(m)

fM(m) = f2(m)S1(m) + f1(m)S2(m)

hM(m) = h1(m) + h2(m)

S≥1 |M(t1 |m) =

S1(t1)

S1(m)if t1 ≥ m (= 1 if t1 < m)

f≥1 |M(t1 |m) =

f1(t1)

S1(m)1It1≥m

h≥1 |M(t1 |m) = h1(t1)1It1≥m

S=1 |M(t1 |m) =

S1(t1)

S1(m)

h2(m)

h1(m) + h2(m)if t1 ≥ m (= 1 if t1 < m)

f=1 |M(t1 |m) = [h−1

1 (m) + h−12 (m)]−1 S1(t1)

S1(m)1It1≥m

h=1 |M(t1 |m) = h1(t1)1It1≥m

Page 95: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 86

3.4 Conditional Distributions for p ≥ 2

In this section, we briefly extend the main concepts previously explained forthe case p = 2 . The formulae become somewhat easier under the followingnotational conventions :

T = (T1, T2, · · ·Tp)

t = (t1, t2, · · · tp)S∗(t) = P (Ti ≥ ti : 1 ≤ i ≤ p)

j = 1, 2 · · ·p\j

e+ =∑

1≤i≤p

ei = (1, 1, · · ·1)′ ∈ IRp

where ei is the i− th column of Ip ,the p× p unit matrix (when ambiguity

is possible, one might write: e(p)+ instead of e+).

3.4.1 Distribution of M = miniT1, T2, · · ·Tp = ∧1≤i≤pTi

This extension from p = 2 to p > 2 is straightforward after remarking that

M ≥ m = Ti ≥ m : 1 ≤ i ≤ p.

Thus, the survivor function of M is, in general :

SM(m) = P (M ≥ m) = S∗(m,m, · · ·m) = S∗(me+).

When S∗ is differentiable in the direction e+ , the density function of M isgiven by :

fM (m) = limε↓01εP [m ≤ Ti ≤ m+ ε : 1 ≤ i ≤ p]

= − d

dmSM(m)

= −De+ [S∗(me+)]

Page 96: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 87

When the partial derivatives Dj [S∗] exist, we also have :

fM(m) = −∑

1≤i≤p

Dj [S∗(me+)]

Similarly, for the hazard rate of M :

hM(m) = limε↓01εP [m ≤M ≤ m+ ε |M ≥ m]

=fM(m)

SM(m)

= −D[lnSM(m)]

= −∑

1≤i≤p

Dj [lnS∗(me+)]

3.4.2 Conditional Distributions of the Ti’s

Let u ∈ IRp−1+ and v ∈ IR+ and denote the conditional survivor function

of (Ti : i ∈ j |Tj = v) by S=j | j(u|v) ; this is the function defined by the

identity :

S∗(u, v) =

[v,∞)

S=j | j(u|s) dFj(s)

When Fj(t) has a density fj(t) , the basic theorem of calculus gives :

S=j | j(u|v) = − Dj [S∗(u∗, v∗)]

fj(v)

where (u∗, v∗) stands for the reordering of the coordinates of (u, v) into thenatural order.

We shall have a particular interest in the case : u = t e+ ∈ IRp−1+ and

v = t . It will be convenient to use the notational convention :

S=j | j(te+|t) = S=

j | j(t)

thus, when Fj(t) has a density fj(t) , we obtain:

S=j | j(t) = − Dj[S∗(t e+)]

fj(t)

Page 97: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 88

3.4.3 Tj given M ≥ m

Remember that M ≥ m ⊂ Tj ≥ m; therefore M ≥ m = Tj ≥m, M ≥ m. Thus, the conditional survivor function of (Tj |M ≥ m),defined and denoted as :

S≥j |M(tj |m) = P (Tj ≥ tj |M ≥ m),

may be evaluated as follows :

S≥j |M(tj |m) = 1 if tj < m

=P (Tj ≥ tj)

P (M ≥ m)=

S∗(tj ej)

S∗(me+)if tj ≥ m

Thus the conditional integrated hazard function of (Tj |M ≥ m), may bewritten as :

H≥j |M(tj |m) = [lnSM(m) − lnS∗(tj ej)] 1Itj ≥m

and the conditional hazard rate of (Tj |M ≥ m) is defined and denoted as :

h≥j |M(tj |m) = hTj(tj |M ≥ m)

= limε↓01εP [tj ≤ Tj ≤ tj + ε | Tj ≥ tj , M ≥ m]

= limε↓01εP [tj ≤ Tj ≤ tj + ε |M ≥ m],

may be evaluated as follows :

h≥j |M(tj |m) = −Dj [lnS∗(tj ej)] 1Itj ≥m

In the particular case where tj = m = t , we use the convenient notationalconvention : h≥j |M(t |t) = h≥j |M(t); in such a case:

h≥j |M(t) = limε↓0

1

ε

S∗(t, · · · , t+ ε · · · , t) − S∗(t, · · · , t)S∗(t e+)

Thus, when S∗(t e+) is differentiable in tj , we get:

h≥j |M(t) = −Dj [lnS∗(t e+)]

The following relationship should be noticed :

hM(m) = − ∑

1≤i≤pDj [lnS∗(me+)]

=∑

1≤i≤p h≥j |M(t)

Page 98: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 89

3.4.4 Tj given M = m

3.5 Multivariate Distributions of Durations

3.5.1 Constructing Multivariate Distributions

In this section, we sketch some procedures for building multivariate durationdistributions. Clearly, what is at stake is obtaining multivariate distributionsthe components of which are not mutually independent.

A first procedure has already been introduced for building conditionalmodels: this is making the parameter of a univariate distribution dependenton other durations. More explicitly:

• consider a given order of the coordinates and associate to each coor-dinate Tj a family of distributions parametrized by, say, αj ; for thesake of simplicity, let the respective densities be written as fj(tj |αj) .

• transform each of those univariate distributions into a distributionconditional on the anterior coordinates by making αj a function of(T1, T2, · · · ,Tj−1) :

fj(tj | t1, t2, · · · , tj−1) = fj(tj |α(t1, t2, · · · , tj−1))

A useful property of such procedures is that the decomposition "marginal-conditional" is variation-free : the only restriction on the sequence of thefunctions αj(·) is to ensure that they should be well defined on the supportof the previously specified distributions and be valued in an adequate param-eter space.

A second procedure consists in introducing an auxiliary latent variable, sayη , conditionally on which the durations would have a "simple" distribution,typically a mutually independent one. For instance, in the bivariate case,one would specify T1 ⊥⊥T2 | η but T1 and T2 would not be independentafter integrating out η , unless η would have zero variance. As an exampleof such a procedure, consider :

hj |H(tj | η) = η hj(tj) j = 1, 2

Page 99: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 3. MULTIVARIATE DURATIONS 90

This specification implies :

S1,2(t1, t2) =

fH(η) exp−∫ t1

0

η h1(u) du +

∫ t2

0

η h2(v) dv dη

In several instances, η may be interpreted as a latent factor of hetero-geneity that makes the durations Tj not independent. In particular, for anyn, when the (Ti | η)’s, i = 1, · · ·n, are i.i.d., the Ti’s are exchangeable.

3.5.2 Examples of Multivariate Durations Distributions

Bivariate extensions of the exponential distribution

These examples accept bivariate independent exponential distributions asparticular cases.

(i) S1,2(t1, t2) = exp[1 − α1t1 − α2t2 − exp α12(α1t1 + α2t2)]

α1, α2 > 0 α12 > −1

Remark: T1 and T2 are not independent once α12 6= 0. For detail, see Exam-ple 1 in Section 4.7.

(ii) S1,2(t1, t2) = exp− (λ1t1 + λ2t2 + λ3 max(t1, t2))

Remark: T1 and T2 are not independent once λ3 6= 0. For detail, see Example2 in Section 4.7.

Examples of mixture distributions

Gamma mixture of exponentials:

(Ti | η) ∼ ind.Exp(αi η) i .e. hTi(ti | η, αi) = αi η (i = 1, 2)

η ∼ Γ(β, γ)

Therefore:

f1,2(t1, t2) =

α1 α2 η2 exp−η(α1t1 + α2t2) ×

βγ

Γ(γ)ηβ−1exp−βη dη

= α1 α2 βγ β(β + 1)

1

[β + α1t1 + α2t2]β+2(3.13)

Page 100: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 4

One transition and several exits

4.1 Introduction

In this chapter we consider problems met in modelling data of the followingtype. For a given individual, we observe the age at the time of death, say T ,and the cause of the death, say A, among a finite set of possible causes, sayE = E1, E2, . . . , Ep.

For statistical purposes, models of this class typically consider a finitenumber of precisely defined causes along with a "residual" cause that gath-ers all other possible causes; this residual cause is often treated as a censoringstate.

At a first glance, the data have the form (T,A) ∈ IR+ × E if A is anelement of E, i.e. a simple cause is given for each observed individual, orhas the form (T,A) ∈ IR+ × 2E if A is a subset of E, which is the case whenmultiple causes are allowed. Multiple causes arise in situations where severalcauses may actually act simultaneously or in situations where the availableinformation does not enable us to distinguish one cause from several possibleones. Notice that whether the range of A is E or 2E is mathematically irrel-evant as long as it is finite: we shall always stick to p possible exits; whenbuilding the model, the cases of multiple causes, or of imperfectly identifiedcauses, raise , however, substantially different problems.

Summarizing, we consider in this chapter a point process characterized by:

- p+1 states (i.e. one initial state, E0, and p exit states)

91

Page 101: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 92

EJ

Ej

E1

E0

T t

Figure 4.1: Trajectory of a process with one transition and multipleexits

- a unique transition

A typical trajectory of such processes is represented in next Figure 4.1

Let us define:

(i)the duration of the spell in the initial state:

T = inft|X(t) 6= E0 ∈ IR+

(ii) the label of the exit, in one of two different possible codings:

K ∈ 1, 2, . . . p with : K = j ⇐⇒ X(T ) = Ej

A = (A1 . . . Ap) ∈ Ap = A ∈ 0, 1p |∑

j Aj = 1where: Aj = 1IK=j = 1IX(T ) =Ej.

The Data have accordingly the following form:

i = 1, . . . n individuals

Yi = (Ti, Ai)

= (Ti, Ki)

Zi covariables (∈ IRk)

Xi = (Yi, Zi)

Page 102: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 93

4.2 Modelling multiple exits

In the general case, the data density, or the likelihood function, of the data(T,A) is based on the specification of

P (T > t, A = a) = F T,A(t, a) (4.1)

which is called a sub-distribution (in the form of a survivor function) because

F T,A(0, a) = P (A = a) < 1 (4.2)

whereas the survivor function of T , i.e., F T (t), would satisfy

F T (0) = P (T > 0) = 1 (4.3)

as soon as P (T = 0) = 0, which we will assume hereafter. We shall alsoassume that P (T <∞) = 1. Clearly

F T (t) =∑

a∈Ap

F T,A(t, a) (4.4)

Furthermore,

F T,A(t, a) = P (T > t | A = a)P (A = a) (4.5)

i.e. a sub-distribution may be decomposed into the product of a (conditional)distribution of a duration and a probability of the exit state, but many modelsare based on the dual decomposition, namely

F T,A(t, a) = E[P (A = a | T )1IT>t] (4.6)

=

(t,∞]

P (A = a | T = s)dFT (s)

Page 103: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 94

where FT (s) = 1 − F T (s) = P (T ≤ s) is the distribution function of theobserved duration T

When F T,A(t, a) is differentiable, in t, the data density LT,A(t, a) is ob-tained as follows:

LT,A(t, a) = − d

dtF T,A(t, a) (4.7)

Furthermore, if the marginal distribution of T admits a density, say fT (t),i.e., dFT (t) = fT (t)dt, we obtain

F T,A(t, a) =

∫ ∞

t

P (A = a | T = s) fT (s)ds (4.8)

and therefore

LT,A(t, a) = P (A = a | T = t) fT (t) (4.9)

A natural specification of the sub-distribution F T,A(t, a), and consequentlyof the likelihood function, may accordingly be also based on the specificationof the probability of the causes conditionally on the observed duration andon the specification of the (marginal) law of the observed duration.

Alternatively, the data density, for one observation, may also be factorizedas:

LT,A(t, a) = P (A = a) fT |A(t | A = a) (4.10)

4.3 Competing risks models: an Introduction

The competing risks model is a class of models originally developed for thecase of a single observed cause , i.e. K ∈ E. Formal generalizations may be

Page 104: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 95

obtained by redefining the state space as the power set of E but problems ofspecification are raised out of the scope of this presentation.

The object of this chapter is to give a brief and simple survey of someproblems arising when modelling competing risks models. Particular atten-tion is paid to accommodate for general distributions rather than sticking tothe absolutely continuous case. The rationale is twofold. Firstly (well chosen)general tools may be simpler to use than tools specific to the absolutely con-tinuous case. Secondly, the upraise of semi-parametric and non-parametricmodels leads to construct procedures based, at the start, on the empiricalprocess which is discrete in nature even when the theoretical distributionis assumedly continuous. Furthermore, data are typically observable on adiscrete scale only and model-builders may want to take this feature intoaccount explicitly, particularly when ties are apparent in the data. Bridgingdiscrete and continuous distributions in a systematic way becomes, in thesecases, of primary interest. Particular attention is also paid to identificationproblems, a crucial issue when modelling. Here also we endeavour to expositthe problem in a general framework, not only for the sake of generality butalso for the simplicity of the basic argument.

An example from the job market

E0 The jobless : initial stateE1, E2 . . . Ep : possible states of exit.

Basic idea: to each exit is associated a specific event, the realization of whichdetermines the particular exit.

For instance:E1 full time job Event : job searched, job found and job matching.E2 part time event : as for full job but with different probabilities.E3 disease event : occurrence of a disease.

Let τ = (τ1 . . . τp) where τj is the latent (i.e. non observable) durationassociated to exit Ej . When the process starts at t = 0, τj is also the date ofrealization of the event which causes exit Ej. The competing risks model isaccordingly based on the following interpretation of the data:

Page 105: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 96

T = minjτj (4.11)

K = minj|τj = T (4.12)

When the probability of ties is strictly positive, i.e. P [τi = τj ] > 0 forsome i 6= j, the definition of K requires some implicit ordering of the statessuch that in case of ties for the minimum, i.e. τi = τj < τk ∀k /∈ i, j ,the observed cause corresponds to the one with the lowest label. When theprobability of ties is always zero, i.e. P [τi = τj ] = 0 ∀i 6= j, K is alsoequal to:

K = arg minjτj (4.13)

Equivalently, K may be recoded into:

A = (A1 . . . Ap) where Aj = 1IK=j.

Page 106: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 97

4.4 The case of 2 exits (p = 2)

Motivation

• simplest case : thus, the reader may skip this section and proceedsdirectly to the next one treating the general case (p ≥ 2)

•• better captures the role of possible dependence among the latent dura-tions

• • • censoring

Let us evaluate the distribution of (T, A) where:

T = minτ1, τ2 (4.14)

A = 1Iτ1≤τ2 (4.15)

When using this case for handling the problem of censorship, E2 will beinterpreted as the exit “censored".

Likelihood function: We distinguish two steps:

i) P (T ≤ t, A = a)ii) d

dtP (. . .)

Notation:

F=

1|2(t1|t2) = P (τ1 > t1|τ2 = t2) S =1|2(t1|t2) = P (τ1 ≥ t1|τ2 = t2)

F=1|2(t1|t2) = P (τ1 ≤ t1|τ2 = t2)

P [T ≤ t, A = a]

A = 1 (Figure 4.2)

P [T ≤ t, A = 1] = P [τ1 ≤ t ∩ τ1 ≤ τ2]

Page 107: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 98

τ1

t

τ1=τ2

t τ2

Figure 4.2: The case τ1 ≤ τ2 (A = 1)

=

∫ t

0

S =2|1(r|r)F1(dr)

=

∫ ∞

0

F=1|2(t ∧ s|s)F2(ds)

=

∫ t

0

F=1|2(s|s)F2(ds) +

∫ ∞

t

F=1|2(t|s)F2(ds)

A = 0 (Figure 4.3 )

P [T ≤ t, A = 0] = P [τ1 > τ2 ∩ τ2 ≤ t]=

∫ t

0

F=

1|2(s|s)F2(ds)

=

∫ ∞

0

F=2|1(t ∧ r|r)F1(dr)

Therefore:

P (T ≤ t, A = a) = [

∫ t

0

S =2|1(r|r)F1(dr)]

a [

∫ t

0

F=

1|2(s|s)F2(ds)]1−a

When F1 and F2 admit densities (i.e. Fj(dt) = fj(t) dt), so that F = S:

fT,A(t, a) =d

dtP [T ≤ t, A = a] = l(t, a)

= [S=2|1(t|t)f1(t)]

a [S=1|2(t|t)f2(t)]

1−a· 1I0,1×IR+

(a, t) (4.16)

Page 108: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 99

τ1

t

τ1=τ2

t τ2

Figure 4.3: The case τ1 > τ2 (A = 0)

Remark: This is the Radon-Nikodym derivative of P with respect to theproduct measure of Lebesgue on IR+ for T times the counting measure on E(for A).

As worked out in Chapter 3, the conditional survival function S=1|2(r|t) is

defined by means of the identity:

S1,2(r, s) = P (τ1 ≥ r, τ2 ≥ s) =

∫ ∞

s

S=1|2(r|t)dF2(t) (4.17)

Thus, when the distribution of τ2 admits a density, dF2(t) = f2(t)dt, weobtain:

f2(t)S=1|2(r|s) = − ∂

∂τ2S1,2(r, s) = −D2(S1,2(r, s))

When S1,2(r, s) is furthermore differentiable on the main diagonal, i.e. onthe set τ1 = τ2, we may write:

f2(t) S=1|2(t|t) = −D2(S1,2(t, t)) = −S1,2(t, t) D2[lnS1,2(t, t)] (4.18)

If we recall that:

ST (t) = P (T ≥ t) = P (τ1 ≥ t, τ2 ≥ t) = S1,2(t, t)

and if we simplify the notation as follows:

S=1|2(t|t) = S=

1|2(t) S=1|T (t | t) = S=

1|T (t)

Page 109: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 100

and make use of (3.9), we may also write:

−D2[S1 ,2(t, t)] = f2(t) S=1|2(t)

= fT (t) S=1|T (t)

= ST (t) h≥2|T (t) (4.19)

where the conditional hazard function h≥2|T (t) is a short-hand notation for

h≥2|T (t | t), defined as (see Section 3.4.3):

h≥2|T (t) = −D2[lnS1,2(t, t)]

= limε↓0

1

εP (t ≤ τ2 ≤ t+ ε|T ≥ t)

= h≥2|1(t | t) from (3.7)

= h≥2|1(t) (notation)

Note also that:

S=1|2(t) = P (τ1 ≥ t|τ2 = t) = P (τ1 ≥ t|T = t)

whereas:

F=

1|2(t) = P (τ1 > t | τ2 = t) = P (τ1 > t | T = t) = P (τ1 > t, A = 0 | T = t)

Thus, when S1,2(r, s) is continuously differentiable,the data density (4.16)may also be written as:

fT,A(t, a) = ST (t) h≥1|2(t)a h≥2|1(t)

1−a (4.20)

Suppose now τ1⊥⊥τ2:

In this case:

S=1|2(t|t) = S1(t)

S=2|1(t|t) = S2(t)

l(t, a) = [S2(t)f1(t)]a [S1(t)f2(t)]

1−a

= S2(t)S1(t)h1(t)ah2(t)

1−a

= ST (t)h1(t)ah2(t)

1−a

Page 110: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 101

4.5 General case ( p ≥ 2)

Let:

τ = (τ1 . . . τp)S∗(t1 · · · tp) = P (τ1 ≥ t1, · · · τp ≥ tp)F ∗(t1 · · · tp) = P (τ1 > t1, · · · τp > tp)Sj(t) = P (τj ≥ t) = S∗(0, . . . 0, t, 0 · · ·0)

= S∗(tej)

When S∗(t1 . . . tp) is everywhere continuously differentiable, so that S∗ = F ∗,one may define:

fj(t) = − ddtSj(t)

hj(t) = − ddt

ln Sj(t) =fj(t)

Sj(t)

= limε↓0

1εP [t ≤ τj < t+ ε|τj ≥ t]

The data are:

T = minτj |1 ≤ j ≤ p : the observed durationK = minj|τj = T : the label of the exit state, or equivalently:A = (A1 · · ·Ap) ∈ Ap Aj = 1IK=j

Note that if P (τi = τj) = 0 ∀i 6= j we also have:

K = arg minj

(τj)

Page 111: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 102

Let also:

ST (t) = P (T ≥ t)

fT (t) = − d

dtST (t)

hT (t) = − d

dtln ST (t)

=fT (t)

ST (t)

= lim∆↓0

1

∆P (t ≤ T < t+ ∆|T ≥ t)

Now remember, from Chapter 3 (where the minimum M is now denoted T ):

ST (t) = S∗(t, . . . t) = S∗(te+)

fT (t) = −∑

j

∂τjS∗(te+) = −

j

Dj(S∗(te+))

hT (t) = −∑

j

∂τjln S∗(te+) = −

j

Dj(ln S∗(te+))

=∑

j

h≥j|T (t)

S=j|j

(t|t) =h≥j|T (t)ST (t)

fj(t)

= −∂

∂τjS∗(te+)

fj(t)= −Dj(S∗(te+))

fj(t)

where:

h≥j|T (t) = limε↓0

1

εP [t ≤ τj < t+ ε | T ≥ t]

= − ∂

∂τjln S∗(te+) = −Dj [ln S∗(te+)]

We are now ready for evaluating the likelihood function:

Page 112: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 103

l(t, j) = h≥j|T (t) ST (t)

= fj(t)S=j|j(t|t)

Alternatively we can also write:

l(t, a) =∏

j

fj(t)S=j|j(t|t)

aj

=∏

j

h≥j|T (t)ST (t)aj

= ST (t)∏

j

(h≥j|T (t))aj

where:

a =

a1...ap

Particular case: ⊥⊥jτj

This is:S∗(t1 . . . tp) =

j

Sj(tj)

Therefore:ST (t) =

j

Sj(t)

h≥j|T (t) = hj(t)

Then:

hT (t) =∑

j

hj(t)

l(t, a) = ST (t)∏

j

hj(t)aj

Page 113: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 104

A remark on the evaluation of: S=j|j

(t|t) (in the independent case)

S=j|j

(t) =h≥j|T (t)

fj(t)ST (t)

=hj(t)

fj(t)

l∈J

Sl(t)

=hj(t)Sj(t)

fj(t)︸ ︷︷ ︸

≡1

l 6=j

Sl(t)

=∏

l 6=j

Sl(t)

Therefore:

l(t, ej) = hj(t)∏

l

Sl(t) = hj(t)ST (t)

= fj(t)∏

l 6=j

Sl(t)

4.6 Identifiability of the competing risks model

Basic Idea

If we only observe:

T = minjτjK = arg minjτj,

one should not hope to identify whether the τj′s are independent or not.

More precisely we have the following theorem.

Theorem

Let:

Page 114: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 105

S = S∗(t1 . . . tp) : continuously differentiableSI = S∗(t1 · · · tp) | S∗(t1 · · · tp) =

j

Sj(tj)

T = min(τj)

K = arg minτj

l∗(t, j) likelihood relative to S∗

lI(t, j) likelihood relative to SI

Then:

∀S∗ ∈ S ∃! SI ∈ SI such that l∗(t, j) = lI(t, j)

In particular:

h≥j|T,∗(t) = hj,I(t) ∀j, ∀t

4.7 Examples

In the following examples, p = 2.

Example 1. A continuously differentiable example

S∗(t1, t2) = exp[1 − α1t1 − α2t2 − exp α12(α1t1 + α2t2)]

α1, α2 > 0

α12 > −1

Remark: τ1 and τ2 are not independent once α12 6= 0

Page 115: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 106

It may be checked that:

h≥j|T,∗(t) = −Dj lnS∗(t, t) = αj1 + α12exp[α12(α1 + α2)t]hj,∗(t) = −Dj lnS∗(t ej) = αj1 + α12exp(αjα12t)

Thus:

Sj,∗(t) = exp1 − αjt− exp(α12αjt)

Il may be checked that:

S∗(t1, t2) 6= S1,∗(t1).S2,∗(t2)

unless α12 = 0

Therefore:

l∗(t, j) = αj1 + α12 exp [α12(α1 + α2)t]exp1 − (α1 + α2)t− exp [α12(α1 + α2)t]

Consider now SI defined as follows:

hj,I(t) = αj1 + α12 exp [α12(α1 + α2)t]Sj,I(t) = exp αj

α1 + α2

− αjt−αj

α1 + α2

exp(α12)(α1 + α2)t

S1,2,I(t1, t2) = S1,I(t).S2,I(t)

It may then be checked that:

l∗(t, j) = lI(t, j)

Page 116: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 107

Example 2. An absolutely continuous but not differentiable exam-ple

Johnson and Kotz (1972, p265) attribute the following example to Freund.Let the latent duration distribution be

S∗(t1, t2) = [1 + α1(t1 − t2)]exp − [(α1 + α2)t1]1It1>t2

+[1 + α2(t2 − t1)]exp[−(α1 + α2)t2]1It1<t2

This survivor function is differentiable outside the main diagonal, i.e. ,

D1D2S∗(t1, t2) = f(t1, t2)= α1(α1 + α2)exp[−(α1 + α2)t1]1It1>t2

+α2(α1 + α2)exp[−(α1 + α2)t2]1It1<t2

and it may be shown that f(t1, t2) is a density, even if it is not continuouson the main diagonal. Therefore in this model

P (τ1 = τ2) = 0

We deduce from this that for a = 1, 2

ST (t) = exp[−(α1 + α2)t]Sa(t) = (1 + αat)exp[−(α1 + α2)t]

By integration of the density, we obtain

S(t, a) = (1 − αa

α1 + α2

)exp[−(α1 + α2)t]

We obtain the equivalent independent risks model as follows:

dHIa(t) = −ST (t)−1dS(t, a)

= (α1 + α2 − αa)dt

Page 117: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 108

ThereforeSI

1(t) = exp[−α2t]SI

2(t) = exp[−α1t]

Example 3. A distribution-free example with positive probabilityof ties

The following example illustrates the construction of an associated modelwith independent latent durations with the same sub-distributions in a casewhere there is a positive probability of ties.

Let (X1, X2, X3) be three independent non negative random variableswith arbitrary distributions. Consider the following two competing risksmodels:

Model 1τ1 = X1 ∧X3

τ2 = X2 ∧X3

T = τ1 ∧ τ2 A = 1Iτ1≤τ2

Model 2τ ∗1 = X1 ∧X3

τ ∗2 = = X2

T ∗ = τ ∗1 ∧ τ ∗2 A∗ = 1Iτ∗1 ≤τ∗

2

where, for instance, X1 ∧X3 denotes min X1, X3.Clearly τ1⊥⊥/ τ2 whereas τ ∗1⊥⊥τ ∗2 . Let us show that the associated sub-distributionsare identical.

(i) T > t, A = 1 = X1 ∧X3 > t,X1 ∧X3 ≤ X2 ∧X3= X1 ∧X3 > t,X1 ∧X3 ≤ X2= τ ∗1 > t, τ∗1 ≤ τ ∗2 = T ∗ > t,A∗ = 1

(ii) T > t, A = 0 = X2 ∧X3 > t,X1 ∧X3 > X2 ∧X3= X2 ∧X3 > t,X2 < X1 ∧X3= X2 > t,X2 < X1 ∧X3= τ ∗2 > t, τ∗2 < τ ∗1 = T ∗ > t,A∗ = 0

Page 118: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 109

This example has been introduced by Marshall Olkin (1967) for the particularcase where the Xi ’s are independently distributed and Xi has an exponentialdistribution with parameter λi for i = 1, 2, 3. In that particular case,

S∗(t1, t2) = exp[−λ1t1 − λ2t2 − λ3 max(t1, t2)]

and therefore

ST (t) = exp[−λ+t]

where λ+ = λ1 + λ2 + λ3

Note that S∗ is continuous but not continuously differentiable (on themain diagonal); in particular

P (τ1 = τ2) = λ3λ−1+

which is strictly positive unless λ3 = 0; also τ1 and τ2 are not independentunless λ3 = 0. The sub-distributions are obtained as follows:

S(t, a) = (λ1 + λ3)λ−1+ exp[−λ+t] a = 1

= λ2λ−1+ exp[−λ+t] a = 0

From (4.3) we obtain the compensators

LIa(t

∗) = (λ1 + λ3)t∗ a = 1

= λ2t∗ a = 0

and derive the survivor functions of the independent equivalent model

SIa(t

∗) = exp[−(λ1 + λ3)t∗] a = 1

= exp[−λ2t∗] a = 0

Page 119: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 4. ONE TRANSITION AND SEVERAL EXITS 110

Thus

SI∗(t

∗1, t

∗2) = exp[−(λ1 + λ3)t

∗1 − λ2t

∗2]

Clearly SIT (t) = SI

∗(t, t) = ST (t) = exp[−λ+t]

Note also that

P (A = a | τa = t) = exp[−λ2t] a = 1

= λ2

λ2+λ3exp[−λ1t] a = 0

In the particular case of exponential distributions, T and A are independent.Indeed

P (A = a | T > t) = λ1+λ3

λ+a = 1

= λ2

λ+a = 0

Page 120: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 5

Transitions models

5.1 Introduction

In this chapter we consider some basic issues in modelling "general" pointprocesses. Thus, let Xt be a stochastic process with finite state space andcontinuous time:

X = Xt : t ≥ 0 Xt ∈ E = E0, E1, . . . , Ep .

By the "general" case, it is specifically meant that more than one transitionis possible. For this reason, such models are also called "Transitions models".Figure 5.1 illustrates the general case where the τj ’s are the stochastic timesof change of state, i.e. :

τj = g(X) = inft | Xt 6= Xτj−1 .

Thus we also have a sequence of durations, Tj defined as the length of thetime interval between 2 transitions, i.e. during which Xt stays in the samestate:

Tj = τj − τj−1.

Such models require data in the forms of (individual) trajectories eithercomplete, i.e. the complete "life" of the process, untill the process enters an"absorbing" state (possibly called a state of "death"), or an incomplete onebut bearing on a "substantial " period of time. Such data are often referedto as "biographical data". Thus this field is also known as the " analysis ofbiographies".

111

Page 121: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 112

τ1 τ2 τ3

E1

E0

E2

E3

Xt

t

T1 T2 T3

Figure 5.1: A typical trajectory in the general case.

Typical sources of such data include:(i) data from "retrospective" interviews, involving some specific problems ofmemory bias,(ii) administrative data, involving problems caused by the requirement ofprotecting privacy (problems of confidentiality) or by the fact that such dataare typically not conceived, at the origin, by the needs of econometric mod-elling.

When modelling transition processes, one should first be aware of "regu-larity" conditions that make a (general) point process economically meaning-ful and mathematically manageable. Such conditions ensure the possibilityof discrete representations of the trajectories in spite of the fact that theprocess is defined in continuous time. This is the object of next section. Insection 5.3, we introduce the representation of point processes by means of amultivariate counting process, a powerful tool for modelling and for operatingwith point processes. The last sections of the chapter introduce markovian-

Page 122: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 113

ity hypotheses that improve considerably the manageability of general pointprocesses.

5.2 Basic Assumptions

Let us first consider a finite state space with p+ 1 elements :

E = E0, E1, . . . , Ep

This space is "abstract" in the sense that, in general, no mathematical struc-ture is embodied (for instance, no addition nor ratio of states is given mean-ing), although in some cases, an ordering of the states might be introduced.For instance, E0 is typically an "initial" state and Ep might be a "final" oran "absorbing" state.

Next the time space is a (nondegenerate) interval of the positive part ofthe real line :

t ∈ I where I = [0, a] or I = IR+ = [0,∞)

A trajectory, x, is a function defined in I with values, indifferently writtenas xt or as x(t) , in E :

x : I → E : ∀t ∈ I 7→ xt ∈ E. (5.1)

The space of (all) trajectories is therefore :

x ∈ EI

We may now define a (general) point process as follows. Let (Ω,A,P)be a probability space; it may be heuristically interpreted as representingan "economic environment". A point process, X, is a (measurable) functiondefined on Ω and valued in the space of trajectories (endowed with the σ-fieldgenerated by the cylinders) :

X : Ω → EI ∀ω ∈ Ω 7→ X(ω) = (Xt(ω) : t ∈ I) ∈ EI (5.2)

The measurability of X only ensures that its probability law is well definedand may be derived from the probability measure P . But this is not enoughto make the point process X meaningful or operational. More specifically,we have interest in functions of the trajectories, such as, for instance, :

Page 123: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 114

(i) the first access time to a given state :

A1,j = inft|Xt(ω) = Ej (5.3)

(ii)the (total) duration of permanence in a given state :

Mj =

t∈I

1IXt(ω)=Ej dt (5.4)

and, more generally:

(iii) the instants of transition :

τ0 = 0 X(0) = E0

τ1 = inft|X(t) 6= E0 · · ·

τk = inft > τk−1|X(t) 6= X(τk−1) · · ·

(iv) the durations in a same state :

T0 = 0· · ·

Tk = τk − τk−1

· · ·

with the obvious relationships :

τk =∑

0≤i≤k

Ti (5.5)

Consider however the apparently simple case :

Xt ∼ i.i.d. P (Xt = Ej) = (p+ 1)−1 j = 0, 1; · · ·p.

The probability law of such a process is perfectly defined : indeed any of itsfinite-dimensional distributions is well defined and even very simple; never-theless its trajectories are pathological with probability one. For instance, forany nondegenerate interval of IR+ the number of transitions is uncountablewith probability one: indeed, between two different real numbers there is anuncountable infinity of real numbers. These pathologies, of the trajectories,makes any of the above functions of interest nonmeasurable and consequently

Page 124: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 115

not endowable with a probability law derived from the original probabilitymeasure P . Note also that these problems are not purely mathematical : theyare of major concern for modelling; indeed, what would be, for instance, theeconomic meaning of a model describing an individual behaviour in such away that the number of transitions on the labour market would be almostsurely uncountable, within any non-degenerate interval of time?

Clearly there is a need for some regularity conditions on the trajectoriesX(ω). Experience shows that a suitable condition is the following.

Fundamental Hypothesis. With probability one, the trajectories areCADLAG (Continue à Droite avec Limite à Gauche - or : RCLL, Right-Continuous with Left Limit) functions of the time.

This hypothesis may also be written as follows; with probability one and forany t:

X(t+) = X(t) and X(t−) exists.

There is a jump at t when X(t−) 6= X(t), otherwise the trajectory is con-tinuous at t. It means, in particular, that when there is a jump at t, thevalue of X(t) is unambiguously the value at its right hand-side. Thus theCADLAG hypothesis excludes a behaviour as illustrated in Figure 5.2. Thishypothesis will always be maintained throughout this monograph.

RemarkAs suggested before, analytical operations in the time domain, such as in-tegration (5.4), are a major concern in continuous time processes. Thus aslightly different approach is provided by defining a point process, ratherthan 5.2, as follows:

X : Ω × T −→ E (5.6)

assumed to be measurable with respect to A⊗B, where B denotes the Borelsets of T .

The fundamental, or CADLAG, assumption has properties crucial for themanageability of the resulting models, in particular :

• The number of transitions is, at most, countable and form a well or-dered set. We may therefore write, for instance : τk k ∈ IN or Tk k ∈IN.

Page 125: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 116

E1

E0 )

(

.

Figure 5.2: A typical pathology excluded by the CADLAG hypoth-esis.

• The functions, of X, τk, Tk,defined above, are (jointly) measurable;their probability distributions are therefore well defined.

• P [τk > τk−1] = 1.

• P [Tk > 0] = 1.

As a consequence, the complete trajectories, defined on an uncountabletime space, admit equivalent countable representations :

(Xt : t ∈ I) ⇐⇒ (τk, X(τk) : k ∈ IN),

⇐⇒ (Tk, X(∑

0≤i≤k Ti) : k ∈ IN),

where ⇐⇒ means equality of information (more formally : equality of thecorresponding σ - fields).

When deriving the likelihood function of an observation, we also need thedescription of a trajectory on an interval [0 , t] where t is a fixed number.Defining K = maxj | τj ≤ t (thus K should actually be written as K(t)),we then obtain :

Page 126: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 117

(Xα : α ∈ [0 , t]) ⇐⇒ (τ1, X(τ1)), · · · , (τK , X(τK)), 1IτK+1>t(5.7)

⇐⇒ (T1, X(T1)), · · · , (TK , X(∑

0≤i≤K

Ti)) ,

1I∑

0≤i≤K+1 Ti>t (5.8)

Note that when X(τK) is an absorbing case, τK+1 may be defined as equalto ∞, i.e. the information 1IτK+1>t becomes empty.

WarningDealing with stochastic processes in continuous time requires considerablemathematical sophistication, beyond the scope of these notes. The basicdifficulty may be approached as follows. In continuous time, it is necessarybut not sufficient that for each t, X(t) be a well defined random variable:the trajectory Xt(ω) t ∈ I, where I is an interval of IR+, should also bea.s. well behaved; this was the rationale for the CADLAG condition. Moregenerally, reasonings point-wise on the time space (i.e. "for every t") are fullof pitfalls, because the measurability requirements should be viewed on theproduct space I × Ω where Ω is the fundamental universe of the underlyingprobability space. Thus, in many places, the reader will find, and possiblyconsider as frustrating, expressions like "under some regularity conditions" or "it may be proved that ..."; these statements should be interpretedas warnings that the issue is not trivial and that, although the underlyingconditions are likely to be satisfied in "usual" situations, care is called. Themain concern of this text is to help understanding the basic ideas and torecommend consulting a knowledgeable probabilist for technical issues.

5.3 Counting Processes

5.3.1 Motivation

One difficulty when dealing with point processes is that the state space has nomathematical structure : numbers are easier to be dealt with than abstractstates. But one possible way of describing a trajectory of a point process is tomake a list of all possible transitions and then to count the number of eachsuch transitions as time proceeds. This is precisely the technique we nowintroduce. The benefits are multiple. This approach allows one to retrievethe results of the theory of counting processes, a widely developped theory

Page 127: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 118

with a large bunch of applications. These results will eventually supply usefulguidelines for modelling complex point processes.

5.3.2 Definition of a counting process

Heuristically, a counting process, denoted N(t), counts the number of real-izations of a well specified event during the time interval [0 , t]. Formally, acounting process is a continuous time stochastic process valued in IN :

N : Ω → INI or N(t) ∈ IN : t ∈ I

where I is an interval of the positive part of the real line : I = [0, a] orI = IR+, such that :

• N(t) ∈ IN = 0, 1, 2, · · ·

• The trajectories are almost surely CADLAG, i.e. with probability 1:

N(t−) = limu↑t

N(u) exists N(t+) = limu↓t

N(u) = N(t)

• N(0) = 0

• N(t) − N(t−) ∈ 0, 1Thus a counting process is a process the trajectories of which are non-

decreasing with unit jumps : the realization of two events at the same timeis therefore excluded. Figure 5.3 displays a typical trajectory of a countingprocess.

The CADLAG property of the trajectories of a counting process ensuresthat these trajectories may be represented, without loss of information, bythe (countable) sequence of the instants of the jump

τ0 = 0τk = inft > τk−1 |N(t) −N(t−) = 1

or, equivalently, by the sequence of the durations between two jumps:

T0 = 0Tk = τk − τk−1

Because the jumps of N(t) are only of unit height, we also have :

N(τk) = k (5.9)

Page 128: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 119

0 τ1 τ2 τ3 τ4 τ5 t

1

2

3

4

5

[

[

[

[

[

N(t)

)

)

)

)

)

Figure 5.3: A typical trajectory of a counting process.

i.e. the state N(t) ∈ IN is therefore given by the label of the transition timeτk .

Remark.From (5.9), the counting process N(t) may also be represented by its sequenceof instants of jump:

N(t) =∑

k≥1

1I[0,t](τk) (5.10)

and, from (5.10), a counting process may also be approached as a randommeasure, through the following equivalent representation:

N(t) =∑

τk≤t

δτk (5.11)

where δτk stands for a probability measure giving a unit mass on the pointτk.

Page 129: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 120

5.3.3 Modelling a univariate counting Process

Introduction

Let us first notice that the simple - or "death"- process considered in chapter2 is equivalent to a counting process with a unique jump : this correspondsto represent, at any time t, the states E0 and E1 by 0 and 1 respectively. Wenow go one step further and consider the modelling of a univariate countingprocess, with more than one jump.

From (5.9), the law of a counting process is equivalently specified in termsof the distribution of the sequence of instants of jumps or of the sequence oftime interval between two consecutive jumps (or durations). This suggestsa first approach when specifying the law of a counting process: specify anyone the following sequences of the conditional distributions:

Sτk(t | τ1, · · · , τk−1)

STk(t | T1, · · · , Tk−1)

These sequences are obviously related through t > τk =∑

1≤i≤k Ti and:

Sτk(t | τ1, · · · , τk−1) = STk

(t −∑

1≤i≤k−1

Ti | T1, · · · , Tk−1)

STk(t | T1, · · · , Tk−1) = Sτk

(t + τk−1 | τ1, · · · , τk−1)

In particular, if the sequence of durations is i.i.d., the process is called arenewal process.

When deriving the likelihood function relative to the observation of acounting process on an interval [0, t], where t is a fixed number (> 0), onemay denote byK the number of jumps up to time t: N(t) = K and representthe trajectory on the interval [0, t] as (see also (5.8):

(T1, T2, · · ·TK :∑

1≤k≤K

Tk ≤ t <∑

1≤k≤K+1

Tk ); (5.12)

the data density may accordingly be written as:

l(t) = ∏

1≤k≤K

fT,k(Tk | T1, · · · , Tk−1) ST,K+1(t− τK | T1, · · · , TK) 1IτK≤t<τK+1

(5.13)

Page 130: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 121

or, in terms of the hazard rates:

l(t) = ∏

1≤k≤K

hT,k(Tk | T1, · · · , Tk−1) exp[−∫ Tk

0

hT,k(u | T1, · · · , Tk−1) du

× exp[−∫ t−τK

0

hT,K+1(u | T1, · · · , TK) du (5.14)

The Doob-Meyer Decomposition

Another, and also natural, idea for modelling a counting process is to rely onits dynamic properties, i.e. to model the law of the future conditional on thepast of the process, by decomposing the process as a sum of two processes,one that depends only on its past and another one that would be "innova-tive" in a sense to be made precise. This is the basic idea underlying thecelebrated Doob-Meyer Decomposition

We first rewrite the history (5.12) of the process up to time t in σ-algebraicterms, to be denoted as FN(t) , as follows :

FN(t) = σ(N(u) : 0 ≤ u ≤ t)= σ(τ1 < τ2, < · · · , < τK ≤ t, 1IτK+1>t)= σ(T1, T2, · · · , TK |

1≤i≤K Ti ≤ t, 1ITK+1> t−∑

1≤i≤K Ti)

Thus, as t increases, FN(t) represents an increasing structure of information;this is a filtration :

t′ > t =⇒ FN(t′) ⊃ FN(t)

Furthermore, under rather general regularity conditions, this filtration is con-tinuous to the right :

FN(t+) := limu↓t

FN(u) = FN(t)

The left limit:FN(t−) := lim

u↑tFN(u)

will be called the (past) history of the process.The Doob-Meyer decomposition may be motivated by reminding a basic

feature of an autoregressive model in discrete time. There, we consider asequence of random variables yt with t ∈ IN or t ∈ ZZ. At time t, the

Page 131: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 122

past history of the process, FY (t−) , may be represented as FY (t−) =σ(yt−1, yt−2, · · · ). When we write an autoregressive model of order p,i.e.AR(p) :

yt = α1 yt−1 + α2 yt−2 + · · · + αp yt−p + εt εt ∼ i.i.d.

we have actually decomposed the process yt as the sum of two processes :

yt = E[yt|FY (t−)] + εt

where the first term :

E[yt|FY (t−)] = α1 yt−1 + α2 yt−2 + · · · + αp yt−p

is the "projection on the past" of the process whereas the second term εt is the"innovation" of the process. This is exactly the basic idea of the celebratedDoob-Meyer decomposition that adapts such a decomposition to continuoustime processes.

Doob-Meyer decomposition. Any counting process N(t) may be decom-posed into :

N(t) = HN(t) + M(t)

where HN(t), the compensator, is the projection of N(t) on its past and theresidual term M(t) = N(t) − HN(t) is a martingale adapted to FN(t), thehistory of N(t), i.e. :

E[M(t) |FN(s)] = M(s) ∀s ≤ t

This decomposition is almost surely unique. The process HN(t) has non-decreasing CADLAG trajectories (with probability one) and starts at 0:HN(0) = 0 (almost surely). Notice that the counting process N(t) is integer-valued whereas HN(t) and M(t) are real-valued but their sum is integer-valued.

Under (several) conditions of regularity and of smoothness, we may definehN (t), the (stochastic) intensity of the counting process N as follows :

hN(t) = limε↓0

P [N(t+ ε) − N(t) = 1|FN(t−)]

ε

Page 132: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 123

Heuristically, the intensity of the process measures the propensity of theprocess to have a jump at time t, as a function of, i.e. conditionally to, itshistory up to the time t−. As a matter of fact, this is the only information tobe retained from the past. When the intensity of the process is well defined,its cumlulative version provides an integral representation of the compensatorof the counting process:

HN(t) =

[0, t]

hN (u) du

Remark. Exactly as for the density of a probability measure, the existenceof the stochastic intensity requires some conditions of continuity but its "in-tegrated" version, the compensator, always exists, as a result of Doob-Meyerdecomposition of a counting process.

Under suitable regularity conditions, this decomposition may also be writ-ten in differential terms :

dN(t) = hN (t) dt + dM(t)

where dM(t), called the "innovation", is a centered process, i.e.

E[dM(t) |FN(t−)] = E[dM(t)] = 0

.Similitudes with an AR(1) process, in discrete time, may be illustrative.

Indeed, the model :

yt = α yt−1 + εt εt ∼ i.i.d.

may be represented as :

yt = α y0 + (α− 1)

t−1∑

i=1

yt−i +

t∑

i=1

εi

The correspondance with the Doob-Meyer decomposition may be viewed asfollows :

N(t) → yt

HN(t) → α y0 + (α− 1)∑t−1

i=1 yt−i

M(t) →∑t

i=1 εi

Page 133: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 124

The AR(1) process may also be written as :

∆yt = (α − 1)yt−1 + εt

and the correspondance with the Doob-Meyer decomposition, in differentialterms, may be viewed as follows:

dN(t) → ∆yt

hN (t) dt → E[∆yt|FN(t−)] = (α − 1)yt−1

dM(t) → εt

This analogy clearly shows that the martingale term, M(t) , does not corre-spond to the (instantaneous) innovation but to the cumulated innovations:its differential , when it exists, represents the (instantaneous) innovation.

In the particular case of a counting process, HN(t) and M(t) equivalentlycharacterize the law of the process in the following sense. In differentialterms, we have :

V [hN(t) |FN(t−)] = 0

E[dM(t) |FN(t−)] = 0

which implies that the process is locally poissonian in the sense that :

E[dN(t) |FN(t−)] = V [dN(t) |FN(t−)] = hN(t) dt

As a consequence, the specification of the law of a counting process boilsdown to the specification of its compensator (which always exists in an al-most surely unique version) or of its stochastic intensity (which exists undersome regularity conditions).

Particular case: a Poisson processThis is a most simple case that helps to understand how far the com-

pensator, or the stochastic intensity, completely characterizes the law of acounting process. Remember that a Poisson stochastic process has indepen-dent increments (i.e. Ns − Nr⊥⊥Nu − Nt for any r < s < t < u) andeach increment Ns − Nr is distributed as a Poisson variable with parameterµ(]r, s]) where µ is an arbitrary measure on IR+. If the measure µ is diffusewith a density m(t) with respect to the Lebesgue measure on IR+, it may bechecked that:

HN(t) = µ([0, t]) hN(t) = m(t)

Page 134: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 125

T 0 τ

E0

E1

t

Figure 5.4: Typical trajectory of a “Death process", or of a simplecounting process

The Poisson process is then homogeneous when the measure µ is invariantfor the translations on IR+ and , therefore, proportional to the Lebesguemeasure on IR+: µ(]r, s]) = λ (s − r) where λ is an arbitrary (strictly)positive constant. In such a case, it may be checked that:

HN(t) = λ t hN(t) = λ

This particular case illustrates the fact that problem of the existence of astochastic intensity is of a the same nature as that of the existence of adensity of probability.

Modeling a counting process through the Doob-Meyer Decompo-sition

Consider now a simple process, i.e. a counting process with a unique jump.As mentionned before, its trajectories are equivalent to those of a deathprocess, after identifying the state Ej of the death process with the value jof the counting process with j = 0, 1. This is illustrated in Figure 5.4.

The stochastic intensity of the simple counting process, hN (t), is a stochas-tic process determined by the past history of the process, FN(t−) , and by the

Page 135: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 126

hazard rate of the duration before the first jump, namely hT (t), as follows:

hN (t) = hT (t) 1I[0,T ](t)= hT (t) [1 − N(t−)]

where the second term of the right hand side, [1 − N(t−)], indicates that the(unique) jump is still possible at time t, i.e. that the process is "at risk" forthe (only possible) transition from E0 to E1 ; this term makes precisely thedifference between the hazard rate of the duration, a deterministic functionof the time t, and the stochastic intensity of the process, a stochastic process.

Thus for a general (univariate) counting process, the stochastic intensityis written as :

hN(t) =∑

k≥1

hT,k(t− τk−1 | T1, T2, · · · , Tk−1) 1Iτk−1<t≤τk (5.15)

where in the left-hand side of

hT,k(t− τk | T1, T2, · · · , Tk−1) 1Iτk−1<t≤− k = hT,k(t− τk |FN(t−)), (5.16)

the first term represents the hazard rate of the k-th duration conditional onthe past history of the process up to time t− whereas the second term meansthat the process is "at risk" for the k−th jump at time t. In particular, atthe random point τk, we have:

hN (τk) = hT,k(Tk | T1, T2, · · · , Tk−1)

Inserting (5.15) into (5.14), we may write the data density relative to atrajectory on an interval [0, t] as follows:

l(t) = ∏

1≤k≤K

hN(τk) × exp−[∑

1≤k≤K

∫ τk

τk−1

hT,k(u − τk−1 | T1, · · ·Tk−1) du

+

∫ t

τK

hT,K+1(u − τK | T1, · · ·TK) du]

= ∏

1≤k≤K

hN(τk) × exp−[

∫ t

0

hN(u) du]

= ∏

1≤k≤K

hN(τk) × exp−HN(t) (5.17)

Page 136: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 127

or, equivalently:

ln l(t) =

∫ t

0

lnhN (u) dN(u) −∫ t

0

hN (u) du (5.18)

where use has been made of the fact that

k|τk≤t

lnhN (τk) =∑

1≤k≤K

lnhN (τk) =

∫ t

0

lnhN(u) dN(u) (5.19)

once the counting process N(t) is viewed as a random measure, see (5.11).

5.4 Representation of a Point Process through

a counting Process

Let us now come back to a point process, X(t), with state space E =E0, E1, · · · , Ep and describe the set, C, of possible transitions :

C = (Ei, Ej) ∈ E2 |Ei 6= Ej

Thus, given that a "transition" from one state into itself is not allowed, wehave: #(C) = p(p+ 1). For any c ∈ C, let us define the counting process:

N c(t) = the number of transitions c in [0, t]

Finally, we define the p(p+ 1)-dimensional vector counting process :

N(t) = [N c(t)] c ∈ C

Clearly, the information provided by the point process X(t) is equivalent tothe information provided by the multivariate counting process N(t); moreprecisely, they generate the same σ−fields and we have :

Proposition X(t) ⇐⇒ N(t)

Therefore, when refering to the trajectory or to the history of a pointprocess we shall use indifferently FX(t) or FN(t).

Page 137: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 128

Let us now consider a particular transition from Ei to Ej (i 6= j): c =(i, j) ∈ C and its associated counting process N c(t). The stochastic intensityof that transition may be written as:

hij(t) = qij(t) Yi(t−) (5.20)

whereYi(t−) = 1IX(t−) = Ei (5.21)

indicates whether the individual is "at risk" for a transition starting from iand qij(t) represents the intensity of the transition from i to j, conditionallyon the past of the process:

qij(t) = limε↓0

P [Xt + ε = Ej | Xt = Ei,FN(t−)]

ε(5.22)

Heuristically, qij(t) may be viewed as an "instantaneous probability" of thetransition from i to j, conditionally on being in Ei at the instant t andconditionally on the past of the process; thus qij(t) is actually a compact(but possibly misleading!) way of writing a function of (Xu : 0 ≤ u < t)

Using (5.18) for each possible transitions, we may now write the log-datadensity relative to a trajectory on an interval [0, t] as follows:

ln l(t) =∑

(ij)∈C

∫ t

0

lnhij(u) dNij(u) −∫ t

0

hij(u) du (5.23)

5.5 Markov Processes

5.5.1 Basic Ideas

A basic issue in the dynamic modelling of a point process is the extend towhich the law of the future of the process depends on the history, namelyFN(t−). Without further restriction, the structure of this history becomesmore and more complex as time proceeds and modelling becomes untractable.The most frequent restriction is to assume the process to satisfy a Markoviancondition. In this section we examine the defintion of such a condition and itsimplication for modelling. In next sections, we consider substantiallly moregeneral conditions, yet leading to manageable models.

Remember that a point process is a process in continuous time but thelaw of the process is characterized, in a unique way, by the (projective)system of all its finite-dimensional distributions. It is convenient to expressthe condition of markovianity as a restriction on that characterization.

Page 138: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 129

5.5.2 Characterization of a Markovian Point Process

Definition The point process X(t) : t ∈ I is markovian if and only if

∀n ∀ t1 < t2 < · · · < tn : X(tn)⊥⊥ (X(t1), X(t2) · · ·X(tn−1)) |X(tn−1)(5.24)

This means that the law of X(tn) conditionally on any finite collection ofpast realizations depends only on the most recent one.

Under regularity conditions, the law of a markovian point process is char-acterized by the matrix of transition probabilities, defined as follows:

P (s, t) = [pi,j(s, t)] = [P (X(t) = Ej |X(s) = Ei)] s < t

The CADLAG property of the trajectories of X(t) implies that:

limt↓s pi,j(s, t) = 1 i = j= 0 i 6= j= 1Ii=j

or, in matrix terms:limt↓s

P (s, t) = I

The matrix of intensities Q(t) = [qi,j(t)] takes the following form:

qi,j(t) = limε↓0

P [Xt+ε = j | Xt = i, F(t−)]

εj 6= i by definition

= limε↓0

P [Xt+ε = j | Xt = i]

εj 6= i by markovianity

= limε↓0

pi,j(t, t+ ε)

εj 6= i

=d

dtpij(s, t)|t=s j 6= i (5.25)

= −∑

i6=j

qi,j(t) i = j (5.26)

The definition (5.26) of the diagonal terms ensures that the sums of eachrow vanish:

Q(t) e+ = 0 (5.27)

Remember that the stochastic intensity, at time t, of a counting process, or ofa point process, is non zero only if the jumps, or the transition, is "at risk",

Page 139: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 130

i.e. is possible. Thus we use again a binary variable, Yi(t−), indicating thata transition from Ei is "at risk", and write the stochastic intensity of thetransition from Ei to Ej as before, in (5.20):

hij(t) = qi,j(t) Yi(t−)

with the only difference that now qi,j(t) depends on F(t−) only through thevalue of t, thanks to the markovian property, i.e. qi,j(t) is a deterministicfunction of the time.

Taking into account that dNij(u) = 0 when Yi(u−) = 0, i.e. when atransition exiting from Ei is not at risk, we derive, from (5.23), the log-datadensity for a Markovian Process observed on an interval [0, t] as follows:

ln l(t) =∑

(i,j)∈C

∫ t

0

ln(qij(u)) dNij(u) −∫ t

0

qij(u) Yj(u−) du (5.28)

The probability of visiting state Ei at time t:

πi(t) = P [X(t) = Ei] (5.29)

may be evaluated in case of two states, namely E0 and E1, as follows:

π0(t) = exp∫ t

0

[q01(u) + q10(u)] du

π0(0) +

∫ t

0

[q10(s) exp(

∫ s

0

[q01(u) + q10(u)] du)ds(5.30)

π1(t) = 1 − π0(t) (5.31)

5.5.3 Homogeneous Markov Process

The condition of homogeneity means that the the probabilities of transitionremain invariant under a translation of the time. Formally we have :

Definition. A markovian point process is homogeneous if :

pij(s, t) = pij(0, t− s) =def pij(t)

The matrix of transition probabilities is accordingly written as P (t) insteadof P (s, t). The condition of CADLAG trajectories implies that :

limt ↓0

pij(t) = 1Ii=j

Page 140: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 131

therefore:limt↓0

P (t) = I

Similarly the intensities of the transitions do not depend anymore on t:

qij(t) =d

dtpij(t)|t=0 = qij j 6= i and therefore Q(t) = Q

Therefore, the stochastic intensity hij(t) = qi,j Yi(t−) depends on t onlythrough Yi(t−) , i.e. the fact that the transition from i to j is at risk. Thestructure of such processes is eventually very simple as the memory of sucha process only remembers the state where the process is actually.

Under suitable smoothness conditions, it may be shown that, in an ho-mogeneous markovian point process, the two matrices P (t) and Q are relatedby the following differential equation :

d

dtP (t) = QP (t) (5.32)

Let us define the exponential function with matrix argument in terms of theseries development of the exponential function, i.e.

etQ =∑

k≥0

Qk tk

k!

The solution of the differential equation (5.32) may then be written as:

P (t) = etQ (5.33)

which implies that the transition probabilities P (t) depends on Q and t only.

Exercise. Check that (5.33) is indeed a solution to the differential equation(5.32).

From (5.28), the log-data density for an homogeneous Markovian Processobserved on an interval [0, t] may be written as:

ln l(t) =∑

(i,j)∈C

∫ t

0

ln(qij) dNij(u) −∫ t

0

qij Yj(u−) du

=∑

(i,j)∈C

[Nij(t) ln(qij) − qij Tit] + const. (5.34)

Page 141: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 132

where Tit denotes the total duration in state Ei during the interval [0, t] :

Tit =

∫ t

0

Yj(u−) du =

∫ t

0

1IX(u−) =Ei du (5.35)

Properties of a Homogeneous Markovian point Process (HMPP)Remember that the trajectories of a point process may be represented by thecountable sequence (Tk, X(τk)) k ∈ IN where

τk =∑

1≤ i≤ k

Ti

It is convenient to simplify the notation as follows:

Xk = X(τk)

The properties of a Homogeneous Markovian point Process may now be de-scribed as follows :

1. The process (Tk, Xk) k ∈ IN is markovian, i.e.

(Tk, Xk)⊥⊥(T k−11 , Xk−1

1 )|Tk−1, Xk−1

where T k−11 = (T1, T2, · · · , Tk−1) and similarly for Xk−1

1 .

2. At the k-th episode, the joint law of the duration Tk and the state ofexit Xk depends of the state where the process is only:

(Tk, Xk)⊥⊥Tk−1|Xk−1

3. Within each episode, the duration Tk and the state of exit Xk areindependent:

Tk ⊥⊥Xk |Xk−1

4. The process Xk k ∈ IN is markovian, and its law is characterizedby:

P [Xk = Ej | Xk−1 = Ei] =qij

j 6=i

qiji 6= j

= 0 i = j (5.36)

5. The durations are exponentially distributed:

(Tk |Xk−1 = Ei) ∼ Exp(λi) where : λi =∑

j 6=i

qij

Note that the condition Xk−1 = Ei is equivalent to Yi(τk−1−) = 1.

Page 142: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 133

5.5.4 Stationary HMPP

In a Homogeneous Markov Point Process, the probabilities of transitionsdo not depend on the time origin. Let us however look at the (marginal)probabilities that, at time t, the process be in state Ej:

πj(t) = P [X(t) = Ej] π(t) = (π0(t), π1(t), · · · , πp(t))′ ∈ Sp (5.37)

Thus, π(0) is accordingly called the "initial distribution". Now, (5.37) maybe evaluated as follows:

πj(t) =∑

0≤i≤p

πi(0)P (X(t) = Ej |X(s) = Ei) π(t)′ = π(0)′ P (t) (5.38)

Thus, π(t) depends on both π(0) and P (t). A natural question is: "underwhich conditions the influence of the initial distribution will tend to vanish?"This is the object of next definition.

Definition A Homogeneous Markov Point Process is stationary if thereexists a probability vector π = (π0, , π1, · · · , πp)

′ with∑πi = 1 such that :

limt ↑∞

pij(t) = πj limt ↑∞

P (t) = e+ π′ with e+ ∈ IRp+1

In such a case, π is called the stationary distribution and we have :

π′ = π′ P (t) ∀t ∈ IR+ (5.39)

π′Q = 0 (5.40)

5.5.5 Some examples of Markovian processes

Let us briefly sketch two simple examples of Markovian processes. Each onehas two states, to be interpreted, for illustrative purposes, as i = 0 corre-sponding to unemployment and i = 1 corresponding to employment. Thefirst example is an homogenous process whereas the second one is not. Theseexamples illustrate how the matrices of intensities completely characterizethe law of a Markovian process.

Example 1. Two states. Homogeneous

Let the matrix of intensities be:

Q =

[−α αβ −β

]

=

[α−β

]

(−1 1) (5.41)

Page 143: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 134

i.e. α represents the intensity of the transition from unemployment to em-ployment whereas β represents the intensity of the other transition. We maythen solve (5.32) and (5.33) and obtain the matrix of transition probabilities:

P (t) =1

α + β

[α + β e−(α + β)t β(1 − e−(α +β)t)α(1 − e−(α + β)t) α e−(α +β)t + β

]

(5.42)

ExerciseIn order to check (5.42), for the case (5.41), check that:

Q2 = −(α + β)Q and therefore

Qk = − (α + β)−1 [− (α + β)]k Q k ≥ 1

P (t) = I − [1

α + β(e−(α +β)t − 1)]Q

We may now evaluate the (marginal) probability that the individual be un-employed, π0(t), or employed, π1(t), at the instant t:

p0(t) =α

α + β+ p0 − α

α + β e−(α + β)t

p1(t) =β

α + β+ 1 − p0 − β

α + β e−(α + β)t.

where p0 = π0(0) is the probability that the individual be unemployed atthe initial time t = 0. We may now derive the limit of this distribution, thusthe stationary distribution:

π0 = limt→∞

π0(t) =α

α + β

π1 = limt→∞

π1(t) =β

α + β

Exercise Check (5.39) and (5.40) in this example.

Finally, the durations are i.i.d.exponential:

(Tk | Xk−1 = 0) ∼ Exp(α)(Tk | Xk−1 = 1) ∼ Exp(β)

Page 144: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 135

Example 2. Two states. Non-Homogeneous

That the markov process is not homogeneous means that the intensities qij(t)are actually function of the time. A typical use of non-homogenous Markovprocesses arises with time-dependent exogenous variables; indeed qij(t) is acompact notation for a function of time t, of the exogenous variables and ofthe past history FN(t−) (that may be summarized by X(t−) = Ei in theMarkovian case). The simplest example example is the following. Supposewe want to introduce the idea that the age of the individual, let z(t) , affectsthe intensity of a transition ij and write z(t) = t + A0 where A0 is theage of the individual when t = 0 ; in such a case, we obtain:

qij(t) = fij(t + A0)

and, with a log-linear specification:

qij(t) = expαij + βij(t + A0)

i.e. a function of t and of the explanatory variable A0.

Let us now consider a trajectory starting in state X0 and observed duringan interval [0, t] and denote, by N+(t), the total number of transitions duringthat interval:

N+(t) =∑

(ij)∈C

Nij(t)

That trajectory may be represented as : [X0, (Tk, X(τk)) : 1 ≤ k ≤N+(t)], where Tk represents the duration of the k-th episode and X(τk) isthe state entered at the end of the k-th episode. The age at the k-th transitionis:

Ak = A0 + τk

Assuming that z(t) is the only explanatory variable, the hazard rate of

Page 145: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 136

the k-th duration may be evaluated as follows:

hTk(u | X(τk−1) = Ei, τk−1, A0)

= limδ↓0

1

δP [u < Tk ≤ u + δ | Tk > u, X(τk−1) = Ei, τk−1, A0] by definition

= limδ↓0

1

δ

j 6=i

P [u < Tk ≤ u + δ, X(τk−1 + u + δ) = Ej

| Tk > u, X(τk−1) = Ei, τk−1, A0]

=∑

j 6=i

limδ↓0

1

δP [X(τk−1 + u + δ) = Ej | X(τk−1) = Ei, τk−1, A0]

by markovianity

=∑

j 6=i

qij(A0 + τk−1 + u)

=∑

j 6=i

expαij + βij(u + Ak−1) (5.43)

We may now write the log-data density for a trajectory observed duringan interval [0, t] using (5.28 ):

ln l(t) =∑

(ij)∈C

∫ t

0

[αij + βij(u + A0)] dNij(u)

−∑

(ij)∈C

∫ t

0

exp[αij + βij(u + A0)]Yj(u−) du

=∑

(ij)∈C

[αij + βij(u + A0)]Nij(t) +∑

(ij)∈C

βij Wij(t)

−∑

(ij)∈C

exp[αij − ln βij]

1≤k≤N+(t)

1IX(τk−1)= Ei[eβijAk − eβijAk−1]

+ 1IX(τN+(t))=Ei[eβij(A0+t) − eβijAN+(t)] (5.44)

whereWij(t) =

1≤k≤N+(t)

τk 1IX(τk−1)= Ei, X(τk)= Ej

Finally, the probability that the individual be unemployed, p0(t), or em-

Page 146: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 137

ployed, p1(t), at the instant t (t > 0) may be written as:

p0(t) = exp −∫ t

0

[q01(s) + q10(s)]ds

p0 +

∫ t

0

q10(s)exp ∫ s

0

[q01(u) + q10(u)]dudsp1(t) = 1 − p0(t)

5.6 Semi-Markov Processes

For many applications, the class of Markov point processes is too narrow: there is a need for a more general class. We start again from a pointprocess (X(t) : t ∈ I) with the same countable representation as before((Xk, Tk) : k ∈ IN).

Definition The point process X(t) : t ∈ I is semi-markovian if and onlyif

(Tk, Xk)⊥⊥(T k−11 , Xk−1

1 )|Xk−1 (5.45)

Its probability law is therefore characterized by the subdistribution

pkij(t) = P [Xk = Ej , Tk ≥ t |Xk−1 = Ei] (5.46)

Note that the semi-markovian condition (5.45) is weaker, i.e. implies, themarkovian condition (5.24).

Definition The semi-markovian process is homogeneous if this subdis-tribution is the same for each episode , i.e. :

pkij(t) = pij(t) (5.47)

Properties of a Homogeneous semi-Markovian point Process

1. The process Xk k ∈ IN is markovian, and its law is characterizedby:

P [Xk = Ej |Xk−1 = Ei] = pij(0) i 6= j

2. ⊥⊥k Tk | (Xk : k ∈ IN)

Page 147: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 138

3. Tk⊥⊥(Xk : k ∈ IN) | Xk−1, furthermore:(Tk |Xk−1 = Ei) ∼ F (i), where F (i) is an arbitrary distribution.The stochastic intensity hi(t) is therefore not constant within eachepisode.

Therefore the memory of such processes is limited to remember the statewhere the process is presently and for how long a duration it has been in thepresent state.

5.7 More general Processes: Some Hints

In semi-markovian processes and in markovian processes, the memory , at thestart of each episode k, is limited to the state where the process is, namelyXk−1. Therefore, modelling each episode boils down to model a process withone transition and multiple exits, as done is the previous Chapter 4. Butin several applications, so strong limitations on the structure of the memorymay seem too restrictive; widening the scope of such models is the object ofthis section.

The problem may be viewed as follows. In general, the history of theprocess up to time t has the following structure, in view of (5.7 ) and (5.8 ):

FX(t−) = σ(τ1, X(τ1)), · · · , (τk, X(τk)), 1Iτk+1>t (5.48)

= σT1, X(T1), · · · , Tk, X(∑

0≤i≤k

Ti)), 1I

0≤i≤k+1

Ti > t(5.49)

where, let us remember, k = k(t) = maxj | τj ≤ t. Thus, with-out restricting assumptions, the stochastic intensity, hX(t) , will depend on2 k(t) + 1 random variables :

hX(t) = f(τ1, X(τ1), · · · , τk, X(τk), 1Iτk+1>t)

and would therefore become quickly unmanageable as time proceeds, i.e. ast increases. Thus the role of the markovian property (of order 1) is to reducethat number to 1, namely Xk−1 = X(τk−1), and in the semi-markoviancase, to 2, namely Xk−1 and t− τk−1.

A less stringent requirement on the memory of the process is to requirethat the stochastic intensity depends on a fixed number, say r, of variables

Page 148: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 5. TRANSITIONS MODELS 139

where r is independent of t. More specifically, let us consider a vector ofr components: W (t) = (W1(t), · · · ,Wr(t)) where each components, Wj(t),depends on the past of the process only and such that the stochastic intensitydepends on W (t) only:

hX(t) = f(W1(t), · · · ,Wr(t))

Remark. In slightly more abstract terms, one is looking for a family ofσ-fields W(t) such that:

• W(t) is adapted to FX(t−), i.e. W(t) ⊂ FX(t−)

• W(t) is "predictively sufficient", i.e. Xt⊥⊥FX(t−) | W(t)

Note however that the family W(t) is not a filtration and does not representan accumulation of information.

Page 149: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 6

Problems of Partial Observability

6.1 Introduction

6.1.1 Incomplete data in Duration Analysis

Incomplete observability of the durationsIn demography, an “ideal" data base can be obtained as follows: draw “ran-domly" n individuals born a same year and already dead and measure thelength of their life. In such a case, life duration is completely observed foreach individual in the sample. The vast majority of research in Demographyis however based on far less complete data.

In a survey on unemployment, complete duration of unemployment isunknown for the individuals unemployed at the time of the survey; indeed, forthose individuals, it is only known that the duration of their unemploymentis still not completed. This is an example of lengthbiased sampling.Incomplete observability of the exogenous variablesOften, many individual characteristics are not available although they aredeemed to be important to “explain" duration behaviour: this creates theproblem of heterogeneitySampling selection

140

Page 150: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 141

6.1.2 “Incomplete" data: a general framework

Model 1

(i) Model

• Structural model

Mθ,zη = p(η | z, θ) : θ ∈ Θ

where η is a possibly non-observable variable and θ is a structuralparameter.

•• Reporting model

Mη,λS = p(s|η, z, λ) : λ ∈ Λ

where S is an observable variable denoting the nature of the observationto be reported and λ is a parameter (typically: nuisance). Possiblyλ = l(θ) or λ known.

• • • Observability model

Y = f(η, S)

Thus, S may be viewed as a "signaling" variable, determining whichpart of η is actually observed.

• • •• Data = (Y, S)

(ii) Likelihood functions

latent likelihood

L(θ) : η 7−→ L(θ) = p(η | z, θ)

observable likelihood

L∗(θ, λ) : (S, Y ) 7−→ L∗(θ, λ) = p(S, Y | θ, z, λ)

Page 151: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 142

where:

p(S, Y | z, θ, λ) = E[p(η | z, θ) p(S | z, η, λ)︸ ︷︷ ︸

p(S,η|z,θ,λ)

| Y, S, z, θ, λ]

=

η|f(η,S)=Y

p(η | z, θ) p(S|z, η, λ) dη

Remark

L ∈ IRΘ+

L∗ ∈ IRΘ×Λ+

Model 2

(i) Model

• Structural model

Mθ,zη = p(η | z, θ) : θ ∈ Θ

where η is a possibly non-observable variable and θ is a structuralparameter.

• • • Observability model

Y = f(η)

In this model, the observation is a deterministic function of the latentvariable η

• • •• Data = (Y )

(ii) Example:

η = (η1, η2) latent durationsY = η1 ∧ η2 = minη1, η2S = 1Iη1≤η2 = 1IY =η1

Page 152: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 143

η5

=

? 54

3

2

1η1

=

?

η4

=

?

T1

end

of the

surv ey

time

individuals (i)

start

of the

surv ey

η3

= T3

η2

= T2

T4

?

Figure 6.1: Censored and uncensored data

6.2 Censored data

6.2.1 Censored and Truncated Data: general ideas

Let us consider Figure 6.1

6.2.2 Modelling right censored data

Latent variables

Page 153: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 144

η latent duration

ζ latent censoring

Y = (T,A) observation T = η ∧ ζ

A = 1Iη≤ζ = 1IT=η

Hypotheses (η, ζ) : continuous

Therefore T ∈ IR+

A ∈ 0, 1

Thus we meet exactly the same structure as in a competing risks model,as seen in Sections 4.3 and 4.4, with two exits corresponding to the latentdurations η and ζ . In the case of censoring, it is generally assumed that ηand ζ are independently distributed and that the parameters characterizingthe distribution of the censoring variable (ζ) are nuisance parameters. It istherefore appropriate to factorize the data density into those terms basedon the distribution of the latent duration η and those terms based on thedistribution of the censoring variable ζ . Using the same derivation as inSections 4.3 and 4.4, the density of the observations may therefore be written,in general as:

fT,D(t, a) = [Sζ|η(t)fη(t)]a· [Sη|ζ(t)fζ(t)]

1−a· 1I0,1×IR+

(a, t)

= [fη(t)]a[Sη|ζ(t)]

1−a[Sζ|η(t)]a[fζ(t)]

1−a

and, when η⊥⊥ζ :

fT,D(t, a) = [Sζ(t)fη(t)]a· [Sη(t)fζ(t)]

1−a· 1I0,1×IR+

(a, t)

= [fη(t)]a[Sη(t)]

1−a[Sζ(t)]a[fζ(t)]

1−a

Page 154: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 145

6.2.3 Interval censored data

Basic idea

In not unusual cases, the precise date of the event of interest is not knownwith precision: only an interval is known.Two different cases:

- fixed endpoints- random endpoints

Fixed endpoints

Let us consider k fixed endpoints: 0 = t0 < t1 < · · · tj < · · · tk < tk+1 = ∞.Each observation is characterized by its membership to one of these intervalsand may accordingly be coded disjunctively:

A = (A1, · · ·Aj , · · · , Ak+1)′ where Aj = 1Itj−1<T<tj

This makes the observations to be distributed as a Generalized Bernouilli:

P (A = a) =∏

θaj

j where θj = S(tj−1) − S(tj)

Random endpoints

Here, each individual observations is characterized by a pair of random vari-able: L, the left limit, and R, the right limit of the interval.

6.3 Aggregation and Heterogeneity

6.3.1 Introduction

MotivationAggregating over heterogenous individuals may create complicated structuresof the hazard function. The analytical aspect is shown, for the general case,in next lemma. An example illustrates a simple application of this lemma.Next it is shown that, in particular, aggregation destroys the exponentialityof a duration.

A Basic lemma

Page 155: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 146

Let

(T | Z) ∼ FZT i.e.P (T ≤ t | Z) = FZ

T (t)

Z ∼ FZ i.e.P (Z ≤ z) = FZ(z)

Then fT (t) =

fZT (t | z)dFZ(z)

ST (t) =

ST (t | z)dFZ(z)

hT (t) =fT (t)

ST (t)=

∫fZ

T (t | z)dFZ(z)∫SZ

T (t | z)dFZ(z)

=

hT (t | z) · SZT (t | z)

∫SZ

T (t | z)dFZ(z)· dFZ(z)

This lemma may be interpreted as follows: aggregating over heterogenousindividuals - characterized by z- produces a duration distribution, the haz-ard function - hT (t) - of which is a weighted average of the individual hazardfunctions - hT (t | z). This possibly complicated weighting scheme may even-tually account for complex hazard functions to be expected when analyzingaggregated data. A simple example illustrates this point.

Example

Let:

Z = 0 for individuals with low educational level

= 1 for individuals with high educational level

T = Duration of first unemployment

P (Z = z) = θz(1 − θ)1−z z = 0, 1 θ = P (Z = 1)

(T | Z = 0) ∼ F 0T

(T | Z = 1) ∼ F 1T

Page 156: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 147

and denote:

f zT (t) = fT |Z(t|Z = z) Sz

T (t) = ST |Z(t|Z = z) z = 0, 1,

we then obtain:

fT (t) = θf 1T (t) + (1 − θ)f 0

T (t)

ST (t) = θS1T (t) + (1 − θ)S0

T (t)

hT (t) =fT (t)

ST (t)=θf 1

T (t) + (1 − θ)f 0T (t)

θS1T (t) + (1 − θ)S0

T (t)

= h1T (t)

θ · S1T (t)

θS1T (t) + (1 − θ)S0

T (t)+ h0

T (t) · (1 − θ)S0T (t)

θS1T (t) + (1 − θ)S0

T (t)

Lemma: (“mover-stayer" lemma)

If (T | Z) ∼ exp(h0(Z))

Z ∼ FZ (arbitrary)

Then hT (t) is monotone decreasing

Indeed:We successively obtain:

ST (t) =

∫ ∞

0

ST (t | z)dFZ(z)

=

∫ ∞

0

exp−th0(z)dFZ(z)

fT (t) = − d

dtST (t) =

∫ ∞

0

h0(z) exp−th0(z)dFZ(z)

hT (t) =fT (t)

ST (t)=

∫ ∞

0h0(z) exp−th0(z)dFZ(z)

∫ ∞

0exp−th0(z)dFZ(z)

One may check that:

d

dthT (t) < 0 ∀t, ∀FZ(z), ∀h0(z)

This lemma may be interpreted as follows. Individuals are characterizedby their value of z. Large values of h0(Z) represent so-called "mover" indi-viduals: they will leave first, while individuals represented by small value of

Page 157: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 6. PROBLEMS OF PARTIAL OBSERVABILITY 148

h0(Z) are "stayer" individuals: they will leave (in probability) later. Thisexplains why hT (t) will be decreasing because being determined at each t bythe remaining individuals with decreasing values of h0(Z). This lemma alsoshows that although every individual has exponential duration the aggregatedoes not have exponential duration except in the case where every individualwould happen to be identical, a not-to-be-expected event.

In the simple case where ho(z) = z, this lemma gives an important prop-erty of compound exponential distributions. As a simple example, assumethat: ho(z) = z and that the (unobservable) variable z is distributed accord-ing to a gamma distribution : z ∼ γ(α, β). Then we obtain:

fT (t) =αβ

Γ(β)

∫ ∞

0

z exp(−tz).zβ−1 exp(−αz)dz

=β.αβ

(α + t)β+1=β

α(1 +

t

α)−(β+1)

Thus T follows a Pareto distribution, a fat-tail distribution for low valueof β. In particular, the rth moment exists up to r < β.

6.4 Endogenous Selection of the Sample

6.4.1 Introduction

Many, if not most, data used in microeconometrics have been gathered foradministraive purposes rather than for statistical analysis. The selection ofthe individuals in such databases does not therefore correspond to a properlycontrolled sampling design. For instance, data bases of individual data man-aged by a service in charge for the evaluation of the right and the distributionof unemployment compensation contain only individuals who have known un-employment at least once. When modelling transitions on the labour marketthese data are likely to be subject to selection bias.

Page 158: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 7

Inference: Sampling methods

7.1 Parametric models

7.1.1 In Marginal Models

Basic model

Let

η = (η1 . . . ηn)′ latent durations

ζ = (ζ1 . . . ζn)′ latent censoring

ξ = (ξ1 . . . ξn)′ ξi = (ηi, ζi)′ n× 2

T = (T1 . . . Tn)′ Ti = ηi ∧ ζi observed duration

A = (A1 . . . An)′ Ai = 1Iηi≤ζi = 1ITi=ηi

X = (X1 . . .Xn)′ Xi = (Ti, Ai)′ complete data

= (T,A) n× 2n

λ a sufficient parametrisation for the process generating ξ

149

Page 159: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 150

Assume

A.1 (independent sampling):⊥⊥iξi | λ

A.2 (independent censoring):ηi⊥⊥ζi | λ

A.3 (definition of θ : sufficient parametrisation for η)ηi⊥⊥λ | θ

A.4 (definition of ω : sufficient parametrization for ζ)ζi⊥⊥λ | ω

A.5 (bayesian cut)θ⊥⊥ω

A.6 θ only of interest

Latent likelihood

Lemma Under (A.1) to (A.5) we have:

(i) η⊥⊥ζ | λ(ii) (η, θ)⊥⊥(ζ, ω)

Complete Latent Likelihood

L∗∗(λ) =∏

i

fξ(ξi | λ)

=∏

i

fη(ηi | θ)·∏

i

fζ(ζi | ω)

= L∗1(θ)·L∗

2(ω)

Page 160: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 151

Under (A.6), the Relevant Latent Likelihood is

L∗1(θ) =

i

fη(ηi | θ) = fη(η | θ)

Actual Likelihood

Complete Actual Likelihood

L(λ) =d

dtP [T ≤ t, A | λ]

=∏

i

fη(Ti | θ)AiSη(Ti | θ)1−Ai

·∏

i

fζ(Ti | ω)1−AiSζ(Ti | ω)Ai

= L1(θ)L2(ω)

Under (A.6), the Relevant Actual Likelihood is:

L1(θ) =∏

i

fη(Ti | θ)AiSη(Ti | θ)1−Ai

=∏

i

hη(Ti | θ)AiSη(Ti | θ)

because fη = hη·Sη. We shall also use, in logarithmic terms:

L(θ) = lnL1(θ)

=∑

i

Ai ln fη(Ti | θ) +∑

i

(1 − Ai) ln Sη(Ti | θ)

=∑

i

Ai lnhη(Ti | θ) +∑

i

lnSη(Ti | θ)

=∑

i

Ai lnhη(Ti | θ) −∑

i

Hη(Ti | θ)

because Hη = − lnSη. It is useful to keep in mind those different forms ofthe likelihood function. Indeed, as seen in Section 2.5, useful distributionsexhibit different levels of analytical comp)lexity in either fη, Sη, hη or Hη

Lemmaη⊥⊥ζ | λ ⇒ η⊥⊥ζ | T,A, λ

Corollary

Page 161: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 152

LetP θ,ω P0 ∀(θ, ω)

p∗ =dP θ,ω

dP0

Then

p∗(T,A | η, λ) = pθ,1(T,A) L2(ω)

p∗(T,A | ζ, λ) = pθ,2(T,A) L1(θ)

Note P ∗(T,A | η, λ) depends only on η for the uncensored data (Ai = 1)whereas P ∗(T,A | ζ, λ) depends only on ζ for the censored data (Ai =0)

Exponential case

Consider:

fη(ηi | θ) = θe−θηi

Sη(ηi | θ) = e−θηi

hη(ηi | θ) = θHη(ηi | θ) = θηi

and p(ζ |λ) is left unspecified.

Remark: The latent process generating η is a member of the exponential

family and∑

i

ηi = η+ is a minimal sufficient complete statistic in the latent

process.

Actual likelihood:

P (T,A|θ) =∏

i

θAie−θTi = θA+e−θT+

L(θ) =∑

i

Ai ln θ −∑

i

Tiθ

= (ln θ)A+ − θT+

Page 162: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 153

ln θ

θ

Figure 7.1: Curved exponential family: the case of exponential cen-sored data

where:

A+ =∑

i

Ai

T+ =∑

i

Ti

Thus, A+ represents the number of uncensored observations whereas T+

represents the "total time nat risk". Remark:

There is only one parameter: θ ∈ IR+ and the bivariate statistic (A+, T+)is minimal sufficient but not complete. This is an example of a “Curvedexponential family" with canonical parameter (θ, ln θ).The score is accordingly:

S(θ) =d

dθL(θ) =

A+

θ− T+

Page 163: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 154

and the statistical information is:

J(θ) = − d2

dθ2L(θ) =

A+

θ2

Therefore the maximum likelihood estimator of θ is:

θML =A+

T+

Remember: √n(θML,n − θ)

L−→N [0, [I(θ)]−1]

where

I(θ) = V [d

dθL(θ) | θ] = E[J(θ) | θ]

= −E[d2

dθdθ′L(θ) | θ]

=E[A+ | θ]

θ2

Remark:

E[Ai | λ] = P [Ai = 1 | λ]

= P [ηi ≤ ζi | λ]

= E[P [ηi ≤ ζi | ζi, λ] | λ]

= E[Fη(ζi | θ) | λ]

= E[1 − e−θζi | λ] = 1 −E[e−θζi | λ]

because ηi⊥⊥ζi | λ and ηi⊥⊥λ|θ; therefore:

P (Ai = 1|θ) = 1 − E[e−θζi |θ]

where:

E[e−θζi|θ] =∫e−θζidFζ(ζi)

Page 164: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 155

because (A.4) and (A.5) imply that θ⊥⊥(ζi, ω) and consequently θ⊥⊥ζi. Inthe above derivations, it is accordingly important to distinguish E(Ai|θ) =P (Ai = 1|θ) from E(Ai|θ, ζi) = P [ηi ≤ ζi | ζi, λ].Thus, A+

θ2 is the “statistical information" of observation A+ whereas

I(θ) =

n−n∑

i=1

E[e−θζi | θ]

θ2=

n

θ2[1 − E(e−θζi |θ)]

is the Fisher information on θ.

In practice, the general result:

√nI(θ)

12 (θML,n − θ) −→ N(0, 1)

is used by estimating I(θ) by:

I(θMV,n) =A+

θ2ML,n

Let us compare with the uncensored case:

A+ −→ n > A+

L(θ) = (ln θ)A+ − θT+ −→ n ln θ − θT+

θML = A+

T+−→ θ∗ML = n

T+> A+

T+

In other words:

θML

θ∗ML

=A+

n≤ 1

= 1 ⇐⇒ A+ = n

Proportional Hazard

Let us consider the model:

Page 165: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 156

hη(ti | θ) = θh0(t)

Hη(ti | θ) = θH0(t)

where h0(t) is assumed to be known. Remember that:

[Hη(ti | θ)|θ] ∼ Exp(1)

which implies:[H0(Ti)|θ] ∼ Exp(θ)

Formally, we fall back to the simple exponential case by a simple change ofcoordinates: Ti → H0(Ti); more explicitly:

L(θ) =∑Ai lnhη(Ti | θ) −

i

Hη(Ti | θ)

= (ln θ)A+ +∑

i

Ai lnh0(Ti) − θH0+

where:

H0+ =∑

i

H0(Ti)

Therefore:

ddθL(θ) = A+

θ−H0+

θML = A+

H0+

In epidemiology, h0(t) is often viewed as an age-specific mortality ratefor a "standard" population. The ratio A+(H0+)−1 is accordingly called a "standardized mortality rate".

7.1.2 In conditional models

General case

Let

Page 166: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 157

• θ = (α, β) ∈ Θα × Θβ ⊂ IRkα × IRkβ kαkβ <∞

•Yi = (Ti, Ai) Y = (Y1 . . . Yn)

Xi = (Yi, Zi) X = (X1 . . .Xn)

Definition of λ and θ:

Z ⊥⊥θ | λ i.e. λ = “θZ” P (Z | θ, λ) = P (Z | λ)

Y ⊥⊥λ | Z, θ i.e. θ = “θY/Z” P (Y | Z, θ, λ) = P (Y | Z, θ)

Hypotheses

cut : λ⊥⊥θ

sift : ⊥⊥iYi | Z, θ

Yi⊥⊥Z | Zi, θ

Therefore:

P (Y | z, θ) =∏

i

P (Yi | zi, θ)

L(θ) =∑

i

Ai lnhη(Ti | zi, θ) +∑

i

lnSη(Ti | zi, θ)

=∑

i

Ai lnhη(Ti | zi, θ) −∑

i

Hη(Ti | zi, θ)

=∑

i

Ai ln fη(Ti | zi, θ) +∑

i

(1 −Ai) lnSη(Ti | zi, θ)

Score and statistical information

S(θ) =d

dθL(θ) =

i

Ai

hη(Ti | zi, θ)

d

dθhη(Ti | zi, θ) −

i

d

dθHη(Ti | zi, θ)

Page 167: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 158

=∑

i

Aid

dθlnhη(Ti|zi, θ| −

i

d

dθHη(Ti|zi, θ))

J(θ) =−d2

dθdθ′L(θ) =∑

i

Ai[hη(Ti|zi, θ)]−2 d

dθhη(Ti|zi, θ)

d

dθ′ hη(Ti|zi, θ)

−∑

i

Aihη(Ti|zi, θ)−1 d2

dθdθ′ hη(Ti|zi, θ)

+∑

i

d2

dθdθ′Hη(Ti|zi, θ)

Exponential case

Let, with θ ≡ β, :

hη(ti | zi, β) = g(zi, β) g(· ) ≥ 0

Hη(ti | zi, β) = tig(zi, β)

Sη(ti | zi, β) = exp−tig(zi, β)

fη(ti | zi, β) = g(zi, β) exp−tig(zi, β)

This case may be viewed as a proportional hazard model as well as an ac-celerated time model with, under both interrpretation, h0(t | α) = 1, i.e.T0 ∼ Exp(1). Therefore:

L(β) =∑

i

Ai ln g(zi, β) −∑

i

Tig(zi, β)

S(β) =d

dβL(β) =

i

Ai

g(zi, β)

d

dβg(zi, β) −

i

Tid

dβg(zi, β)

=∑

i

Aid

dβln g(zi, β) −

i

Tid

dβg(zi, β) β, z ∈ IRk

Particular case

Page 168: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 159

g(z, β) = ez′β

If zi = (1, zi1, · · · , zik)′ and β = (β0, β1, · · · , βk)

′, eβ0 represents, in theproportional hazard interpretation, the base-line hazard rate correspondingto the individual characterized by zi = (1, 0, · · · , 0)′.

In this case, we have:

ln g(z, β) = z′

β

and this implies:

L(β) = β′∑

i

Aizi −∑

i

Tiez′

S(β) =∑

1≤i≤n

Aizi −∑

i

Tiez′

iβzi ∈ IRk

I(β) = V [S(β)|zi, β] = −E[d

dβ ′S(β)|zi, β]

=∑

i

E(Ti|β, zi) ez′

iβziz′

i

Given that the distribution of the censoring variable is left unspecified, andthat its parameters are considered as nuisance parameters, E(Ti|β, zi) willbe approximated by:

E[Ti|β, zi, ci] = E[E(Ti | Ai, ci, zi, β)|β, ci, zi)]

= E[(e−ziβ)Aic(1−Ai)i |β, zi, ci]

= E[Aie−z′iβ + (1 − Ai)ci|β, ci, zi]

= e−z′

iβ· γi + (1 − γi)ci

Page 169: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 160

where γi = P [Ai = 1|β, ci, zi] = 1− eciez′

and ci is the observed value of thecensored data

Proportional Hazard model

Let

hη(t | z, θ) = g(z, β)h0(t | α)

Hη(t | z, θ) = g(z, β)H0(t | α)

Sη(t | z, θ) = exp−Hη(t | z, θ)

= exp−g(z, β)

∫ t

0

h0(u | α)du

fη(t | z, θ) = g(z, β)h0(t | α)· exp−g(z, β)

∫ t

0

h0(u | α)du

where, as usual,

θ = (α, θ)

Therefore, the log likelihood function may be written as:

L(θ) =∑

Ai ln hη(Ti | z, θ) −∑

i

Hη(t | z, β)

=∑

i

Ai ln g(zi, β) +∑

i

Ai ln h0(Ti|α) −∑

i

g(zi, β)H0(Ti|α)

= f1(α) + f2(β) + f3(α, β)

Particular case : the log-linear specification

g(z, β) = ez′β

In this case, the loglikelihood function becomes:

L(θ) = β ′∑

i

Aizi +∑

i

Ai ln h0(Ti|α) −∑

i

ez′iβH0(Ti|α)

Page 170: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 161

Accelerated Life model

Let (see section 2.3.3)

hη(t|z, θ) = g(z, β)h0(t.g(z, β)|α)

Hη(t|z, θ) = H0(t.g(z, β)|α)

Sη(t|z, θ) = S0(t.g(z, β)|α)

fη(t|z, θ) = g(z, β)f0(t.g(z, β)|α)

where θ = (α, β)

For an arbitrary family of baseline distributions, the log-likehood functionmay be written as:

L(θ) =∑

iAi[ln g(zi, β) + lnh0(Ti.g(zi, β)|α)] −∑

iH0(Ti.g(zi, β)|α)

When the baseline distribution is exponential:

h0(ti|α) = α

we obtain:

L(θ) =∑

i

Ai[ln g(zi, β) + lnα] − α∑

i

Tig(zi, β)

In the particular case where g(zi, β) = ez′iβ we have:

L(θ) = lnα∑

i

Ai + β ′∑

i

Aizi − α∑

i

Tiez′iβ

Page 171: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 162

7.2 Non-parametric and Semi-parametric mod-

els

7.2.1 Introduction

Parametric model : θ ∈ Θ ⊂ IRp

Example : (T |θ) ∼ exp(θ) Θ = IR+

Non-parametric model : Θ functional spaceExample: (x|θ) ∼ θ

Θ = θ|θ is a continuous distribution on IRSemi-parametric model : Θ = Θ1 × Θ2 with: Θ1 ⊂ IRp

Θ2 functional space.Example 1

yi = z′

iβ + εi (ε|θ) ∼ θ2 β, zi ∈ IRk

Θ1 = IRk × IR+

Θ2 = θ2 : distribution on IR|∫

x dθ2 = 0,

x2dθ2 = σ2 <∞

Typically, θ1 is a parameter of interest whereas θ2 may be a nuisance param-eter. However, a functional defined on Θ2, namely σ2 =

∫u2 θ2(du), may

also be of interet.

Example 2

yi = α(zi) + εi εi ∼ N(0, σ2)

α ∈ Θ2 = α : IRk → IR|"reasonably smooth"σ2 ∈ Θ1 = IR+

Θ = Θ1 × Θ2

In this example, the functional parameter α would typically be a parameterof interest.

7.2.2 Non-parametric estimation of the Survivor func-

tion

Remember:

Page 172: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 163

(i) in the discrete case:

ST (t) =∏

j|aj<t

(1 − hj) F T (t) =∏

j|aj≤t

(1 − hj)

(ii) empirical process : discrete by nature even when the true distribution iscontinuous

(iii) Glivenko-Cantelli (Theorem 9.2.2)

Kaplan-Meier estimator

Objective estimate ST (t) taking censoring into account

Basic Idea Estimate the empirical survivor function in its product form andadjust the estimation of the (discrete) hazard rates to the censoring.

Let:

Yi = (Ti, Ai)

Ti = min(ηi, ζi)

Ai = 1ITi=ηi

Ti → T(1) < T(2) . . . T(n): order statistics

Ai → A′

1 A′

2 . . . A′

n

R(t) =∑

i

1IT(i)≥t

Thus R(t) represents the number of individuals at risk at time t, this is thosewho are neither “dead" nor censored at time t-.

B(T(i)) =∑

j

A′

j1ITj=T(i)

Thus B(T(i)) represents the number of “deads" (i.e. exited without censor-ing) at the (observed) time T(i). Then, taking censoring into account, the(instantaneous) hazard function at (observed) time T(i) is estimated as:

Page 173: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 164

hj → h(T(i)) =B(T(i))

R(T(i))(7.1)

SKM(t) =∏

T(i)<t

[1 − h(T(i))] (7.2)

FKM(t) =∏

T(i)≤t

[1 − h(T(i))] (7.3)

Remarks

1. If at T(i) there are only censored data, we have B(T(i)) = 0 and therefore

FKM(T(i)) and SKM(T(i)) are continuous at T(i).

2. If the largest observation is only censored ones, FKM(t) and SKM(t) arestrictly positive and continuous, at T(n) :

FKM(t) = FKM(T(n−1)) > 0 ∀t > T(n−1)

SKM(t) = SKM(T(n)) > 0 ∀t > T(n)

if T(n−1) is not censored; therefore limt→∞

FKM(t) > 0, a defective distribution.

A natural interpretation of this occurence, in the case of a life duration, isthe following: if the largest observation does not correspond to a "dead",there is no empirical reason not to believe that such a life could possibly beinfinite.

If one is willing to avoid defective distributions one may modify theKaplan-Meier estimator as follows:

Fm

KM(t) =∏

T(i)≤t

[1 − h(T(i))] 1It≤maxAi Ti = FKM(t) 1It≤maxAi Ti

where maxAi Ti represents the largest uncensored duration.

3. If there are no ex-aequo at T(i), then:

B(T(i)) = A′

i

R(t(i)) = n− i+ 1

FKM(t) =∏

T(i)≤t

(1 − Ai

n− i+ 1)

Page 174: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 165

In many data sets, ties are observed, as a matter of fact. They call for tworemarks: (i) even if Fη and Fζ are continuous, P (η = ζ) > 0 is possible whenη⊥⊥/ ζ .(Marshall-Olkin).

(ii) Rounding problem : Although theoretical models assume continuoustime, actual measurements are discrete in nature. We have just seen that theKaplan-Meier estimator accomodates for ties. When the rounding problem istoo severe because spells are actually observed through intervals, truncatedsurvivor function may be used for an explicit modelling.

4. If, at the largest observation, there is a tie with censored and uncensored

data, the distribution is again defective, FKM(T(i)) and SKM(T(i)) are bothdiscontinuous at T(n), with:

FKM(T(n)) = FKM(T(∞−)) > 0

SKM(T(n)) > SKM(T(∞−)) > 0

Table 7.2.2 illustrates the behaviour of the Kaplan-Meyer estimator in fivedifferent situations, each with n = 6 and therefore T(6+1) = ∞− ( = T(5+1)

in Example 5). The first example has no censored data; Example 2, 3 and 4have censored data in moving places and Example 5 has a tie with censoredand uncensored data at the largest observation.

7.2.3 Semi-parametric proportional hazard model (Cox

model)

Remember that in: θ = (α, β), α is a sufficient parameter for the baseline dis-tribution whereas β is introduced for describing the action of the exogenousvariables. The semi-parametric version of the proportional hazard modeltakes the form:

hT (t|z, θ) = α(t)ez′β

where α(t) = h0(t|z, θ), the hazard function of the baseline distribution, is a

Page 175: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 166

T(i) A′i hi 1 − hi FKM(t) SKM(t)

T(i) ≤ t < T(i+1) T(i−1) < t ≤ T(i)

Example 11 1 1/6 5/6 5/6 12 1 1/5 4/5 4/6 5/63 1 1/4 3/4 3/6 4/64 1 1/3 2/3 2/6 3/65 1 1/2 1/2 1/6 2/6

6 1 1 0 0 1/6 SKM(6+) = 0Example 2

1 1 1/6 5/6 5/6 12 0 0 1 5/6 5/63 0 0 1 5/6 5/64 1 1/3 2/3 5/9 5/65 0 0 1 5/9 5/9

6 1 1 0 0 5/9 SKM(6+) = 0Example 3

1 1 1/6 5/6 5/6 12 0 0 1 5/6 5/63 0 0 1 5/6 5/64 1 1/3 2/3 5/9 5/65 1 1/2 1/2 5/18 5/9

6 0 0 1 5/18 5/18 SKM(6+) = 5/18Example 4

1 0 0 1 1 12 1 1/5 4/5 4/5 13 0 0 1 4/5 4/54 1 1/3 2/3 8/15 4/55 0 0 1 8/15 8/15

5 1 1 0 0 8/15 SKM(6+) = 0Example 5

(tie at i = 5)1 1 1/6 5/6 5/6 12 0 0 1 5/6 5/63 0 0 1 5/6 5/64 1 1/3 2/3 5/9 5/6

5 0, 1 1/2 1/2 5/18 5/9 SKM(5+) = 5/18

Table 7.1: Numerical behaviour of the Kaplan-Meyer estimator

Page 176: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 167

functional parameter. Thus the parameter space takes the following form:

θ = (α, β) ∈ Θα × Θβ

Θα = α : IR+ → IR+|α is continuous and

∫ ∞

0

α(t)dt = ∞

Θβ = IRk

Typically, the functional parameter α is a nuisance parameter whereas theenclidean parameter β is the parameter of interest. It is therefore of interestto try to separate inferences on α and β. A natural idea is to look for astatistic W = f(Y ) such that the likelihood function LY |Z(α, β) factorizes asfollows:

LY |Z(α, β) = LW |Z(β) LY |W,Z(α, β)

In such a case, the inference on β would be made simpler by consideringonly the partial likelihood LW |Z(β) instead of LY |Z(α, β); a heuristic argu-ment in favour of this simplication is that the information on β contained inLY |W,Z(α, β) is likely to be “eaten up" by the functional parameter α. Thissimplified estimator may now be build as follows. Let:

Yi = (Ti, Ai)

Similarly to the Kaplan-Meier estimator, let us reorder the sample accordingto the observed durations:

Ti −→ T(1) < T(2) < . . . < T(n)

Ai A′

1 A′

2 A′

n

and let us also define:

R(t) =∑

1≤i≤n

1IT(i)≥t

R(t) = k|T (k) ≥ t = i|Ti ≥ t

Page 177: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 7. INFERENCE: SAMPLING METHODS 168

Thus R(t) represents the number of individual at risk at time t and R(t)represents the set of such individuals. Notation will be usefully simplified asfollows:

R(i) = R(T (i))

R(i) = R(T (i))

Let us now represent the sample (T1 . . . Tn) by its order statistic 0 = (T(1) . . . T(n))and its rank statistic R = (R1 . . . Rn) where Ri is the rank of the ith observa-tion in the order statistic. Giving the rank statistic the role of W , as above,we may write the likelihood function of the rank statistics as follows:

L(β) =∏

1≤i≤n

[

ez′iβ

k∈R(i) ez′

]Ai

=∏

1≤i≤A+

[ez′iβ

k∈R(i) ez′

kβ]Ai

where A+ =∑

i Ai. The (partial) likelihood estimator of β is accordinglydefined as

β = arg maxβ

L(β)

Note that the likelihood LY |Z(α, β) has been decomposed into :

LY |Z(α, β) = LR|Z(β) L0|R,Z(α, β)

A slightly different argument for justifying the use of the partial likelihoodis the following. Let us reparametrize the functional parameter α into itscorresponding integrated hazard function:

α(t) −→ Hα(t) =

∫ t

0

α(u)du

The corresponding parameter space becomes:

Hα ∈ Θα = H : IR+ −→ IR+ |H CADLAG,

monotone non-decreasing, H(0) = 0, H(∞−) = ∞(7.4)

This is a multiplicative group and the rank statistic is the maximal invariantstatistic of the corresponding group on the sample space.

Page 178: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 8

Inference: Bayesian Methods

8.1 Bayesian Inference: general principles

In this chapter, we review the main results obtained up to now in the field ofthe bayesian inference, more particularly for non-parametric and semipara-metric models

Whereas a statistical model, in sampling theory, is a family of samplingdistributions indexed by a parameter, a Bayesian model is characterized by aunique probability measure on the product space parameter × observation.This probability is obtained by endowing a sampling theory model with aprobability measure on the parameter space, to be called the prior probabil-ity, and by treating the sampling model as a conditional probability on thesample space given the parameter. Bayesian methods aim at analysing poste-rior and predictive distribution. The former are distributions for parametersconditionally upon observations and the latter average sampling distributionsusing prior probability as weight function.

8.2 Nonparametric duration models without cen-

soring.

8.2.1 A nonparametric Bayesian model.

Even though earlier Bayesian papers have discussed nonparametric methods,Ferguson’s (1973, 1974) papers have been most influential in motivating newcontributions over the last ten years. In this section we summarize the basic

169

Page 179: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 170

results obtained in that direction, limiting the presentation to the particularcase of duration models, viz models for non-negative observations, althoughthe basic model is considerably more general.

In the case of duration models, the sample space is IR+, the positive partof the real line, endowed with its Borel sets. In the case of nonparametricmodels, the parameter space is an appropriate subset of the set of all proba-bility measures on the sample space or of some of their transformations, suchas the survivor function or the hazard functions.

Thus, the basic nonparametric Bayesian model for analyzing durationdata may be described as follows. The sampling process is I.I.D and thesample space for a sample of size one, is (IR+,B+). For a given sample sizen, the data are therefore constituted as t1, · · · , tn where each ti (1 ≤ i ≤ n)is independently generated by a common sampling probability, say Φ, anelement of the set of all probability measures on (IR+,B+). Thus one maysimply write Φ(A) instead of P (t ∈ A|Φ) for any A ∈ B+.

For the prior specification, note that the prior probability makes Φ a ran-dom probability measure on (IR+,B+) and should therefore be viewed as astochastic process whose trajectories are probability measures on (IR+,B+).FollowingFerguson (1973, 1974) a workable choice, which is also natural conjugate tothe empirical process, is the Dirichlet process. The basic intuition and themain features of this specification may be approached through the finite case.

Thus, let us assume as a first step in the presentation, that the samplingprobability Φ is restricted to give positive probabilities to a fixed finite setsay (a1, · · · , ak). Thus Φ is characterized by a point, θ, of the simplex Sk ofIRk : θ = (θ1, x, θk) ∈ IRk

+ such that θ1 + ...+θn = 1 with the interpretationthat θj = Φ(aj), 1 ≤ j ≤ k. An observed duration t may be representedby a vector of binary variables x = (x1, · · · , xk) with xj = 1It=aj and thesampling probability may then be written as:

p(x|θ) =∏

j

θxj

j . (8.1)

A natural conjugate prior distribution for the sampling process is providedby the Dirichlet distribution whose density may be written as:

p(θ) = fDi(θ|n0, p0)

= Γ(n0)∏

j

θn0p0j−1j

Γ(n0p0j)1Iθ∈Sk

(8.2)

where (n0, p0) ∈ IR+ × Sk are natural parameters for a Dirichlet distribu-tion. If we have n I.I.D observations of x i.e. xi = (xi1, · · · , xij, · · · , xik) for

Page 180: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 171

i = 1, · · · , n, the vector of ,(pj , · · · , pk) with pj = n−1∑

1≤i≤n

xij . The latter

multiplied by n constitutes a sufficient statistics distributed as a multinomial.The natural conjugate property implies that the posterior distribution of θis again Dirichlet with parameters:

n∗ = n0 + n (8.3)

p∗ =n0p0 + np

n0 + n. (8.4)

Suppose now that instead of restricting Φ to have a fixed finite supporta1, · · · , ak we consider a finite fixed partition of IR+, namely B1, · · · , Bj , · · · , Bk,and restrict the parameter of interest to be θ = (θj) with θj = P (t ∈ Bj |Φ) =Φ(Bj). It is then natural to retain from an I.I.D sample of durations t1, · · · , tnthe proportions of observations in each Bj i.e. pj = n−1

1≤i≤n 1Iti∈Bj. Ifthe prior specification of Φ is such that it implies a Dirichlet distribution onθ, the analysis of the case of finite support may be exactly repeated withoutany change. The transition from the Dirichlet distribution to the Dirichletprocess is achieved by switching from a fixed to an arbitrary partition andby replacing the prior parameter p0 = (p01, · · · , p0k) ∈ Sk by a probabilitymeasure P0 on the sample space (IR+,B+). More specifically, the samplingprobability Φ on (IR+,B+) is said to be distributed as a Dirichlet process withparameter (n0, P0), denoted by Φ ∼ Di(n0, P0) if for any measurable partition(B1, · · · , Bk) of IR+, the random vector (Φ(B1), · · · ,Φ(Bk)) is distributed asa Dirichlet distribution with parameters n0 and (P0(B1), · · · , P0(Bk)). Fromthe properties of the Dirichlet distribution it may be verified that the system(Φ(B1), · · · ,Φ(Bk)), defined for any finite partition of IR+ induces a projec-tive system and therefore, by Kolmogorov theorem, uniquely characterizesthe law of the process generating Φ.

Let us write Pn for the empirical process, viz:

Pn(B) =1

n

1≤i≤n

1Iti∈B =1

n

1≤i≤n

δti(B) B ∈ B+ (8.5)

where δti is the unit mass at ti.The statistic Pn is sufficient for I.I.D. sampling and the posterior probabilityof Φ is again a Dirichlet process with parameters:

n∗ = n0 + n (8.6)

Page 181: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 172

P∗ =n0P0 + nPn

n0 + n. (8.7)

In most applications, the location parameter of the prior distribution,P0, is a continuous probability measure whereas the empirical process, Pn,is discrete. Thus, P∗, the location parameter of the posterior distribution istypically a mixed probability measure on IR+ smoothing Pn.

The description of this Bayesian model is now completed by describingthe predictive process through the following sequence.

(i) the first observation t1 is predictively distributed as P0

(ii) the observation ti+1 conditionally on (t1, · · · , ti) is predictively dis-

tributed according to1

n0 + i(n0P0 + iPi) where Pi is the empirical process of

(t1, · · · , ti), namely:

Pi(B) =1

i

1≤j≤i

1Itj∈B B ∈ B+. (8.8)

An important feature of this predictive process is to generate ties. Morespecifically, one has, for instance:

P (t2 = t1) =1

n0 + 1, (8.9)

when P0 is continuous. As a consequence, the predictive distribution oft1, · · · , tn, may also be characterized globally in an alternative way based onthe following remark. The information contained in (t1, · · · , tn) is equiva-lently described by (Cn, (t(j))1≤j≤p) where (t(j))1≤j≤p is the vector of differentvalues taken by (t1, · · · , tn) where the values are ordered according to theorder of appearance in (t1, · · · , tn) and Cn is a partition of 1, 2, · · · , n intop non empty elements (1 ≤ p ≤ n), namely Cn = Ij : 1 ≤ j ≤ p, Ij is anon-empty subset of 1, · · · , n corresponding to the indices i ∈ 1, · · · , nfor which the ti’s are equal to t(j). Note that p is a function of Cn, namelyp = |Cn| where |Cn| is the cardinality (the number of elements) of Cn.

Therefore the distribution of (t1, · · · , tn) is equivalently described by thedistribution of Cn, running over all partitions of 1, · · · , n into 1, 2, · · · , nelements, and the distribution of (t(j))1≤j≤p, conditionally on Cn.

The marginal distribution of Cn is somewhat involved and is given withsome details in Blackwell and MacQueen (1973),Antoniak (1974),Yamato(1984) and Rolin (1992b). The distribution of (t(j))1≤j≤p, conditionally onCn may be described more easily. It depends essentially on p, the number

Page 182: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 173

of distinct values, and each t(j) is otherwise I.I.D. with distribution P0 whenthis one is continuous.

8.2.2 Some properties of the Dirichlet process.

The previous section suggests that the structure of the nonparametric Ba -ye - sian model under a Dirichlet prior specification is simple and provides aworkable approach for the evaluation of the posterior distribution as well asof the predictive distribution. In this section, some properties of the modelreinforce this idea but other properties shed light on some subtle aspects ofthe Dirichlet process and require that it be handled with special care.

Moments.

Let Φ be a random probability on (IR+,B+) distributed as a Dirichlet processwith parameter (a,M); thus in the context of section 2.1, a may be taken asn0 or n∗ and M as P0 or P∗. From the definition of the Dirichlet process itshould be clear that, for any B ∈ B+, the random variable Φ(B) follows abeta distribution with parameters (aM(B), aM(Bc)) ; in particular one has

E[Φ(B)] = M(B) (8.10)

V [Φ(B)] =M(B)M(Bc)

a + 1. (8.11)

Thus, P0 may be interpreted as a "prior guess" on Φ and n0 as a measureof prior precision.Similarly, P∗ is the posterior expectation of the process andn∗ characterizes its posterior precision. Consequently, P∗ may be viewed asa natural Bayesian estimator of Φ built as a convex combination of the priorguess P0 and the empirical process Pn and convergent in as far as it has thesame asymptotic behavior as Pn.

The trajectories of the Dirichlet process.

Proper understanding of Bayesian nonparametric models under a Dirichletprior specification requires a serious analysis of its trajectories, both theirstructure and their support. Furthermore, knowledge of the trajectories iscrucial for designing efficient numerical procedures of simulation and for an-alytical derivations; details on these issues are given in Florens and Rolin(1994).In particular, it is shown there how to estimate moments under a

Page 183: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 174

Dirichlet prior specification. The relationship between classical bootstrap-ping and simulation of the posterior distribution under a noninformativeprior specification is also discussed.

When a random probability Φ on (IR+,B+) is distributed as a Dirichletprocess with parameters (a,M), the structure and the support of its trajec-tories depend crucially on the location parameter M . To show this, let usdecompose M into its purely discrete and continuous parts, i.e., let

S = x ∈ IR+ : M(x) > 0 = aj : j ∈ I

(where I ⊂ IN is a finite or a countable set)

ad = aM(S); ac = aM(Sc)

Md(B) = M(B|S); Mc(B) = M(B|Sc) ∀ B ∈ B+.

Clearly, one hasaM = acMc + adMd. (8.12)

Let us now do the same decomposition on Φ, namely let

α = Φ(S)

Φd(B) = Φ(B|S); Φc(B) = Φ(B|Sc),

so thatΦ = (1 − α)Φc + αΦd. (8.13)

It has been shown in Rolin (1992a) that

(i) α,Φc,Φd are independent

(ii) α has a beta distribution with parameter (ad, ac)

(iii) Φc is distributed as a Dirichlet process with parameter (ac,Mc)

and

(iv) Φd is distributed as a Dirichlet process with parameter (ad,Md)

Page 184: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 175

Since Md has a countable support S, Φd has the same support with proba-bility 1 and Φd(aj) : j ∈ I has a countable Dirichlet distribution withparameter aM(aj) : j ∈ I. In particular, Φd(aj) has a beta distribu-tion with parameters (aM(aj), a−aM(aj)). Furthermore, the support ofthis Dirichlet process is the set of all probabilities with support S. Now, Fer-guson (1973) has shown that the trajectories of Φc are almost surely discrete.In other words, Φc may be represented as

Φc =∑

1≤j<∞

γjδτj(8.14)

almost surely (and not only in distribution). A first description of the dis-tribution of (τj)1≤j<∞ and (γj)1≤j<∞ has been provided by Ferguson (1973):

(i) (τj)1≤j<∞ is an infinite I.I.D. sample from Mc

(ii) the sequence (γj)1≤j<∞ is normalized and decreasing i.e. such that∑

j

γj = 1 and 0 < γj+1 < γj with probability one and may be repre-

sented as follows:

γj = (∑

1≤l<∞

Jl)−1Jj (8.15)

where the sequence (Jj)1≤j≤∞ is markovian and decreasing (Jj+1 < Jj)and its distribution is as follows:

P [J1 ≤ t] = exp−ac

∫ ∞

t

u−1e−udu t ≥ 0 (8.16)

P [Jj+1 ≤ t|J1, · · · , Jj] = exp−ac

∫ Jj

t

u−1e−udu 0 ≤ t ≤ Jj (8.17)

(iii) furthermore, the sequences (τj)1≤j<∞ and (Jj)1≤j<∞ are independentand, eventually, so are also the sequences (τj)1≤j<∞ and (γj)1≤j<∞.

A second description has been obtained in Rolin (1992b) and Sethuraman(1994) where it is shown that there exists an infinite permutation of IN suchthat, keeping the same representation of Φc as in (2.14), (i) and the secondpart of (iii) remain valid and (ii) becomes:

Page 185: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 176

(ii*) the sequence (γj)1≤j<∞ may be represented as follows:

γj = βj

1≤`≤j−1

(1 − β`) (8.18)

where (βj)1≤j<∞ is an infinite I.I.D. sample of the beta distributionwith parameter (1, ac).

It should be remarked that (i) the distribution of (τj) (of (γj) depends only onMc (on ac), (ii) even thoughMc is continuous the trajectories of Φc are almostsurely discrete but the infinitely many jumps are randomly located and thesupport of Φc is almost surely dense in the support of Mc. In other words,for any set B such that Mc(B) > 0, Φc(B) is almost surely a strictly positiverandom variable. Furthermore, any probability absolutely continuous withrespect to Mc is in the pointwise support of the Dirichlet process (recall thatthe pointwise convergence is defined by Mn(B) →M(B) ∀ B ∈ B+).

Let us now consider the consequences of a Dirichlet prior specification tothe Bayesian model. When P0 is a discrete probability measure, with jumpsat a fixed set S = aj : j ∈ I ⊂ IR+ with I ⊂ IN, the sampling probabilitiesΦ will be almost surely discrete with the same support S, both a priori anda posteriori for almost any observation t1, · · · , tn because P∗ will have thesame support as P0. Furthermore, the support of the Dirichlet process is theset of all probabilities on S, both a priori and a posteriori. Also the sequenceof the predictive distributions generating t1 and (ti+1|t1, · · · , ti) will all beprobabilities on S. Note that in such a case, the model is unable to handlean observation falling outside S.

When P0 is continuous, the prior trajectories are characterized as in(2.14). For the posterior trajectories, let us remark that P∗c = P0 andP∗d = Pn. Therefore, if (t(j))1≤j≤p is the set of distinct values taken by(t1, . . . , tn) and if nj is the number of ti’s that are equal to t(j), 1 ≤ j ≤ p,Pn may be written as

Pn =∑

1≤j≤p

nj

nδt(j)

Therefore, according to (2.12), Φ may be represented a posteriori as

Φ = (1 − αn)Φc + αnΦd

whereαn =

1≤j≤p

Φ(t(j))

Page 186: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 177

and

Φd = α−1n

1≤j≤p

Φ(t(j))δt(j)

=∑

1≤j≤p

Φd(t(j))δt(j) .

In this representation, conditionally on (t1, . . . , tn),

(i) αn, Φc, Φd are independent.

(ii) αn has a beta distribution with parameter (n, n0)

(iii) Φc is a Dirichlet process with parameter (n0, P0)

(iv) Φd is a Dirichlet process with parameter (n, Pn)

or equivalently,

(iv*) Φd(t(j)) : 1 ≤ j ≤ p has a Dirichlet distribution with parameter(nj : 1 ≤ j ≤ p).

It should be remarked that: (i) the normalized part of Φ outside the ob-servations, Φc, is not revised by the observations, i.e. is distributed as Φ apriori, (ii) the normalized part of Φ at the observations, Φd, has a distribu-tion independent of the prior distribution, i.e. does not depend on (n0, P0).Finally, Φ is a convex combination of Φc and Φd the coefficient of which (αn)has a distribution depending only on the prior precision and the sample size,i.e., (n0, n).

"Non-informative" prior specification raises problems. A natural sugges-tion has indeed been to consider that n0 → 0. In this latter case, the be-haviour of the posterior process is rather natural: the posterior process tendsto a Dirichlet process with parameter (n, Pn). Note however the discontinuityat n0 = 0 where the prior distribution has the pathological representation ofa random jump process: Φ = δt with t ∼ P0.

Uses and Extensions of Dirichlet process in duration models.

When modeling duration data, it is rather natural to give structure to oneof the following transformations of the sampling probability Φ: either thesurvivor function

Page 187: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 178

Σ(t) = Φ((t,∞)) (8.19)

or the cumulative hazard function defined in either one of the following twonon equivalent ways:

Λ(t) = − ln Σ(t) (8.20)

Λ(t) =

[0,t]

[Σ(u−)]−1Φ(du) (8.21)

Historically,Bayesian statistical analysis of duration models has used def-inition (2.20) to define neutral to the right processes (Doksum (1974)) andto modelize proportional hasards models (see section 5). Definition (2.21),asunderlined by a referee,has a better probabilistic meaning in relation withDoob-Meyer decomposition (Λ(t) is a previsible process and 1Iti≤t − Λ(t)is a martingale. It serves as a cornerstone of martingales estimators (,e.g.,Fleming and Harrington (1991) and Andersen, Borgan, Gill and Keiding(1993)). Its utility in Bayesian analysis has been shown in the definition ofBeta processes introduced by Hjort (1990).Due to some confusions arizing inthe literature,we present the relations existing between the two definitions.

Because Σ(t) is non-increasing and right-continuous such that Σ(0) ≤ 1and Σ(∞) = 0, Λ(t) and Λ(t) are both non-decreasing, right-continuous suchthat 0 ≤ Λ(0) ≤ Λ(0) and Λ(∞) ≤ Λ(∞) = ∞. Therefore, both Λ(t)and Λ(t) can be viewed as a cumulative distribution functions of a σ-finitemeasure on IR+.Let consider the decomposition of Λ and Λ into discrete andcontinuous parts:

Λ(t) = Λc(t) + Λd(t) (8.22)

Λ(t) = Λc(t) + Λd(t) (8.23)

We note that the continuous parts always coincide

Λc(t) = Λc(t) ∀ t

, but the discrete parts are different since

Λd(t) =∑

0≤s≤t

lnΣ(s−)

Σ(s)(8.24)

= −∑

0≤s≤t

ln[1 − Λ(s) − Λ(s−)]

Page 188: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 179

Λd(t) =∑

0≤s≤t

[

1 − Σ(s)

Σ(s−)

]

(8.25)

=∑

0≤s≤t

[1 − exp−(Λ(s) − Λ(s−)]

Thus, from the logarithmic inequality lnx ≥ 1 − x−1, we conclude thatΛ(t) ≥ Λ(t) ∀t and, in terms of the associated σ-finite measure, Λ(B) ≥ Λ(B)for any Borel sets of IR+. Definition (2.20) has the advantage to be easilyinverted; indeed

Σ(t) = e−Λ(t) (8.26)

But inversion of definition (2.21) is more complicated, since

Σ(t) = e−Λc(t)∏

0≤s≤t

[1 − Λ(s) − Λ(s−)] (8.27)

Formula (2.27) is a simple expression of the so called product limit integralin the case of mixed cumulative hazard function.

The process Λ(t) has some nicer properties than the process Λ(t). One ofthem, as seen in the second expression of definition (2.20), relies on the factthat Λ(t) has jumps of size smaller than one.

Once Φ is distributed as a Dirichlet process with parameter (a,M), thelaw of the survivor process Σ, which is purely discrete in view of (2.13)and (2.14), is easily characterized. Indeed for any k and any ordered k-tuple t1 < t2 < · · · < tk, the joint distribution of the random vector(Σ(t1), · · · ,Σ(tk)) is obtained from the joint distribution of 1 − Σ(t1) =Φ((0, t1]),Σ(t1) − Σ(t2) = Φ((t1, t2]), · · · ,Σ(tk−1) − Σ(tk) = Φ((tk−1, tk]),Σ(tk) = Φ((tk,∞)), which is a Dirichlet distribution with parameter (n0[1−S0(t1)], n0[S0(t1)−S0(t2)], · · · , n0[S0(tk−1)−S0(tk)], n0S0(tk)) where S0 is thesurvivor function associated to P0. For the laws of Λ and Λ note first thatboth Λ and Λ have independent (non negative) increments. For the case ofΛ, this is a direct consequence of a basic property of the Dirichlet process.Indeed for any k and any ordered k-tuple t1 < t2 <, · · · , < tk, the Dirich-let process makes Σ(tj+1)(Σ(tj))

−1 independent of (Σ(t1),Σ(t2), · · · ,Σ(tj))and consequently of (Σ(ti+1))(Σ(ti))

−1 ∀i < j); on taking logarithm oneobtains the independence of the increments of Λ, in view of definition (2.20).The same properties holds for Λ(t) because, from (2.24) and (2.25), Λ(t)is a bijective transformation of Λ(t) (same locations of jumps and bijectivetransformation of jump heights) and eventually keeps the same independenceproperties. Furthermore, as (Σ(tj+1))(Σ(tj))

−1 has a beta distribution with

Page 189: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 180

parameter (aM((tj+1,∞)), aM((tj , tj+1])) (see e.g. Rolin 1992a), the distri-bution of each increment of Λ is log-beta with the same parameters.

A generalization of the Dirichlet process may be obtained by relaxingthe log-beta distribution property of the increments of Λ. More specifically,a positive right-continuous stochastic process indexed by t ≥ 0, with inde-pendent non negative increments is called a Levy process and the associatedrandom σ-finite measure is called a purely random measure. From (2.20), as-suming Λ to be a Levy process is equivalent to assuming that Σ is neutral tothe right, i.e. for any k and any ordered k-uple 0 = t0 < t1 < t2 < . . . < tk,the random variables Σ(tj+1)[Σ(tj)]

−1 are mutually independent (for moredetails see e.g. Doksum (1974)). It may be shown (see e.g. Ferguson andKlass (1972)) that the continuous part of any Levy process is deterministicand that its discrete part is also a Levy process. Furthermore, Λ is a Levyprocess if and only if Λ is a Levy process, in which case they share a samedeterministic continuous parts but have different discrete parts.

An example of such a use of a Levy process, as a tool for modellingneutral to the right processes, is given by the Beta processes introduced byHjort (1990).

To describe this extension, let us first assume that M is supported by afixed finite ordered set (a1, ..., ak). Then, if a0 = 0,

Λ(aj) = Λ(aj) − Λ(aj−1) = 1 − Σ(aj)

Σ(aj−1)(8.28)

Whence Λ(aj), 1 ≤ j ≤ k, are independent and have beta distributionswith parameter (aM((aj−1, aj ]), aM((aj ,∞])).It follows that

E[Λaj] = L(aj) = 1 − M((aj ,∞])

M([aj−1,∞])(8.29)

Note that Λ(aj) is the conditional probability P (ti = aj | ti3aj) (givenΣ).One could then consider events of the form ti = aj as binomial trials to

which one then assigns probabilities from a beta distribution.Indeed, Λ(aj)has a beta distribution with parameter (c(aj)L(aj), c(aj)[1 − L(aj)])where c(aj) = aM((aj−1,∞]).

Now, if the function c is arbitrary, Λ is said to be a discrete Beta processwith parameters c and L.

In the general case, if Φ is distributed as a Dirichlet process with parame-ter (a,M), Λ will be said to be distributed as a Beta process with parameter(c, L) denoted by Λ ∼ Be(c, L) where

Page 190: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 181

c(t) = aM([t,∞]) (8.30)

and

L(t) =

[0,t]

M([u,∞])−1M(du) (8.31)

i.e. L is a cumulative hazard function of the probability M (see (2.21)).Therefore, if a priori Φ ∼ Di(n0, P0) or equivalently Λ ∼ Be(c0, L0),

then a posterior, Φ|t1, t2, ..., tn ∼ Di(n∗P∗) or equivalently, Λ|t1, t2, ..., tn ∼Be(c∗, L∗).

Now, clearly

c∗(t) = n∗P∗([t,∞]) = c0(t) + nPn([t,∞]) (8.32)

and since (2.31) entails that n0P0(du) = c0(u)L0(du).

L∗(t) =

[0,t]

P∗([u,∞])−1P∗(du)

=

[0,t]

c0(u)L0(du) + nPn(du)

c0(u) + nPn([u,∞])

(8.33)

Hjort (1990) has shown that taking c0 to be an arbitrary positive mea-surable function on IR+, then Λ ∼ Be(c0, L0) is precisely defined as a priorspecification and that a posteriori, Λ|t1, t2, ...tn ∼ Be(c∗, L∗) where c∗ and Λ∗

are defined by the last members of (2.32) and (2.33). Therefore, Beta pro-cesses form a natural conjugate family larger than Dirichlet processes (n0, areal number is replaced by a measurable function c0) and smaller than theneutral to the right processes or equivalently the Levy processes.

8.3 Nonparametric duration models with cen-

sored observations.

One of the main features of duration data sets is the presence of censoredobservations. In this section, we want to show the extensions of the resultspresented in section 2 to non stochastically right-censored durations. Letus first recall that, in this case, the sample is generated as follows: for anyi = 1, · · · , n, ci is a fixed duration, τi is a latent duration independently andidentically distributed from an unknown probability Φ and we observe:

Page 191: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 182

ti = min(τi, ci)

di = 1Iτi≤ci

(8.1)

The unknown functional parameter is endowed with a prior probability(in this paper, a Dirichlet process) and we want to analyse the posteriorprobability of Φ given (ti, di)1≤i≤n.

The main results of this analysis are the following:

i) the family of Dirichlet processes is not closed under such a sampling.In other words ,the posterior probability deduced from a Dirichlet prior anda censored sample is not a Dirichlet process. However, the class of neutral tothe right processes, is closed for the Bayesian inference as shown by Fergussonand Phadia (1979).

ii) Elements of this class are not as simple as Dirichlet processes but someof their characteristics can be obtained analytically. This is in particular thecase for the expectation of the survivor function. An application of thiscomputation is the posterior expected survivor function constructed througha Dirichlet prior (see Susarla and Van Ryzin (1976) and (1978),Blum andSusarla (1977)).

iii) Beta processes that constitute a strict subclass of neutral to the rightprocesses are still closed for sampling with right-censoring as shown by Hjort(1990). This implies in particular that the posterior of a Dirichlet processprior is a Beta process.

In order to point out the essential elements of this inference procedure wewill start by the treatment of the "finite" case in which the support of Φ isfinite. If we assume that the values ci are also elements of this set, we maydescribe the sample using different counting statistics.

Let a1, · · · , ak be the support of Φ (with a1 < a2 < · · · < ak) and let nbe the sample size. For any j = 1, · · · , k we define

ej =∑

1≤i≤n

1Iti=aj ,di=1 (8.2)

the number of non censored durations equal to aj

hj =∑

1≤i≤n

1Iti=aj (8.3)

Page 192: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 183

the number of censored or non censored durations equal to aj

nj =∑

j≤`≤k

h` =∑

1≤i≤n

1Iti≥aj

the number of individuals at risk in aj .

Note that, in particular, n1 = n =∑

1≤j≤k

hj and hj − ej is the number of

censored durations at aj.Furthermore, the statistic (hj, ej)1≤j≤k is obviously sufficient.

The probability Φ on a1, · · · , ak may be described in different ways:

i) the sequence (θj)1≤j≤k , θj > 0,∑

1≤j≤k

θj = 1, are the probabilities of

theaj ’s, i.e, θj = Φ(aj), from which we can construct the survivor func-

tionΣ(t−) =

aj≥t

θj =∑

j

θj1Iaj≥t (8.4)

ii) Φ is also characterized by the hazard rates sequence (λj)1≤j≤k definedby

λj = Λ(aj) =θj

Σjwhere Σj = Σ(aj−). (8.5)

In particular, the well known product formula connects θj and Σj to λj

θj = λj

`<j

(1 − λ`) Σj =∏

`<j

(1 − λ`). (8.6)

The likelihood of the sufficient statistics may be written in terms of theprobabilities or in terms of the hazard rates

`((hj, ej)1≤j≤k | (θj)1≤j≤k) =∏

1≤j≤k

θej

j Σhj−ej

j (8.7)

or

`((hj , ej)1≤j≤k | (λj)1≤j≤k) =∏

1≤j≤k−1

λej

j (1 − λj)nj−ej . (8.8)

Let us now consider the prior probability. With the same notation as insection 2.1, a Dirichlet prior distribution on (θj) has density

Page 193: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 184

m((θj)1≤j≤k) ∝∏

1≤j≤k

θn0p0j−1j (8.9)

with respect to the uniform density restricted to the simplex in IRk. Underthe change of variables underlying (3.5) and (3.6), the prior density definedin (3.9) may be rewritten in terms of the (λj)’s as

m((λj)1≤j≤k) ∝∏

1≤j≤k−1

λn0p0j−1j (1 − λj)

n0S0 j+1−1 (8.10)

with respect to the uniform measure restricted to the set of hazard rates in[0, 1]k−1. In (3.10) S0 is the expected prior survivor function

S0(t−) =∑

1≤j≤k

p0j1Iaj≥t and S0j = S0(aj−) (8.11)

Thus the parameters (λ1, · · · , λk−1) are independently Beta distributedwith parameter (n0p0j , n0S0 j+1), 1 ≤ j ≤ k − 1. Note the restrictions amongthe parameters of the independent distributions of λj implied by (3.11). Ac-cording to (2.29), an interesting property is the following:

E(λj) =n0p0j

n0p0j + n0S0 j+1

=p0j

S0j

= λ0j. (8.12)

In other words, the prior expected hazard rates are the hazard rates of theprior expected distribution.

The posterior probability on the θj ’s has the density

m((θj)1≤j≤k|(hj, ej)1≤j≤k) ∝∏

j

θn0p0j+ej−1

j (∑

aj′≥aj+1

θj)hj−ej (8.13)

∝∏

j

θn0p0j+ej−1

j Σhj−ej

j+1 , (8.14)

which is clearly not a Dirichlet distribution and the lack of a closednessproperty of the class of the Dirichlet distributions is then pointed out.

The posterior density on the λj’s is equal to:

m((λj)1≤j≤k−1 | (hj , ej)1≤j≤k) ∝ (8.15)

1≤j≤k−1

λn0p0j+ej−1j (1 − λj)

n0S0 j+1+nj−ej−1

Page 194: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 185

It follows from (3.15) that, a posteriori, the λj(1 ≤ j ≤ k − 1) are stillindependent. Any hazard rate λj follows a posteriori a Beta distributionwhose parameters are equal to n0p0j + ej and n0S0 j+1 + nj − ej , but thepreviously noted restriction on the parameters of the independent distribu-tion of λj does no longer holds. Thus, the sequence of k − 1 independentBeta probabilities looks like the prior specification but, apart from the caseof non censored data, these distributions cannot be derived from a Dirichletposterior on (θj)1≤j≤k.If, however,we compute the survivor function from theusual product formula (3.6) and exploit the posterior mutual independenceof the λj ’s we find that:

Snj = E(Σj |(hj, ej), 1 ≤ j ≤ k) = E(∏

`<j(1 − λ`)|(hj , ej), 1 ≤ j ≤ k)

=∏

`<j(1 − E(λ`|(hj , ej), 1 ≤ j ≤ k))

=∏

`<j(1 − n0p0j + ej

n0S0j + nj)

(8.16)The Bayesian estimation Sn(t) is deduced from the Snj using the propertythat the survivor function is constant between the jumps in aj and is rightcontinuous at the jumps (and therefore everywhere).

The same computations shows that this analysis goes through if, insteadof a Dirichlet prior, we specify Λ to be a discrete Beta process. Indeed, in thiscase, a priori the λj ’s are independently and Beta distributed with parameter(c0jλ0j , c0j(1−λ0j)) where c0j , 1 ≤ j ≤ k−1, are arbitrary constants. Hencea posteriori the λj’s are independently and Beta distributed with parameter

(c∗jλ∗j , c∗j(1 − λ∗j)) where c∗j = c0j + nj and λ∗j =c0jλ0j + ej

c0j + nj.

These relations, in the finite case, extend formulas (2.32) and (2.33) inthe case of censoring.

Let us consider the general case in which the functional parameter Φ isnot constrained by a finite support condition and is distributed as a neutralto the right prior process. Its associated integrated hazard function Λ(t) =− ln Σ(t) (where Σ(t) = Φ((t,∞)) is a Levy process. Such a process has beendefined in section 2.2.3. Let us recall that a Levy process is increasing withindependent increments. For any sequence t1 < t2 < · · · < tk, the randomvariables rj = Λ(tj) − Λ(tj−1) are independently distributed with densitiesmj(rj) 1 ≤ j ≤ k. In the case of a Dirichlet prior distribution, we have seenthat the rj are log-beta distributed. It is sufficient to prove that for any

Page 195: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 186

observation ti (censored or not) the posterior measure given ti is still a Levyprocess. In what follows, we give a heuristic argument giving a clue to amore formal proof.

First let us assume that t is noncensored. In order to compute the pos-terior distribution of Λ restricted to any finite family of increments, we justhave to consider the marginalized likelihood given this family. This likelihoodreduces to the probability of the interval (t`, t`+1] to which the observed t be-longs. Then the posterior probability of (rj)1≤j≤k is proportional to:

m((rj) : 1 ≤ j ≤ k | t` < t ≤ t`+1) ∝

[∏

1≤j≤k

mj(rj)](Σ(t`) − Σ(t`+1))

= [∏

1≤j≤k

mj(rj)][exp−∑

1≤j≤`

rj][1 − exp−r`+1]

= [∏

1≤j≤`

mj(rj) exp−rj ][m`+1(r`+1)(1 − exp−r`+1)][∏

`+2≤j≤k

mj(rj)]

(8.17)

Thus, the rj ’s are a posteriori independent. The posterior density of thefirst ` are proportional to mj(rj) exp(−rj), the posterior density of r`+1 isproportional to m`+1(r`+1)(1 − exp−r`+1) and the distribution of the lastincrements is identical a priori and a posteriori.

An identical computation can be done in the case of a censored obser-vation t ∈ (t`, t`+1]. The marginalized likelihood on the finite family of

increments is equal to Σ(t`) = exp−∑

1≤j≤`

rj and the posterior distribution

of the rj’s is identical to the previous one excepted for r`+1 which becomesunrevised by the observation.

Given a prior distribution on Λ(t) one may derive the prior distributionof any sequence of increments and compute the posterior distribution usingprevious arguments sequentially on the sample.

In the special case of a Beta prior specification, i.e., L ∼ Be(c0, L0), Hjort(1990) has shown that a posteriori L|(ti, di)1≤i≤n ∼ Be(c∗, L∗) as in the caseof no censoring where

c∗(t) = c0(t) +∑

1≤i≤n

1Iti≥t (8.18)

Page 196: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 187

and

L∗(t) =

[0,t]

c0(s)L0(ds) +∑

1≤i≤n diδti(ds)

c0(s) +∑

1≤i≤n 1Iti≥s(8.19)

For a Dirichlet process prior, the posterior is therefore characterized by(3.18) and (3.19) where c0(t) = n0S0(t−).

In particular (Sursala, Van Ryzin (1976)), the posterior expectation ofthe survival function is given by

E(Σ(t)|(ti, di)1≤i≤n) =n0S0(t) + n(t)

n0 + n

1≤j≤`

n0S0(uj) + n(uj) + fj

n0S0(uj) + n(uj)t ∈ [u`, u`+1)

(8.20)where u1, u2, · · · , uk are the distinct observed values of the sample, n(t) is

the number of individuals at risk after t i.e. n(t) =∑

1≤i≤n

1Iti>t, fj is the

number of censored observations at uj and S0 is the prior survivor function.Let us remark that, in absence of censoring, (3.20) reduces to the usual resultgiven in section 2.

A more detailed characterization of the posterior distribution for a Dirich-let process prior with censored observations will be given in a future paperproviding easy simulation to analyze posterior distributions of functionals ofthe survival function.

For general Beta processes and more generally Levy processes, no easy tosimulate description of the trajectories are available.Simulations techniquesmust rely on more complicated schemes simulating probabilities of intervalsas in Damien,Laud and Smith (1995) and (1996).However this method doesnot lead to simulations of the distributions of functionals.In the case of noninformative priors, Lo (1987) and (1993) provides a Bayesian Bootstrap forcensored durations.

Page 197: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 188

8.4 Heterogeneity and Mixture of Dirichlet Pro-

cesses.

8.4.1 Introduction

The last two sections of this paper are devoted to the study of models con-ditional on a variable or a function of variables describing individual het-erogeneity. In other words, the first step in the specification consists in de-scribing the law of the duration ti conditionally on θi, a variable representingindividual characteristics of the ith individual, i.e.,one needs to specify

Σ(t|θi) = P (ti > t|θi,Φ)

Two characteristics have to be taken into account: on one hand,the ob-servability or not ofθi,and on the other hand,the class of conditional modelsto be considered (usually a proportional hazards or an accelerated lifetimemodel).

First of all, it is usual to consider θi as a function of observed explanatoryvariables zi and of an unknown structural parameter β (the same for eachindividual). This is known as observed heterogeneity and the zi’s are alsoknown as treatment variables, covariates and so on... Econometric literaturehas also emphasized the interest in considering θi as an unobserved realiza-tion of a random variable (see e.g., Heckman (1981), Heckman and Singer(1982), (1984a) and (1984b) and Lancaster (1990). The main reference rel-ative to identification problems in such models is Elbers and Ridder (1982).A motivation for this model is the following: suppose that each individualhas a duration generated by the exponential law of parameter λθi. Given θi,the hazard rate of the ith individual is time independent but if θi is random,the marginal distribution of ti has a decreasing hazard rate. This is knownas "spurious time dependence". The same argument of heterogeneity is alsoused to explain the u-shaped observed hazard rates in reliability theory.

At last, two types of conditional models for duration data are generallyconsidered: accelerated lifetime models and proportional hazards models. Inthe first type of models, the observed lifetime ti is written as θiτi where τi isa basic lifetime and θi appears as an acceleration factor and therefore,

Σ(t|θi) = Σ(θ−1i t)

where Σ is the survival function of τi.

Page 198: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 189

In the proportional hazards model, the observed lifetime has, condition-ally on θi, a survival function given by

Σ(t|θi) = Σ(t)θi

or equivalently has a hazard function given by

Λ(t|θi) = θiΛ(t)

In this paper,we shall only consider two alternative combinations of con-ditional models and heterogeneity.In this section, we analyse the acceleratedlifetime model with non observable heterogeneity.In the next section we dis-cuss the proportional hazards model with observable explanatory variables.

The interest in the accelerated lifetime model with unobserved hetero-geneity lies in the fact that integration of the unobservable variable producesa "smoothing" of the trajectories of the Dirichlet process and furnishes anestimation of the density and of the hazard rate. This is a Bayesian versionof Kernel estimators (see, e.g., Lo and Weng (1989)). In the proportionalhazards model, however, the discrete character of the Dirichlet process ispreserved by integration since the locations of the jumps are unchanged.Moreover, the neutral to the right property is lost and this makes the com-putation of the posterior much more difficult. However such a model maysometimes be useful and may be considered as a byproduct of the computa-tions of the next section.

In the proportional hazards model with θi = a(β, zi) the marginal likeli-hood on β is easy to obtain and provides alternatives to the famous Cox’spartial likelihood model. On the contrary, the semiparametric analysis ofthe accelerated lifetimes model with θi = a(β, zi) with a Dirichlet processprior for Φ, i.e. Φ ∼ Di(n0P0) has little interest. Indeed, as far as all theobservations are distinct, the posterior distribution of β is identical to theposterior distribution obtained in the parametric model where the τi’s arei.i.d. P0 (see e.g. Bunke (1981)).

8.4.2 A simple model.

Let us consider n observed durations (t1, · · · , ti, · · · , tn) along with the fol-lowing multiplicative decomposition:

ti = θiτi 1 ≤ i ≤ n (8.1)

Page 199: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 190

Both θi’s and τi’s are unobservable and I.I.D.For identifiability,we assumethat the distribution of the θi’s is known and is characterized by a densityfunction.In contrast,the distribution of the τi’s is unknown and assumed tobe a priori distributed as a Dirichlet process.

In the context of the heterogeneity problem, model (4.1) may be inter-preted as follows. The duration τi of each individual i is assumed to begenerated by a same unknown distribution but accelerated by a factor θi

reflecting unobservable individual characteristics. A simple solution to thestandard problem of identification in accelerated time models is provided bythe assumption that the distribution of the factor θi is known.

It is interesting to notice that, because the ti’s can be represented asthe product of two quantities their sampling distribution is a.s. smooth,i.e. admits a density a.s., in spite of the Dirichlet specification. In otherwords, the introduction of θi acts similarly to a smoothing kernel, a maindifference being a multiplicative convolution rather than an additive one.The multiplicative model is indeed more natural than the additive one whendealing with non negative random variables. One of the results of this modelis to produce a Bayesian analogue to kernel estimators.

Let us now be more specific about the basic assumptions underlying (4.1)(A.1) (θi)1≤i≤n are I.I.D.; their common distribution, denoted by Q, is

known, supported by IR+ admits a known density q.(A.2) (τi)1≤i≤n are I.I.D.; the common distribution denoted by Γ, is un-

known and distributed a priori as a Dirichlet process with parameters (n0, G0)where G0 is a probability measure supported by IR+.

(A.3) The θi’s and the τi’s are jointly independent in the sampling, i.e.:

(θi)1≤i≤n⊥⊥(τi)1≤i≤n|Γ (8.2)

These assumptions imply that the ti’s are I.I.D. in the sampling, thecommon distribution denoted Φ is a multiplicative convolution between Qand Γ and will be denoted by Q.Γ, more precisely:

Φ([0, t]) = Q.Γ([0, t]) =

IR+

Γ([0,t

θ])Q(dθ)

=

IR+

Q([0,t

τ])Γ(dτ)

(8.3)

Because Q admits a density q, (4.3) may be rewritten as:

Φ([0, t]) =

∫ t

0

IR+

1

τq(u

τ) Γ(dτ)du (8.4)

Page 200: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 191

Thus Φ is dominated by Lebesgue measure and admits a density ϕ definedas

ϕ(t) =

IR+

1

τq(t

τ) Γ(dτ) (8.5)

When Γ ∼ Di(n0, G0) the distribution of Φ, derived from (4.3), is an example(a particular case) of a mixture of Dirichlet processes. One may easily checkthat

E[Φ] = Q.E(Γ) = Q.G0 (8.6)

d

dtE[Φ([0, t])] =

IR+

1

τq(t

τ) G0(dτ) (8.7)

When G0 is a continuous probability measure, results given in section2.2.2 imply that ϕ(t) may be represented as

ϕ(t) =∑

1≤k<∞

γk1

σkq(t

σk) (8.8)

and may accordingly be easily simulated using (2.18) and the fact that(σk)1≤k<∞ is an infinite I.I.D. sample from G0.

The posterior distribution of Φ is somewhat involved. Let us have a lookon its expectation and evaluate it first conditionally on (τ1, · · · , τn):

E[Φ|t1, · · · , tn, τ1, · · · , τn] =n0

n0 + n(Q.G0) +

n

n0 + n(Q.Gn) (8.9)

where Gn is the empirical distribution of the τi’s:

Gn = n−1∑

1≤i≤n

δτi(8.10)

In order to integrate the τi’s out, it is convenient to represent (τ1, · · · , τn)through three components namely,(1) p, the number of the different valuesof the τi’s,(2) a partition Cn of 1, · · · , n into p elements (Ij)1≤j≤p whereeach Ij is a set of indices corresponding to identical values of τi (thus Cn

represents the partition of the configuration of ties in the τi’s), and finally(3) the p-vector of distinct values of the τi’s: (τ(j))1≤j≤p. Therefore Gn mayalso be written as

Page 201: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 192

Gn =1

n

1≤j≤p

njδτ(j) (8.11)

where nj = |Ij| and consequently∑

1≤j≤p

nj = n. Note also that

ti = τ(j)θi ∀i ∈ Ij (8.12)

where this expression is based on the configuration Cn. Let us now denoteby Gj

n the conditional probability of τ(j) conditionally on t1, · · · , tn and Cn.Thus

Gjn(A) = E[1IA(τ(j)) | t1, · · · , tn, Cn] (8.13)

It may be shown, and it is intuitively rather natural, thatGjn depends only

on those ti’s for which i ∈ Ij and may be evaluated easily using standardtechniques for evaluating posterior distributions; this is so, in particular,because the θi’s are I.I.D. with known distribution Q and the τ(j)’s are apriori indepently distributed according to G0. Later on we give an examplefor evaluating Gj

n. If we first integrate (τ(j))1≤j≤p out from (4.9), conditionallyon the configuration Cn, we obtain

E[Φ|t1, t2, . . . , tn, Cn] =n0

n0 + n(Q ·G0) +

1

n0 + n

1≤j≤p

nj (Q ·Gjn) (8.14)

Just as above, the properties of the trajectories of the Dirichlet processmay be used to generate the posterior distribution of the density ϕ(t) givent1, t2, . . . , tn) and the configuration of ties, Cn.Indeed,

ϕ(t) = (1 − αn)∑

1≤k<∞

γk1

σk

q(t

σk

) + αn

1≤j≤p

β(j)1

σ(j)

q(t

σ(j)

) (8.15)

where αn has a beta distribution with parameter (n, n0), β(j) : 1 ≤ j ≤ phas a Dirichlet distribution with parameter nj : 1 ≤ j ≤ p and σ(j) areindependently generated from Gj

n, 1 ≤ j ≤ p.Since the configuration Cn is unknown, we may, at least formally, inte-

grate Cn out from (4.14) conditionally on t1, t2, . . . , tn, to obtain

E[Φ|t1, t2, . . . , tn] =n0

n0 + n(Q ·G0) (8.16)

+1

n0 + n

Cn∈Cn

P (Cn|t1, . . . , tn)∑

1≤j≤p

nj (Q ·Gjn)

Page 202: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 193

where Cn is the set of all partitions of 1, 2, . . . , n.To simulate ϕ(t) conditionally on (t1, t2, . . . , tn) we may use formula

(4.15), if we first generate Cn conditionally on (t1, t2, . . . , tn).Unfortunately, an exact evaluation of (4.16) is close to impossible for

reasonably large sample size n. Indeed, the evaluation of P (Cn|t1, · · · , tn) israther involved in view of the analysis given in section 2.3 and of the factthat |Cn| increases dramatically with n.

Different strategies can be envisaged for facing this difficulty. Two ofthese are the following. A first strategy consists in selecting arbitrarily aconfiguration Cn of ties. This is leaving the strict Bayesian framework asfar as one is conditioning on a non-available information. Although generaltheorems may ensure the convergence of E(Φ|t1, · · · , tn), the convergence ofE(Φ|t1, · · · , tn, Cn) requires specific hypotheses for arbitrary choices of Cn.

One way of selecting a configuration of ties may rely on clustering meth-ods. Indeed, as discussed in section 2.3, given p,( τ(j), 1 ≤ j ≤ p) is an I.I.D.sample from G0, and given (τ(j), 1 ≤ j ≤ p) and Cn, (ti)i∈Ij

is an I.I.D. sam-

ple from1

τ(j)q(

t

τ(j)) for 1 ≤ j ≤ p. A clustering analysis might accordingly

estimate p and (Ij)1≤j≤p (and as a by-product (τj)1≤j≤p ).Note that p, being the cardinality of Cn, is a random variable indexed by

n. It is known (see e.g. Korwar, Hollander (1973) and Rolin (1992b)) that,as n→ ∞, pn/n0 lnn→ 1 a.s. and so if n is large we can choose pn = n0 lnn.The random sequence Cn should however not be expected to converge; notein particular that the cardinality of its range space is n!. But an estimator Cn

of Cn (for instance by cluster) may however be appropriate if one can showthat plugging Cn into (4.14) provides a convergent procedure. As illustratedin the particular case presented in subsection 4.3, a possible avenue is tolook for a sufficient summary yn = fn(t1, . . . , tn, Cn) for (t1, t2, . . . , tn, Cn)such that

Φ⊥⊥(t1, t2, . . . , tn, Cn) | yn (8.17)

and then to look for an estimator yn of yn such that replacing yn by yn wouldstill provide a consistent procedure for evaluating E[Φ|t1, t2, . . . , tn, Cn] =E[Φ|yn].

Another kind of strategy (proposed and largely used by Escobar (1994)and Escobar and West (1994)) relies on simulation methods (post-data sam-pling). Here Gibbs sampling is a natural candidate for generating (τi)1≤i≤n

conditionally on (ti)1≤i≤n.This distribution is rather complicated and almostunmanageable if n is large but the conditional distributions are howeverrather simple. Indeed, conditionally on (τi′)i′ 6=i and (ti′)i′ 6=i, (see formula

Page 203: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 194

(2.24)) the distribution of τi is given by

Gi0 =

n0

n0 + n− 1G0 +

1

n0 + n− 1

i′ 6=i

δτi′(8.18)

and the conditional distribution of ti given τi has density:

h(ti | τi) =1

τiq

(tiτi

)

(8.19)

Then by Bayes theorem the conditional distribution of τi given ti (and (τi′)i′ 6=i

and (ti′)i′ 6=i) is deduced from (4.18) and (4.19), i.e.,

P [τi ∈ A|τi′ , i′ 6= i, ; ti, 1 ≤ i ≤ n] =

n0h(ti)Gti0 (A) +

i′ 6=i

1

τi′q

(tiτi′

)

δτi′(A)

n0h(ti) +∑

i′ 6=i

1

τi′q

(tiτi′

)

(8.20)where Gti

0 is the conditional distribution of τi given ti computed with G0 asa prior distribution for τi.The predictive density h(ti) is the density of Q.G0

(formula (4.7)).Starting from an initial value of (τi)1≤i≤n, the Gibbs sampling procedure

will generate sequentially τi from (4.20) to finally obtain a draw of (τi)1≤i≤n

conditional on (ti)1≤i≤n.From this, a realization of the posterior distributionof Γ or of Φ may be computed.

A review of a lot of techniques for estimating smoothly densities in theBayesian framework is provided by Hjort (1996)

8.4.3 A particular case.

Let us give a simple example of such computations.Let Q be the inverse gamma distribution with parameter (θ0, ν0) ∈ IR2

+

the density of which is given by

q(θ) =θν00

Γ(ν0)θ−(ν0+1)e−θ0/θ (8.21)

and let G0 be the gamma distribution with parameter (τ0, µ0) ∈ IR2+ whose

density is given by

g0(τ) =1

Γ(µ0)τ−µ0

0 τµ0−1e−τ/τ0 (8.22)

Page 204: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 195

Thus the distribution of ti conditionally on τi is an inverse-gamma distribu-

tion with parameter (θ0τi, ν0) since its density is given by1

τiq

(t

τi

)

.

Simple computations show that the distribution of τi conditionally on ti,denoted above as Gti

0 , is the gamma distribution with parameter[(

1

τ0+θ0ti

)−1

, µ0 + ν0

]

. Furthermore the distribution of ti, denoted above

as Q · G0 is the Fisher distribution with parameter (θ0τ0, µ0, ν0) the densityof which is given by

g(t) =Γ(µ0 + ν0)

Γ(µ0)Γ(ν0)

(θ0τ0)−µ0−ν0−1

(

1 + tθ0τ0

)µ0+ν0(8.23)

Now conditionally on τ(j) and Cn, the ti’s for i ∈ Ij are distributed indepen-dently following an inverse gamma distribution with parameter (θ0τ(j), ν0).Therefore, the likelihood is proportional to

i∈Ij

τ ν0

(j)e−θ0τ(j)/ti = τ

njν0

(j) e−θ0τ(j)

i∈Ijt−1i

Therefore, by Bayes’ theorem, the distribution of τ(j) conditionally on ti; 1 ≤i ≤ n and Cn denoted before as Gj

n is the gamma distribution with param-eter

1

τ0+ θ0

i∈Ij

1

ti

−1

, µ0 + njν0

Finally, by the same computation as before, Q ·Gjn is the Fisher distribution

with parameter

1

θ0τ0+

i∈Ij

1

ti

−1

, µ0 + njν0, ν0

This implies that the posterior expectation of Γ conditionally on Cn hasa density given by a convex combination of gamma densities and that theposterior expectation of Φ conditionally on Cn has a density given by a convexcombination of Fisher densities. Notice that, in such a case, (t1, t2, . . . , tn, Cn)

has been reduced, by sufficiency, into yn = ((nj ,∑

i∈Ij

t−1i ) 1 ≤ j ≤ p). This

reduction has been made possible because the mixed distribution, i.e. thedistribution (ti|τi), is a member of the exponential family.

Page 205: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 196

8.5 Semiparametric model with proportional haz-

ard.

The last section of this article is devoted to a Bayesian treatment of the semi-parametric analysis of duration models conditional on observed explanatoryvariables. We restrict the attention to fixed explanatory variables (i.e. onlydependent on the individuals but not on time) acting through a proportionalhazards model. Moreover we assume for the sake of exposition that thedata are observed without censoring and we don’t introduce an unobservedheterogeneity component.

The sample is now defined by the sequence (ti, zi) , i = 1, · · · , n, where tiis a duration and zi a vector of explanatory variables. The observations areassumed to be independent and the sampling process is characterized by thedistribution of ti conditionally on zi. This conditional probability Φi may becharacterized by its survivor function Σi which satisfies:

Σi(t) = Σ(t)a(β,zi). (8.1)

Here Σ is an unknown survivor function of a baseline probability Φ anda(β, zi) is a known positive function of an unknown vector of parameters βand of the explanatory variables zi. A common choice for a is a(β, zi) =exp β ′zi.

The specification (5.1) is equivalent to:

Λi(t) = a(β, zi)Λ(t) (8.2)

where Λi(t) is the conditional integrated hazard associated to∑

i( i.e.Λi(t) =− ln

i(t)) and Λ(t) the integrated hazard associated to the baseline sur-vivor function

∑(t). Relation (5.2) justifies the name " proportional hazards

model".The sampling model is then indexed by a functional parameter (the base-

line probability Φ or equivalently∑

or Λ) and by a vector of parameters β.The Bayesian specification is completed by the choice of a prior distribution.Given β, Φ is endowed with a Dirichlet process with parameters (n0,Φ0) thatare possibly dependent on β. The vector β has a prior density m(β).

Another model has been used by Kalbfleisch (1978), specifying a Gammaprocess on Λ, i.e. a Levy process with gamma distributed increments.

This section is essentially devoted to the computation of the posteriordistribution of β. This means that the functional parameter is treated as anuisance parameter and integrated out analytically. However the posterior

Page 206: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 197

distribution of β must be treated numerically (in order to compute, for ex-ample, its moments or its marginal densities). There is no suitable choice ofm(β) that would simplify the posterior computations and this density willbe left unspecified.

In order to simplify the computations we assume that the observed samplecontains no ties and ,therefore, has n distinct durations. This assumptionmay be easily verified and the following derivations may be extended tosamples with ties (see Ruggiero (1989)). Under the independence propertyone may assume without loss of generality that t1 < t2 < · · · < tn.

Let us start with the joint sampling survivor function

1≤i≤n

i(ti) =∏

1≤i≤n

∑(ti)

a(β,zi)

= exp−∑

1≤i≤n

a(β, zi)Λ(ti)

= exp−∑

1≤i≤n

a(β, zi)∑

1≤j≤i

γj

(8.3)

where γj = Λ(tj) − Λ(tj−1) for j > 1 and γ1 = Λ(t1); therefore:

1≤i≤n

Σi(ti) =∏

1≤i≤n

exp−γiAi(β) (8.4)

where Ai(β) =∑

i≤j≤n

a(β, zj). Note that the joint sampling survivor function

depends on the functional parameter Λ through the sequence γi only. As aconsequence, integrating out the functional parameter Λ may be performedby integrating out the γi’s from(5.4). Furthermore, the Dirichlet prior on Φimplies that the γi are a priori independent with log-beta prior density

m(γi) =Γ(n0Σ0(ti−1))

Γ(n0Σ0(ti))Γ(n0Σ0(ti−1) − n0Σ0(ti))

× (e−γi)n0Σ0(ti)(1 − e−γi)n0(Σ0(ti−1)−Σ0(ti))−1

(8.5)

After integration the joint survivor function of the ti’s is therefore equal to

Page 207: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 198

S(t1, · · · , tn|β) =∏

1≤i≤n

∫ ∞

0

m(γi) exp−γiAi(β)dγi

=Γ(n0)

Γ(n0Σ0(tn))

1≤i≤n

Γ(n0Σ0(ti) + Ai(β))

Γ(n0Σ0(ti−1) + Ai(β))

=Γ(n0)

Γ(n0 + A1(β))

1≤i≤n

Γ(n0Σ0(ti) + Ai(β))

Γ(n0Σ0(ti) + Ai+1(β))

(8.6)

The corresponding density is given by

`(t1, · · · , tn|β)

= (−1)n ∂n

∂t1 · · ·∂tnS(t1, · · · , tn|β)

= (−1)n Γ(n0)

Γ(n0 + A1)

1≤i≤n

n0Σ′0i

[Γ′(n0Σ0i + Ai)Γ(n0Σ0i + Ai+1) − Γ′(n0Σ0i + Ai+1)Γ(n0Σ0i + Ai)]

[Γ(n0Σ0i + Ai+1)]−2

(8.7)

where Γ′ is the derivative of the gamma function and Σ0i = Σ0(ti).The posterior density of β follows from Bayes rule

m(β|ti, · · · , tn) ∝ m(β)`(t1, · · · , tn|β). (8.8)

We have computed the joint marginalized survivor function in the (open)subset of (IR+)n in which all the durations are different and this functionis differentiable on this subset. However the marginalized survivor functionis not differentiable on the subsets defined by configurations of ties and theapplication of Bayes theorem becomes more complicated if the sample hasties; (see Ruggiero (1989) for a general description of this computation). Acompleted analysis of the marginalized density in the case of n distinct values(computation of score and second derivative) can be found in Hakizamungu(1992).

Page 208: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 8. INFERENCE: BAYESIAN METHODS 199

The marginal posterior probability of Φ given β cannot be computedusing the technique of section 2 because, in the proportional hazard model,there is in general no transformation of the ti the distribution of which is thebaseline probability (see Kalbfleisch (1978) and Hakizamungu (1992)).

A last natural question concerns the relation between the Bayesian mar- gi - na - lized likelihood (5.7) and the Cox marginalized likelihood.Note first that the word "marginalized" has different meaning in the twoapproaches: in the Bayesian analysis the marginalization is obtained throughan integration of the nuisance parameter Φ using a prior probability andin Cox analysis the marginalization is realized on the rank statistic in thesampling distribution. However some connection between the two resultswould have been expected in the case of "non informative" prior measure onΦ (or on Γ).

A natural way is to compute the posterior distribution of β with a uniformprior measure on β, i.e., m(β) = 1 in (5.8) and to take its limit when n0 → 0.The result becomes:

m(β|t1, · · · , tn) ∝ 1

Γ(A1)

n∏

i=1

Γ′(Ai)Γ(Ai+1) − Γ′(Ai+1)Γ(Ai)

[Γ(Ai+1]2(8.9)

Note that in (5.9) individual duration ti disappears and the rank statisticsbecomes sufficient but this posterior density on β is finally rather differentfrom the Cox marginalized likelihood.

Page 209: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

Chapter 9

Tools

9.1 Mathematical Analysis

9.1.1 One-sided Limit and Continuity

Let f : IR → IR

The function f is right-continuous at the point x if f(x) and its right-limitf(x+) = lima↓x f(a) exist and f(x) = f(x+); similarly, the function f isleft-continuous at the point x if f(x) and its left-limit f(x−) = lima↑x f(a)exist and f(x) = f(x−).

The function f is CADLAG (or, RCLL: right-continuous with left-limit)at the point x if f(x), f(x−) and f(x+) exist and f(x) = f(x+); similarly,the function f is CAGLAD (or, LCRL: left-continuous with right-limit) atthe point x if f(x), f(x−) and f(x+) exist and f(x) = f(x−). The func-tion f is continuous at the point x if it is both CADLAG and CAGLAD, i.e.f(x), f(x−) and f(x+) exist and f(x) = f(x+) = f(x−).

LemmaIf f is monotone (increasing or decreasing), the number of points of discon-tinuity is at most countable and f has everywhere both left and right limits.

ExerciseWith the help of Figure 9.1.1, check that the following functions of t, with thevalue of a fixed, 1I[a, +∞)(t) = 1It≥a and 1I(−∞, a)(t) = 1It<a are CADLAGwhereas 1I(a, +∞)(t) = 1It>a and 1I[a, +∞)(t) = 1It≤a are CAGLAD. Con-

200

Page 210: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 201

1

0

11t≥a

a t

1

11

0

t a<

a t

1

0

11t a

a t

>

Figure 9.1: Characteristic Functions of Intervals

clude that the distribution function FX(x) = P (X ≤ x) and the survivorfunction FX(x) = P (X > x) are CADLAG whereas SX(x) = P (X ≥ x) isCAGLAD.

9.1.2 Directional derivatives

Let

f : IRn → IR

Page 211: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 202

Dif(x) =∂

∂tif(x)

D[f(x)] =

D1f(x)...Dif(x)...Dnf(x)

= Df(x)

Let u ∈ IRn

Definition: directional derivative of f at the point x, in the direction of u

Duf(x) = lim∆→0

f(x + ∆ u) − f(x)

Notice: if u is the i− th column of the identity matrit I(n), i.e. u = ei, thenDei

f(x) = Dif(x).

Theorem 9.1.1

Duf(x) = u′D[f(x)] =∑

1≤i≤n

uiDif(x)

9.1.3 Integration and differentiation

Theorem 9.1.2 (Basic theorem of calculus)Let f : IR → IR, continuous and F : IR → IR defined as follows:

F (x) = a

∫ x

b

f(u) du + (1 − a)

∫ c

x

f(u) du

then:F ′(x) = a f(x) − (1 − a) f(x) (9.1)

Similarly:

D1D2 [

∫ x1

0

∫ x2

0

f(u, v)du dv] = f(x1, x2),

once f(u, v) is continuous (and integrable).

Page 212: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 203

9.1.4 Gamma function and associated integrals

Γ(x) =

∫ ∞

0

ux−1e−udu x > 0

In particular :

Γ(x) = (x− 1)Γ(x− 1) x > 1

Γ(n) = (n− 1)! n ∈ IN+

Γ(2) = Γ(1) = 0! = 1

Γ(1

2) =

√π

∫ ∞

0

xn−1 exp−axp dx = |p|−1 a−np Γ(

n

p) p 6= 0, a > 0,

n

p> 0

9.2 Probability Theory

9.2.1 Some basic theorems

Theorem 9.2.1 (The integral transform Theorem)Let:

X ∼ FX with FX(x) = P (X ≤ x) continuous, strictly increasing from0

to 1

Y = FX(X) (∈ [0, 1])

Z = SX(X) (∈ [0, 1])

V = − lnSX(X) (∈ IR+)

W = ln[− lnSX(X)] (∈ IR)

then:

Y ∼ U(0, 1) FY (a) = a

Z ∼ U(0, 1) FZ(a) = a

Page 213: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 204

V ∼ Exp(1) FV (a) = 1 − e− a

W ∼ EExp(1) FW (a) = 1 − e− ea

where EExp(1) stands for the standard double exponential distribution,a skew distribution with E[W ] = 0.5772 and Me(W ) = ln[− ln(0.5)] ≈− 0.3665. More generally, the double exponential distribution of parameterα, or: type I extreme value distribution, EExp(α), is the distribution of a ran-dom variable, say U , representable as the logarithm of a variable distributedas Exp(α); thus:

fU(a) = α ea−α ea

FU(a) = P (U ≤ a) = 1 − e−α ea

Furthermore:E[V ] = V ar[V ] = 1

.

Theorem 9.2.2 (Glivenko-Cantelli Theorem)Let:

Xi ∼ i.i.d.(F ) 1 ≤ i ≤ n

Fn(x) =1

n

1≤i≤n

1IXi≤x x ∈ IR

then:

P [supx∈IR | Fn(x) − F (x) |→ 0] = 1

i.e. under i.i.d. sampling, the empirical distribution function, Fn, convergesuniformally a.s. to F .

9.2.2 Stieltjes Integral

When making the components of a mathematical expectation explicit, weshall often use a formulation of the following type:

E[h(Z)] =

h(Z) dFZ =

h(z) dFZ(z) (9.2)

Page 214: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 205

where dFZ is the distribution function of Z: FZ(z) = P (Z ≤ z). This isan example of Stieltjes integral, heuristically obtained by replacing, in theRiemann integral, the increments dz by the increments of the distributionfunction dFZ(z). Without entering into technical considerations, one mayconsider (9.2) as a short-hand notation encompassing , among others, theabsolutely continuous case, dFZ(z) = f(z) dz, for which we have:

E[h(Z)] =

h(z) f(z) dz

and the discrete case, FZ(z) =∑

i

fi 1Iai≤z, for which we have:

E[h(Z)] =∑

i

fi h(ai)

This concept is readily extended to the case where F is replaced by a non-negative, non-decreasing and CADLAG function G, eventually decomposableas follows:

G = Gd + Gac + Gs

Gd(x) =∑

i

gi 1Iai≤x gi > 0 ai < ai+1 (discrete component)

dGac(x) = g(x) dx g(x) ≥ 0 (absolutely continuous component)

Gs : continous, nowhere differentiable (singular component)

In the sequel, we always assume that the singular component vanishes. TheStieltjes integral of h with respect to G is then defined as :

E

h(x) dG(x) =

E

h(x) g(x) dx +∑

i

h(ai) gi 1Iai∈E

Note that this integral vanishes when G is constant over E. Integrability ofh with respect to G is obtained when h is continuous or h is monotone andG continuous but is not obtained when h and G have common points of rightdiscontinuity. If we now define:

H(x) =

∫ x

a

h(u)dG(u), (9.3)

then H is continuous at every point of continuity of G and H is differentiableat every point where h is continuous and G is differentiable, with derivativeg(x), in which case we have (extended version of the Basic theorem of calcu-lus):

D[H(x)] = h(x) g(x) (9.4)

Page 215: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 206

9.2.3 Expectation of non-negative random variables

LemmaLet

X such that P (X ≥ 0) = 1

F (x) = P (X > t) = 1 − F (x)

Then E[X] =

∫ ∞

0

F (x)dx

Proof :

(i) ∃ f :dF (x)

dx= −f(x) i.e. F (x) = 1 −

∫ x

0

f(u) du =

∫ ∞

x

f(u)du

then

E[X] =

∫ ∞

0

x f(x)dx = −∫ ∞

0

x dF (x)

= [−xF (x)]∞0 +

∫ ∞

0

F (x) dx

↓= 0 if lim

0→+∞xF (x) = 0

= ∞ if limx→+∞

xF (x) = ∞

(ii) discrete case: F (x) =∑

j fj1Iaj>x with a0 = 0then (see Figure 9.2.3 ):

∫ ∞

0

F (x)dt =∑

1≤j<∞

(aj − aj−1)F (aj−1) see Figure 2.16

=∑

1≤j<∞

ajF (aj−1) −∑

1≤j<∞

ajF (aj) because a0 = 0

=∑

1≤j<∞

aj [F (aj−1) − F (aj)] see Figure 2.17

=∑

1≤j<∞

ajfj = E[X]

Page 216: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 207

0=a0

a1 a2 a3 t

f2

f1

f0

1

F(t)

a1 a2 a3 t

1

F(t)

=

Σ1<j< aj[F(aj-1)-F(aj)] Σ1<j< (aj - aj-1) F(aj-1)

Figure 9.2: E[X]: discrete case

Remark A shorter proof makes use of the general properties of the Stieltjesintegral:

E[X] =

∫ ∞

0

x dF (x)

=

∫ ∞

0

[

∫ x

0

du] dF (x)

=

∫ ∞

0

[

∫ ∞

x

dF (u)] dx

=

∫ ∞

0

F (x) dx

CorollaryE(X) <∞ ⇒ lim

t→∞xF (x) = 0

Remark. That the implication of this corollary may not be reversed can beviewed from the following example.

Page 217: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 208

F (x) = e(x ln x)−1 x ≥ e= 1 0 ≤ x ≤ e

where it may be checked that E[X] = ∞ while it is also true that:

limx→∞

xF (x) = 0

Example 1

aj fj F (aj)0 0 1.1 .50 .502 .25 .253 .25 0.

Total 1.

E[X] = 1 · 1

2+ 2 · 1

4+ 3 · 1

4= .5 + .5 + .75 = 1.75

∫ ∞

0

F (x)dx = [1 − 0] · 1 + [2 − 1] · 1

2+ [3 − 2] · 1

4

= 1 · (1 − 1

2) + 2 · (1

2− 1

4) + 3 · 1

4

Example 2

aj fj F (aj)0 .50 .501 .25 .252 .25 .0

Total 1.

Remarque F (0) = 1/2 F (0−) = 1

E[X] = 0 · 1

2+ 1 · 1

4+ 2 · 1

4= .25 + .5 = .75

∫ ∞

0

F (x) dx = 1 · 1

2+ (2 − 1) · 1

4= .5 + .25 = .75

Page 218: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 209

t

1

F(t)

1/2

1/4

1 2 3

1/2

1/4

1/4

Figure 9.3: Example 1

t

1

F(t)

1/2

1/4

1 2

1/2

1/4

1/4

Figure 9.4: Example 2

Page 219: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 210

9.2.4 Conditional expectation

Let us consider a random vector X = (Y, Z), A a fixed event in the rangespace of X . The conditional probability of A given Z is denoted as P (A | Z)and defined as the (a.s. unique) function of Z such that for any B , event inthe range space of Z, we have:

P (A ∩ B) =

z∈B

P (A | Z = z) dFZ(z) = E[P (A | Z) 1IZ∈B] (9.5)

Similarly, let k be a function defined on the range space ofX. The conditionalexpectation of k(Y, Z) given Z is denoted as E[k(Y, Z) | Z] and defined asthe (a.s. unique) function of Z such that for any function h , defined on therange space of Z, we have:

E[k(Y, Z) h(Z)] =

E[k(Y, z) | Z = z] h(z) dFZ(z)

= E[E[k(Y, Z) | Z] h(Z)] (9.6)

Particular caseConsider now A = 0 ≤ Y ≤ y and B = 0 ≤ Z ≤ z ; suppose thatP [Y ≤ y, Z ≤ z] = FY,Z(y, z) is differentiable in z, i.e. D2FY,Z(y, z) is welldefined, and that FZ is absolutely continuous, i.e. dFZ(z) = fZ(z) dz, then,from (9.4) and (9.5 ) we get:

P [Y ≤ y | Z = z] =D2FY,Z(y, z)

fZ(z)(9.7)

Note that if fZ(z) = 0, E[k(y, z) | Z = z] is not defined.

Page 220: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 211

9.3 Stochastic Processes

9.3.1 General Stochastic Processes

9.3.2 Counting Processes

9.4 Statistics: General issues

9.4.1 On Conditional Models and Exogeneity

9.4.2 Statistical Inference

Notations

In this section, we consider, under suitable regularity conditions, a statisticalmodel, written in terms of densities:

[RX , X, p(x | θ) : θ ∈ Θ] Θ ⊂ IRk

along with its log-likelihood, its score, its statistical information and itsFisher Information matrit:

lθ(X) = L(θ) = ln p(X | θ) (9.8)

sθ(X) = S(θ) =d

d θL(θ) = [

∂ θi

ln p(X | θ)] (9.9)

jθ(X) = J(θ) = − d2

d θ d θ′lθ(X) = −[

∂2

∂ θi ∂ θjln p(X | θ)] (9.10)

IX(θ) = V (sθ(X) | θ) = IE (jθ(X) | θ) (9.11)

= −IE [∂2

∂ θi ∂ θjln p(X | θ) | θ] (9.12)

Under i.i.d. sampling, we have, introducing a subscript n to denote thesample size:

Ln(θ) =∑

1≤k≤n

lθ(Xk) (9.13)

Sn(θ) =∑

1≤k≤n

sθ(Xk) (9.14)

Jn(θ) =∑

1≤k≤n

jθ(Xk) (9.15)

In(θ) = V (Sn(θ)) = nIX1(θ) (9.16)

Page 221: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 212

Convergence in distribution

Theorem 9.4.1 (The δ-method)Let Tn = fn(X1, · · · , Xn) be a statistic.If

• an(Tn − bn)d−→ Y ∈ IRp

• an ↑ +∞ bn −→ b

• g : IRp → IRq continuously differentiable

• 5g(y) =

[∂gi(y)

∂yj

]

=dg

dy′: p× q

then:an[g(Tn) − g(bn)]

d−→ 5g(b)′g(Y )

In particular, if: √n(Tn − θ)

d−→ N(0,Σ)

then √n[g(Tn) − g(τ)]

d−→ N(0,5g(τ)′Σ 5g (τ))

Matimun Likelihood Estimation

The matimum likelihood estimator ( m.l.e.) of θ is defined as:

θ = arg supθ∈Θ

L(θ) = arg supθ∈Θ

p(X | θ) (9.17)

Two simple iterative methods to evaluate an m.l.e. Let θn,0 be an initialestimator.

• Newton-Raphson:

θn,k+1 = θn,k + [Jn(θn,k)]−1Sn(θn,k)

• Scoreθn,k+1 = θn,k + [In(θn,k)]

−1Sn(θn,k)

Page 222: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 213

Asymptotic tests

Consider an hypothesis testing,

H0 : θ ∈ Θ0 ⊂ Θ against H1 : θ /∈ Θ0

where Θ0 is specified either in the form

H0 : g(θ) = 0 where g : Θ −→ IRr (r ≤ k)

or in the form:

H0 : θ = h(α) where h : IRp −→ Θ (p ≤ k)

i.e. :H0 : Θ0 = g−1(0) = Im(h).

Let also θ0 be the m.l.e. of θ under H0 and θ be the unconstrained m.l.e. ofθ :

θ0 = arg supθ ∈Θ0

p(X | θ) (9.18)

θ = arg supθ∈Θ

p(X | θ) (9.19)

Note that:

θ0 = h(α0)

where : α0 = arg supα∈ IRp

p(X | h(α)) (9.20)

Three standard ways of building a test statistic are the following.

Likelihood Ratio Test

L = −2 lnp(X | θ0)p(X | θ)

(9.21)

Wald TestsW = (θ0 − θ)′IX(θ)(θ0 − θ) (9.22)

Rao or Lagrange Multiplier Test

R = S(θ0)′[IX(θ0)]

−1S(θ0) (9.23)

Page 223: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 214

For these three statistics, the critical (or, rejection) region corresponds tolarge values of the statistics. The level of those tests is therefore the survivorfunction, (i.e. 1 - the distribution function), under the null hypothesis, and,under suitable regularity conditions, their asymptotic distribution, under thenull hypothesis, is χ2 with l − l0 degrees of freedom where l, resp l0, is the(vector space) dimension of Θ, resp.Θ0.(Note: the vector space dimension ofΘ = the dimension of the smallest affine vector space containing Θ.)

Page 224: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

References

Andersen P.K., O. Borgan, R.D. Gill and N. Keiding (1993), StatisticalModels Based on Counting Processes, New York: Springer.

Andersen P.K. and N. Keiding (1995), Survival Analysis, Research Report,Department of Biostatistics, University of Copenhagen.

Antoniak C.E. (1974), Mixtures of Dirichlet Processes with Applications toBayesian Nonparametric Problems, Annals of Statistics, 2, 1152-1174.

Bailey, Norman T.J. (1964), The Elements of Stochastic Processes with Ap-plications to the Natural Sciences, New York: Wiley and Sons.

Barra, J. -R. (1981), Mathematical Basis of Statistics, New York: AcademicPress.

Basawa, I.V. and B. L. S. Prakasa Rao (1980), Statistical Inference inStochastic Processes, London: Academic Press.

Blackwell D., J.B. MacQueen (1973), Ferguson distributions via Polya urnschemes, Annals of Statistics, 1, 353-355.

Blum K., V. Susarla (1977), On the posterior distribution of a Dirichletprocess given randomly right censored data, Stochastic Processes andApplications, 5, 207-211.

Bouleau, Nicolas (1988), Processus Stochastiques et Applications, Paris:Hermann.

Breiman, Leo (1968), Probability, Reading, Mass.: Addison-Wesley.

Bunke O. (1981), Bayesian Estimators in Semiparametric Problems, Preprint102, Humboldt University, Berlin.

215

Page 225: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 216

Chen, Y.Q. and N.P. Jewell (2001), On a general class of semiparametrichazards regression models, Biometrika, 88, 687-702.

Chow, Y.S. and M.Teicher (1978), Probability Theory: Independence, Inter-changeability, Martingales, New York: Springer Verlag

Chung, K.L. (1960), Markov Processes with Stationary Transition Probabil-ities, Heidelberg: Springer Verlag.

Chung, K.L. (1968), A Course in Probability Theory, New York: Harcourt,Brace and World Inc.

Cox, D.R. and H.D. Miller (1965), The Theory of Stochastic Processes,London: Methuen.

Cox, D.R. and D.Oakes, (1984), Analysis of Survival Data, London: Chap-man and Hall.

Damien P., P.W. Laud, A.F.M. Smith (1995), Approximate random variategeneration from infinitely divisible distributions with applications toBayesian inference, Journal of the Royal Statistical Society, Series B,57(3), 547-563.

Damien P., P.W. Laud, A.F.M. Smith (1996), Implementation of Bayesiannon-parametric inference based on Beta processes, Scandinavian Jour-nal of Statistics, 23, 27-36.

Doksum K.A. (1974), Tailfree and Neutral Random Probabilities and theirPosterior Distributions, Annals of Probability, 2, 183-201.

Droesbeke, J.J., Fichet, B. and P. Tassi (editors), (1989), Analyse Statistiquedes Données de Survie, Paris: Economica.

Doob, J.L. (1953), Stochastic Processes, New York: Wiley and Sons.

Elbers C. , G. Ridder (1982), True and Spurious Dependence: The identifi-ability of the proportional hazards model, Review of Economic Studies,49, 403-409.

Erkanli A., P. Muller, M. West (1993), Curve Fitting using Dirichlet ProcessMixtures, Technical report, ISDS, Duke University.

Escobar M.D. (1994), Estimating Normal means with a Dirichlet Processprior, Journal of the American Statistical Association, 89, 268-277.

Page 226: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 217

Escobar A., M. West (1994), Bayesian Prediction and Density Estimation,Mimeo July 1994.

Escobar A., M. West (1995), Bayesian density estimation and inferenceusing mixtures, Journal of the American Statistical Association, 90,577-588.

Feller, W.(1970), An Introduction to Probability Theory and Its Applica-tions, Vol.1, New York: Wiley and Sons.

Feller, W.(1971), An Introduction to Probability Theory and Its Applica-tions, Vol.2, New York: Wiley and Sons.

Ferguson T.S. (1973), A Bayesian Analysis of Some Nonparametric Prob-lems, Annals of Statistics, 1, 209-230.

Ferguson T.S. (1974), Prior Distributions on Spaces of Probability Mea-sures, Annals of Statistics, 2, 615-629.

Ferguson T.S., M.J. Klass (1972), A representation of Independent Pro-cesses without Gaussian components, Annals of Mathematical Statis-tics, 43, 1634-1643.

Ferguson T.S. , E.G. Phadia (1979), Bayesian Nonparametric EstimationBased on Censored Data, Annals of Statistics, 7, 163-186.

Fleming, T.R. , D.P. Harrington (1991), Counting Processes and SurvivalAnalysis, New York: Wiley.

Florens J.P., D. Fougère , T. Kamionka and M. Mouchart (1994), La moéli-sation économétrique des transitions individuelles sur la marché dutravail, Economie et Prévision, 116, 181-217.

Florens J.P., D. Fougère and M. Mouchart (1995), Duration Models, inEconometrics of Panel Data, Matyas and Sevestre (eds), 2nd ed., ,Dordrecht: Kluwer Academic Publishers, 491-536.

Florens, J. -P., M. Mouchart and J. -M. Rolin (1990), Elements of BayesianStatistics, New York: Marcel Dekker.

Florens J.P., M. Mouchart , J.M. Rolin (1992), Bayesian Analysis of Mix-tures: Some Results on Exact Estimability and Identification, in BayesianStatistics , 4, Bernardo J.M., J.O. Berger, A.P. Dawid, A.F.M. Smith(eds.), Oxford Science Publications, 127-145.

Page 227: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 218

Florens J.P., M. Mouchart , J.M. Rolin (1992), Semi- and Non-parametricBayesian Analysis of Duration Models : A Survey, International Sta-tistical Review, , 67, 187-210, 1999.

Hakizamungu J. (1992), Modèles de Durée: Approche Bayesienne Semiparamé-trique, Ph.D. Dissertation, Faculté des Sciences, Université catholiquede Louvain, Louvain-la-Neuve, Belgium.

Heckman J.J. (1981), Heterogeneity and State Dependence, in: Studies inLabor Markets, S. Rosen ( ed.), University of Chigago Press, 91-139.

Heckman J.J., B. Singer (1982), The Idendification Problem in EconometricModels for Duration Data, in Advances in Econometrics, W. Hilden-brand (ed.), Cambridge University Press, 39-77.

Heckman J.J., B. Singer (1984a), Econometric Duration Analysis, Journalof Econometrics, 24, 63-132.

Heckman J.J., B. Singer (1984b), The Identifiability of the ProportionalHazards Model, Review of Economic Studies, 51, 231-243.

Hjort N. (1990), Nonparametric Bayes estimators based on beta processesin models for life history data, Annals of Statistics, 18, 1259-1294.

Hjort N. (1996), Bayesian approaches to non-and semiparametric densityestimation in Bayesian Statistics 5, Bernardo J.M., J.O. Berger, A.P.Dawid, A.F.M.Smith (eds.), Oxford University Press, 223-253.

Hougaard, Philip (2002), Analysis of Interval Censored Survival Data, DraftNotes prepared for ISCB pre-conference course, September 9,2002, Di-jon, France.

Kalbfleisch J.D. (1978), Nonparametric Bayesian Analysis of Survival TimeData, Journal of the Royal Statistical Society Ser.B, 214-221.

Kalbfeisch, J.D. and R.L.Prentice, (1980), The Statistical Analysis of Fail-ure Time Data ,New York: Wiley.

Karlin, S. and H.M.Taylor (1975), A First Course in Stochastic Processes,2a edición, New York: Academic Press.

Karlin, S. and H.M.Taylor (1981), A Second Course in Stochastic Processes,2a edición, New York: Academic Press.

Page 228: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 219

Kemeny, J.G. and J.L.Snell (1960), Finite Markov Chains, New York: SpringerVerlag.

Kendall, M. G. and A. Stuart (1979), The Advanced Theory of Statistics,Vol. 2, Inference and Relationships, Fourth Edition, London: Griffin.

Kiefer, N.M., (1988), Economic Duration Data and Hazard Functions, Jour-nal of the Economic Literature, XXVI, 646-679.

Kijima, Masaaki (1997), Markov Processes for Stochastic Modelling, Lon-don: Chapman and Hall.

Klein, J.P. and M.L. Moeschberger,(1997),Survival Analysis, New York:Springer.

Korwar, R.M. and M. Hollander (1973), Contributions to the Theory ofDirichlet Processes, The Annals of Probability, 1, 705-711.

Lahiri P. and P. Dong Ho (1988), Nonparametric Bayes and Empirical BayesEstimation of the Residual Survival Function at Age t, Communicationsin Statistics, Theory and Methods, 4085-4098.

Lancaster T. (1990), The Econometric Analysis of Transition Data, Econo-metric Society Monograph, Cambridge University Press.

Lawless, J.F., ( 1982), Statistical Models and Methods for Lifetime Data,New York: Wiley.

Lo A.Y.(1987), A large sample study for the Bayesian bootstrap, Annals ofStatistics, 15, 360-375.

Lo A.Y.(1993), A Bayesian bootstrap for censored data, Annals of Statistics,21, 100-123.

Lo A.Y, C.S. Weng (1989), On a Class of Bayesian Nonparametric Esti-mates: II. Hazard Rates Estimates, Annals of the Institute of StatisticalMathematics, 227-245.

McEachern S. (1992), Estimating normal means with a conjugate styleDirichlet process prior, Technical report, Department of Statistics, OhioState University.

Métivier, M., (1972), Notions Fondamentales de la Théorie des Probabilités,2nd edition, Paris: Dunod.

Page 229: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 220

Mouchart, M. and J.-M. Rolin (2002) Competing Risks Models : Problemsof Modelling and of Identification, in Life Tables, Modelling Survivaland Death, edited by G.Wunsch, M. Mouchart and J. Duchêne, Dor-drecht: Kluwer Academic Publishers, 245-267.

Parzen, Emanuel (1963), Stochastic Processes, San Francisco: Holden DayInc.

Phadia E., V. Susarla (1983), Nonparametric Bayesian Estimation of a Sur-vival with Dependent Censoring Mechanism, Annals of the Institute ofStatistical Mathematics, 389-400.

Prakasa Rao, B. L. S. (1992), Identifiability in Stochastic Models. Charac-terization of Probability Distributions, Boston: Academic Press.

Robert, Christian (1996), Méthodes de Monte Carlo par Chaînes de Markov,Paris: Economica.

Rolin J.M. (1983), Nonparametric Bayesian Statistic: A Stochastic ProcessApproach, in Specifying Statistical Models, From Parametric to Non-parametric Using Bayesian Approaches, Lecture Notes in Statistics,Berlin: Springer Verlag, 108-133.

Rolin J.M. (1992a), Some Useful Properties of the Dirichlet Process. CoreDiscussion Paper 9207, Université catholique de Louvain, Louvain-la-Neuve, Belgium.

Rolin J.M. (1992b), On the Distribution of Jumps of the Dirichlet Pro-cess, Core Discussion Paper 9259, Université catholique de Louvain,Louvain-la-Neuve, Belgium.

Rolin J.M. (1998), Bayesian Survival Analysis, in Encyclopedia of Biostatis-tics, New-York: Wiley and Sons, 1, 271-286.

Ruggiero M. (1989), Analyse semi-paramétrique des modèles de durées:L’apport des méthodes bayésiennes, Thèse de doctorat, Université d’Aix-Marseille II, France.

Ruggiero M. (1994), Bayesian Semiparametric estimation of proportionalhazard models, Journal of Econometrics, 62, 277-300.

Sethuraman J. (1994), A Constructive Definition of the Dirichlet Prior,Statistica Sinica, 2, 639-650.

Page 230: UCLouvain · 2012. 9. 13. · Contents Preface 1 Notation 3 1 Introduction 4 1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 An Overview of Topics

CHAPTER 9. TOOLS 221

Sursarla V., J. Van Ryzin (1976), Nonparametric Bayesian Estimation ofSurvival Curves from Incomplete Observations, Journal of the Ameri-can Statistical Association, 71, 897-902.

Susarla, V. and J. Van Ryzin (1978a), Empirical Bayes Estimation of aDistribution (Survival) Function from Right-Censored Observations,Annals of Statistics, 6, 740-754.

Susarla, V. and J. Van Ryzin (1978b), Large sample theory for a Bayesiannonparametric survival curve estimator based on censored samples, An-nals of Statistics, 6, 755-768.

Takács Lajos (1960), Stochastic Processes, Problems and Solutions, London:Methuen.

Taylor, H.M. and S. Karlin (1998), An Introduction to Stochastic Modeling,Third edition, San Diego: Academic Press.

Tsai W.Y. (1986), Estimation of Survival Curves from Dependent Censor-ship Models via a Generalized Self-Consistent Property with Nonpara-metric Bayesian Estimation Application, Annals of Statistics, 14, 238-249.

West M. (1990), Bayesian Kernel Density Estimation, Technical report,Department of Statistics, Ohio State University.

West M. (1992), Hyperparameter Estimation in Dirichlet Process MixtureModels, Technical report, ISDS, Duke University.

Wunsch, G., M. Mouchart and J. Duchêne (editors)(2002), Life Tables,Modelling Survival and Death, Dordrecht: Kluwer Academic Publish-ers.

Yamato H. (1984), Properties of samples from distributions chosen from aDirichlet process, Bulletin of Informatics and Cybernetics, 21, 77-83.