petra e. todd fall, 2016 - university of...
TRANSCRIPT
Chapter 1
Discrete choice modeling
Consider the classical regression model:
y = xβ + ε
E(ε|x) = 0
ε˜N(0, σ2)
This model is untenable if y is discrete. To model discrete choices, we needto think of the ingredients that give rise to choices.
For example, suppose we want to forecast the demand for a new good and weobserve consumption data on old goods, x1...xI . Each good might representa transportation mode or a type of car.
Assume that people choose the good that yields the highest utility. Whenwe have a new good, we need a way of putting it on a basis with the old.
1
2 CHAPTER 1. DISCRETE CHOICE MODELING
The earliest literature on discrete choice was developed in psychometricswhere researchers were concerned with modeling choice behavior.
There are two predominant modeling approaches:
(1) Luce (1953) Model ⇐⇒McFadden conditional logit model(2) Thurstone (1929) ⇐⇒Quandt multivariate probit/normal model
The Luce-McFadden model
• widely used in economics
• easy to compute
• identifiability of parameters well understood
• very restrictive substitution possibilities among goods
The Thurstone-Quandt model
• very general substitution possibilities
• allows for more general forms of heterogeneity
• more difficult to compute
• identifiability less easily established
1.1 Luce Model/McFadden conditional logit
model
Some references: Manski and Mcfadden; Greene; Amemiya
Notation:
X : universe of objects of choice
S : universe of attributes of people
B : feasible choice set
1.1. LUCE MODEL/MCFADDEN CONDITIONAL LOGIT MODEL 3
Behavior rule mapping attributes into choices:
h(B, s) = x
We might assume that there is a distribution of choice rules because
• in observation we lose some information governing choices
• there can be random variation in choices due to unmeasured psycho-logical factors
DefineP (x|s, B) = Pr(h ∈ H such that h(s, B) = x)
This represents the probability that an individual drawn randomly from thepopulation with attributes s and alternative set B chooses x.
Luce Axioms:
Maintain some restrictions on P (x|s, B) and derive implications for the func-tional form of P.
Axiom #1: Independence of Irrelevant Alternatives (IIA)
x, y ∈ B, s ∈ S
P (x|s, xy)P (y|s, xy)
=P (x|s, B)
P (y|s, B)
where B is a larger choice set
This means that other choices are irrelevant for the relative probabilityratio. This is not necessarily a reasonable assumption. For example, supposethe choice is a career decision and the possible choices are economist (E),policeman (P) and fireman (F). IIA implies
P (E|s, EF)P (F |s, EF)
=P (E|s, EFP)P (F |s, EFP)
4 CHAPTER 1. DISCRETE CHOICE MODELING
This example is due to Stern. Does this restriction seem reasonable?
The restriction imposed on the substitution patterns is also known as theRed Bus-Blue Bus problem, which was first discussed by Debreu. Supposethe possible choices are modes of transportion car (c), red bus (RB), bluebus (BB). IIA implies:
P (c|s, c, RB)P (RB|s, c, RB)
=P (c|s, c, RB,BB)P (RB|s, c, RB,BB)
Axiom 2: Eliminate 0 probability choices
Pr(y|s, B) > 0 for all y ∈ B
Let’s analyze the implications of the above axioms:
Define Pxx = P (x|x, xx) and assume Pxx = 1/2
By IIA
P (y|s, B) =PyxPxy
P (x|s, B)
∑y∈B
P (y|s, B) = 1 =⇒ P (x|s, B) =1∑
y∈BPyxPxy
Furthermore, we have from IIA
P (y|s, B) =PyzPzy
P (z|s, B)
P (x|s, B) =PxzPzx
P (z|s, B)
P (y|s, B) =PyxPxy
P (x|s, B)
which implies
PyxPxy
=P (y|s, B)
P (x|s, B)=
PyzPzyPxzPzx
1.1. LUCE MODEL/MCFADDEN CONDITIONAL LOGIT MODEL 5
Define V (s, x, z) = ln(PxzPzx
) and V (s, y, z) = ln(PyzPzy
). Then we can write
PyxPxy
=eV (s,y,z)
eV (s,x,z)
Axiom 3: Separability assumption
Assume that V (s, x, z) = v(s, x)− v(s, z).V can be interpreted as a utility indicator of relative tastes
Then,
P (x|s, B) =1∑
y∈BPyxPxy
=1∑
y∈Bev(s,y)−v(s,z)
ev(s,x)−v(s,z)
=ev(s,x)∑y∈B e
v(s,y)
Thus, you can derive the multivariate logistic specification from the threeLuce Axioms.
6 CHAPTER 1. DISCRETE CHOICE MODELING
1.2 Random Utility Models
Now, link to more familiar models in economics. Marshak (1959) estab-lished the link between the Luce model and so-called random utility models(RUMs). These kinds of models were proposed by Thurstone (1927, 1930’s)
Assume utility from choosing alternative j is
u(s, xj) = v(s, xj)nonstochastic
+ ε(s, xj)stochastic
The second component ε(s, xj) reflects stochastic, idiosyncratic tastes
We assume that alternative j is chosen when it yields maximal utility withinthe set of all alternatives:
Pr(j is maximal in set B) = Pr(u(s, xj) ≥ u(s, xl) ∀l 6= j)
= Pr(v(s, xj) + ε(s, xj) ≥ v(s, xl) + ε(s, xl) ∀l 6= j)
Specify a cdf F (ε1, ..., εJ) and denote vl = v(s, xl), εl = ε(s, l). Then,
Pr(vj − vl ≥ εl − εj ∀l 6= j)
= Pr(vj − vl + εj ≥ εl ∀l 6= j)
=
∫ ∞−∞
Fj(vj − v1 + εj, ..., vj − vj−1 + εj, εj, ..., vj − vJ + εj)dεj
where Fj is the derivative with respect to the jth argument.
For example, in the bivariate case
Pr(x1 > x2) =
∫ ∞−∞
∫ x1
−∞f(x1, x2)dx1dx2 =
∫ ∞−∞
F1(x1,x1)dx1
If ε is iid, thenF (ε1, ..., εJ) = Πn
i=1Fi(εi)
so the above expression equals∫ ∞−∞
[ΠJl=1l 6=jF (vj − vl + εj)
]Fi(εj)dεj
1.2. RANDOM UTILITY MODELS 7
Two Choice (Binary) Example (J=2)
Pr(choose 1|x,B) = P (v1 + ε1 > v2 + ε2)
= P (ε2 − ε1 < v1 − v2)
= P (ε2 < v1 − v2 + ε1)
=
∫ ∞−∞
F1(ε1, v1 − v2 + ε1)dε1
If ε1, ε2 normal, then ε2 − ε1 normal, so Pr(ε2 − ε1 < v1 − v2) normal.
Suppose ε1, ε2 distributed type I extreme value (also called double exponen-tial). The cdf is
=⇒ P (ε < c) = e−e−c+α
In that case, it can be shown that ε2 − ε1 is logistic. This implies that
Pr(V1 + ε1 > V2 + ε2) = Pr(ε2 − ε1 < V1 − V2)
= Λ(V1 − V2)
=eV1−V2
1 + eV1−V2
=eV1
eV1 + eV2.
• Assuming that errors follow a Weibull distribution yields same logitmodel as derived from the Luce Axioms (result due to Marshak, 1959).
• It turns out that type I extreme value is sufficient but not necessary,some other distributions also generate a logit, but Yellot (1977) showedthat if we require “invariance under uniform expansions of the choiceset” then only the double exponential gives a logit. example: coffee,tea, milk
invariance requires that the probabilities stay the same if we doublethe choice set.
8 CHAPTER 1. DISCRETE CHOICE MODELING
1.3 Solving the forecasting problem
Let xj be the set of characteristics associated with choice j.
Usually, we will assume V (s, xj) = θ(s)′xj, where we let the parameters varyby s to allow that individuals may differ in how they value characteristics ofgoods (e.g. larger families might prefer larger cars).
Pr(j|s, B) =eθ(s)
′xj∑Nl=1 e
θ(s)′xl
1.3.1 Model Parameter Estimation
Given a parametric specification for Pr(j|s, B), we can solve for model pa-rameters by Maximum likelihood:
Dij = 1 if individual i chooses j,= 0 else.
maxθ(s)
ΠNi=1
[eθ(s)
′x1∑Jl=1 e
θ(s)′xl
]Di1...
[eθ(s)
′xJ∑Jl=1 e
θ(s)′xl
]DiJ
Suppose we consider adding a new good J + 1 into the choice set. Let xJ+1
be the vector of characteristics for the new good and B′ = B , J + 1 theaugmented choice set.
Given the parameters θ(s) estimated by MLE using data on choices of exist-ing goods, we can construct a forecast that the individual chooses the newgood:
Pr(J + 1|s, B′) =eθ(s)
′xJ+1∑J+1l=1 e
θ(s)′xl.
1.4. VERIFYING THAT A PROBABILITY CHOICE SYSTEM (PCS) IS CONSISTENTWITH A RANDOMUTILITYMODEL (RUM)9
1.3.2 Debreu criticism of Luce’s model
Suppose the J+1th alternative is identical to the first alternative. Then
Pr(choose 1 or J + 1|s, B′) =2eθ(s)
′xN+1∑J+1l=1 e
θ(s)′xl
• This means that introducing an identical good into the choice setchanges the probability of making that choice, which is not a desir-able result.
• The result comes from making an iid assumption on the error termassociated with the new alternative
U1 + ε1
UJ+1 + εJ+1
• If the goods are identical, then we would expect the error terms to beperfectly correlated, not iid. The iid assumption is at odds with thenotion of introducing a close (or identical) alternative.
1.4 Verifying that a probability choice sys-
tem (PCS) is consistent with a random
utility model (RUM)
What are criteria for a good probabilistic choice system?
1. has flexible functional form
2. is computationally practical
3. allows for flexibility in representing substitution patterns among choices
4. is consistent with a RUM (has a structural interpretation)
10 CHAPTER 1. DISCRETE CHOICE MODELING
How do you verify whether consistent with a RUM?
• as shown earlier, we can start with a RUM and derive the PCS
Ui = V (s, xi) + ε(s, xi)
and solve integral for
Pr(U i > U l for all l 6= i) = Pr(V i + εi > V l + εl for all l 6= i)
• In some cases, we can start with PCS and verify that it is consistentwith a RUM using sufficient conditions
1.4.1 Daly-Zachary-Williams Theorem
Daly-Zachary (1976), Williams (1977)
Provides a set of conditions that make it easy to derive a PCS from a RUMwithin a class of models called generalized extreme value (GEV) models.
Define G = G(Y1, ..., YJ)
If G satisfies the following conditions:
1. nonnegative defined on Y1, ..., YJ ≥ 0
2. homogeneous of degree one in its arguments
3. limYi→∞G(Y1, ., Yi, ...YJ)→∞, for all i = 1..J
4. ∂kG∂Y1,...,∂Yk
is nonnegative if k is odd, and nonpositive if k is even
Then, for a RUM with Ui = Vi+εi and F (ε1, ..., εJ) = exp−G(e−ε1 , ..., e−εJ )the PCS is given by (with Yi = eVi)
Pi =∂ lnG
∂ lnYi=eViGi(e
V1 , ..., eVJ )
G(eV1 , ..., eVJ )
Remark: McFadden shows that under certain conditions on the form of Vi(satisfies AIRUM form), the DZW theorem can be seen as a form of Roy’sidentity.
Let’s apply the result:
1.4. VERIFYING THAT A PROBABILITY CHOICE SYSTEM (PCS) IS CONSISTENTWITH A RANDOMUTILITYMODEL (RUM)11
(a) Multinomial logit model
F (ε1, ..., εJ) = e−e−ε1 ...e−e
−εJ
= e−∑Jj=1 e
−εj
Here,
G(eV1 , ..., eVJ ) =J∑j=1
eVj satisfies DZW conditions
Applying DZW, get the multinomial logit model (MNL) model:
Pi =∂ lnG
∂Vi=
eVi∑Jj=1 e
Vj
(b) Nested logit model (this model addresses to a limited extent the IIAcriticism)
G(eV1 , ..., eVJ ) =M∑m=1
am
[∑i∈Bm
eVi
1−σm
]1−σm
Idea is to first divide goods into M branches. First choose a branch andthen choose a good within a branch. Will allow for correlation betweenthe errors within a given branch.
Define Bm as a subset of the choice set:
Bm ⊆ 1, ..., Jm∪Mm=1Bm = B
Calculate the equation for Pi using the DZW result:
Pi =∂ lnGi
∂Vi=
∑m s.t. i∈Bm am
(e
Vi1−σm
) [∑i∈Bm e
Vi1−σm
]−σm∑M
m=1 am
[∑i∈Bm e
Vi1−σm
]1−σm
If we multiply by [∑i∈Bm
eVi
1−σm
]−1 [∑i∈Bm
eVi
1−σm
],
12 CHAPTER 1. DISCRETE CHOICE MODELING
get
=M∑m=1
Pr(i|Bm) Pr(Bm),
where
Pr(i|Bm) =
(e
Vi1−σm
)∑
i∈Bm eVi
1−σm
if i ∈ Bm,= 0 otherwise
Pr(Bm) =am
[∑i∈Bm e
Vi1−σm
]1−σm
∑Mm=1 am
[∑i∈Bm e
Vi1−σm
]1−σm
How does the nested logit model accommodate the Red bus/Blue busproblem?
Suppose we have a nested set-up in which one branch contains onlychoice 1 and the other branch contains choice 2 and 3.
1
2 3
1.4. VERIFYING THAT A PROBABILITY CHOICE SYSTEM (PCS) IS CONSISTENTWITH A RANDOMUTILITYMODEL (RUM)13
G = Y1 + [Y1/(1−σ)
2 + Y1/(1−σ)
3 ]1−σ
Yi = eVi
Pr(1|123) =∂ lnG
∂V1
=eV1
eV1 + [eV2/(1−σ) + eV3/(1−σ)]1−σ
Pr(2|123) =eV2/(1−σ)[eV2/(1−σ) + eV3/(1−σ)]−σ
eV1 + [eV2/(1−σ) + eV3/(1−σ)]1−σ
As V3 → −∞ get logisticeV2
eV1 + eV2
As V1 → −∞
Pr(2|123) =eV2/(1−σ)
eV2/(1−σ) + eV3/(1−σ)
What role does σ play? σ is the degree of substitutability parameter.
Recall thatF (ε1, ε2, ε3) = exp−G(e−ε1 , e−ε2 , e−ε3)
Here,
σ =cov(ε2, ε3)√var(ε2)var(ε3)
Thus,−1 ≤ σ ≤ 1 (correlation coefficient)
It turns out that we also require σ > 0 for the DZW conditions to be satis-fied, which is unfortunate because we cannot allow the ε’s to be negativelycorrelated.
It can be shown that
limσ→1
Pr(1|123) =eV1
eV1 + max(eV2 , eV3).
What happens if V2 = V3?
14 CHAPTER 1. DISCRETE CHOICE MODELING
Then,
Pr(2|123) =eV2/(1−σ)[2eV2/(1−σ)]−σ
eV1 + [2eV2/(1−σ)]1−σ
=2−σeV2
eV1 + 21−σeV2
limσ→1
=1
2
eV2
eV1 + eV2,
This means you introduce an identical alternative and cut the probability ofchoosing good 2 by one half. This solves the Red-Bus-Blue-Bus problem.
Remark:
Can expand the nested logit to accommodate multiple levels. For exam-ple, three levels:
G =
Q∑q=1
aq
∑m∈Qq
am
[∑i∈Bm
Y1/(1−σm)i
]1−σm
1.4.2 Application: Choice of transportation mode
Let’s see how the model can be applied to the choice of transportion mode.This example is taken from Amemiya.
Suppose we assume that first a person chooses a neighborhood where tolive and then they choose a mode of transportation, given the set of choicesavailable within their neighborhood.
Neighborhood m
Transportation mode t
P (m) : probability of residing in neighborhood m
P (i|Bm) : probability of choosing ith mode, given live in neighborhood m
Not all modes may be available in each neighborhood. Let Tm denote the
1.4. VERIFYING THAT A PROBABILITY CHOICE SYSTEM (PCS) IS CONSISTENTWITH A RANDOMUTILITYMODEL (RUM)15
number of modes available in neighborhood m.
Pm,t =ev(m,t)1−σm
[∑Tmt=1 e
v(m,t)1−σm
]−σm∑m
j=1
[∑Tjt=1 e
v(j,t)1−σm
]−σmPt|m =
ev(m,t)1−σm∑Tm
t=1 ev(m,t)1−σm
Pm =
[∑Tmt=1 e
v(m,t)1−σm
]1−σm
∑mj=1
[∑Tjt=1 e
v(j,t)1−σm
]1−σm
Standard type of utility function that people might use:
zt : trans. mode characteristics
xi : person characteristics
ym : neighborhood characteristics
V (m, t) = z′tγ + x′iβmt + y′mα
Could also include interaction terms between x and z and y.
Estimate by MLEΠMm=1Πn
i=1PDtmtm
where Dtm = 1 if person chooses mode t and neighborhood m, else Dtm = 0.
Chapter 2
Multinomial Probit Model
Main reference:
Bunch (1979) ”Estimability in the Multinomial Probit Model” in Transporta-tion Research
Applications of this model in Albreit, Lerman and Manski (1978), Hausmanand Wise (1978). Also, see Amemiya.
2.1 MNP Model
Consider a random utility model where j denotes the alternative,
Uj = Zjβ + ηj
= Zjβ + Zj(β − β) + ηjThis is an example of a random coefficients model, because the parameter β(which captures how people vary characteristics of the goods) is permittedto vary across people.
Assume there are J alternatives, K characteristics of the alternatives andthat β˜N(β,Σβ).
Define
ZJ×K βK×1 =
z′1β...
z′J β
η =
η1...ηJ
17
18 CHAPTER 2. MULTINOMIAL PROBIT MODEL
We will assume that η˜N(0,Ση).Then,
Pr(alternative j selected)=Pr(Uj > Ui for all i 6= j)
=
∫ ∞Uj=−∞
∫ Uj
U1=−∞...
∫ Uj
UJ=−∞φ(U |Vµ,Σµ)dUJ ...dU1dUj
Remarks:
• Here, φ is a J-dimensional multivariate normal density with mean Vµand variance Σµ.
• Unlike the MVL, no closed form expression exists for the integral inthe normal case. The integrals are therefore usually evaluated usingsimulation methods (as will be discussed later).
How many parameters are there?
β : K parameters
Σβ : K ×K symmetric matrix,K2 −K
2+K =
K(K + 1)
2parameters
Ση :J(J + 1)
2parameters
When a person chooses alternative j, all we know is relative utility, notabsolute utility. This suggests that not all the parameters in the model willbe identified. We will therefore require some normalizations.
2.2 Identification
What does it mean to say a parameter is identified in a model? If aparameter or set of parameters is not identified, then a model with one
2.2. IDENTIFICATION 19
parameterization is observationally equivalent to a model with a differentparameterization.
Example: Binary probit model (with fixed β)
Consider the case where a person is making a choice between 1 and 2. LetD denote the choice. Let Xi denote characteristics of the individual.
Pr(D = 1|Xi) = Pr(V1 + ε1 > V2 + ε2)
= Pr(Xiβ1 + ε1 > Xiβ2 + ε2)
= Pr(Xi(β1 − β2) > ε2 − ε1)
= Pr(Xi(β1 − β2)
σ>ε2 − ε1
σ)
= Φ(Xiβ
σ), where β = β1 − β2
Φ(Xiβσ
) is observationally equivalent to Φ(Xiβ∗
σ∗) for β
σ= β∗
σ∗, so β is not sepa-
rately identified relative to σ.
But the ratio is identified. Suppose
Φ(Xiβ
σ) = Φ(
Xiβ∗
σ∗)
Φ−1Φ(Xiβ
σ) = Φ−1Φ(
Xiβ∗
σ∗)
n∑i=1
X ′iXi(β
σ) =
n∑i=1
X ′iXi(β∗
σ∗)
which impliesβ
σ=
β∗
σ∗
The setb : b = βγ, γ any positive scalar is identified
Note that we assumed that∑n
i=1 X′iXi is invertible. We would say that “β
is identified up to scale and also that the sign is identified” (because σ mustbe positive).
20 CHAPTER 2. MULTINOMIAL PROBIT MODEL
Definition of Identification
For a parameter space B and a random vector Z, identification is selectinga subset in B that characterizes some aspect of the probability distributionof Z.
Now, return to identification in the MVP model.
Pr(j selected |Vµ,Σµ) = Pr(Ui − Uj < 0 for all i 6= j)
Define the contrast matrix
∆j =
1 0 . . . −1 . . . 0
−1 0
0...
...0 −1 0 1
(J−1)×J
∆jU =
U1 − U j
...UJ − U j
Pr(j selected |Vµ,Σµ) = Pr(∆jU < 0|Vµ,Σµ)
= Φ(0|VZ ,ΣZ)
where
VZ is the mean of ∆jU , which is ∆jZβ
ΣZ is the variance, which is ∆jZΣβZ′∆′j + ∆jΣη∆
′j
VZ is (J − 1)× 1
ΣZ is (J − 1)× (J − 1)
Here, we reduced the dimension of the integration problem by one. This rep-resentation shows that all the information is in the contrasts and we thereforecannot identify all of the components of Vµ and Σµ.
2.2. IDENTIFICATION 21
Now define ∆j as ∆j with the ith column removed. Choose J as the“reference alternative” with corresponding ∆J . We can verify that
∆j = ∆j∆J
For example, with three goods(1 −10 −1
)(1 0 −10 1 −1
)=
(1 −1 00 −1 1
)∆2∆3 = ∆2
Therefore, we can write:
VZ = ∆jZβ
ΣZ = ∆jZΣβZ′∆′j + ∆j∆JΣη∆
′J∆′j
The last expression, Cj = ∆j∆JΣη∆′J∆′j, is (J − 1) × (J − 1), with J(J−1)
2
total parameters.
- Because the original model can always be expressed in terms of a modelwith (β,Σβ, Cj), it follows that some of the parameters in the original modelare not identified.
- How many parameters are identified?
Original model: K + K(K+1)2
+ J(J+1)2
Now: K + K(K+1)2
+ J(J−1)2
So, J2−J−(J2−J)2
= J not identified.
It turns out that one additional parameter is not identified, because evalua-tion of
Φ(0|kVZ , k2ΣZ)
gives same result as evaluating
Φ(0|VZ ,ΣZ),
so can eliminate one more parameter through a suitable choice of k. Thus,J + 1 parameters from the original model are not identified.
22 CHAPTER 2. MULTINOMIAL PROBIT MODEL
Illustration of how to impose normalization
J = 3
Ση =
σ11 σ12 σ13
σ21 σ22 σ23
σ31 σ32 σ33
Set 3 as the reference alternative:
C3 = ∆3Ση∆′3
=
(1 0 −10 1 −1
)Ση
(1 0 −10 1 −1
)′=
(σ11 − 2σ31 + σ33 σ21 − σ31 − σ32 + σ33
σ21 − σ31 − σ32 + σ33 σ22 − 2σ32 + σ33
)
Only combinations of parameters are identified. For example,
∆2∆3Ση∆′3∆′2
=
(1 −10 −1
)(σ11 − 2σ31 + σ33 σ21 − σ31 − σ32 + σ33
σ21 − σ31 − σ32 + σ33 σ22 − 2σ32 + σ33
)(1 0−1 −1
)
Normalization approach of Albreit, Lerman and Manski (1978):
- need J + 1 restrictions on the VCV matrix
- fix J parameters by settings last row and last column of Ση to 0
- fix the scale by constraining the diagonal elements of Ση so that trace(ΣεJ
)equals the variance of a standard log Weibull (they compare estimatesof MNL with those of a MNP and of an independent MNP).
2.3. HOW DO WE SOLVE THE FORECASTING PROBLEM? 23
2.3 How do we solve the forecasting prob-
lem?
Suppose we have two goods and add a third.
Pr(choose 1|Z1, Z2) = Pr(U1 − U2 ≥ 0)
= Pr((Z1 − Z2)β ≥ w2 − w1)
where
w1 = Z1(β − β) + η1
w2 = Z2(β − β) + η2
Because w2 − w1 is normally distributed,∫ (Z1−Z2)β
[σ11+σ22−2σ12+(Z2−Z1)Σβ(Z2−Z1)′]1/2
−∞
1√2πe−t/2dt
Now add a third good:
U3 = Z3β + Z3(β − β) + η3
Problem is that we don’t know the correlation of η3 with the other errors.
Suppose we knew that η3 = 0 (i.e. only preference heterogeneity)
Then
Pr(1 is chosen)=
∫ a
−∞
∫ b
−∞BV Ndt1dt2
where BVN = bivariate normal density and
a =(Z1 − Z2)β
[σ11 + σ22 − 2σ12 + (Z2 − Z1)Σβ(Z2 − Z1)′]1/2
b =(Z1 − Z3)β
[σ11 + (Z2 − Z1)Σβ(Z3 − Z1)′]1/2
We could also solve the forecasting problem if we make an assumptionlike η2 = η3.
24 CHAPTER 2. MULTINOMIAL PROBIT MODEL
2.4 Application from Hausman and Wise (1978)
Develop a model of commuting choices for workers in Washington, D. C.
Choice of
(a) driving car
(b) sharing ride
(c) riding bus
Utility model for person i choosing mode j
Uij = βi1 log x1ij + βi2 log x2ij + βi3x3ij
x4i
+ εij
x1ij = time in vehicle
x2ij = time out of vehicle
x3ij = cost of mode for that person
x4i = income of person
Chapter 3
Simulation Methods for MNPmodels
MNP models tend to be difficult to estimate because of high dimensionalintegrals that need to be evaluated at each stage of evaluating the likelihood.
There are a variety of simulation methods that have been developed forestimating these types of models. Two popular methods are:
• Simulated method of moments (SMOM)
• Simulated maximum likelihood (SMLE)
References :
Lerman and Manski (1981) - Structual Analysis of Discrete Data
McFadden (1989) Econometrica
Ruud (1982) Journal of Econometrics
Hajivassilou and Ruud, Ch. 20, Handbook of Econometrics, Stern (1992)Econometrica
Stern (1997) survey article in Journal of Economic Literature
Gourieroux, Christin and Alain Monfort (2002): Simulation-Based Econo-metric Methods
25
26 CHAPTER 3. SIMULATION METHODS FOR MNP MODELS
Geweke, John and Michael Keane (2001): “Computationally intensive meth-ods for integration in econometrics,” Handbook of Econometrics, Vol.5.
3.1 Early simulation method - crude frequency
method
Consider a model with only taste heterogeneity but no preference hetero-geneity:
Uj = Zjβ + ηj
β fixed
ηj˜N(0,Ω)
J choices
Let
Pij = prob person i chooses j
Yij = 1 if i chooses j,= 0 else
The likelihood can be written as
L = ΠNi=1ΠJ
j=1(Pij)Yij
logL =N∑i=1
J∑j=1
Yij logPij
=N∑i=1
logPik
where N denotes the number of individuals and k is the alternative chosenby individual i.
3.1. EARLY SIMULATIONMETHOD - CRUDE FREQUENCYMETHOD27
Simulation Algorithm:
1. For a given β,Ω generate R Monte Carlo draws of the vector of resid-uals, η
2. Let
Pk =1
R
R∑r=1
1(U rk = maxU r
1 , ..., UrJ)
Pk is a frequency simulator of the proportion of times that k is chosengiven β,Ω
3. Maximize∑N
i=1 log Pik over alternative values for β,Ω.
Lerman and Manski found that this procedure performs poorly and requiresa large number of draws to be accurate, particularly when P is close to 0 or 1.
Variance of the frequency simulator is 1R2RV ar(1()) = Pk(1−Pk)
R.
McFadden (1989) provided some insights into how to improve the simulationmethod. He showed that simulation is a viable method of estimation evenfor a small number of draws provided that
(a) an unbiased simulator is used
(b) functions that are simulated appear linearly in the conditions definingthe estimator
(c) same set of random draws is used to simulate the model at differentparameter values
Condition (b) is violated for the crude frequency simulator, which has logPik.
28 CHAPTER 3. SIMULATION METHODS FOR MNP MODELS
3.2 Simulated method of moments
Consider model we had before, but with only preference heterogeneity andno taste heterogeneity:
Uij = Zijβ = Zijβ + Zijεi
β = β + εi
Define
Pij(γ) = Pr(i chooses j|Zi, β,Σε)
Yij = 1 if i chooses j,= 0 else
The likelihood is given by:
logL =1
N0
N∑i=1
J∑j=1
Yij lnPij(γ), N0 = NJ
∂ logL
∂γ=
1
N0
N∑i=1
J∑j=1
Yij
∂Pij∂γ
Pij(γ)
= 0 (*)
Now use the fact that
J∑j=1
Pij(γ) = 1
which impliesJ∑j=1
∂Pij∂γ
= 0
andJ∑j=1
∂Pij∂γ
PijPij = 0.
We can therefore rewrite (*) as
1
N0
N∑i=1
J∑j=1
(Yij − Pij)
∂Pij∂γ
Pij(γ)
= 0
3.2. SIMULATED METHOD OF MOMENTS 29
Note that E(Yij) = Pij. Letting Zij =∂Pij∂γ
Pij(γ)and εij = Yij − Pij, we have
1
N0
N∑i=1
J∑j=1
εijZij = 0,
which is like a moment condition using Zij as an instrument. So far, Pij isstill a J-1 dimensional integral.
The method of simulated moments solves the first-order conditions.
Model:
Uij = Zijβ + Zijεij
Ui = Ziβ + ZΓei
Ui is J × 1
Zi is J ×Kei is K × 1
Γ is K ×K
ΓΓ′ = Σε (cholesky decomposition)
ei˜N(0, Ik)
Simulation algorithm:
(i) Generate ei for each i such that ei are iid across persons and distributedN(0, Ik). In total, generate NKR draws, where N is the sample size, Kis the number of characteristics, and R is the number of Monte Carlodraws.
(ii) Fix matrix Γ and obtainηij = ZijΓei
and form vector zi1Γeizi2Γei
...ziJΓei
for each person.
30 CHAPTER 3. SIMULATION METHODS FOR MNP MODELS
(iii) Fix β and generate:Uij = Zijβ + ηij
for all i.
(iv) Find relative frequency that the ith person chooses alternative j acrossMonte Carlo draws:
Pij(γ) =1
R
R∑r=1
1(Ui1 > Uim for all m 6= j)
Pij(γ) is a simulator for Pij(γ). Stack to get Pi(γ).
(v) To get Pij(γ) for different values of γ, repeat steps (i) through (iv) usingthe same random variable draws ei from step (i).
(vi) Obtain a numerical simulator for∂Pij∂γ
: For small h,
∂Pij(γ)
∂γ=Pij(γ + hem)− Pij(γ − hem)
2h
where em is a vector with a 1 in the mth place.
Now, we have the ingredients for evaluating the moment conditions. Canuse a method such as Gauss-Newton method to solve the moment conditions.
Limitations of Simulation Methods
1. Pij(γ) simulator not smooth due to presence of indicator function (causesdifficulties in deriving the asymptotic distribution - need to use methodsdeveloped by Pakes and Pollard (1989) for nondifferentiable functions)
2. Pij cannot equal 0 (causes problems in denominator when close to 0)
3. Simulating a small Pij may require a large number of draws
Refinement: “Smoothed Simulated Method of Moments” by S. Stern (1992,Econometrica)
3.2. SIMULATED METHOD OF MOMENTS 31
Replaces the indicator function with a smooth function, which is differen-tiable
Pij(γ) =1
R
R∑r=1
k(Uij − Uim)
The smoothed function goes smoothly from 0 to one around 0 instead ofchanging abruptly at 0.
How does simulation affect the asymptotic distribution?
Define
Wi =
∂Pi∂γ
Pi(γ).
The variance of the method of moments estimator is given by
D(γ) = plim[1
N
N∑i=1
W ′i
∂Pi∂γ
]−1
plim[1
N
N∑i=1
(1 +1
R)Pi(1− PI)W ′
iWi]
plim[1
N
N∑i=1
W ′i
∂Pi∂γ
]−1
where the term 1R
adjusts for the simulation error.
Remarks:
• An advantage of simulated methods of moments over simulated maxi-mum likelihood methods is that SMM does not require that the numberof draws (R) goes to infinity and in fact has been shown to performwell even for a modest number of draws.
• Simulated maximum likelihood (where the simulator is plugged directlyinto the likelihood instead of working with the FOC, does require thatthe number of draws go to infinity.
Chapter 4
Nonrandom sampling
References: Amemiya (Chapter 9 and 10), Manski and McFadden (Chapters1 and 2), Manski and Lerman (1978 Econometrica), Amemiya
Different Types of Sampling
• random sampling
• censored sampling
• truncated sampling
• other nonrandom
– exogenous stratified sampling
– endogenous stratified (choice-based) sampling
- For example, in administering surveys, there is often oversampling ofhigh density population areas (often done to reduce sampling costs or to in-crease representation of certain groups). The sampling could be consideredto be exogenous or endogenous, depending on the phenomenon being studied.
- Most estimators are developed assuming random sampling and adjust-ments usually need to be made to accommodate nonrandom sampling situ-ations.
- Datasets usually provide sampling weights that are required to adjustfor the nonrandom sampling
33
34 CHAPTER 4. NONRANDOM SAMPLING
4.1 Some examples of choice-based sampling
• Suppose we want to study transportation mode choices and we gatherdata at the train, station, subway station, and at car checkpoints (e.g.toll booths).
• Study the effectiveness of a new treatment for heart attacks and followpeople who have had the treatment along with a control group of peo-ple who have not had the treatment.
• Evaluate the effects of a job training program using data on a group ofprogram participants and a group of nonparticipants. Participants areoften oversampled relative to their frequency in a random population.
In all of these cases, we observe characteristics of populations conditionalon the choices they made. Let’s look at how to use weights can be used toadjust for choice-based sampling:
Let Pi = P (i |Zi) in a random sample.
Let P ∗i the corresponding probability in a choice-based sample.
Under choice-based sampling, sampling is random within i partitions of thedata
P (Z |i) = P ∗(Z |i)
but
P (Z) 6= P ∗(Z).
Suppose we want to recover P (i |Z) from choice-based sampled data. Weobserve:
P ∗(i |Z)
P ∗(Z)
P ∗(i)
4.1. SOME EXAMPLES OF CHOICE-BASED SAMPLING 35
and we assume that
P ∗(Z |i) = P (Z |i)
Will apply Bayes’ rule:
P (A|B) =P (A,B)
P (B)=P (B|A)P (A)
P (B)
By Bayes’ rule, write
P ∗(i |Z) =P ∗(Z |i)P ∗(i)
P ∗(Z )
P (i |Z) =P (Z |i)P (i)
P (Z )=⇒
P (i |Z) =
[P ∗(i |Z)P ∗(Z )
P ∗(i)
]P (i)
P (Z) =∑
j P (Z |j)P (j)
=
[P ∗(i |Z)P ∗(Z )
P ∗(i)
]P (i)∑
j P∗(Z |j)P (j)
=
[P ∗(i |Z)P ∗(Z )
P ∗(i)
]P (i)∑
jP ∗(j |Z)P ∗(Z)
P ∗(j)P (j)
=P ∗(i |Z) P (i)
P ∗(i)∑j P∗(j |Z) P (j)
P ∗(j)
This shows that to recover P (i |Z) from choice-based sampled data, we needto know P (j) and P ∗(j) for all choices j. P ∗(j) can be estimated from thesampling, but P (j) requires outside information (usually from some other
dataset). Need to form the weights P (i)P ∗(i)
.
4.1.1 Manski and Lerman (1978)
Consider how to introduce weights to adjust for choice-based sampling in astandard probit model.
36 CHAPTER 4. NONRANDOM SAMPLING
Binary choice model
Di = 1 if xiβ + εi > 0
= 0 else
Denote choice-based sampling weights
w1i =P (Di = 1)
P ∗(Di = 1)
w0i =P (Di = 0)
P ∗(Di = 0)
Likelihood
L =1
NΠni=1 [1− Φ(−xiβ)]w1iDi [Φ(−xiβ)]w0i(1−Di)
In logs,
logL =1
N
n∑i=1
(w1iDi)log[1− Φ(−xiβ)] + (w0i(1−Di)log[Φ(−xiβ)]
As N gets large, DiN
converges to P ∗(Di = 1) and 1−DiN
to P ∗(Di = 0), sothat incorporating the weights, log L converges to E(logL).
Chapter 5
Limited Dependent VariableModels
main reference: Amemiya
In discrete choice models, we observe which choice is made from a set ofalternatives along with some observable covariates of the individual and/orof the choices.
We next consider a class of models where the outcome variables are partiallyobserved, called limited dependent variable models.
5.1 Type I Tobit Model
In Tobin’s (1958) example: expenditure on a durable good (e.g. a refrigera-tor) is only observed if the good is purchased. Some people chose not to buythe good, but might have bought it had the price been lower.
Define a latent variable that represents desired expenditure amount on agood:
y∗i = xiβ + ui
37
38 CHAPTER 5. LIMITED DEPENDENT VARIABLE MODELS
We observe
yi = y∗i if y∗i ≥ y0
= 0 if y∗i < y0
where y0 is the minimum price available in the market. Or, we can write
yi = 1(yi ≥ y0)y∗i .
Estimation
We can assume a parametric specification on the error term:
ui˜N(0, σ2)
y∗i |xi˜N(xiβ, σ2u)
Di = 1 if y∗i ≥ y0,
= 0 else
The likelihood can be written as
L = Πni=1 Pr(y∗i < y0) (1−Di) Pr(y∗i ≥ y0)f(y∗i |y∗i ≥ y0) Di .
Pr(y∗i < y0) = Pr(xiβ + uiσu
<y0
σu) = Φ
(y0 − xiβσu
)
f(y∗i |y∗i ≥ y0) =
1σuφ(y∗i−xiβσu
)1− Φ
(y0−xiβσu
)Now, note that the likelihood can be written as:
L = ΠDi=0
Φ
(y0 − xiβσu
)Π
Di=1
1− Φ
(y0 − xiβσu
)Π
Di=1
1σuφ(y∗i−xiβσu
)1− Φ
(y0−xiβσu
)Remarks
(i) The first two products are the information that you would get fromwhether y∗i ≥ y0 (i.e. whether bought the good or not). This infor-mation alone could be used to estimate β up to scale by a binary probit.
(ii) With the additional information on expenditure (y∗i ), can get a more ef-ficient estimate of β/σu and an estimate of σu, which was not identifiedfrom only the information on whether the good was purchased.
5.1. TYPE I TOBIT MODEL 39
5.1.1 Truncated Type I Tobit Model
Suppose you only observe expenditure for the cases when the good was pur-chased. That is, you see y∗i for y∗i ≥ y0.
For example, maybe you gather your data at the appliance store interviewingpeople who purchase the good.
In that case, you have less information and the likelihood is:
L = ΠDi=1
1σuφ(y∗i−xiβσu
)1− Φ
(y0−xiβσu
)It is still possible to estimate the model parameters using the truncated data.
5.1.2 Different Ways of Estimating Tobit Models
(a) MLE, or
(b) If Censored, can use a two step approach.
(c) Nonlinear least squares
Two Step Approach (due to Heckman, 1974, 1976)
First, estimate probit model using only data on whether good is purchased,get β
σu
Then estimate OLS model on observations for which y∗i is observed. Forexample, with y0 = 0 :
E(yi|xiβ + ui ≥ 0) = xiβ + E(ui|xiβ + ui ≥ 0)
= xiβ + σuE(uiσu| uiσu≥ −xiβ
σu)
for ε˜N(0, 1)
40 CHAPTER 5. LIMITED DEPENDENT VARIABLE MODELS
We will use some properties of truncated normal random variables
E(ε|ε ≥ c) =φ(c)
1− Φ(c)=
φ(c)
Φ(−c)
E(ε|ε < c) =−φ(c)
Φ(c)
Define λ(c) = E(ε|ε ≥ c). λ(c) is also known as the mill’s ratio. It comesup sometimes in actuarial analysis in modeling the expected length of lifegiven an individual has survived to some age c.
E(uiσu| uiσu≥ −xiβ
σu) =
φ(−xiβσu
)
1− Φ(−xiβσu
)=φ(xiβ
σu)
Φ(xiβσu
)
Two Step Estimation Procedure
(i) Estimate βσu
by probit
(ii) Form λi = λ( xiβσu
)
Regress
yi = xiβ + σλi + (vi + εi)
vi = σλi − λiεi = ui − E(ui|ui > −xiβ)
Note that εi has conditional mean equal to zero by construction. Here, vi isthe additional error due to the fact that λi is estimated.
The composite error term (vi+εi) will be heteroskedastic, which implies thatthe regression could be weighted to improve efficiency.
We need to account for the fact that λi is estimated in computing standarderrors for β. This is called an “estimated regressor” problem (also sometimescalled the “Durbin problem.”)
Two ways of doing this
5.2. TYPE II TOBIT MODEL 41
- Delta method
- GMM (Newey, Economic Letters, 1984) - set up moment conditions forthe OLS estimation and for the probit estimation program, with opti-mal weighting matrix. Estimating jointly will give corrected standarderrors.
(Correction for estimated regressor is incorporated into stata).
Nonlinear least squares approach
Suppose you used all the data, including the zeros,
E(yi|xi) = Pr(y∗i < 0) · 0 + Pr(y∗i > 0)E(y∗i |y∗i ≥ 0)
= Φ
(xiβ
σu
)[xiβ + σuλ(
xiβ
σu)
]
Could estimate by ols, replacing Φ with Φ and λ with λ, or could do nonlinearleast squares.
5.2 Type II Tobit Model
In the previous model, the outcome was observed based on whether the out-come exceeded a threshold value.
With Type II Tobit, we have one equation that describes the outcome andanother equation that describes when the outcome is observed, called theswitching equation:
y∗1i = x1iβ + u1i
y∗2i = x2iβ + u2i
y2i = y∗2i if y∗1i ≥ 0
= 0 else
Here, we observe x1i and x2i. y∗1i is not directly observed, although we observe
whether y∗1i ≥ 0. y2i is only observed if y∗1i ≥ 0.
42 CHAPTER 5. LIMITED DEPENDENT VARIABLE MODELS
5.2.1 The likelihood:
The likelihood for this Type II Tobit model is:
L = Π1
Pr(y∗1i ≥ 0)f(y2i|y∗1i ≥ 0)Π0
Pr(y∗1i < 0)
where
f(y2i|y∗1i ≥ 0) =
∫∞0f(y∗1i, y
∗2i)dy
∗1i∫∞
0f(y∗1i)dy
∗1i
= f(y∗2i)
∫∞0f(y∗1i|y∗2i)dy∗1i∫∞
0f(y∗1i)dy
∗1i
,
where f(y∗2i) and f(y∗1i) are the marginal densities.
Assume
y∗1i˜N(x1iβ1, σ21)
y∗2i˜N(x2iβ2, σ22)
We will use the following properties of conditional densities:
y∗1i|y∗2i˜N(x1iβ1 +σ12
σ22
(y∗2i − x2iβ2), σ21 −
σ12
σ22
)
y∗2i|y∗1i˜N(x2iβ2 +σ12
σ21
(y∗1i − x1iβ2), σ22 −
σ12
σ21
)
Estimation by MLE:
L = Π0
[1− Φ
(x1iβ1
σ1
)]Π1Φ
(x1iβ1
σ1
)f(y∗2i)
∫∞0f(y∗1i|y∗2i)dy∗1i∫∞
0f(y∗1i)dy
∗1i
= Π0
[1− Φ
(x1iβ1
σ1
)]×
Π1
φ(y2i−x2iβ2
σ2
)σ2
1− Φ
(−x1iβ1 + σ12
σ22
(y2i − x2iβ2)σ∗
)
σ∗ =
√σ2
1 −σ12
σ22
5.2. TYPE II TOBIT MODEL 43
The likelihood depends on σ1 only through β1σ−11 and σ12/σ
−11 , so we
cannot separately identify σ1 and WLOG can impose a normalization suchas σ1 = 1.
44 CHAPTER 5. LIMITED DEPENDENT VARIABLE MODELS
Estimation by Two Step Approach
Using data on y2i for which y1i > 0.
E(y2i|y∗1i > 0) = x2iβ2 + E(u2i|x1iβ1 + u1i > 0)
= x2iβ2 + σ2E(u2i
σ2
|u1i
σ1
>−x1iβ1
σ1
)
= x2iβ2 +σ12
σ1
E(u1i
σ1
|u1i
σ1
>−x1iβ1
σ1
)
= x2iβ2 + αφ(−x1iβ1
σ1
)1− Φ(−x1iβ1
σ1)
= x2iβ2 + αλ
(−x1iβ1
σ1
)
5.2.2 Examples
Estimating the effect of school quality on school-going behaviorand on test scores in a developing country
Suppose we wish to assess how improving the quality of schools in a countrywould affect school-going behavior and test scores.
Quality might be measured in terms of school infrastructure (e.g. type ofroof, whether there is a bathroom), textbooks or teacher characteristics (ex-perience, degrees).
Suppose the tests are given in school and we only observe test scores forchildren who go to school. However, we observe for all children whether theygo to school.
y2i = student’s test scores
y1i = index representing parent’s propensity
to enroll child in school
5.2. TYPE II TOBIT MODEL 45
x1i and x2i might both include school quality along with other covariates.For example, distance to school could be a favor influencing whether a familysends their child to school (i.e. included in x1i but not in x2i).
Gronau’s Female Labor Supply Model
maxC,L
U(C,L)
C = wH + A
H = 1− Lw = wage rate (given)
Pc = 1
L = time spent at home for child care
First-order condition:
MRS =∂U∂L∂U∂C
= w when L < 1
∂U∂L∂U∂C
> w if L = 1
Define reservation wagewR = MRS |H=0.
We don’t observe reservation wage directly, so model
wo = xβ + u wage offer
wR = zγ + v reservation wage.
Only observe the wage offer if the offer exceeds the reservation wage:
wi = woi if woi > wRi= 0 else.
46 CHAPTER 5. LIMITED DEPENDENT VARIABLE MODELS
This model fits within the Tobit Type II framework if we set
y∗1i = xiβ − ziγ + ui − vi = woi − wRiy∗2i = woi
Gronau did not develop a model to explain the choice of H (hours worked).
5.3 Type III Tobit: Heckman’s Model (In-
corporates Choice of H)
Adopts a functional form for the utility function that implies a functionalform relationship between the marginal rate of substitution and hours worked:
woi = x2iβ2 + u2i
MRS = γHi + ziα + vi
MRS |H=0 = wR = ziα + vi
An individual works if
woi > wRi , or
x2iβ2 + u2i > ziγ + vi
If works, then
woi = MRS
=⇒ x2iβ2 + u2i = γHi + ziα + vi
=⇒ Hi =x2iβ2 − ziα + u2i − vi
γ= x1iβ1 + u1i
where
x1iβ1 = (x2iβ2 − ziα)γ−1
u1i = (u2i − vi)γ−1
Observe both hours and wages (two outcomes) if hours worked is positive:
5.4. TYPE IV TOBIT MODEL 47
y∗1i = x1iβ1 + u1i hours (= Hi)
y∗2i = x2iβ2 + u2i wages (= w∗i )
y1i = y∗1i if y∗1i > 0
= 0 if y∗1i ≤ 0
y2i = y∗2i if y∗1i > 0
= 0 if y∗1i ≤ 0.
5.4 Type IV Tobit Model
Adds another equation
y3i = y∗3i if y∗1i > 0
= 0 if y∗1i ≤ 0.
Can estimate these models by
(i) maximum likelihood
(ii) Two-step method
E(woi |Hi > 0) = γHi + ziα + E(vi|Hi > 0)
5.5 Type V Tobit: Switching Regression Model
In the so-called Switching Regression model, we observe one of two possibleoutcomes and there is an equation that tells us which outcome is observed.
y∗1i = x′1iβ1 + u1i
y∗2i = x′2iβ1 + u2i
y∗3i = x′3iβ1 + u3i
y2i = y∗2i if y∗1i > 0
= 0 if y∗1i ≤ 0
y3i = y∗3i if y∗1i ≤ 0
= 0 if y∗1i > 0 i = 1, .., n
48 CHAPTER 5. LIMITED DEPENDENT VARIABLE MODELS
Only the sign of y∗1i is observed.
We could assume (u1i, u2i, u3i) are drawings from a trivariate normal distri-bution.
The likelihood can be written as,
L = Π0
∫ 0
−∞f3(y∗1i, y3i)dy
∗1iΠ
1
∫ 0
−∞f2(y∗1i, y2i)dy
∗1i
where f3(y∗1i, y3i) is the joint density of y∗1i and y∗3i and f2(y∗1i, y2i) is the jointdensity of y∗1i and y∗2i.
5.5.1 Examples
Lee(1978) unionism model
In goal of Lee’s application is to estimate the effect of unions on workerwages. y∗2i represents the log of the wage rate if the ith worker joins a union,and y∗3i represents the wage rate if don’t join a union.
y∗1i = y∗2i − y∗3i + ziα + vi,
where ziα captures costs (monetary and psychic) of joining a union. Thisset-up is close to the Roy model, which we will consider later.
Heckman (1978) model of effects of antidiscrimination laws
Used to estimate the effect of antidiscrimination laws on income of AfricanAmericans in a way that takes into account that passage of the law may beendogenous. Let
y2i = average income of African Americans
within a state
y∗1i = unobservable sentiment toward blacks in the ith
state
wi = 1 if y∗1i > 0
= 0 if y∗1i ≤ 0
5.5. TYPE V TOBIT: SWITCHING REGRESSION MODEL 49
y∗1i = γ1y2i + x′1iβ1 + δ1wi + u1i
y2i = γ2y∗1i + x′2iβ2 + δ2wi + u2i
If we solve for y∗1i the solution should not depend on wi, which requires theassumption that
γ1δ2 + δ1 = 0
for the model to be logically consistent.
Chapter 6
The Roy Model
• The Roy model is a two-sector econometric model of self-selection thathas been very influential in economics, especially in labor economics.
• It was first developed by A.D. Roy (1951) in his analysis of earningsin two occupational sectors, in which individuals self-select into thesector with the highest earnings. In Roy’s original example, the twooccupations were hunting and fishing.
• Roy assumed that individuals have some endowed skill in both sectorsand that they choose to work in the sector that would give them thehighest earnings.
• Although originally developed in studying occupational choice, themodel has also been used to study the effects of unions on earnings, theeffects of education (high school or college) on earnings, the effects offederal job training programs and the effects of private schools on testscores.
• Heckman and Honore (1990) develop the mathematics of the Roy modeland derive a number of new results, generalizing it to non-normal elip-tical distributions.
6.1 Model
To describe Roy’s set-up, let (F,H) denote an individual’s endowment ofskills (talent) in the two occupations, fishing and hunting.
51
52 CHAPTER 6. THE ROY MODEL
Distribution of skills
• Talent matters more in occupation with higher variance in skills
lnH
lnF
It is assumed that skills are log normally distributed.Denote the earnings for fisherman and hunters by:
yFi = FipF
yHi = HipH
where pF is the unit price of fish and pH the unit price of hunting output.
It is assumed that the individual chooses the occupation that maximizes theirearnings, so the choice rule is:
yi = maxyFi , yHi An individual chooses fishing if
FipF > HipH
and chooses hunting if
HipH > FipF .
The line of indifference is:
Fi =pHpFHi
Let g(Fi, Hi) denote the distribution of skills in the population (a two di-mensional density).
6.2. MEAN EARNINGS GIVEN SELF-SELECTION 53
What is the average amount of fish caught by people who choose the fishingoccupation?
E(Fi|yFi > yHi ) = E(Fi|Fi >pHpFHi) =
∫∞0
∫∞pHpF
Fig(Fi, Hi)dFidHi∫∞0
∫∞pHpF
g(Fi, Hi)dFidHi
If M is the population size, then the total supply of fish in the populationwill be
M∗[E(Fi|Fi >pHpFHi)Pr(y
Fi > yHi ) + 0Pr(yFI < yHi )] = M∗E(Fi|Fi >
pHpFHi)
∫ ∞0
∫ ∞pHpF
g(Fi, Hi)dFidHi
as the price of fish increases, more people will choose the fishing occupationand the supply of fish will increase.
A similar expression can be derived for the average amount of hunting outputfor people who choose hunting as their profession and for the total amountof hunting output in the economy.
6.2 Mean earnings given self-selection
Next, we will analyze the implications of self-selection into occupations formean earnings in that occupation. We will also consider how to recover theparameters of the underlying skill distribution.
Let Y0 denote log earnings in the fishing occupation, which is assumed todepend on some observed characteristics X and on some unobservables ε0.
Also, let Y1 denote log earnings in the fishing occupation, which is assumedto depend on some observed characteristics X and on some unobservablesε1.
Y0i = lnyFi = lnpF + lnFi = Xiβ0 + ε0i
Y1i = lnyHi = lnpH + lnHi = Xiβ1 + ε1i
54 CHAPTER 6. THE ROY MODEL
Note that the will enter into the intercepts of the log earnings equation. (Butif prices were observed and varied across individuals, we could incorporatethem as additional regressors)
We can assume that the residuals are joint normally distributed.(ε0i
ε1i
)˜N
((00
),
(σ2
0 σ01
σ10 σ21
)),
Define the correlation coefficient:
ρ =σ10√σ2
1
√σ2
0
=σ10
σ1σ0
<=> σ10 = ρσ0σ1
To recover the underlying skill distribution in the population, we need torecover the parameters β0, β1, σ
20, σ
21, and σ10.
Define D = 1 if an individual chooses occupation H, else D = 0.
Prob(D = 1|X) = Prob(Y1 > Y0|x)
= Prob(Xβ1 + ε1 ≥ Xβ0 + ε0)
= Prob(ε0 − ε1
σ≤ X(β1 − β0)
σ)
Define ξ = ε0−ε1σ
and σ =√var(ε0 − ε1) =
√σ2
0 + σ21 − 2σ10
If occupations were chosen or assigned at random (without regard to earningspotential), then average earnings would be
E(Y0|X) = Xβ0
E(Y1|X) = Xβ1
Now, ask, what are average earnings in an occupation, given the fact thatindividuals choose the occupation that maximizes their earnings?
6.2. MEAN EARNINGS GIVEN SELF-SELECTION 55
First, let’s consider earnings in section 1 for those who chose sector 1:
E(Y1|X,D = 1) = Xβ1 + σ1E(ε1
σ1
|ε0 − ε1
σ≤ X(β1 − β0)
σ)
To derive an expression for the last term, we will use the fact that for normallydistributed mean zero random variables, the projection of U on V is
U =Cov(U, V )
V ar(V )+ η
Define ξ = ε0−ε1σ
E(Y1|X) = Xβ1 + σ1E(ε1
σ1
|ξ ≤ X(β1 − β0)
σ)
= Xβ1 +σ10 − σ2
1
σE(ξ|ξ ≤ X(β1 − β0)
σ)
= Xβ1 +ρσ1σ0 − σ2
1
σE(ξ|ξ ≤ X(β1 − β0)
σ)
The term E(ξ|ξ ≤ X(β1−β0)σ
) < 0, because it is a standard normal randomvariable that is truncated from above.1
We can rewrite the expression as
E(Y1|X,D = 1) = Xβ1 +σ1σ0
σ(ρ− σ1
σ0
)E(ξ|ξ ≤ X(β1 − β0)
σ)
We have that
E(Y1|X,D = 1) > E(Y1) if ρ <σ1
σ0
E(Y1|X,D = 1) < E(Y1) if ρ >σ1
σ0
1The mean without truncation is zero, E(ε) = 0, so E(ε < c) le0.
56 CHAPTER 6. THE ROY MODEL
Similarly, we can obtain
E(Y0|X,D = 0) = Xβ0 +σ1σ0
σ(σ0
σ1
− ρ)E(ξ|ξ ≥ X(β1 − β0)
σ)
where E(ξ|ξ ≥ X(β1−β0)σ
) > 0, because it is the mean of a standard normalrandom variable truncated from below.
From these expressions, we see that
E(Y0|D = 0, X) > E(Y0|X) if ρ <σ0
σ1
E(Y0|D = 0, X) < E(Y0|X) if ρ >σ0
σ1
Can we have both E(Y1|X,D = 1) < E(Y1|X) and (E(Y0|X,D = 0) <E(Y0|X)? No,
var(ξ) = σ21 + σ2
0 − 2σ10
which implies that either
σ21
σ1σ0
>σ10
σ1σ0or
σ20
σ1σ0
>σ10
σ1σ0
or both conditions hold.
ρ <σ1
σ0or
ρ <σ0
σ1
or both.
6.3. ESTIMATION 57
6.3 Estimation
How do we estimate this model? Write
Y1 = Xβ1 + σ1E(ε1
σ1
|ε0 − ε1
σ≤ X(β1 − β0)
σ) + v1
v1 = ε1 − E(ε1|X,D = 1)
Y1 = Xβ1 +σ10 − σ2
1
σ
−φ(X(β1−β0)σ
)
Φ(X(β1−β0)σ
)+ v1
Let
λ1(X(β1 − β0)
σ) =−φ(X(β1−β0)
σ)
Φ(X(β1−β0)σ
)
Then
Y1 = Xβ1 + α1λ1 + v1
Similarly, for D = 0, write
Y0 = Xβ0 + σ0E(ε0
σ0
|ε0 − ε1
σ>X(β1 − β0)
σ) + v0
v0 = ε0 − E(ε0|X,D = 0)
Y0 = Xβ0 +σ2
0 − σ10
σ
φ(X(β1−β0)σ
)
1− Φ(X(β1−β0)σ
)+ v0
Let
λ0 = (X(β1 − β0)
σ) =
φ(X(β1−β0)σ
)
1− Φ(X(β1−β0)σ
)
58 CHAPTER 6. THE ROY MODEL
Then
Y0 = Xβ0 + α0λ0 + v0
We can use Heckman’s (1976) proposed two-step estimation approach forthis model.
Step 1: First estimate a probit model for Pr(D = 1|X) and Pr(D =
0|X). From that, get estimates of Xˆ(β1−β0)σ
and form λ1 and λ0.
Step 2: Next estimate the Y1 and Y0 regressions, including λ1 and λ0 as“control functions.” From this, we get estimates of β1,β0,α1 and α0.
To separately identify σ21, σ
20, and σ10, we would also have to estimate models
for
V ar(Y0|D = 1, X)
V ar(Y1|D = 0, X)
and use the fact that for a normal random variable
V ar(u|u > c) = 1− φ(c)
Φ(−c)c− φ(c)
Φ(−c)
2
.
For further discussion, see Heckman and Honore (1990).
6.4 Constructing counterfactuals
Once the parameters if the model are estimated, we can use the model toconstruct so-called counterfactuals, which are objects that were not directlyobserved.
For example, we can ask what average earnings would be for sector 0 personshad they chose sector 1 instead.
6.4. CONSTRUCTING COUNTERFACTUALS 59
E(Y1|X,D = 0) = Xβ1 + E(ε1|X,Xβ0 + ε0 > Xβ1 + ε1)
E(Y1|X,D = 0) = Xβ1 + σ1E(ε1
σ1
|ε0 − ε1
σ>X(β1 − β0)
σ)
E(Y1|X,D = 0) = Xβ1 +σ10 − σ2
1
σE(ξ|ξ > X(β1 − β0)
σ)
E(Y1|X,D = 0) = Xβ1 +σ10 − σ2
1
σ
φ(X(β1−β0)σ
)
1− Φ(X(β1−β0)σ
)
which we can estimate by
Xβ1 + α1
φ(ˆX(β1−β0)σ
)
1− Φ(ˆX(β1−β0)σ
)
Similarly, average earnings for sector 1 persons had they chose sector 0
E(Y0|X,D = 1) = Xβ0 + E(ε0|X,Xβ0 + ε0 ≤ Xβ1 + ε1)
E(Y0|X,D = 1) = Xβ0 + σ0E(ε0
σ0
|ε0 − ε1
σ≤ X(β1 − β0)
σ)
E(Y0|X,D = 1) = Xβ0 +σ2
0 − σ10
σE(ξ|ξ ≤ X(β1 − β0)
σ)
E(Y0|X,D = 1) = Xβ0 +σ2
0 − σ10
σ
−φ(X(β1−β0)σ
)
Φ(X(β1−β0)σ
)
which we can estimate by
Xβ0 + α0
−φ(ˆX(β1−β0)σ
)
Φ(ˆX(β1−β0)σ
)
The Roy model is very powerful in that it provides a way of inferring thedistribution of underlying skills even though earnings are only observed for
60 CHAPTER 6. THE ROY MODEL
the self-selected group of individuals who chose a particular sector.
The model makes some strong assumptions, though. Namely, the modelassumes
• That agents choose sectors without incurring pecuniary costs
• That there is no uncertainty in the future rewards from choosing asector.
• That the Y0 and Y1 are normally distributed.
Other work relaxes some of these assumptions.
• Rosen and Willis (1979) present an application of an extended versionof the Roy model to studying the decision to go to college, where thetwo sectors represent stopping at a high school degree or going on tocollege. One of the focuses of their study is on the parameter σ10,which tells us whether people who have unobserved skills that makethem higher earners without college also tend to be higher earnerswith college (conditional on X). They find evidence of comparativeadvantage, that individuals who go to college would not tend to be thehighest earners had they stopped at high school and vice versa.
• The Roy model has also been applied to study the decision to go topublic or private school, the decisions to get a union wage job or not,and the decision to go to a job training program.