gdo 2009 selection_bias
DESCRIPTION
TRANSCRIPT
How to correct the selection
bias in management research
Team Bin Xu Oualid EL Ouardi Shabnam kazempur Teacher : Mr Christophe Benavent
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 2
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 3
I.: Overview
1.1 What is the selection bias?
Selection bias is the error of distorting a statistical analysis by pre- or
post-selecting the samples. Typically this causes measures of statistical
significance to appear much stronger than they are, but it is also possible
to cause completely illusory artifacts. Selection bias can be the result of
scientific fraud which manipulate data directly, but more often is either
unconscious or due to biases in the instruments used for observation.
1.2 Reasons for selection bias
Figure 1
The figure 1 shows three main reasons, selective non-response,
incomplete observability and the self-slection.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 4
1.2.1 Selective non response
It means there are not enough the response. For example, the 1997 Dutch
Labour Force Survey (LFS) had a response of 56%. Apart from the LFS,
other socio-cultural surveys in the Netherlands also show response rates
between 50% and 60%. Panel-studies are even more problematic. The
response rate of the two-wave Dutch Parliamentary Election Study
(DPES) has been below 50% since 1981 and only 43% of the electorate
participated in 1998 (Aarts, Van der Kolk &Kamp, 1999, pp. 22-24).
Apparently, not enough data is hard to be convincing, because it is
probable the people who have not responded are those who are not agree
with our problem.
1.2.2 Incomplete observability,
It is included of two types: censored data and truncated data.
Censored data:
Censored data point are those whose measured properties are not known
precisely, but are known to lie above or below some limiting sensitivity.
For example, suppose a study is conducted to measure the impact of a
drug on mortality. In such a study, it may be known that an individual's
age at death is at least 75 years. Such a situation could occur if the
individual disenrolled from the study at age 75, or if the individual is
currently alive at the age of 75. Censoring also occurs when a value
occurs outside the range of a measuring instrument. For example, a
bathroom scale might only measure up to 300 lbs. If a 350 lb individual is
weighed using the scale, the observer would only know that the
individual's weight is at least 300 lbs.
Truncated data:
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 5
Truncated data points are those which are missing from the sample
altogether due to sensitivity limits. For example, if an experiment were to
be conducted to count the distribution of sizes of fish in a lake, a net
might be used to catch a representative sample of fish. If the net had a
mesh size of 1 cm, then no fish narrower than 1 cm wide would be found
in the sample. This is a result of the method of selection: there is no way
of knowing whether there are any fish smaller than 1 cm based on an
experiment using that net.
Censoring and truncation:
As the example before mentioned, Censoring is when an observation is
incomplete due to some random cause. The cause of the censoring must
be independent of the event of interest if we are to use standard methods
of analysis. Truncation is a variant of censoring which occurs when the
incomplete nature of the observation is due to a systematic selection
process inherent to the study design.
1.2.3 Self-selection
It is a term used to indicate any situation in which individuals select
themselves into a group, causing a biased sample. It is commonly used to
describe situations where the characteristics of the people which cause
them to select themselves in the group create abnormal or undesirable
conditions in the group. Self-selection is a major problem in research in
sociology, psychology, economics and many other social sciences.
Self-selection makes it difficult to determine causation. For example, one
might note significantly higher test scores among those who participate in
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 6
a test preparation course, and credit the course for the difference.
However, due to self-selection, there are a number of differences between
the people who chose to take the course and those who chose not to.
Arguably, those who chose to take the course might have been more
hard-working, studious, and dedicated than those who did not, and that
difference in dedication may have affected the test scores between the
two groups. If that was the case, then it is not meaningful to simply
compare the two sets of scores. Due to self-selection, there were other
factors affecting the scores than merely the course itself.
Self-selection causes problems for research about programs or products.
In particular, self-selection makes it difficult to evaluate programs, to
determine whether the program has some effect, and makes it difficult to
do market research.
1.3 The main selection bias in management
In the most observational studies in management, selection bias usually is
due to the last two reasons, that‟s because their study object is more
specific, the response from study object is easier to collect. And
Truncation is a variant of censoring which occurs when the incomplete
nature of the observation is due to a systematic selection process inherent
to the study design.
In fact we focus on selection bias comes in two main flavors: (1) self-
selection of individuals to participate in an activity or survey, or as a
subject in an experimental study; (2) selection of samples or studies by
researchers to support a particular, especially censoring data.
In the next section, the paper will describe more details of selection bias
due to self-selection and censoring data in management research,
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 7
introduce two basic theory and model in order to correct the selection
bias, which are Tobic model and Heckman model respectively. When the
Former is basis of the latter model and a method to correct selection bias
from censoring data, and the latter is more useful abroad and a main
stream model to correct selection bias.
Solutions:
Selection bias started in 1958 with Tubin studies about house
expenditures and luxury products and Since publication of Heckman‟
article in 1979, which with 7300 citation made him Nobble prize winner,
there have been thousands of articles in this respect. While searching in
different articles we identified different categories and trends for
identifying, controlling or removing .
The primary methods which were used were mainly mathematical and
econometrical models. Thanks to the huge researches and attention to it,
this area is almost a mature area, and is in fact the main source of
progress in selection bias. However, in many areas there have been many
efforts to simplify the mathematical models and provide insight for
researcher in human science who are not necessarily mathematicians.
Emerging experimental and quasi experimental methods like propensity
score method is the results of these efforts.
In our paper we will see the mathematical back ground. Then we will
work on experimental methods. Finally we will provide our insights, and
solutions to deal with selection bias. In appendix we have provided some
information about the software and toolkits which can be used for
selection bias.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 8
We have tried to avoid entering complicated mathematical equations.
Even we can claim that we have been more successful than those articles
which have been written by titles like “intuition about selection bias”.
Nonetheless, this is an area which still very dependent to mathematics,
and it is not easy to transfer concepts without understanding its
mathematical base. For this reason, in second chapter, we have tried to
explain up to some degree the mathematical bases of selection bias.
Although it is simple and primary, it provides the base for those who
want to follow this field through mathematical equations. In addition, it
gives good insights to others who only want to understand selection bias
qualitatively. But our last chapter is completely qualitative and discussed
about experiments and qualitative approaches.
Finally we have provided good bibliography about different subjects in
selection bias.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 9
Econometrical Base of Selection Bias:
A :Tobit Model :
As we discussed before, a common occurrence in many regression
models is the existence of truncation or censoring in the response
variable. Tobin (1958) pioneered the study of such models in economics,
analyzing household expenditures on durable goods while taking into
account the fact that expenditures cannot be negative. That is, for some
observations the observed response is not the actual response, but rather
the censoring value (often zero), and an indicator that censoring has
occurred. More specifically, the so-called Type I Tobit model can be
written as a combination of two familiar models. The first model is a
Probit model, which determines whether the iy variable is zero or
positive and the second model is a Truncated Regression model for the
positive values of iy . The Type-I Tobit model assumes that the
parameters for the effect of the explanatory variables on the probability
that an observation is censored and the effect on the conditional mean of
the non-censored observations are the same.
In this section we will see an overview on the Tobit Model which mean
for us Type-I Tobit model, unless otherwise specified, after that we will
give an application of using this model, then we present some comment
on the results. The last part introduces many limitations of the Tobit
model.
I. Overview
1. Truncation and Censoring
a. Truncation
Truncation occurs when some observations on both the dependent
variable and regressors are lost. For example, income may be the
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 10
dependent variable and only low-income people are included in the
sample. In effect, truncation occurs when the sample data is drawn from a
subset of a larger population.
Examples:
Let iy = the profit of the i-th firm as a percentage of assets and ix =
the four firm concentration ratio of industry the firm is in. Suppose
only firms with positive profit rates are observed and firms with
negative profit rates are not observed. In this case a = 0 and we
have a problem where the dependent variable is left truncated. In
the case where ),( ii xy is observed only when ayi (left truncation)
or when byi (right truncation) or when dyc i (double
truncation).
A second example: Objects of certain type in a specific region of
the sky will not be detected by the instrument if the apparent
luminosity of objects is less than a certain lower limit. This often
happens due to instrumental limitations or due to our position in
the universe.
For instance, suppose that the data concern the purchases of new
cars, with yi the price of the car and xi characteristics of the buyer
like age and income class. Then no observations on yi can be
below the price of the cheapest new car. Some households may
want to buy a new car but find it too expensive, in which case they
do not purchase a new car and are not part of the observed data.
This truncation effect should be taken into account, for instance, if
one wants to predict the potential sales of a cheaper new type of
car, because most potential buyers will not be part of the observed
sample.
This figure shows an example of a truncated normal density with
truncation from below (at x = -1).
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 11
b. Censoring
Censoring occurs when data on the dependent variable is lost (or limited)
but not data on the regressors. Sources/events can be detected, but the
values (measurements) are not known completely. We only know that the
value is less than some number. Case where all ),( ii xy are observed it is
just that when iy "passes" the truncation point, iy is recorded as the
truncation point. As in the truncation model you can have left-censoring,
right-censoring, or upper and lower-censoring. For discussion purposes
let's consider the case where 0iy . We consider a contribution to charity
as example. Some people give to the designated charity and some people
do not.
Examples:
If we consider people of all income levels may be included in the
sample, but for some reason the income of high-income people
may be top-coded as, say, $100,000. Censoring is a defect in the
sample - if there were no censoring, then the data would be a
representative sample from the population of interest. Truncation
entails a greater loss of information than censoring. Long (1997,
188) provides a nice picture of truncation and censoring.
This figure shows an example shows a censored normal density
with censoring from below (at x = 0),
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 12
N.B: The main difference between censoring and truncation is that
censored object is detectable while the object is not even detectable in the
case of truncation.
2. Tobit model for censored data
The dependent variable is called censored when the response cannot take
values below (left censored) or above (right censored) a certain threshold
value. For instance, in the example on investments in a new financial
product, the investments are either zero or positive. And, in deciding
about a new car, one has either to pay the cost of the cheapest car or
abstain from buying a new car. The so-called Tobit model relates the
observed outcomes of Y*>0 to an index function
The Tobit model for censored data is sometimes called the Tobit type 1
model, to distinguish it from the Tobit type 2 model that will be discussed
in the next section for data with selection effects. In contrast with a
truncated sample, where only the responses for y*i > 0 are observed, it is
now assumed that responses y*I = 0 corresponding to y*i < 0 are also
observed and that the values of xi for such observations are also known.
In practice these zero-responses are of interest, as they provide relevant
information on economic behaviour. For instance, it is of interest to know
which individuals decided not to invest (as other financial products could
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 13
be developed for this group) or which individuals did not buy a new car
(as one could design other cars that appeal more to this group). The Tobit
model can be seen as a variation of the Probit model, with one discrete
option („failure‟, yi =0) and where the option „success‟ is replaced by the
continuous variable yi > 0.
II. Application of Tobit model for censored data
As we introduced in (2.Tobit model for censored data) If we consider
direct marketing data concerning a new financial product, when of the
925 customers, 470 responded to the mailing by investing in the new
product. We analyze the censuring sample consisting of these 470
customers. We will not consider only the customers of the bank who
decided to invest in the financial product as for truncated case, but we
also know the individual haracteristics of the customers who decided not
to invest. We will, therefore, construct a Tobit model for the invested
amount of money.
We will discuss 1) the data, 2) the ML (Maximum of Likelihood)
estimates of the Tobit model, after that we will see in 3) a comparison
with the results obtained if we use the truncated sample approach rather
than censured approach.
1) The data
We consider data that were collected in a marketing campaign for a new
financial product of a commercial investment firm (Robeco).The
campaign consisted of a direct mailing to customers of the firm. The firm
is interested in identifying characteristics that might explain which
customers are interested in the new product and which ones are not. In
particular, there may be differences between male and female customers
and between active and inactive customers (where active means that the
customer already invests in other products of the firm). Also the age of
customers may be of importance, as relatively young and relatively old
customers may have less interest in investing in this product than middle
aged people.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 14
The data set consists of 925 individuals, of whom 470 responded by
making an investment in the product and 455 did not respond. For
individuals who responded, the amount of money invested in this product
is known. The explanatory variables (gender, activity, age) are known for
all 925 individuals, hence also for the individuals that did not invest in the
product. So the dependent variable is censored, not truncated. As before,
we take as dependent variable yi = log (1 + invest), where „invest‟ is the
amount of money invested. For individuals who did not invest (so that
„invest‟ is zero), we get yi = 0.
2) The ML (Maximum of Likelihood) estimates of the Tobit model
The Tobit estimates (ML in the censored regression model) are in Panel 2
of Figure of results. For comparison this table also contains the OLS
estimates that are obtained if the censoring is erroneously neglected (see
Panel 1).The Tobit multipliers in Panel 2 of Figures of results are
somewhat larger than the OLS (least squares or ordinary least squares)
multipliers. The variables „gender‟ and „activity‟ have a positive effect on
the amount of money invested, and age has a parabolic effect, with a
maximum at an age of around 53 years (namely, where 0,196 – 2*0,185*
(age=100)= 0).
3) Comparison of Tobit estimates with results for truncated sample
If we compare the results of the Tobit model in Panel 2 of Figures of
results with the results for the truncated sample (without Tobit model)
obtained if we use truncated model (see Panel 3). The effect of „activity‟
now has the expected positive sign (instead of negative) and the
maximum investments are around an age of 53 (instead of 62). Further,
the Tobit estimates indicate higher investments by males as compared to
females, whereas the reverse effect was estimated in the truncated
sample. As the information on individuals who do not invest is of
importance in describing the general investment behaviour, the results
obtained for the censored sample are more reliable than the ones for the
truncated sample. This illustrates the general point that it is always
advisable to include relevant information in the model. The truncated
model neglects the information on non-investing customers, and this
makes this model much less informative than the Tobit model for the
censored data.
4) Figures of results
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 15
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 16
III. Mathematical definition of Tobit model types
Amemiya (1984) classified Tobit models into five types based on the
characteristics of the likelihood function. For notational convenience, let
P denote a distribution or density function, assuming that y_ji is
normally distributed with a mean of and a variance of
j
Type 1 Tobit
The Type 1 tobit model, discussed in the preceding “Censored and
Truncated Regression Models” section, is defined as
Type 2 Tobit
The Type 2 tobit model is defined as:
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 17
Where
Type 3 Tobit
The Type 3 tobit model is different from the Type 2 tobit in that i of
the Type 3 tobit is observed when
Where ; (i.i.d.) means independent and identically
distributed.
Type 4 Tobit
The Type 4 tobit model consists of three equations.
Where (i.i.d.) means independent and identically
distributed
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 18
Type 5 Tobit
The Type 5 tobit model is defined as
Where are from iid trivariate normal distribution.
IV. Overview on trend corrections of Tobit model
The Tobit model was introduced by James Tobin in 1958 in order to
model a specific type of discrete-continuous data commonly found in
economic applications. The Tobit model is a specific case of a censored
regression model and assumes that the continuous component of the data
(right-tail) is normally distributed. Early examples included modeling
household expenditures of luxury goods, inheritance, and expected age of
retirement. But now it is used everywhere when selection bias is possible.
However, it has been demonstrated in later research that even small
departures from underlying normality assumption may lead to
inconsistent estimators. Arabmazar and Schmidt (1982) explored the
robustness of the Tobit estimator when estimating a population mean
when the assumption of normality is violated. They concluded that the
bias can be quite large and that the bias is dependent on the proportion of
censoring.One technique that is often utilized in an attempt to compensate
for this weakness in the case of long-tailed distributions is to apply a log
transformation to the data. Lorimer and Kiermeier (2007) conduct a
simulation study to examine the use of Tobit models on log-transformed
microbiological data. They compared the Tobit method to two other
methods, using only uncensored observations, and using the limit of
detection for the censored values. They concluded that the two standard
methods led to biased estimates and that the Tobit model led to less
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 19
biased estimates. However, their conclusions are based on the underlying
assumption of normality. Others have discussed that usually tobit model
leads to non linear and complicated equations which cannot be solved
even by mathematicians.
Therefore, for each condition of profit models, whose solutions many
times become nonlinear and complicated, they are trying to use simple
and heuristic models for each case.
It is important to notice that Tobit models are the basic models in
identifying selection bias, and next models as Heckman models are some
special case of Tobit model. It is why there are still wide researches in
this fields and many new trends has been discovered. Besides, there is a
huge trend which try to develop classical solution for each of these
trends.
The Relation between Tobit model and Heckman model:
Tobit model become in an particular case as Probit model, which is the
first stage of Heckman model, hence we can use the same way of
simulating Tobit model as in the first stage for Heckman. Although Tobit
model was introduced 23 years after Probit model, the last one were
introduced by Chester Ittner Bliss in 1935, and it‟s fast method of solving
the models was introduced by Ronald Fisher in an appendix to the same
article. Because the response is a series of binomial results, the likelihood
is often assumed to follow the binomial distribution. Let Y be a binary
outcome variable, and let X be a vector of regressors. The probit model
assumes that
where Φ is the cumulative distribution function of the standard normal
distribution. The parameters β are typically estimated by maximum
likelihood.
While easily motivated without it, the probit model can be generated by a
simple latent variable model. Suppose that
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 20
where , and suppose that Y is an indicator for whether the
latent variable Y * is positive:
In next part we will see it is the foundation of Heckman model in detail.
V. Some References
Amemiya, Takeshi (1973). "Regression analysis when the dependent
variable is truncated normal". Econometrica 41 (6), 997–1016.
Amemiya, Takeshi (1984). "Tobit models: A survey". Journal of
Econometrics 24 (1-2), 3-61.
Amemiya, Takeshi (1985). "Advanced Econometrics". Basil Blackwell.
Oxford.
Schnedler, Wendelin (2005). "Likelihood estimation for censored random
vectors". Econometric Reviews 24 (2),195–217.
Tobin, James (1958). "Estimation for relationships with limited
dependent variables". Econometrica 26 (1), 24–36.
Econometric methods with applications in business and economics
Par C. Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, Herman K.
Van Dijk Oxford University Press, 2004
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 21
Heckman Model:
We saw how there may be self selection by the individuals or data
units being investigated. Her we review one example and try to model
it by Heckman primary model which as an econometric model.
Suppose we observe that college grades are uncorrelated with success in
graduate school .Can we infer that college grades are irrelevant? Of
course not, but we should figure out that unmeasured variables (e.g.
motivation) used in the admissions process might explain why those who
enter graduate school with low grades do as well as those who enter
graduate school with high grades.
Formulating the problem:
Solving problem
If we continue and solve this model we will have:
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 22
Here, will be our new repressor and V1i will be new specifics errors
in our example is motivation
Y1 = the result of students at university
X1 = undergraduate scores
Y2 = Admission at graduate school
X2= All the factors that create admission result
I: Number of observations
For being present at regression sample, you should be
admitted or Y2> 0
References:
Heckman, James J. 1979. “Sample Selection Bias as a
Specification Error.” Econometrica 47(1): 153-161
W.H. Green , Econometrical Analysis, third edition
Graphical explanation of Heckman Model:
As we discussed, selection bias simply discusses only about the samples
who satisfies a primary condition. We can show it graphically as below
we only can see the parts in which r is below 0, while those r who are not
represented in our surve,y can show different characteristics.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 23
The example which we will study here is about application of Heckman
model in assurance. We will apply the same probabilistic condition, but
since we have a binary situation, we will not enter to regressions, and
solve our problem with probability theories.
Auditing policies derived from statistical analysis and applied to insurance
claims face a major selection bias problem. Most insurance companies are
reluctant to carry out a random auditing policy. This is because the long-
term influence of an audit decision on the policyholder‟s value for the
company is negative. Indeed, an honest policyholder may take the audit
process amiss and his loyalty to the company should decrease as a
consequence, as well as his value for the insurer. Hence companies are
deterred from performing a systematic audit on part of their claims
database. In fact they imply auditing only on suspicious files. In This way
we have a selection bias: we select those who are suspicious. In order to
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 24
identify fraudulent insurance claims at an early stage, most systems score a
new incoming claim using a set of fraud indicators. If the score is high
enough, then the claim is audited and fraud (or abuse)maybe confirmed.
Only claims with a high suspicion level (or score) are selected for
investigation.
Let us now formalize the selection bias issue. If all incoming claims can
be considered for audit, the selection bias issue can be formalized in the
following way: A and Fdenote the binary variables related to audit and
fraud, and x is the vector of variables which describe the claim. A
statistical model assessing fraud risk is derived from the audited claims
and estimates probabilities of the type P(F = 1 | A= 1, x) = E(F | A= 1, x).
Now an audit policy induced by this model is applied on the incoming
claims, and uses the probabilities P(F = 1 | x) = E(F | x). Selection bias is
a consequence of the confusion between the conditional and
unconditional probabilities.
Here S represents the suspected files. In normal auditing, usually the
numbers of audited files (A) is less than S, because we do not audit every
file and only evaluate suspected file where S>0.
Random auditing of claims is the basic strategy which makes it possible
to counteract selection bias. A pure random auditing strategy consists in
picking claims at random, then in auditing these claims. This controlled
experiment eliminates the selection induced by the audit decision. The
estimation of a single fraud equation in this sample provides an estimated
fraud probability for incoming claims which is not subject to selection
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 25
bias. Here we calculate the parameters for both random and usual
nonrandom samples. You can consider the difference in two cases.
Experimental Models:
Application of Heckman Model:
After introducing Heckman Model, it has been widely diffused in
statistical tests. While, searching for an example of Heckman model, we
observed that it has turned to be a common and necessary test in
economics application. In other words, doing Heckman test is not an
advantage for an article, but not doing it accounts a weakness for the
research. Most of the research has used Heckman test as a criteria for
showing the validity of their test. Sometimes it is used with other
methods to show the possible variance of answers. Like example below:
Nous volons mesurer les composantes de productivité de secteur
bancaire:
ln Y = B0 + B1M-O + B2M-O2 + B3Sprfc + B4Ann´ee + Bri RI + e
o`u Y est la mesure de production, M-O est la mesure de main-d‟œuvre,
Sprfc celle du capital, Année est une mesure binaire différenciant 1995 et
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 26
1996et RI est l‟information binaire sur la présence des intéressements.
In this article they continue to calculate coefficients with Heckman
method and Instrumental variable method to remove selection bias as
below:
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 27
As you can see the results of three methods, are not the same and the
authors continue to reasoning to identify correct range of answers1.
As you can see, here authors necessarily do not think that Heckman‟s
method is the final methodology. It is only regarded as a mean to
approaching the problem from other dimensions. In fact the real problems 1 Simon Drolet, Paul Lanoie, Bruce Shearer , Analyse de l’impact
productif des pratiques de rémunération incitative pour une entreprise de
services : Application à une coopérative financière québecoise,1999,
CIRANO
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 28
are often more complex than what we saw in beginning with Heckman‟s
primary article. There are fields that Heckman„s model is proved to be
complicated and useless. Instead there are other fields that different
researches have provided the bases for using Heckman‟s model as a
classical and confident solutions to the special kind of problems which
exist in. As we can see in next chapter, there are other researches which
have tried to exploit the quantitative research to provide practical hints
for researcher in qualitative methods.
Experimental Methodes:
We saw TOBIT and Heckman‟s methods which are based on pure
mathematical approach. However, after their primary article, the selection
bias came into the attention of different fields in social science. At the
beginning, this job was excluded to econometricians, and they were
trying to provide insight for other disciplines, but later other branches of
human science, though did not have the Excellency of econometricians in
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 29
dealing with mathematics problems, came to work and facilitated
introduction of experimental methods.
As we saw in previous sections, standard econometrical method for
evaluating social programs uses the outcomes of participants to estimate
what nonparticipants would have experienced had they participated. The
difference between participant and nonparticipant outcomes is the
estimated gross impact of a program reported in many evaluations. The
outcomes of nonparticipants may differ systematically from what the
outcomes of participants would have been without the program,
producing selection bias in estimated impacts. A variety of non-
experimental estimators adjust for this selection bias under different
assumptions. Under certain conditions, randomized social experiments
eliminate this bias.(Heckman, 1997)
Here as one of the most important experiment model we study propensity
Score model and continue our discussion.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 30
Propensity score matching
Introduction:
The probability of selection into a treatment, also called the propensity
score, plays a central role in classical selection models and in matching
models (see,e.g., Heckman, 1980; Heckman and Navarro, 2004;
Heckman and Vytlacil, 2007; Hirano et al., 2003; Rosenbaum and Rubin,
1983). Heckman and Robb (1986, reprinted 2000), Heckman and Navarro
(2004) and Heckman and Vytlacil (2007) show how the propensity score
is used differently in matching and selection models. They also show that,
given the propensity score, both matching and selection models are robust
to choice-based sampling, which occurs when treatment group members
are over- or under-represented relative to their frequency in the
population. Choice-based sampling designs are frequently chosen in
evaluation studies to reduce the costs of data collection and to obtain
more observations on treated individuals. Given a consistent estimate of
the propensity score, matching and classical selection methods are robust
to choice-based sampling, because both are defined conditional on
treatment and comparison group status.
Hence, in statistics, propensity score matching (PSM) is one of quasi-
empirical “correction strategies” that corrects for the selection biases in
making estimates.
Generally, PSM is for cases of causal inference and simple selection bias
in non-experimental settings in which: (i) few units in the non-
experimental comparison group are comparable to the treatment units;
and (ii) selecting a subset of comparison units similar to the treatment
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 31
unit is difficult because units must be compared across a high-
dimensional set of pretreatment characteristics.
In another word, the propensity scoring method is usually applied under
two situation.
Selection bias: potential bias from treatment assignment/selection
conditional on observed variables, due to the effects of unobserved
variables, controlled with selection into treatment.
Finite data: sample size reduces our ability to estimate causal effects by
conditioning on observed variables.
But the PSM is only to adjust for (but not totally solve the problem of)
selection bias; and to minimize the limitation from matching on many
observed variables on finite data.
In this section, we will introduce an empirical case to present how the
PSM works.
Web surveys are the most economical way to make social and market
surveys, but selection bias can invalidate the obtained results. The main
source of bias derives from the internet access coverage: if even in the
USA the web surveys on the elderly
population (50 years old or more) can have strong risks of biased results
(Couper et al,2007), the greater digital divide – in Italy the 89.2% of the
population with more than 50 years do not have Internet access (Istat,
2005) – suggests caution in using web surveys in our country. Taking into
account that in many social or market surveys the target population may
not coincide with general population it can be interesting to apply
methods to correct web survey results when surveys are conducted on
topics related to niche interests.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 32
The first step is to well define the target population on the base of the
topic under investigation. Only a subset (unknown in the size) of this
population can have web access: but, using a national survey, it is
possible to quantify and qualify the subset of the target population with
no web access. This data can be used to correct the web survey results.
The second step is in having an emails list to which submits the web
survey.
Following a proposed classification (Romano et al, 2006) email lists can
differ for accuracy and reasons of enrollment. There are many thematic
web sites where, to be enrolled, people go through authentication (login
and password). These email lists are surely an interesting starting point
to realize web surveys: the interest in the topic becomes itself a favorable
element for a good result of the survey in terms of response rate (among
others, Olson, 2006).
The third step is in applying methods in order to correct ex post the
selection bias deriving from web access. Among other methods proposed
in the literature, the propensity score technique, originally proposed to
correct selection bias in health observational study (Rosenbaum et al,
1983), has been recently used to correct selection bias deriving from
nonprobabilistic sampling (Terhanian et al, 2001) and/or web survey
(Schonlau et al, 2004).
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 33
We discuss preliminary results obtained applying this technique to the
data collected through a web survey on enrolled people to a well-known
enogastronomic web site.
Apart from the survey’s goals, in this work we take into account a target
population defined as Italian people from 18 to 78 year-old and focus on
the estimate of the proportion of people who went on holiday. For the
application of propensity score method, data used is composed by a
subset of web survey respondents (nw=4,128) and a subset of Istat
Multiscopo survey (nm=37,677). A logistic regression was performed
using as dependent variable survey indicator (0=Multiscopo; 1=Web
survey) and as regressors gender (2 levels), level of education (3 levels),
geographical areas (5 levels), age (6 levels). All variables were
significant with α<0.04; respondents for both surveys were divided in 5
and then in 10 bins according to the propensity scores. The propensity
weights were applied to the web survey results, as shown in Table 1.
The results are quite impressive considering the digital divide of the
Italian population (only 31.8% has internet access): the difference
between weighted and not weighted web results is almost 20%. Taking
into account that this is an intentionally simple exercise, we want to
highlight that more useful results can be obtained when target population
is tailored on specific survey’s goals and/or when the digital divide is not
so dramatically high.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 34
Certainly, the application before is a just a simple case, here we only
provide a brief logic to show how the propensity score method corrects
the selection bias.
Also we should be clear another fact is PSM has the several limitations.
There are:
Large samples are required; Group overlap must be substantial; Hidden
bias may remain because matching only controls for observed variables
(to the extent that they are perfectly measured). (Shadish, Cook, &
Campbell, 2002)
Analysis:
We saw propensity score matching model as a method which by
implementing controlled experiment identifies the mechanism of
selection in our estimation and gives us direction to adjust our
estimations. However, Social experiments are costly and the identifying
assumptions required to justify them are not always satisfied.
Nonetheless, it is widely held that there is no valid alternative to
experimentation as a method for evaluating social programs (see, e.g.,
Burtless, 1995). There are other methods which combines experimental
and mathematical models to identify and remove selection bias. In an
important paper, LaLonde (1986) combines data from a social experiment
with data from nonexperimental comparison groups to evaluate the
performance of many commonly used nonexperimental estimators. For
the particular group of parametric estimators that he investigates, and for
his particular choices of regressors, he concludes that the estimators
chosen by econometric model selection criteria produce a range of impact
estimates that is unacceptably large2. STOLZENBERG R. M. and
RELLES in their article, try to extract intuitive insights for non-
2 Characterizing Selection Bias Using Experimental Data,James Heckman, Hidehiko
Ichimura, Jerey Smith, August 1997
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 35
econometricians and qualitative researchers. They mention that they
provide mathematical tools to assist intuition about selection bias in
concrete empirical analyses. These new tools do not offer a general
solution to the selection bias problem3; and we can claim that our work in
this project is much more intuitive for non-econometricians then theirs.
However, their statements gave us the direction to correct our ways, and
regarding to frequent citations, we understood they have originated a
huge wave about understanding the limitation of Heckman models and
their solutions. They mentioned that Heckman model necessary does not
improve the solution, and it only can be used with other examples to
discuss the range of validity of problems. Something that we discussed
before. Besides, they mentioned that Heckman and other mathematical
model usually do not provide enough tuitions and directions for
correcting answers, and they insist to implement different experiments
and exploit researchers own insight for detection and adjusting
selection bias problems. Heckman, himself in his final articles works on
papers which identify selection bias by social experiments, then measures
the accuracy of his experiment by econometrical models.4
Conclusion:
In this paper, we introduced and identified important types of selection
bias. As the most important models we studied Tobit and probit models,
and identified Heckman model as a specific case of tobit model. We
found an overview of different types of Tobit model, and their possible
solution and conditions. Specially we studied Tobit 1 model in detail. we
3 STOLZENBERG R. M. and RELLES D. A., Tools for intuition about sample selection
bias and its correction , American sociological review,Vol 62,N°3,1997
4 Characterizing Selection Bias Using Experimental Data,James
Heckman, Hidehiko Ichimura, Jerey Smith, August 1997
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 36
acknowledged that Tobit model has conditions which cannot be easily
satisfied, and nonlinear equations which are not always easy to solve.
Then we studied heckman model as an special case in selection bias.
Finally, we argued that there is not a confident and comprehensive
answer to selection bias, and regarding to the complexity of the relations,
different methods should be used, or avoided. In some papers it was
discussed that in some circumstances Heckman model should be avoided,
and oppositely in some areas the efficiency of Heckman model was is so
proved , that researchers have developed classical solution for some kind
of problems. Besides, today, selection bias is so diffused in econometrics
that has turned to be a necessary part of researches. However, there is a
growing trend in using experimental methods in identifying and
correcting selection bias. In this respect we studied propensity score
matching as a way of identifying selection bias by experiment, in control
groups, and adjusting our data selection and treatment and selection to
remove selection bias.
In addition, we saw how researchers are approaching question with
different experimental and econometrical methods, specially using
conditional probabilities. They calculate with different methods the
related parameters and discuss about different response and their
credibility‟s. Once getting a generalization about some special case, the
researchers identify it as a classical solution, and in futures use them as
standard test.
Likewise, for each condition of probit models, whose solutions many
times become nonlinear and complicated, they are trying to use simple
and heuristic models for each case. According to what we have learned
through our studding in this project, we can mention that If you are a
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 37
researcher in management and want to evaluate selection bias, our first
advice is that start it with a mathematicians. Because still the tools are not
independents freom mathematics and econometrics is the main
knowledge creation body. The prove of our claim is the high level of
mathematics in papers who cintende to give qualitative insights. It is why
we saw some workshops and courses in USA and UK universities, like
oxford which were specially developed to teach selection bias to students
of Human science. However, control experiments can be very useful to
identify and adjust selction bias for qualitative researchers. Besides,one
short-cut way can be finding an article either in econometrics or your
related discipline who already has worked on your subject. In other
words, look if you can find classical solutions to your problems.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 38
APPENDIX:
Software and Toolkits:
Short Toolkit for using Tobit modeling with EasyReg
1) Introduction
The use of Tobit Model is very difficult whit mathematical approach,
there is many type of Tobit model type and several hypothesis to consider
in order getting the last result using Tobit Model. We suggest you in this
section a short summary of using the EasyReg software for modeling
Tobit model starting from you gathered empirical data.
2) The data
The data has been generated artificially as follows. The independent
variables X1,j and X2,j and the error Uj for j = 1,....,n = 500 have been
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 39
drawn independently from the standard distribution, and Y has been
generated as:
Y =max(0,X1,j + X2,j + Uj).
Thus, if an intercept is included in the model, so that the vectors of
regressions are
Xj = (X1,j,X2,j,1)',
then the true parameter vector is = (1,2,3)', where
1 = 1
2 = 1
3 = 0
Moreover, the true value of is
= 1
The data file involved is Tobit_Data.TXT, like this array
Observation
Y
X1
X2
Z
1 0.000000000 -1.463631868 -0.640421391 0
2 0.000000000 0.427667916 -0.219542548 0
which is in former EasyReg default format This data file also contains a
variable Z, which I will use and explain later. (see Guided tour on
importing data files in EasyReg space delimited text format at EasyReg
software book).
3) How to estimate a Tobit model with EasyReg
Now open "Menu > Single equation models > Tobit models" in the
EasyReg main window, select the variables Y, X1 and X2, and keep the
the default intercept, similar to running an OLS regression with intercept,
until you arrive at the following window.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 40
In general there is no need to adjust the stopping rules of the Newton
iteration which is used to maximize the likelihood function. Thus, click
"Tobit analysis". Then after a few seconds the maximum likelihood
estimation results appear:
If you click "Continue", the module NEXTMENU will be activated:
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 41
You have seen this window before after running an OLS regression, so no
further explanation is necessary.
The output is listed below. Note that I have used the option "Wald test of
linear parameter restrictions" to test the joint null hypothesis:
1 = 1
2 = 1
3 = 0
This hypothesis is not rejected, of course, at any reasonable
significance level
4) The output
Tobit model:
y = y* if y* > 0, y = 0 if y* <= 0, where y* = b'x + u
with x the vector of regressors, b the parameter vector,
and u a N(0,s^2) distributed error term.
Dependent variable:
Y = Y
Characteristics:
Y
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 42
First observation = 1
Last observation = 500
Number of usable observations: 500
Minimum value: 0.0000000E+000
Maximum value: 5.4575438E+000
Sample mean: 7.2127526E-001
This variable is nonnegative, with 244 zero values.
A Tobit model is therefore suitable
X variables:
X(1) = X1
X(2) = X2
X(3) = 1
Frequency of Y = 0: 48.80%
(244 out of 500)
Newton iteration succesfully completed after 5 iterations
Last absolute parameter change = 0.0001
Last percentage change of the likelihood = 0.0603
Tobit model: Y = max(Y*,0), with
Y* = b(1)X(1) + b(2)X(2) + b(3)X(3) + u,
where u is distributed N(0,s^2), conditional on the X variables.
Maximum likelihood estimation results:
Variable ML estimates (t-value)
[p-value]
x(1)=X1 b(1)= 1.0547731 (17.0084)
[0.00000]
x(2)=X2 b(2)= 0.9905518 (15.2253)
[0.00000]
x(3)=1 b(3)= -0.0243418 (-0.3450)
[0.73011]
standard error of u s= 1.0635295 (21.9209)
[0.00000]
[The p-values are two-sided and based on the normal approximation]
Log likelihood: -4.74065017126E+002
Pseudo R^2: 0.60984
Sample size (n): 500
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 43
Information criteria:
Akaike: 1.912260069
Hannan-Quinn: 1.925490511
Schwarz: 1.945976933
If the model is correctly specified then the maximum likelihood
parameter estimators b(1),..,b(3), minus their true values, times the
square root of the sample size n, are (asymptotically) jointly normally
distributed with zero mean vector and variance matrix:
1.92290870E+00 6.77554263E-01 -9.38221607E-01
5.37455447E-01 2.11638376E+00 -9.79444588E-01
-9.81136382E-01 -1.09217153E+00 2.48931672E+00
Wald test:
x(1)=X1 b(1)= 1.0547731 (17.0084)(*)
x(2)=X2 b(2)= 0.9905518 (15.2253)(*)
x(3)=1 b(3)= -0.0243418 (-0.3450)(*)
(*): Parameters to be tested
Null hypothesis:
1.x(1)+0.x(2)+0.x(3) = 1.
0.x(1)+1.x(2)+0.x(3) = 1.
0.x(1)+0.x(2)+1.x(3) = 0.
Null hypothesis in matrix form: Rb = c, where
R =
1. 0. 0.
0. 1. 0.
0. 0. 1.
and c =
1.
1.
0.
Wald test statistic: 0.98
Asymptotic null distribution: Chi-square(3)
p-value = 0.80630
Significance levels: 10% 5%
Critical values: 6.25 7.81
Conclusions: accept accept
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 44
5) An inappropriate attempt to conduct Tobit analysis
As an example of a case for which EasyReg refuses to conduct Tobit
analysis, select the variables Z, X1, X2 and the constant 1 for the
intercept, and declare Z the dependent variable. Then you will get stuck
here:
The problem is that Z is discrete, because I have generated it as
Z = Int(100*Y)
where the "Int" function trucates its argument to an integer, by cutting off
all the digits after the decimal symbol (a dot "." in the US, a comma "," in
Europe). But the Tobit model assumes that Z has a continuous
distribution, conditional on Z > 0 and X1 and X2, so that the assumptions
of the Tobit model do not hold. Therefore, in order to prevent you from
doing bad econometrics, EasyReg will not allow you to continue.
In view of the queries I have gotten about this issue, the message in this
window may not be clear enough. If so, click the "Yes" button, which
opens a PDF file:
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 45
6) What to do if the dependent variable Y is confined to a
bounded interval?
a. The case Y (a,b]
If the observed dependent variable Y is confined to an interval (a,b],
where - < a < b < , with P[Y = b] > 0, it is possible to transform Y to a
new dependent variable Z, say, such that Z [0,) and P[Z = 0] = P[Y =
b] > 0, namely Z = -ln[(Y - a)/(b - a)]. Next, assume that Z = max(0,Z*),
where Z* = 'X + U. Then
Y = min(b,a + (b - a)exp(-Z*)) = min(b,a + (b - a)exp(-'X - U)).
To create this variable Z, open Menu > Input > Transform variables, and
conduct the following transformations:
1. Click the "Constant = 1" button. Then a new variable "1" is
created, which has the value 1 for all observations.
2. Click the "Linear combination of variables" button, select "1" and
use the value of a as coefficient. Then a new variable with name
"ax1" is created, which has the value a for all observations. I will
assume that you have renamed the variable "ax1" as variable A.
3. Click the "Linear combination of variables" button, select "1" and
use the value of b as coefficient. Then a new variable with name
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 46
"bx1" is created, which has the value b for all observations. I will
assume that you have renamed the variable "bx1" as variable B.
4. Click the "Linear combination of variables" button, select the
variables Y and A, and create the linear combination Y-A. I will
assume that you have renamed Y-A as YminA. Note that now
YminA (0,b-a].
5. Click the "Linear combination of variables" button, select the
variables B and A, and create the linear combination B-A. I will
assume that you have renamed B-A as BminA. Note that BminA is a
contant with value b-a for all observations.
6. Click the "Multiplicative transformation of variables" button, select
the variables YminA and BminA and use the powers 1 and -1,
respectively, to create the new variable "YminA x BminA^-1". I will
assume that you have renamed this new variable as YminA/BminA.
Note that YminA/BminA (0,1].
7. Click the "LOG transformation: x -> ln(x)" button, and select the
variable YminA/BminA. Then the new variable LN[YminA/BminA]
will be created. Note that LN[YminA/BminA] (-,0].
8. Click the "Linear combination of variables" button, select the
variable LN[YminA/BminA], and use the coefficient -1 to create the
variable -LN[YminA/BminA]. I will assume that you have renamed
this variable as Z. Thus, Z = -LN[YminA/BminA]. Now Z [0,),
and P[Z = 0] = P[Y = b] > 0.
The new variable Z in step 8 can now be used as dependent variable in a
Tobit model. However, keep in mind that in this case a negative
coefficient of an X variable implies a positive effect on the original
dependent variable Y, because Z/Y = -1/(Y-a) < 0, hence Y/Z < 0.
Although needless to say (but I will say it anyhow), if a = 0 and b = 1
then you can skip the steps 1 to 6, and use Y instead of YminA/BminA in
step 7.
b. The case Y [a,b)
If Y [a,b), where - < a < b < , with P[Y = a] > 0, then Z = -ln[(b -
Y)/(b - a)] [0,), with P[Z = 0] = P[Y = a] > 0. This variable Z can be
created similarly to the previous steps 1 to 8, and can be used as the new
dependent variable in a Tobit model. Since now Y/Z > 0, a positive
coefficient of an X variable implies a positive effect of this X variable on
Y.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 47
Note that now we model the conditional distribution of Y by
Y = max(a,b - (b - a)exp(-Z*)) = max (a,b - (b - a)exp(-'X - U)).
c. The case Y [a,b]
This case cannot be handled by standard Tobit analysis.
You can find some specialized software in link below:
The widest variety of sample selection models. Information regarding
LIMDEP can be found at www.limdep.com. Also, a student version,
along with documentation, can be downloaded free from
ww.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm. The student
version of this software and accompanying data sets are included with
Greene‟s (2000) text.
For SAS users, Jaeger (1993) provides the code for performing Heckman‟s
two-step estimation of sample selection bias. This program can be
downloaded from the SAS Institute web page using the following link
(http://ftp.sas.com/techsup/download/stat/heckman.html). Some adjustments
to the code are necessary for the program to work (i.e., your own variable
names must be inserted).
Finally, Stata 7 (2001) (http://www.stata.com/site.html) also can be
used to estimate Heckman‟s (1976, 1979) two-step detection and
correction of sample selection bias. Specific programming information
can be found at the following Stata link
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 48
(http://www.stata.com/help.cgi?heckman).
In most software like Stata , SAS, and SPSS, there are special tests for
this problems.If you search at help of these software with Heckman,
Tobit, selection bias, and more importantly endegenity you will find your
related syntax which will simplify you work significantly.
For stata we observed many times that it is quite defined there and could
usually be identified and applied. Here you can see the result of the
search in help of stata:
-------------------------------------------------------------------------------
search for selection bias (manual: [R] search)
-------------------------------------------------------------------------------
Keywords: selection bias
Search: (1) Official help files, FAQs, Examples, SJs, and STBs
(2) Web resources from Stata and from other users
Search of official help files, FAQs, Examples, SJs, and STBs
[R] heckman . . . . . . . . . . . . . . . . . . . Heckman selection model
(help heckman)
[SVY] svy: heckman . . . . . . . . Heckman selection model for survey
data
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 49
(help svy: heckman)
FAQ . . . . . . . . . . . . . . . Endogeneity versus sample selection bias
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Millimet
10/01 What is the difference between 'endogeneity' and
'sample selection bias'?
http://www.stata.com/support/faqs/stat/bias.html
FAQ . . . . . . . . . . . . . . Determining the sample for a Heckman model
. . . . . . . . . . . . . . . . . . . . . . . V. Wiggins and W. Gould
Besides, in stata, if you search with good phrase you may be linked by
web to other users who have worked on your subject. As well as you can
ask from the related center who develops this software, and ask your
specific questions.
GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 50