gdo 2009 selection_bias

How to correct the selection

bias in management research

Team Bin Xu Oualid EL Ouardi Shabnam kazempur Teacher : Mr Christophe Benavent

GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur


I.: Overview

1.1 What is the selection bias?

Selection bias is the error of distorting a statistical analysis by pre- or

post-selecting the samples. Typically this causes measures of statistical

significance to appear much stronger than they are, but it is also possible

to cause completely illusory artifacts. Selection bias can be the result of

scientific fraud which manipulate data directly, but more often is either

unconscious or due to biases in the instruments used for observation.

1.2 Reasons for selection bias

Figure 1

The figure 1 shows three main reasons, selective non-response,

incomplete observability and the self-slection.

http://www.nationmaster.com/encyclopedia/Statistics

http://www.nationmaster.com/encyclopedia/Statistical-significance



http://www.nationmaster.com/encyclopedia/Scientific-fraud


1.2.1 Selective non response

It means there are not enough the response. For example, the 1997 Dutch

Labour Force Survey (LFS) had a response of 56%. Apart from the LFS,

other socio-cultural surveys in the Netherlands also show response rates

between 50% and 60%. Panel-studies are even more problematic. The

response rate of the two-wave Dutch Parliamentary Election Study

(DPES) has been below 50% since 1981 and only 43% of the electorate

participated in 1998 (Aarts, Van der Kolk &Kamp, 1999, pp. 22-24).

Apparently, not enough data is hard to be convincing, because it is

probable the people who have not responded are those who are not agree

with our problem.

1.2.2 Incomplete observability,

It is included of two types: censored data and truncated data.

Censored data:

Censored data point are those whose measured properties are not known

precisely, but are known to lie above or below some limiting sensitivity.

For example, suppose a study is conducted to measure the impact of a

drug on mortality. In such a study, it may be known that an individual's

age at death is at least 75 years. Such a situation could occur if the

individual disenrolled from the study at age 75, or if the individual is

currently alive at the age of 75. Censoring also occurs when a value

occurs outside the range of a measuring instrument. For example, a

bathroom scale might only measure up to 300 lbs. If a 350 lb individual is

weighed using the scale, the observer would only know that the

individual's weight is at least 300 lbs.

Truncated data:

http://en.wikipedia.org/wiki/Measuring_instrument


Truncated data points are those which are missing from the sample

altogether due to sensitivity limits. For example, if an experiment were to

be conducted to count the distribution of sizes of fish in a lake, a net

might be used to catch a representative sample of fish. If the net had a

mesh size of 1 cm, then no fish narrower than 1 cm wide would be found

in the sample. This is a result of the method of selection: there is no way

of knowing whether there are any fish smaller than 1 cm based on an

experiment using that net.

Censoring and truncation:

As the example before mentioned, Censoring is when an observation is

incomplete due to some random cause. The cause of the censoring must

be independent of the event of interest if we are to use standard methods

of analysis. Truncation is a variant of censoring which occurs when the

incomplete nature of the observation is due to a systematic selection

process inherent to the study design.

1.2.3 Self-selection

It is a term used to indicate any situation in which individuals select

themselves into a group, causing a biased sample. It is commonly used to

describe situations where the characteristics of the people which cause

them to select themselves in the group create abnormal or undesirable

conditions in the group. Self-selection is a major problem in research in

sociology, psychology, economics and many other social sciences.

Self-selection makes it difficult to determine causation. For example, one

might note significantly higher test scores among those who participate in

http://en.wikipedia.org/wiki/Group_%28sociology%29

http://en.wikipedia.org/wiki/Biased_sample

http://en.wikipedia.org/wiki/Sociology

http://en.wikipedia.org/wiki/Psychology

http://en.wikipedia.org/wiki/Economics

http://en.wikipedia.org/wiki/Social_sciences

http://en.wikipedia.org/wiki/Causation


a test preparation course, and credit the course for the difference.

However, due to self-selection, there are a number of differences between

the people who chose to take the course and those who chose not to.

Arguably, those who chose to take the course might have been more

hard-working, studious, and dedicated than those who did not, and that

difference in dedication may have affected the test scores between the

two groups. If that was the case, then it is not meaningful to simply

compare the two sets of scores. Due to self-selection, there were other

factors affecting the scores than merely the course itself.

Self-selection causes problems for research about programs or products.

In particular, self-selection makes it difficult to evaluate programs, to

determine whether the program has some effect, and makes it difficult to

do market research.

1.3 The main selection bias in management

In the most observational studies in management, selection bias usually is

due to the last two reasons, that‟s because their study object is more

specific, the response from study object is easier to collect. And

Truncation is a variant of censoring which occurs when the incomplete

nature of the observation is due to a systematic selection process inherent

to the study design.

In fact we focus on selection bias comes in two main flavors: (1) self-

selection of individuals to participate in an activity or survey, or as a

subject in an experimental study; (2) selection of samples or studies by

researchers to support a particular, especially censoring data.

In the next section, the paper will describe more details of selection bias

due to self-selection and censoring data in management research,

http://en.wikipedia.org/wiki/Program_evaluation

http://en.wikipedia.org/wiki/Market_research


introduce two basic theory and model in order to correct the selection

bias, which are Tobic model and Heckman model respectively. When the

Former is basis of the latter model and a method to correct selection bias

from censoring data, and the latter is more useful abroad and a main

stream model to correct selection bias.

Solutions:

Selection bias started in 1958 with Tubin studies about house

expenditures and luxury products and Since publication of Heckman‟

article in 1979, which with 7300 citation made him Nobble prize winner,

there have been thousands of articles in this respect. While searching in

different articles we identified different categories and trends for

identifying, controlling or removing .

The primary methods which were used were mainly mathematical and

econometrical models. Thanks to the huge researches and attention to it,

this area is almost a mature area, and is in fact the main source of

progress in selection bias. However, in many areas there have been many

efforts to simplify the mathematical models and provide insight for

researcher in human science who are not necessarily mathematicians.

Emerging experimental and quasi experimental methods like propensity

score method is the results of these efforts.

In our paper we will see the mathematical back ground. Then we will

work on experimental methods. Finally we will provide our insights, and

solutions to deal with selection bias. In appendix we have provided some

information about the software and toolkits which can be used for

selection bias.


We have tried to avoid entering complicated mathematical equations.

Even we can claim that we have been more successful than those articles

which have been written by titles like “intuition about selection bias”.

Nonetheless, this is an area which still very dependent to mathematics,

and it is not easy to transfer concepts without understanding its

mathematical base. For this reason, in second chapter, we have tried to

explain up to some degree the mathematical bases of selection bias.

Although it is simple and primary, it provides the base for those who

want to follow this field through mathematical equations. In addition, it

gives good insights to others who only want to understand selection bias

qualitatively. But our last chapter is completely qualitative and discussed

about experiments and qualitative approaches.

Finally we have provided good bibliography about different subjects in

selection bias.


Econometrical Base of Selection Bias:

A :Tobit Model :

As we discussed before, a common occurrence in many regression

models is the existence of truncation or censoring in the response

variable. Tobin (1958) pioneered the study of such models in economics,

analyzing household expenditures on durable goods while taking into

account the fact that expenditures cannot be negative. That is, for some

observations the observed response is not the actual response, but rather

the censoring value (often zero), and an indicator that censoring has

occurred. More specifically, the so-called Type I Tobit model can be

written as a combination of two familiar models. The first model is a

Probit model, which determines whether the iy variable is zero or

positive and the second model is a Truncated Regression model for the

positive values of iy . The Type-I Tobit model assumes that the

parameters for the effect of the explanatory variables on the probability

that an observation is censored and the effect on the conditional mean of

the non-censored observations are the same.

In this section we will see an overview on the Tobit Model which mean

for us Type-I Tobit model, unless otherwise specified, after that we will

give an application of using this model, then we present some comment

on the results. The last part introduces many limitations of the Tobit

model.

I. Overview

1. Truncation and Censoring

a. Truncation

Truncation occurs when some observations on both the dependent

variable and regressors are lost. For example, income may be the


dependent variable and only low-income people are included in the

sample. In effect, truncation occurs when the sample data is drawn from a

subset of a larger population.

Examples:

Let iy = the profit of the i-th firm as a percentage of assets and ix =

the four firm concentration ratio of industry the firm is in. Suppose

only firms with positive profit rates are observed and firms with

negative profit rates are not observed. In this case a = 0 and we

have a problem where the dependent variable is left truncated. In

the case where ),( ii xy is observed only when ayi (left truncation)

or when byi (right truncation) or when dyc i (double

truncation).

A second example: Objects of certain type in a specific region of

the sky will not be detected by the instrument if the apparent

luminosity of objects is less than a certain lower limit. This often

happens due to instrumental limitations or due to our position in

the universe.

For instance, suppose that the data concern the purchases of new

cars, with yi the price of the car and xi characteristics of the buyer

like age and income class. Then no observations on yi can be

below the price of the cheapest new car. Some households may

want to buy a new car but find it too expensive, in which case they

do not purchase a new car and are not part of the observed data.

This truncation effect should be taken into account, for instance, if

one wants to predict the potential sales of a cheaper new type of

car, because most potential buyers will not be part of the observed

sample.

This figure shows an example of a truncated normal density with

truncation from below (at x = -1).


b. Censoring

Censoring occurs when data on the dependent variable is lost (or limited)

but not data on the regressors. Sources/events can be detected, but the

values (measurements) are not known completely. We only know that the

value is less than some number. Case where all ),( ii xy are observed it is

just that when iy "passes" the truncation point, iy is recorded as the

truncation point. As in the truncation model you can have left-censoring,

right-censoring, or upper and lower-censoring. For discussion purposes

let's consider the case where 0iy . We consider a contribution to charity

as example. Some people give to the designated charity and some people

do not.

Examples:

If we consider people of all income levels may be included in the

sample, but for some reason the income of high-income people

may be top-coded as, say, $100,000. Censoring is a defect in the

sample - if there were no censoring, then the data would be a

representative sample from the population of interest. Truncation

entails a greater loss of information than censoring. Long (1997,

188) provides a nice picture of truncation and censoring.

This figure shows an example shows a censored normal density

with censoring from below (at x = 0),


N.B: The main difference between censoring and truncation is that

censored object is detectable while the object is not even detectable in the

case of truncation.

2. Tobit model for censored data

The dependent variable is called censored when the response cannot take

values below (left censored) or above (right censored) a certain threshold

value. For instance, in the example on investments in a new financial

product, the investments are either zero or positive. And, in deciding

about a new car, one has either to pay the cost of the cheapest car or

abstain from buying a new car. The so-called Tobit model relates the

observed outcomes of Y*>0 to an index function

The Tobit model for censored data is sometimes called the Tobit type 1

model, to distinguish it from the Tobit type 2 model that will be discussed

in the next section for data with selection effects. In contrast with a

truncated sample, where only the responses for y*i > 0 are observed, it is

now assumed that responses y*I = 0 corresponding to y*i < 0 are also

observed and that the values of xi for such observations are also known.

In practice these zero-responses are of interest, as they provide relevant

information on economic behaviour. For instance, it is of interest to know

which individuals decided not to invest (as other financial products could


be developed for this group) or which individuals did not buy a new car

(as one could design other cars that appeal more to this group). The Tobit

model can be seen as a variation of the Probit model, with one discrete

option („failure‟, yi =0) and where the option „success‟ is replaced by the

continuous variable yi > 0.

II. Application of Tobit model for censored data

As we introduced in (2.Tobit model for censored data) If we consider

direct marketing data concerning a new financial product, when of the

925 customers, 470 responded to the mailing by investing in the new

product. We analyze the censuring sample consisting of these 470

customers. We will not consider only the customers of the bank who

decided to invest in the financial product as for truncated case, but we

also know the individual haracteristics of the customers who decided not

to invest. We will, therefore, construct a Tobit model for the invested

amount of money.

We will discuss 1) the data, 2) the ML (Maximum of Likelihood)

estimates of the Tobit model, after that we will see in 3) a comparison

with the results obtained if we use the truncated sample approach rather

than censured approach.

1) The data

We consider data that were collected in a marketing campaign for a new

financial product of a commercial investment firm (Robeco).The

campaign consisted of a direct mailing to customers of the firm. The firm

is interested in identifying characteristics that might explain which

customers are interested in the new product and which ones are not. In

particular, there may be differences between male and female customers

and between active and inactive customers (where active means that the

customer already invests in other products of the firm). Also the age of

customers may be of importance, as relatively young and relatively old

customers may have less interest in investing in this product than middle

aged people.


The data set consists of 925 individuals, of whom 470 responded by

making an investment in the product and 455 did not respond. For

individuals who responded, the amount of money invested in this product

is known. The explanatory variables (gender, activity, age) are known for

all 925 individuals, hence also for the individuals that did not invest in the

product. So the dependent variable is censored, not truncated. As before,

we take as dependent variable yi = log (1 + invest), where „invest‟ is the

amount of money invested. For individuals who did not invest (so that

„invest‟ is zero), we get yi = 0.

2) The ML (Maximum of Likelihood) estimates of the Tobit model

The Tobit estimates (ML in the censored regression model) are in Panel 2

of Figure of results. For comparison this table also contains the OLS

estimates that are obtained if the censoring is erroneously neglected (see

Panel 1).The Tobit multipliers in Panel 2 of Figures of results are

somewhat larger than the OLS (least squares or ordinary least squares)

multipliers. The variables „gender‟ and „activity‟ have a positive effect on

the amount of money invested, and age has a parabolic effect, with a

maximum at an age of around 53 years (namely, where 0,196 – 2*0,185*

(age=100)= 0).

3) Comparison of Tobit estimates with results for truncated sample

If we compare the results of the Tobit model in Panel 2 of Figures of

results with the results for the truncated sample (without Tobit model)

obtained if we use truncated model (see Panel 3). The effect of „activity‟

now has the expected positive sign (instead of negative) and the

maximum investments are around an age of 53 (instead of 62). Further,

the Tobit estimates indicate higher investments by males as compared to

females, whereas the reverse effect was estimated in the truncated

sample. As the information on individuals who do not invest is of

importance in describing the general investment behaviour, the results

obtained for the censored sample are more reliable than the ones for the

truncated sample. This illustrates the general point that it is always

advisable to include relevant information in the model. The truncated

model neglects the information on non-investing customers, and this

makes this model much less informative than the Tobit model for the

censored data.

4) Figures of results


III. Mathematical definition of Tobit model types

Amemiya (1984) classified Tobit models into five types based on the

characteristics of the likelihood function. For notational convenience, let

P denote a distribution or density function, assuming that y_ji is

normally distributed with a mean of and a variance of

j

Type 1 Tobit

The Type 1 tobit model, discussed in the preceding “Censored and

Truncated Regression Models” section, is defined as

Type 2 Tobit

The Type 2 tobit model is defined as:


Where

Type 3 Tobit

The Type 3 tobit model is different from the Type 2 tobit in that i of

the Type 3 tobit is observed when

Where ; (i.i.d.) means independent and identically

distributed.

Type 4 Tobit

The Type 4 tobit model consists of three equations.

Where (i.i.d.) means independent and identically

distributed


Type 5 Tobit

The Type 5 tobit model is defined as

Where are from iid trivariate normal distribution.

IV. Overview on trend corrections of Tobit model

The Tobit model was introduced by James Tobin in 1958 in order to

model a specific type of discrete-continuous data commonly found in

economic applications. The Tobit model is a specific case of a censored

regression model and assumes that the continuous component of the data

(right-tail) is normally distributed. Early examples included modeling

household expenditures of luxury goods, inheritance, and expected age of

retirement. But now it is used everywhere when selection bias is possible.

However, it has been demonstrated in later research that even small

departures from underlying normality assumption may lead to

inconsistent estimators. Arabmazar and Schmidt (1982) explored the

robustness of the Tobit estimator when estimating a population mean

when the assumption of normality is violated. They concluded that the

bias can be quite large and that the bias is dependent on the proportion of

censoring.One technique that is often utilized in an attempt to compensate

for this weakness in the case of long-tailed distributions is to apply a log

transformation to the data. Lorimer and Kiermeier (2007) conduct a

simulation study to examine the use of Tobit models on log-transformed

microbiological data. They compared the Tobit method to two other

methods, using only uncensored observations, and using the limit of

detection for the censored values. They concluded that the two standard

methods led to biased estimates and that the Tobit model led to less


biased estimates. However, their conclusions are based on the underlying

assumption of normality. Others have discussed that usually tobit model

leads to non linear and complicated equations which cannot be solved

even by mathematicians.

Therefore, for each condition of profit models, whose solutions many

times become nonlinear and complicated, they are trying to use simple

and heuristic models for each case.

It is important to notice that Tobit models are the basic models in

identifying selection bias, and next models as Heckman models are some

special case of Tobit model. It is why there are still wide researches in

this fields and many new trends has been discovered. Besides, there is a

huge trend which try to develop classical solution for each of these

trends.

The Relation between Tobit model and Heckman model:

Tobit model become in an particular case as Probit model, which is the

first stage of Heckman model, hence we can use the same way of

simulating Tobit model as in the first stage for Heckman. Although Tobit

model was introduced 23 years after Probit model, the last one were

introduced by Chester Ittner Bliss in 1935, and it‟s fast method of solving

the models was introduced by Ronald Fisher in an appendix to the same

article. Because the response is a series of binomial results, the likelihood

is often assumed to follow the binomial distribution. Let Y be a binary

outcome variable, and let X be a vector of regressors. The probit model

assumes that

where Φ is the cumulative distribution function of the standard normal

distribution. The parameters β are typically estimated by maximum

likelihood.

While easily motivated without it, the probit model can be generated by a

simple latent variable model. Suppose that

http://en.wikipedia.org/wiki/Chester_Ittner_Bliss

http://en.wikipedia.org/wiki/Ronald_Fisher

http://en.wikipedia.org/wiki/Binomial_distribution

http://en.wikipedia.org/wiki/Cumulative_distribution_function

http://en.wikipedia.org/wiki/Normal_distribution



http://en.wikipedia.org/wiki/Maximum_likelihood



http://en.wikipedia.org/wiki/Latent_variable_model


where , and suppose that Y is an indicator for whether the

latent variable Y * is positive:

In next part we will see it is the foundation of Heckman model in detail.

V. Some References

Amemiya, Takeshi (1973). "Regression analysis when the dependent

variable is truncated normal". Econometrica 41 (6), 997–1016.

Amemiya, Takeshi (1984). "Tobit models: A survey". Journal of

Econometrics 24 (1-2), 3-61.

Amemiya, Takeshi (1985). "Advanced Econometrics". Basil Blackwell.

Oxford.

Schnedler, Wendelin (2005). "Likelihood estimation for censored random

vectors". Econometric Reviews 24 (2),195–217.

Tobin, James (1958). "Estimation for relationships with limited

dependent variables". Econometrica 26 (1), 24–36.

Econometric methods with applications in business and economics

Par C. Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, Herman K.

Van Dijk Oxford University Press, 2004


Heckman Model:

We saw how there may be self selection by the individuals or data

units being investigated. Her we review one example and try to model

it by Heckman primary model which as an econometric model.

Suppose we observe that college grades are uncorrelated with success in

graduate school .Can we infer that college grades are irrelevant? Of

course not, but we should figure out that unmeasured variables (e.g.

motivation) used in the admissions process might explain why those who

enter graduate school with low grades do as well as those who enter

graduate school with high grades.

Formulating the problem:

Solving problem

If we continue and solve this model we will have:


Here, will be our new repressor and V1i will be new specifics errors

in our example is motivation

Y1 = the result of students at university

X1 = undergraduate scores

Y2 = Admission at graduate school

X2= All the factors that create admission result

I: Number of observations

For being present at regression sample, you should be

admitted or Y2> 0

References:

Heckman, James J. 1979. “Sample Selection Bias as a

Specification Error.” Econometrica 47(1): 153-161

W.H. Green , Econometrical Analysis, third edition

Graphical explanation of Heckman Model:

As we discussed, selection bias simply discusses only about the samples

who satisfies a primary condition. We can show it graphically as below

we only can see the parts in which r is below 0, while those r who are not

represented in our surve,y can show different characteristics.


The example which we will study here is about application of Heckman

model in assurance. We will apply the same probabilistic condition, but

since we have a binary situation, we will not enter to regressions, and

solve our problem with probability theories.

Auditing policies derived from statistical analysis and applied to insurance

claims face a major selection bias problem. Most insurance companies are

reluctant to carry out a random auditing policy. This is because the long-

term influence of an audit decision on the policyholder‟s value for the

company is negative. Indeed, an honest policyholder may take the audit

process amiss and his loyalty to the company should decrease as a

consequence, as well as his value for the insurer. Hence companies are

deterred from performing a systematic audit on part of their claims

database. In fact they imply auditing only on suspicious files. In This way

we have a selection bias: we select those who are suspicious. In order to


identify fraudulent insurance claims at an early stage, most systems score a

new incoming claim using a set of fraud indicators. If the score is high

enough, then the claim is audited and fraud (or abuse)maybe confirmed.

Only claims with a high suspicion level (or score) are selected for

investigation.

Let us now formalize the selection bias issue. If all incoming claims can

be considered for audit, the selection bias issue can be formalized in the

following way: A and Fdenote the binary variables related to audit and

fraud, and x is the vector of variables which describe the claim. A

statistical model assessing fraud risk is derived from the audited claims

and estimates probabilities of the type P(F = 1 | A= 1, x) = E(F | A= 1, x).

Now an audit policy induced by this model is applied on the incoming

claims, and uses the probabilities P(F = 1 | x) = E(F | x). Selection bias is

a consequence of the confusion between the conditional and

unconditional probabilities.

Here S represents the suspected files. In normal auditing, usually the

numbers of audited files (A) is less than S, because we do not audit every

file and only evaluate suspected file where S>0.

Random auditing of claims is the basic strategy which makes it possible

to counteract selection bias. A pure random auditing strategy consists in

picking claims at random, then in auditing these claims. This controlled

experiment eliminates the selection induced by the audit decision. The

estimation of a single fraud equation in this sample provides an estimated

fraud probability for incoming claims which is not subject to selection


bias. Here we calculate the parameters for both random and usual

nonrandom samples. You can consider the difference in two cases.

Experimental Models:

Application of Heckman Model:

After introducing Heckman Model, it has been widely diffused in

statistical tests. While, searching for an example of Heckman model, we

observed that it has turned to be a common and necessary test in

economics application. In other words, doing Heckman test is not an

advantage for an article, but not doing it accounts a weakness for the

research. Most of the research has used Heckman test as a criteria for

showing the validity of their test. Sometimes it is used with other

methods to show the possible variance of answers. Like example below:

Nous volons mesurer les composantes de productivité de secteur

bancaire:

ln Y = B0 + B1M-O + B2M-O2 + B3Sprfc + B4Ann´ee + Bri RI + e

o`u Y est la mesure de production, M-O est la mesure de main-d‟œuvre,

Sprfc celle du capital, Année est une mesure binaire différenciant 1995 et


1996et RI est l‟information binaire sur la présence des intéressements.

In this article they continue to calculate coefficients with Heckman

method and Instrumental variable method to remove selection bias as

below:


As you can see the results of three methods, are not the same and the

authors continue to reasoning to identify correct range of answers1.

As you can see, here authors necessarily do not think that Heckman‟s

method is the final methodology. It is only regarded as a mean to

approaching the problem from other dimensions. In fact the real problems 1 Simon Drolet, Paul Lanoie, Bruce Shearer , Analyse de l’impact

productif des pratiques de rémunération incitative pour une entreprise de

services : Application à une coopérative financière québecoise,1999,

CIRANO


are often more complex than what we saw in beginning with Heckman‟s

primary article. There are fields that Heckman„s model is proved to be

complicated and useless. Instead there are other fields that different

researches have provided the bases for using Heckman‟s model as a

classical and confident solutions to the special kind of problems which

exist in. As we can see in next chapter, there are other researches which

have tried to exploit the quantitative research to provide practical hints

for researcher in qualitative methods.

Experimental Methodes:

We saw TOBIT and Heckman‟s methods which are based on pure

mathematical approach. However, after their primary article, the selection

bias came into the attention of different fields in social science. At the

beginning, this job was excluded to econometricians, and they were

trying to provide insight for other disciplines, but later other branches of

human science, though did not have the Excellency of econometricians in


dealing with mathematics problems, came to work and facilitated

introduction of experimental methods.

As we saw in previous sections, standard econometrical method for

evaluating social programs uses the outcomes of participants to estimate

what nonparticipants would have experienced had they participated. The

difference between participant and nonparticipant outcomes is the

estimated gross impact of a program reported in many evaluations. The

outcomes of nonparticipants may differ systematically from what the

outcomes of participants would have been without the program,

producing selection bias in estimated impacts. A variety of non-

experimental estimators adjust for this selection bias under different

assumptions. Under certain conditions, randomized social experiments

eliminate this bias.(Heckman, 1997)

Here as one of the most important experiment model we study propensity

Score model and continue our discussion.


Propensity score matching

Introduction:

The probability of selection into a treatment, also called the propensity

score, plays a central role in classical selection models and in matching

models (see,e.g., Heckman, 1980; Heckman and Navarro, 2004;

Heckman and Vytlacil, 2007; Hirano et al., 2003; Rosenbaum and Rubin,

1983). Heckman and Robb (1986, reprinted 2000), Heckman and Navarro

(2004) and Heckman and Vytlacil (2007) show how the propensity score

is used differently in matching and selection models. They also show that,

given the propensity score, both matching and selection models are robust

to choice-based sampling, which occurs when treatment group members

are over- or under-represented relative to their frequency in the

population. Choice-based sampling designs are frequently chosen in

evaluation studies to reduce the costs of data collection and to obtain

more observations on treated individuals. Given a consistent estimate of

the propensity score, matching and classical selection methods are robust

to choice-based sampling, because both are defined conditional on

treatment and comparison group status.

Hence, in statistics, propensity score matching (PSM) is one of quasi-

empirical “correction strategies” that corrects for the selection biases in

making estimates.

Generally, PSM is for cases of causal inference and simple selection bias

in non-experimental settings in which: (i) few units in the non-

experimental comparison group are comparable to the treatment units;

and (ii) selecting a subset of comparison units similar to the treatment

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Quasi-empirical

http://en.wikipedia.org/wiki/Quasi-empirical

http://en.wikipedia.org/wiki/Selection_bias

http://en.wikipedia.org/w/index.php?title=Causal_inference&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Non-experimental&action=edit&redlink=1


unit is difficult because units must be compared across a high-

dimensional set of pretreatment characteristics.

In another word, the propensity scoring method is usually applied under

two situation.

Selection bias: potential bias from treatment assignment/selection

conditional on observed variables, due to the effects of unobserved

variables, controlled with selection into treatment.

Finite data: sample size reduces our ability to estimate causal effects by

conditioning on observed variables.

But the PSM is only to adjust for (but not totally solve the problem of)

selection bias; and to minimize the limitation from matching on many

observed variables on finite data.

In this section, we will introduce an empirical case to present how the

PSM works.

Web surveys are the most economical way to make social and market

surveys, but selection bias can invalidate the obtained results. The main

source of bias derives from the internet access coverage: if even in the

USA the web surveys on the elderly

population (50 years old or more) can have strong risks of biased results

(Couper et al,2007), the greater digital divide – in Italy the 89.2% of the

population with more than 50 years do not have Internet access (Istat,

2005) – suggests caution in using web surveys in our country. Taking into

account that in many social or market surveys the target population may

not coincide with general population it can be interesting to apply

methods to correct web survey results when surveys are conducted on

topics related to niche interests.


The first step is to well define the target population on the base of the

topic under investigation. Only a subset (unknown in the size) of this

population can have web access: but, using a national survey, it is

possible to quantify and qualify the subset of the target population with

no web access. This data can be used to correct the web survey results.

The second step is in having an emails list to which submits the web

survey.

Following a proposed classification (Romano et al, 2006) email lists can

differ for accuracy and reasons of enrollment. There are many thematic

web sites where, to be enrolled, people go through authentication (login

and password). These email lists are surely an interesting starting point

to realize web surveys: the interest in the topic becomes itself a favorable

element for a good result of the survey in terms of response rate (among

others, Olson, 2006).

The third step is in applying methods in order to correct ex post the

selection bias deriving from web access. Among other methods proposed

in the literature, the propensity score technique, originally proposed to

correct selection bias in health observational study (Rosenbaum et al,

1983), has been recently used to correct selection bias deriving from

nonprobabilistic sampling (Terhanian et al, 2001) and/or web survey

(Schonlau et al, 2004).


We discuss preliminary results obtained applying this technique to the

data collected through a web survey on enrolled people to a well-known

enogastronomic web site.

Apart from the survey’s goals, in this work we take into account a target

population defined as Italian people from 18 to 78 year-old and focus on

the estimate of the proportion of people who went on holiday. For the

application of propensity score method, data used is composed by a

subset of web survey respondents (nw=4,128) and a subset of Istat

Multiscopo survey (nm=37,677). A logistic regression was performed

using as dependent variable survey indicator (0=Multiscopo; 1=Web

survey) and as regressors gender (2 levels), level of education (3 levels),

geographical areas (5 levels), age (6 levels). All variables were

significant with α<0.04; respondents for both surveys were divided in 5

and then in 10 bins according to the propensity scores. The propensity

weights were applied to the web survey results, as shown in Table 1.

The results are quite impressive considering the digital divide of the

Italian population (only 31.8% has internet access): the difference

between weighted and not weighted web results is almost 20%. Taking

into account that this is an intentionally simple exercise, we want to

highlight that more useful results can be obtained when target population

is tailored on specific survey’s goals and/or when the digital divide is not

so dramatically high.


Certainly, the application before is a just a simple case, here we only

provide a brief logic to show how the propensity score method corrects

the selection bias.

Also we should be clear another fact is PSM has the several limitations.

There are:

Large samples are required; Group overlap must be substantial; Hidden

bias may remain because matching only controls for observed variables

(to the extent that they are perfectly measured). (Shadish, Cook, &

Campbell, 2002)

Analysis:

We saw propensity score matching model as a method which by

implementing controlled experiment identifies the mechanism of

selection in our estimation and gives us direction to adjust our

estimations. However, Social experiments are costly and the identifying

assumptions required to justify them are not always satisfied.

Nonetheless, it is widely held that there is no valid alternative to

experimentation as a method for evaluating social programs (see, e.g.,

Burtless, 1995). There are other methods which combines experimental

and mathematical models to identify and remove selection bias. In an

important paper, LaLonde (1986) combines data from a social experiment

with data from nonexperimental comparison groups to evaluate the

performance of many commonly used nonexperimental estimators. For

the particular group of parametric estimators that he investigates, and for

his particular choices of regressors, he concludes that the estimators

chosen by econometric model selection criteria produce a range of impact

estimates that is unacceptably large2. STOLZENBERG R. M. and

RELLES in their article, try to extract intuitive insights for non-

2 Characterizing Selection Bias Using Experimental Data,James Heckman, Hidehiko

Ichimura, Jerey Smith, August 1997


econometricians and qualitative researchers. They mention that they

provide mathematical tools to assist intuition about selection bias in

concrete empirical analyses. These new tools do not offer a general

solution to the selection bias problem3; and we can claim that our work in

this project is much more intuitive for non-econometricians then theirs.

However, their statements gave us the direction to correct our ways, and

regarding to frequent citations, we understood they have originated a

huge wave about understanding the limitation of Heckman models and

their solutions. They mentioned that Heckman model necessary does not

improve the solution, and it only can be used with other examples to

discuss the range of validity of problems. Something that we discussed

before. Besides, they mentioned that Heckman and other mathematical

model usually do not provide enough tuitions and directions for

correcting answers, and they insist to implement different experiments

and exploit researchers own insight for detection and adjusting

selection bias problems. Heckman, himself in his final articles works on

papers which identify selection bias by social experiments, then measures

the accuracy of his experiment by econometrical models.4

Conclusion:

In this paper, we introduced and identified important types of selection

bias. As the most important models we studied Tobit and probit models,

and identified Heckman model as a specific case of tobit model. We

found an overview of different types of Tobit model, and their possible

solution and conditions. Specially we studied Tobit 1 model in detail. we

3 STOLZENBERG R. M. and RELLES D. A., Tools for intuition about sample selection

bias and its correction , American sociological review,Vol 62,N°3,1997

4 Characterizing Selection Bias Using Experimental Data,James

Heckman, Hidehiko Ichimura, Jerey Smith, August 1997


acknowledged that Tobit model has conditions which cannot be easily

satisfied, and nonlinear equations which are not always easy to solve.

Then we studied heckman model as an special case in selection bias.

Finally, we argued that there is not a confident and comprehensive

answer to selection bias, and regarding to the complexity of the relations,

different methods should be used, or avoided. In some papers it was

discussed that in some circumstances Heckman model should be avoided,

and oppositely in some areas the efficiency of Heckman model was is so

proved , that researchers have developed classical solution for some kind

of problems. Besides, today, selection bias is so diffused in econometrics

that has turned to be a necessary part of researches. However, there is a

growing trend in using experimental methods in identifying and

correcting selection bias. In this respect we studied propensity score

matching as a way of identifying selection bias by experiment, in control

groups, and adjusting our data selection and treatment and selection to

remove selection bias.

In addition, we saw how researchers are approaching question with

different experimental and econometrical methods, specially using

conditional probabilities. They calculate with different methods the

related parameters and discuss about different response and their

credibility‟s. Once getting a generalization about some special case, the

researchers identify it as a classical solution, and in futures use them as

standard test.

Likewise, for each condition of probit models, whose solutions many

times become nonlinear and complicated, they are trying to use simple

and heuristic models for each case. According to what we have learned

through our studding in this project, we can mention that If you are a


researcher in management and want to evaluate selection bias, our first

advice is that start it with a mathematicians. Because still the tools are not

independents freom mathematics and econometrics is the main

knowledge creation body. The prove of our claim is the high level of

mathematics in papers who cintende to give qualitative insights. It is why

we saw some workshops and courses in USA and UK universities, like

oxford which were specially developed to teach selection bias to students

of Human science. However, control experiments can be very useful to

identify and adjust selction bias for qualitative researchers. Besides,one

short-cut way can be finding an article either in econometrics or your

related discipline who already has worked on your subject. In other

words, look if you can find classical solutions to your problems.


APPENDIX:

Software and Toolkits:

Short Toolkit for using Tobit modeling with EasyReg

1) Introduction

The use of Tobit Model is very difficult whit mathematical approach,

there is many type of Tobit model type and several hypothesis to consider

in order getting the last result using Tobit Model. We suggest you in this

section a short summary of using the EasyReg software for modeling

Tobit model starting from you gathered empirical data.

2) The data

The data has been generated artificially as follows. The independent

variables X1,j and X2,j and the error Uj for j = 1,....,n = 500 have been


drawn independently from the standard distribution, and Y has been

generated as:

Y =max(0,X1,j + X2,j + Uj).

Thus, if an intercept is included in the model, so that the vectors of

regressions are

Xj = (X1,j,X2,j,1)',

then the true parameter vector is = (1,2,3)', where

1 = 1

2 = 1

3 = 0

Moreover, the true value of is

= 1

The data file involved is Tobit_Data.TXT, like this array

Observation

Y

X1

X2

Z

1 0.000000000 -1.463631868 -0.640421391 0

2 0.000000000 0.427667916 -0.219542548 0

which is in former EasyReg default format This data file also contains a

variable Z, which I will use and explain later. (see Guided tour on

importing data files in EasyReg space delimited text format at EasyReg

software book).

3) How to estimate a Tobit model with EasyReg

Now open "Menu > Single equation models > Tobit models" in the

EasyReg main window, select the variables Y, X1 and X2, and keep the

the default intercept, similar to running an OLS regression with intercept,

until you arrive at the following window.

http://econ.la.psu.edu/~hbierens/EasyRegTours/OLS.HTM


In general there is no need to adjust the stopping rules of the Newton

iteration which is used to maximize the likelihood function. Thus, click

"Tobit analysis". Then after a few seconds the maximum likelihood

estimation results appear:

If you click "Continue", the module NEXTMENU will be activated:


You have seen this window before after running an OLS regression, so no

further explanation is necessary.

The output is listed below. Note that I have used the option "Wald test of

linear parameter restrictions" to test the joint null hypothesis:

1 = 1

2 = 1

3 = 0

This hypothesis is not rejected, of course, at any reasonable

significance level

4) The output

Tobit model:

y = y* if y* > 0, y = 0 if y* <= 0, where y* = b'x + u

with x the vector of regressors, b the parameter vector,

and u a N(0,s^2) distributed error term.

Dependent variable:

Y = Y

Characteristics:

Y


First observation = 1

Last observation = 500

Number of usable observations: 500

Minimum value: 0.0000000E+000

Maximum value: 5.4575438E+000

Sample mean: 7.2127526E-001

This variable is nonnegative, with 244 zero values.

A Tobit model is therefore suitable

X variables:

X(1) = X1

X(2) = X2

X(3) = 1

Frequency of Y = 0: 48.80%

(244 out of 500)

Newton iteration succesfully completed after 5 iterations

Last absolute parameter change = 0.0001

Last percentage change of the likelihood = 0.0603

Tobit model: Y = max(Y*,0), with

Y* = b(1)X(1) + b(2)X(2) + b(3)X(3) + u,

where u is distributed N(0,s^2), conditional on the X variables.

Maximum likelihood estimation results:

Variable ML estimates (t-value)

[p-value]

x(1)=X1 b(1)= 1.0547731 (17.0084)

[0.00000]

x(2)=X2 b(2)= 0.9905518 (15.2253)

[0.00000]

x(3)=1 b(3)= -0.0243418 (-0.3450)

[0.73011]

standard error of u s= 1.0635295 (21.9209)

[0.00000]

[The p-values are two-sided and based on the normal approximation]

Log likelihood: -4.74065017126E+002

Pseudo R^2: 0.60984

Sample size (n): 500


Information criteria:

Akaike: 1.912260069

Hannan-Quinn: 1.925490511

Schwarz: 1.945976933

If the model is correctly specified then the maximum likelihood

parameter estimators b(1),..,b(3), minus their true values, times the

square root of the sample size n, are (asymptotically) jointly normally

distributed with zero mean vector and variance matrix:

1.92290870E+00 6.77554263E-01 -9.38221607E-01

5.37455447E-01 2.11638376E+00 -9.79444588E-01

-9.81136382E-01 -1.09217153E+00 2.48931672E+00

Wald test:

x(1)=X1 b(1)= 1.0547731 (17.0084)(*)

x(2)=X2 b(2)= 0.9905518 (15.2253)(*)

x(3)=1 b(3)= -0.0243418 (-0.3450)(*)

(*): Parameters to be tested

Null hypothesis:

1.x(1)+0.x(2)+0.x(3) = 1.

0.x(1)+1.x(2)+0.x(3) = 1.

0.x(1)+0.x(2)+1.x(3) = 0.

Null hypothesis in matrix form: Rb = c, where

R =

1. 0. 0.

0. 1. 0.

0. 0. 1.

and c =

1.

1.

0.

Wald test statistic: 0.98

Asymptotic null distribution: Chi-square(3)

p-value = 0.80630

Significance levels: 10% 5%

Critical values: 6.25 7.81

Conclusions: accept accept


5) An inappropriate attempt to conduct Tobit analysis

As an example of a case for which EasyReg refuses to conduct Tobit

analysis, select the variables Z, X1, X2 and the constant 1 for the

intercept, and declare Z the dependent variable. Then you will get stuck

here:

The problem is that Z is discrete, because I have generated it as

Z = Int(100*Y)

where the "Int" function trucates its argument to an integer, by cutting off

all the digits after the decimal symbol (a dot "." in the US, a comma "," in

Europe). But the Tobit model assumes that Z has a continuous

distribution, conditional on Z > 0 and X1 and X2, so that the assumptions

of the Tobit model do not hold. Therefore, in order to prevent you from

doing bad econometrics, EasyReg will not allow you to continue.

In view of the queries I have gotten about this issue, the message in this

window may not be clear enough. If so, click the "Yes" button, which

opens a PDF file:


6) What to do if the dependent variable Y is confined to a

bounded interval?

a. The case Y (a,b]

If the observed dependent variable Y is confined to an interval (a,b],

where - < a < b < , with P[Y = b] > 0, it is possible to transform Y to a

new dependent variable Z, say, such that Z [0,) and P[Z = 0] = P[Y =

b] > 0, namely Z = -ln[(Y - a)/(b - a)]. Next, assume that Z = max(0,Z*),

where Z* = 'X + U. Then

Y = min(b,a + (b - a)exp(-Z*)) = min(b,a + (b - a)exp(-'X - U)).

To create this variable Z, open Menu > Input > Transform variables, and

conduct the following transformations:

1. Click the "Constant = 1" button. Then a new variable "1" is

created, which has the value 1 for all observations.

2. Click the "Linear combination of variables" button, select "1" and

use the value of a as coefficient. Then a new variable with name

"ax1" is created, which has the value a for all observations. I will

assume that you have renamed the variable "ax1" as variable A.

3. Click the "Linear combination of variables" button, select "1" and

use the value of b as coefficient. Then a new variable with name


"bx1" is created, which has the value b for all observations. I will

assume that you have renamed the variable "bx1" as variable B.

4. Click the "Linear combination of variables" button, select the

variables Y and A, and create the linear combination Y-A. I will

assume that you have renamed Y-A as YminA. Note that now

YminA (0,b-a].


variables B and A, and create the linear combination B-A. I will

assume that you have renamed B-A as BminA. Note that BminA is a

contant with value b-a for all observations.

6. Click the "Multiplicative transformation of variables" button, select

the variables YminA and BminA and use the powers 1 and -1,

respectively, to create the new variable "YminA x BminA^-1". I will

assume that you have renamed this new variable as YminA/BminA.

Note that YminA/BminA (0,1].

7. Click the "LOG transformation: x -> ln(x)" button, and select the

variable YminA/BminA. Then the new variable LN[YminA/BminA]

will be created. Note that LN[YminA/BminA] (-,0].


variable LN[YminA/BminA], and use the coefficient -1 to create the

variable -LN[YminA/BminA]. I will assume that you have renamed

this variable as Z. Thus, Z = -LN[YminA/BminA]. Now Z [0,),

and P[Z = 0] = P[Y = b] > 0.

The new variable Z in step 8 can now be used as dependent variable in a

Tobit model. However, keep in mind that in this case a negative

coefficient of an X variable implies a positive effect on the original

dependent variable Y, because Z/Y = -1/(Y-a) < 0, hence Y/Z < 0.

Although needless to say (but I will say it anyhow), if a = 0 and b = 1

then you can skip the steps 1 to 6, and use Y instead of YminA/BminA in

step 7.

b. The case Y [a,b)

If Y [a,b), where - < a < b < , with P[Y = a] > 0, then Z = -ln[(b -

Y)/(b - a)] [0,), with P[Z = 0] = P[Y = a] > 0. This variable Z can be

created similarly to the previous steps 1 to 8, and can be used as the new

dependent variable in a Tobit model. Since now Y/Z > 0, a positive

coefficient of an X variable implies a positive effect of this X variable on

Y.


Note that now we model the conditional distribution of Y by

Y = max(a,b - (b - a)exp(-Z*)) = max (a,b - (b - a)exp(-'X - U)).

c. The case Y [a,b]

This case cannot be handled by standard Tobit analysis.

You can find some specialized software in link below:

The widest variety of sample selection models. Information regarding

LIMDEP can be found at www.limdep.com. Also, a student version,

along with documentation, can be downloaded free from

ww.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm. The student

version of this software and accompanying data sets are included with

Greene‟s (2000) text.

For SAS users, Jaeger (1993) provides the code for performing Heckman‟s

two-step estimation of sample selection bias. This program can be

downloaded from the SAS Institute web page using the following link

(http://ftp.sas.com/techsup/download/stat/heckman.html). Some adjustments

to the code are necessary for the program to work (i.e., your own variable

names must be inserted).

Finally, Stata 7 (2001) (http://www.stata.com/site.html) also can be

used to estimate Heckman‟s (1976, 1979) two-step detection and

correction of sample selection bias. Specific programming information

can be found at the following Stata link


(http://www.stata.com/help.cgi?heckman).

In most software like Stata , SAS, and SPSS, there are special tests for

this problems.If you search at help of these software with Heckman,

Tobit, selection bias, and more importantly endegenity you will find your

related syntax which will simplify you work significantly.

For stata we observed many times that it is quite defined there and could

usually be identified and applied. Here you can see the result of the

search in help of stata:

-------------------------------------------------------------------------------

search for selection bias (manual: [R] search)

-------------------------------------------------------------------------------

Keywords: selection bias

Search: (1) Official help files, FAQs, Examples, SJs, and STBs

(2) Web resources from Stata and from other users

Search of official help files, FAQs, Examples, SJs, and STBs

[R] heckman . . . . . . . . . . . . . . . . . . . Heckman selection model

(help heckman)

[SVY] svy: heckman . . . . . . . . Heckman selection model for survey

data


(help svy: heckman)

FAQ . . . . . . . . . . . . . . . Endogeneity versus sample selection bias

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Millimet

10/01 What is the difference between 'endogeneity' and

'sample selection bias'?

http://www.stata.com/support/faqs/stat/bias.html

FAQ . . . . . . . . . . . . . . Determining the sample for a Heckman model

. . . . . . . . . . . . . . . . . . . . . . . V. Wiggins and W. Gould

Besides, in stata, if you search with good phrase you may be linked by

web to other users who have worked on your subject. As well as you can

ask from the related center who develops this software, and ask your

specific questions.

gdo 2009 selection_bias

Technology