labour market evaluation: theory and practice

Labour Market Evaluation: Theory and Practice

Seamus McGuinness8th November 2013

Why is Evaluation Necessary• It assesses the extent that policy initiatives are achieving their

expected targets and goals.

• Drawing from this the evaluator will identify the nature of any shortfalls in either program delivery or the stated objectives.

• Value for money from the perspective of the tax payer is also likely to prove a dominant feature of any evaluation.

• Fulfils a vital policy challenge role within society and helps ensure that policy is evidence based and that ineffective programmes are modified or closed. Represents a key mechanism for policy challenge.

What are the most common forms of labour market evaluation ?

• Generally labour economists tend to focus on impact evaluation (is the programme achieving its desired impacts ?).

• Process evaluation (is the programme being delivered as intended ?) is less common.

• However, in practice most impact evaluations will also consider the efficiency of programme delivery and implementation.

• The bulk of impact evaluations focus on labour market programmes that are designed to improve outcomes related to employment, earnings and labour market participation.

Main Barriers to Effective Independent Evaluation

• Lack of an evaluation culture: Policy makers may view evaluation as a threat and actively seek a less rigorous form of assessment.

• The organisation being evaluated has the power to set the terms of reference and is invariably involved in choosing the evaluating body.

• Stemming from this, often little consideration given to programme evaluation at the programme implementation and design stage (often a lack of viable a control group to assess the counterfactual).

• Data constraints: Lack of available and “linkable” administrative datasets also make proper evaluation difficult.

Measuring a programmes impact• Not at all straightforward: There have been instances when different

researchers have arrived at very different conclusions regarding a programmes impact.

• We basically need to know what would happen to individuals had the programme not been in place i.e. we attempt to measure the counterfactual.

• There are various methods used for estimating the counterfactual, however, they all generally rely on measuring the difference in outcomes between people participating in the study (the treatment group) and those eligible for the programme but not participating in it (the control group).

The Selection Problem• Comparison of a treatment and control group is not straightforward as

substantial differences may exist between the two groups that must be factored out as assignment to either is rarely random.

• Such differences can also arise as a consequence of ineffective control group construction.

• Non-random selection refers to the possibility that (a) programme administrators engaged in “picking winners” in order to ensure the programmes success or (b) more capable individuals are more likely to put themselves forward for intervention.

• Failure to account for this will result in a serious over-estimate of the programmes effectiveness.

Programme design and control group delivery

• Piloting the programme.

• By rolling out the programme to different areas at difference times.

• Ensuring access to administrative data on the targeted population (for instance live register data).

• Keeping records of unsuccessful applicants to the programme in instances where the demand for programme places exceeds supply.

Ineffective Control Group Construction

• In evaluating the National Employment Action Plan (NEAP) in 2005 Indecon consultants compared a treatment of 1000 NEAP claimants (by definition first time claimants) and a control group of 225 unemployed (non-Neap) individuals taken from the ECHPS 58 % of whom were already LT unemployed at the initial point of observation. By definition none of the NEAP treatment group will have been LT unemployed.

• Indecon then compared the unemployment rates of the control and treatment groups 24 months down the line and concluded that the control group faired much better and that the NEAP programme was , therefore, effective.

• Does this represent a like for like comparison?

Methods Used for Overcoming the Selection Problem

• Difference in Difference Estimator: This is a two-period estimator and requires that the treatment is introduced in a second time period. More powerful as it seems as will eradicate non-random selection based around unobserved attributes (picking winners etc).

• Matching Estimators: Tries to match control and treatment group members on observable characteristics (education, age, labour market history etc) to ensure a like-for-like comparison (consider the earlier NEAP example). May still be prone to unobserved influences?

• Other methods do exist such as the use of controlled experiments but these are rarely seen in the context of labour market evaluation.

Difference in Difference• Period 1: Outcome Y (say earnings) determined by observable

characteristics X (age, education, labour market experience etc) and unobservable factors that do not change over time I (innate ability, motivation etc).

• Period 2: Outcome variable determined as in period 1 but say a labour market training programme ( a treatment T) is now present.

• By differencing across the same individuals in two periods we can both isolate T and remove the impact of time invariant (and often unobserved) factors.

Difference in Difference

0

1 1 1

1 0 1 1( )

t it i

t it i t

t t it it t

Y X IY X I TY Y X X T

Example of a difference in difference approach

• Say we plan to introduce a new unemployment activation measure in June 2013 in the County Dublin.

• Our control group would be the rest of the country that were not to receive the measure (until perhaps 2014). We would estimate a model comparing exits from unemployment in Dublin w.r.t. the Rest of Ireland over both periods (2012 2013).

• The extent to any change in the margin of difference in Dublin exit rates exit rates (relative to rest of Ireland) over the two periods will be interpreted as the impact of the programme.

Model Estimation

• dt = dummy variable for treatment group (Dublin area), will pick up any differences between the treatment and control groups prior to the policy change.

• T is a dummy variable for time period 2 and measures the extent to which the value of y increased or fell in period 2 independent of anything else.

• dt*T will be = 1 for those individuals in the treatment group receiving intervention in the second period. It is therefore a measure of the impact of the policy.

0 1 2 2 3 2* tY dt T dt T

3 2 1 2 1( ) ( )treat treat cont contY Y Y Y

Difference in Difference• Really powerful tool in eradicating unobserved bias “picking

winners” self-selection etc.

• Required little data.

• Requires that policy be implemented in a rolled out fashion e.g. across regions across time « not always appreciated by policy makers.

• Is it sufficient to deal with selection bias on observables?

Propensity Score Matching• This technique allows us to deal explicitly with the problem of differences in the characteristic

make-up of the control and treatment group that have the potential to bias our estimate of the programme impact.

• For example, say we have an active labour market programme aimed at reducing unemployment and the control group contains a higher proportion of LT unemployed. Failure to control for this will upwardly bias the estimated programme impact as the control group, by definition almost, have lower likelihoods of labour market success even before the impacts of any programme are started.

• Basically, chances are that if you compare the proportions in employment of both groups, at a future point in time, in the absence of any labour market programme, the treatment group will have performed better.

• Thus the problem we must confront is that the estimated programme impact is simply being driven up, or entirely attributable, to differences in the characteristic make up of our control and treatment groups.

What does PSM do

• It is a method that allows us to match both the treatment and control groups on the basis of observable characteristics to ensure we are making a like for like comparison.

• After matching has been completed, we simply compare the mean outcomes (e.g. employment rates) of the control and treatment groups to see which is highest.

How do we match• We estimate a probit (1,0) model on treatment group membership. This identifies

that main characteristics that separates the control from the treatment group.

• Every member of the control and treatment group is then given a probability of their likelihood of being assigned to the treatment group based on their characteristics.

• Each member of the treatment group is then “matched” with a member of the control group with a similar probability score.

• It can be shown that matching on probability score is equivalent to matching on actual characteristics.

• This process ensures that the treatment and control groups are similar in terms of their observable characteristics.

Matching

• Clearly again a powerful tool and the most effective for tackling the sample selection problem.

• Requires a lot of data and additional checks to ensure that matching was successful and all observable differences between the control and treatment groups were eradicated.

• Does not deal with unobserved bias.

Carrots without sticks: an evaluation of active labour market policy in ireland

Seamus McGuinness, Philip O'Connell & Elish Kelly

Overview• This study focuses on assessing the effectiveness of the Job Search Assistance (JSA)

component of the National Employment Action Plan (NEAP). The NEAP is Irelands principal tool for activating unemployed individuals back into the labour market.

• Under the NEAP, individuals registering for unemployment benefit are “automatically” referred to FÁS for an interview after 13 weeks on the system. The FAS interview is aimed at helping claimants back into work through advice and placement and referring others for further training.

• Individuals with previous exposure to NEAP – i.e. those with a previous history of unemployment – are excluded and will not be referred to FÁS for a second time.

• NEAP was distinct in an international sense in that it was characterised by an almost complete absence of monitoring and sanctions. Unusually, it did not appear to hinge on the principal of mutual obligation.

Evaluations Objectives• To assess the extent to which individuals participating in the

NEAP were more likely to find employment relative to non-participants

• To assess the extent to which individuals in receipt of both interview and training had enhanced employment prospects relative to those in receipt of interview only (impact of training).

• We are going to focus on the effectiveness of the referral and interview process.

Problem 1: No control group?• Selection under the NEAP is automated and universal. If all claimants are

automatically sent for interview at week 13 of their claim then how can we construct a counterfactual.

• i.e. remember counterfactual assesses what happens to individuals in the absence of the programme.

• The only people not exposed to the programme are those already in employment by week 13. This rules difference-in-difference out for a start.

• Problem illustrates very clearly that the need for proper evaluation was not a major consideration in the programmes design or implementation.

What can we do?• Only option is to utilise the fact that individuals with previous exposure to NEAP can’t access it

again (totally counter-intuitive rule as basically those most in need of support were being excluded from the outset).

• We take as an initial control group individuals who had previous exposure to NEAP more than two years prior to the study who’s contact was limited to a FAS interview.

• Given the time lapse and changing macroeconomic conditions any advice received by the control group should have declined in relevance allowing some assessment of the impact of the programme.

• Still even if the above were true we are still left with a selection problem as, prior to the study, all of the control group will have had a previous unemployment spell of at least 13 weeks whereas none of the treatment group will.

• This difference cannot be eradicated by matching and our estimates are unlikely to be free of bias.

Construction of The Evaluation data

Weekly Population of Live Register Claimants

Weekly Population of Live Register Claimant Closure Files

Profiling Questionnaire Information for Claimant Population Issued June to

September 2006

Li ve Register Claimant Population

(September 2006 – June

2008)

Dataset for NEAP Evaluation

FAS Events Histories

New Control Group Found?• On linking the data we found that around 25 % of new

claimants were not being referred by DSP to FAS after 13 weeks unemployment duration, despite these individuals having no previous exposure to the NEAP.

• We need to establish what is going on here, are we missing something in terms of the referral process and, if not, what are the factors driving the omission and are they random.

• A list containing the PPS numbers of our potential new control group was sent to DSP for validation.

Validation checks• DSP confirm that individuals had fallen through the net.

• No concrete explanation found. Most likely that individuals were not referred when number of referrals in DSP office exceeded slots in local FAS office and had been subsequently overlooked when slots became available.

• Even before we begin we have uncovered major problems with programme processes i.e. 25 % of potential claimants excluded and a further 25% missed.

• Clear example of how process evaluation becomes a component of an impact evaluation.

the control group

• A natural experiment?

Entire Claimant Population

From NEAP Evaluation Database

Control Group II Previous NEAP clients with light interventions with two year gap who were on register for at

least 20 weeks (N=3074)

Control Group I New clients qualifying for NEAP intervention but not contacted who were in register for at


Treatment Group New clients qualifying for NEAP intervention

and intervened with who were on register for at


Data and methods

• In terms of econometrics, we estimate probit and matching models augmented by additional checks for unobserved heterogeneity bias.

• All models contain a wide range of controls for educational attainment, health, location attributes, access to transport, age marital status, labour market history etc, that we available to us as a consequence of the profiling data.

How random are our control groups? Is there a selection problem?

Total

Sample

Treatment

Group

Control

Group I

Control

Group II

Gender:

Male 60.3 58.7 57.3 70.2 Female 39.7 41.3 42.7 29.8 Age:

Age 18-24 25.4 26.6 29.8 13.3 Age 25-34 29.4 26.5 26.7 44.4 Age 34 - 44 21.1 22.7 15.5 23.2 Age 45 - 54 13.9 14.8 11.3 14.4 Age 55 + 10.2 9.4 16.7 4.7

Total

Sample

Treatment

Group

Control

Group I

Control

Group II

Marital Status:

Single 54.6 52.6 54.7 61.8 Married 32.2 34.4 31.7 24.6 Cohabits 4.8 4.8 5.0 5.1 Separated/Divorced 7.4 7.1 7.5 8.0 Widowed 0.7 0.7 1.1 0.5 Children 25.2 26.3 21.9 23.3 Education/Training:

Primary or Less 13.4 13.0 13.1 15.0 Junior Certificate 26.8 25.7 24.8 34.0 Leaving Certificate 33.7 34.0 35.3 30.2 Third-level 26.1 27.3 26.8 20.8

Literacy/Numeracy Problems 7.7 8.2 6.4 7.5 English Proficiency 3.8 4.6 3.5 1.7 Apprenticeship 13.9 12.9 13.8 17.4 Transportation:

Own Transport 55.0 54.4 58.7 52.4 Public Transport 74.0 74.6 71.8 75.0

Total

Sample

Treatment

Group

Control

Group I

Control

Group II

Employment History:

Employed in Last Month 54.4 54.4 50.2 60.1

Employed in Last Year 23.1 24.3 21.1 21.2

Employed in Last 5 Years 7.3 7.2 6.2 9.0

Employed Over 6 Years Ago 2.2 2.3 1.7 2.2

Never Employed 7.9 9.3 8.2 2.7

What are the descriptive telling us ?

• The treatment group and control group I look very similar which would suggest that the “process” that generating control group I was random in nature.

• There are more substantial differences between the treatment group and control group II in that the latter tends to be more disadvantaged in terms of their observable characteristics. Potential for selection bias here.

Kaplan-meier survival estimate

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

20 25 30 35 40 45 50

Treatment Group Control Group 1 Control Group 2

Regular Probit• These will give us an initial estimate, that may or may not be biased,

of the effectiveness of NEAP.

• We want to see that the data is sensible and that relationships move in the correct direction. This is important both in terms of the probit estimate and the reliability of any subsequent matching.

• Provides us with an assurance that there is nothing weird happening with our data.

• Note: The models measure the impact of variables w.r.t. The claimants probability of exiting the live register before 12 months.

Model 1

Both C G

Model 2

Control

Group I

Model 3

Control

Group II

NEAP Intervention:

FÁS Referral plus Interview -0.07*** -0.16*** 0.02

(0.013) (0.017) (0.001)

Personal and Family

Characteristics:

Male 0.06*** 0.08*** 0.06***

(0.014) (0.016) (0.015)

Age Reference Category: Aged 18-

24

Age 25-34 -0.06*** -0.02 -0.04**

(0.018) (0.023) (0.020)

Age 35-44 -0.11*** -0.09*** -0.10***

(0.020) (0.026) (0.022)

Age 45-54 -0.10*** -0.08*** -0.08***

(0.023) (0.029) (0.025)

Age 55+ Years -0.22*** -0.26*** -0.18***

(0.022) (0.026) (0.025)

Health Reference Category:

Bad/Very Bad Health

Very Good Health 0.15** 0.13 0.11

(0.065) (0.082) (0.070)

Good Health 0.11* 0.09 0.08

(0.067) (0.084) (0.073)

Marital Status Reference Category:

Single

Married -0.03 -0.04* -0.02

(0.021) (0.025) (0.023)

Cohabits -0.02 -0.05 -0.02

(0.030) (0.035) (0.032)

Separated/Divorced -0.06** -0.07** -0.06*

(0.030) (0.037) (0.032)(0.075) (0.083) (0.084)

Children -0.04*** -0.04*** -0.04***

(0.010) (0.012) (0.011)

Model 1

Both

Control

Groups

Model 2

Control

Group I

Model 3

Control

Group II

Spousal Earnings Reference Category:

None

Spouse Earnings €250 0.14*** 0.13*** 0.16***

(0.036) (0.041) (0.040)

Spouse Earnings €251-€350 -0.01 0.02 0.00

(0.090) (0.099) (0.094)

Spouse Earnings €351 and Above -0.05** -0.06** -0.05*

(0.023) (0.027) (0.025)

Human Capital Characteristics:

Education Reference Category: Primary

or Less

Junior Certificate -0.00 0.02 0.00

(0.021) (0.026) (0.022)

Leaving Certificate 0.05** 0.10*** 0.04*

(0.021) (0.026) (0.023)

Third-level 0.15*** 0.17*** 0.14***

(0.023) (0.028) (0.025)

Apprenticeship 0.03 0.01 0.03*

(0.018) (0.022) (0.020)

Literacy/Numeracy Problems -0.06*** -0.06** -0.06**

(0.024) (0.030) (0.025)

English Proficiency 0.01 0.01 0.02

(0.035) (0.040) (0.038)

Employment/Unemployment/Benefit

History:

Employment History Reference

Category: Never

Employed in Last Month 0.08** 0.09* 0.10**

(0.040) (0.049) (0.043)

Employed in Last Year 0.06 0.07 0.08*

(0.042) (0.051) (0.046)

Employed in Last 5 Years -0.02 -0.01 0.01

(0.042) (0.053) (0.047)

Model 1

Both

Control

Groups

Model 2

Control

Group I

Model 3

Control

Group II

Job Duration Reference Category:

Never Employed

Job Duration Less than Month 0.09* 0.12** 0.05

(0.046) (0.057) (0.049)

Job Duration 1-6 Months 0.11*** 0.16*** 0.10**

(0.038) (0.048) (0.041)

Job Duration 6-12 Months 0.09** 0.13*** 0.06

(0.040) (0.051) (0.043)

Job Duration 1-2 Years 0.07* 0.14*** 0.04

(0.041) (0.051) (0.043)

Job Duration 2+ Years -0.00 0.02 -0.03

(0.038) (0.048) (0.040)

Would Move for a Job 0.05*** 0.05*** 0.05***

(0.013) (0.016) (0.014)

Social Welfare Payment Type Reference

Category: Jobseeker’s Benefit

Jobseeker’s Allowance -0.18*** -0.18*** -0.17***

(0.015) (0.018) (0.016)

Signing on the Live Register for 12

Months Plus

-0.19*** -0.18*** -0.13***

(0.017) (0.034) (0.019)

On CE Scheme for 12 Months Plus -0.14*** -0.22*** -0.12**

(0.046) (0.063) (0.048)

Geographic Location Information:

Location Reference Category: Rural

Village -0.02 -0.01 -0.04

(0.021) (0.025) (0.023)

Town -0.02 -0.01 -0.00

(0.021) (0.024) (0.022)

Large Town/City -0.02 -0.01 -0.02

(0.021) (0.025) (0.022)

Selection Bias – (see handout)

FÁS Interview FÁS Interview (Nearest Neighbour) (Kernel)

Control Group I & II (Model 1) -0.074 (0.018)*** -0.076 (0.013)***

Control Group I (Model 2) -0.149 (0.022)*** -0.151 (0.017)***

Control Group II (Model 3) 0.007 (0.028) 0.006 (0.020)

Checking our Assumptions 1

Table 4: Probit & PSM Estimates of Treatment Effect: 25 Week Threshold

FÁS Interview FÁS Interview FÁS Interview (Probit) (Nearest Neighbour) (Kernel)

Control Group I/ II (Model 1) -0.041 (0.013)*** -0.043 (0.019)*** -0.046 (0.014)***

Control Group I (Model 2) -0.104 (0.018)*** -0.115 (0.025)*** -0.106 (0.019)***

Control Group II (Model 3) -0.018 (0.016) -0.020 (0.029) 0.008 (0.021)

Table 5: Probit & PSM Estimates of Treatment Effect: 30 Week Threshold


Control Group I/ II (Model 1) -0.020 (0.013)* -0.027 (0.019) -0.027 (0.013)**


Control Group II (Model 3) 0.027 (0.016) 0.005 (0.029) 0.020 (0.020)

Checking our Assumptions - IITable 6: Estimates of Treatment Effect: 15 Month Model




Control Group II (Model 3) 0.041 (0.018)** 0.035 (0.020) 0.023 (0.022)

Table 7: Estimates of Treatment Effect: 18 Month Model




Control Group II (Model 3) 0.035 (0.018)* 0.024 (0.032) 0.026 (0.022)

Summary and conclusions - I

• Strong and consistent evidence that JSA delivered under the NEAP was highly ineffective and actively reduced transitions off the Live Register to employment.

• Two possibilities arise: (i) claimants received poor advice or (ii) claimants relaxed the intensity of job-search on learning of the absence of monitoring and sanctions.

• Advice explanation not supported by results as we would expect the negative impact to fall away in medium term models as claimants adjust behaviour.

Summary and conclusions - II

• We conclude that participants attending the interview quickly learnt that their prior fears with respect to the extent of job search, monitoring and sanctions were unjustified and consequently lowered their job search activity levels.

• Note: - The analysis was found to be robust to the influences of both

sample selection and unobserved heterogeneity.

- Strong negative JSA effects were also generated using a other estimation techniques (Cox Proportional Hazard Model).

How Reliable are our results ?• We controlled for a wide-range of observables implying that unobserved

factors should be less of a factor.

• Sensitivity tests seemed to confirm this.

• We had a highly representative control group.

• Still PSM framework while allowing us to test the sensitivity of estimates of unobserved bias – it does not eradicate.

• We are seeing the increased use of combined PSM and diff-in-diff methods of ensuring that evaluation estimates are free from both selection bias (on observables) and unobserved bias (picking winners etc).

labour market evaluation: theory and practice

Documents

programme evaluation

programme design

programme implementation

programme administrators

labour market programmes

evaluation culture

evaluation necessaryit

process evaluation