economics of education review - faculty.smu.edu · 3 the origins of our work are harris and sass...

Vapr

Tima Deb Dec De

1. I

dattimaltebroeduthelearof

stusmstatandlon

Economics of Education Review 38 (2014) 9–23

A R

Artic

Rece

Rece

Acce

JEL c

I21

J24

Keyw

Teac

Valu

*

(A. S1

You

and

027

http

lue-added models and the measurement of teacheroductivity

R. Sass a,*, Anastasia Semykina b, Douglas N. Harris c

partment of Economics, Georgia State University, 14 Marietta Street NW, Atlanta, GA 30303, United States

partment of Economics, Florida State University, 113 Collegiate Loop, Tallahassee, FL 32306, United States

partment of Economics, 302 Tilton Hall, Tulane University, New Orleans, LA 70118, United States

ntroduction

In the last dozen years the availability of administrativeabases that track individual student achievement overe and link students to their teachers has radicallyred how research on education is conducted and hasught fundamental changes to the ways in whichcational programs and personnel are evaluated. Until

late 1990s, research on the role of teachers in studentning was limited primarily to cross-sectional analysesstudent achievement levels or simple two-perioddies of student achievement gains using relativelyall samples of students and teachers.1 The advent ofewide longitudinal databases in Texas, North Carolina Florida, along with the availability of micro-levelgitudinal data from large urban school districts, has

allowed researchers to track changes in student achieve-ment as students move between teachers and schools overtime. This in turn has permitted the use of panel datatechniques to account for the influences of prior educa-tional inputs, students and schools when evaluating thecontributions of teachers to student achievement.

The availability of student-level panel data is alsofundamentally changing school accountability and themeasurement of teacher performance. In Tennessee,Dallas, New York City and Washington DC, models ofindividual student achievement have been used for manyyears to evaluate individual teacher performance. Whilethe stakes are currently low in most cases, there is growinginterest among policymakers to use estimates fromstudent achievement models for high-stakes performancepay, school grades, and other forms of accountability.Chicago, Denver, Houston and Washington, DC have alladopted compensation systems for teachers based onstudent performance. Further, as a result of the federalTeacher Incentive Fund and Race to the Top initiatives, manymore states and districts plan to implement performancepay systems in the near future. Florida is a particularlyinteresting case as the state has recently adopted a very

T I C L E I N F O

le history:

ived 18 May 2012

ived in revised form 30 October 2013

pted 30 October 2013

lassification:

ords:

her productivity

e added

A B S T R A C T

Research on teacher productivity, as well as recently developed accountability systems for

teachers, relies on ‘‘value-added’’ models to estimate the impact of teachers on student

performance. We consider six value-added models that encompass most commonly

estimated specifications. We test many of the central assumptions required to derive each

of the value-added models from an underlying structural cumulative achievement model

and reject nearly all of them. While some of the six popular models produce similar

estimates, other specifications yield estimates of teacher productivity and other key

parameters that are considerably different.

� 2013 Elsevier Ltd. All rights reserved.

Corresponding author. Tel.: +1 404 413 0150.

E-mail addresses: [email protected] (T.R. Sass), [email protected]

emykina), [email protected] (D.N. Harris).

For reviews of the early literature on teacher quality see Wayne and

ngs (2003), Rice (2003), Wilson and Floden (2003) and Wislon, Floden,

Ferrini-Mundy (2001).

Contents lists available at ScienceDirect

Economics of Education Review

jo ur n al h o mep ag e: w ww .e lsev ier . co m / loc ate /ec o ned u rev

2-7757/$ – see front matter � 2013 Elsevier Ltd. All rights reserved.

://dx.doi.org/10.1016/j.econedurev.2013.10.003

http://crossmark.crossref.org/dialog/?doi=10.1016/j.econedurev.2013.10.003&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1016/j.econedurev.2013.10.003&domain=pdf

http://dx.doi.org/10.1016/j.econedurev.2013.10.003

mailto:[email protected]




T.R. Sass et al. / Economics of Education Review 38 (2014) 9–2310

aggressive teacher accountability system which relies onthese panel data techniques.

Measurement of teacher productivity in both educationresearch and in accountability systems is often based onestimates from panel-data models where the individualteacher effects are interpreted as a teacher’s contributionto student achievement or teacher ‘‘value-added.’’ Thetheoretical underpinning for these analyses is the cumu-lative achievement model developed by Boardman andMurnane (1979), Hanushek (1979), and Todd and Wolpin(2003), where current student achievement is a function ofa student’s entire history of educational and family inputs.However, varying data constraints have led to a widevariety of empirical specifications being estimated inpractice. Each empirical specification makes (typicallyimplicit) assumptions about the parameters of theunderlying structural model.

Understanding the assumptions being made and how totest their validity are important both for interpreting theestimates from empirical models and for determiningwhether the estimates are subject to bias. Model mis-specification, including omitted variables, can yieldparameter estimates that do not represent the constructsof the underlying structural model and are biased.

Two recent studies evaluate various value-added modelspecifications based on the criteria of minimizing bias inestimated teacher effects.2 Kane and Staiger (2008)conduct an experiment in which 78 pairs of teachers wererandomly assigned to classrooms in the same grade andschool. They then compare pairwise differences inestimated teacher effects from the experimental sampleto differences in the pre-experiment value-added esti-mates of the same teacher pairs. Their analysis includedestimates derived from seven value-added specificationswith varying controls for student heterogeneity. For thefive value-added models that accounted for prior-yearstudent achievement, they could not reject the null thatthe estimated within-pair differences in teacher produc-tivity were equivalent to the differences under randomassignment. Random assignment is a key advantage, butexperiments also have limitations. They can generally onlybe implemented on a small scale and only for individuals orinstitutions that voluntarily participate. For example, inKane and Staiger (2008), they could only test whetherwithin-school sorting is an issue and only among pairs ofteachers that a principal was comfortable randomlyassigning students to. The original experiment wasrecently replicated across cities with a much larger sample,with the same general results, but the participation ratewas once again very low and there were significantproblems with non-compliance to the randomization(Kane, McCaffrey, Miller, & Staiger, 2013).

Guarino, Reckase, and Wooldridge (2012) generatesimulated data under various student grouping and teacherassignment scenarios and then compare the estimates from

alternative achievement model specifications to the known(generated) teacher effects. While no specification issuperior under all student/teacher assignment scenarios,a model that estimates current achievement as a function ofprior-year achievement and observable student and teach-er/school inputs is the most robust. The simulation approachhas the advantage of producing known ‘‘true’’ teacher effectsthat can be used to evaluate the estimates from alternativemodels. The disadvantage, however, is that there is no wayto know if the selected data generating processes actuallyreflect the student–teacher matching mechanisms thatoccur in real-world data. In particular, the data generatingprocesses they employ relies on a number of simplifyingassumptions about the underlying cumulative achievementmodel.

We take a different approach and test the assumptionsrequired to derive empirical value-added models from astructural model of student achievement. This allows us todetermine whether the estimates from value-addedmodels have a structural interpretation.3 The validity ofthe assumptions is also important because data generationprocesses used in simulation work rely on (often implicit)assumptions about the underlying structural model ofstudent achievement. The disadvantage, however, is thatfailure of the underlying assumptions does not necessarilymean value-added models fail to accurately classifyteacher performance for accountability. While we cannotdirectly test the magnitude of bias in value-added models,we can and do conduct simple hypothesis testing andconsider how model specification affects the estimatedproductivity of teachers. By comparing estimated teachereffects across models of varying flexibility, we can evaluatethe magnitude of the change in teacher rankings of specificmodeling choices, each with differing data and computa-tional costs. If the results are insensitive to modelingchoices then one can be less concerned about imposingfalse restrictions. But this is not what we find. The resultsare very sensitive to certain types of assumptions.

We begin our analysis in the next section by consider-ing the general form of cumulative achievement functionsand the assumptions which are necessary to generateempirical models that can be practically estimated. In thatsection we also delineate a series of specification tests thatcan be used to evaluate the assumptions underlyingempirical value-added models. Section 3 discusses ourdata and in Section 4 we present our results. In the finalsection we summarize our findings and consider theimplications for future research and for the implementa-tion of accountability systems.

2. Value-added models and tests

2.1. A general cumulative model of achievement

In order to clearly delineate the empirical models thathave been estimated and the assumptions underlyingthem, we begin with a general cumulative model of2 Another branch of recent literature investigates alternative forms of

the cumulative achievement function, emphasizing the impact of

historical home and schooling inputs on current achievement. See Todd3
and Wolpin (2007), Ding and Lehrer (2007), Andrabi, Das, Khwaja, and
Zajonc (2011) and Jacob, Lefgren and Sims (2010).

The origins of our work are Harris and Sass (2006), though the

analysis has evolved considerably since that original working paper.

stuMu

Ait

what

repschmi0

cha(su

conprinas wThetimnonputtimamtimset

funwe

Ait

4

vary

lead

vect

leve

give

men

the

part

wan

the

5

teac

aid

reso

low

like

pers

inpu6

rand

to c

estim

this

effe

like

alte

to b

(2007

estim

ativ

stud8

assu

prod

T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 11

dent achievement in the spirit of Boardman andrnane (1979) and Todd and Wolpin (2003):

¼ At½XiðtÞ; FiðtÞ; EiðtÞ; mi0; eit � (1)

ere Ait is the achievement level for individual student i

the end of their tth year of life, Xi(t), Fi(t) and Ei(t)resent the entire histories of individual, family andool-based educational inputs, respectively. The termis a composite variable representing time-invariant

racteristics an individual is endowed with at birthch as innate ability), and eit is an idiosyncratic error.The vector of school-based educational inputs, Ei(t),tains both school-level inputs such as the quality ofcipals and other administrative staff within a school,4

ell as a vector of classroom-level inputs in classroom.5

latter group of inputs includes peer characteristics,6

e-varying teacher characteristics (such as experience),-teacher classroom-level inputs (such as books, com-ers, etc.) and the primary parameter vector of interest,e-invariant teacher characteristics (including, for ex-ple, ‘‘innate’’ ability and pre-service education). Thee-invariant teacher characteristics can be captured by aof teacher indicator variables.7

If we assume that the cumulative achievementction, At[�] is linear and additively separable,8 then

can rewrite the achievement level at grade t as:

¼Xt

h¼1

½ahtXih þ whtFih þ bhtEih� þ ctmi0 þ eit (2)

where aht, wht and bht represent the vectors of (potentiallytime-varying) weights given to individual, family andschool inputs. The impact of the individual-specific time-invariant endowment in period t is given by ct.

2.2. Cumulative model with fixed family inputs

Estimation of Eq. (2) requires data on both currentand all prior individual, family and school inputs.However, administrative records contain only limitedinformation on family characteristics and no directmeasures of parental inputs.9 Therefore, it is necessaryto assume that family inputs are constant over time andare captured by a student-specific fixed component, zi.

10

However, the marginal effect of these fixed parentalinputs on student achievement may vary over time and isrepresented by kt. Thus, the effect of the fixed familyinput (ktzi) is analogous to the effect of the studentcomponent (ctmi) in (1).

The assumption of fixed parental inputs of courseimplies that the level of inputs selected by families doesnot vary with the level of school-provided inputs a childreceives. For example, it is assumed that parents do notsystematically compensate for low-quality schoolinginputs by providing tutors or other resources.11 Similarly,it is assumed that parental inputs are invariant toachievement realizations; parents do not increase theirinputs when their son or daughter does poorly in school.

The validity of the assumption that family inputs do notchange over time is hard to gauge. Todd and Wolpin (2007),using data from the National Longitudinal Survey of Youth1979 Child Sample (NLSY79-CS), consistently reject exo-geneity of family input measures at a 90% confidence level,but not at a 95% confidence level. They have only limitedaggregate measures of schooling inputs (average pupil-teacher ratio and average teacher salary measured at thecounty or state level) and the coefficients on these variablesare typically statistically insignificant, whether or notparental inputs are treated as exogenous. Thus it is hardto know to what extent the assumed invariance of parentalinputs may bias the estimated impacts of schooling inputs. Itseems reasonable, however, that parents would attempt tocompensate for poor school resources and therefore any biasin the estimated impacts of schooling inputs would betoward zero.

If we assume that the marginal effects of the endow-ment and family inputs are equal to each other in eachperiod, i.e., kt = ct then we can re-label this effect as vt and

Typically administrative data provide little information on time-

ing school-level inputs like scheduling systems, curricular choices,

ership styles and the like. Consequently, it is common to replace the

or of school characteristics with a school fixed effect. When school-

l effects are included, the teacher fixed effect captures the effect of a

n teacher’s time-invariant characteristics on her students’ achieve-

t relative to other teachers at the same school. This obviously limits

comparison group for assessing teacher productivity, which is

icularly problematic in accountability contexts since one typically

ts to compare the performance of a teacher with all other teachers in

school district or state, not just at their own school.

Classroom-level variables may be correlated with the assignment of

hers and students to classrooms. For example, principals may seek to

inexperienced teachers by giving them additional computer

urces. Similarly, classrooms containing a disproportionate share of

-achieving or disruptive students may receive additional resources

teacher aides. Due to the paucity of classroom data on non-teaching

onnel and equipment, most studies omit any controls for non-teacher

ts.

It is well known that if students are assigned to classrooms non-

omly and peer-group composition affects achievement, then failure

ontrol for the characteristics of classroom peers will produce biased

ates of the impact of teachers on student achievement. Recognizing

potential problem, the majority of the existing studies of teacher

cts contain at least crude measures of observable peer characteristics

the proportion who are eligible for free/reduced-price lunch. An

rnative approach is to focus on classes where students are, or appear

e, randomly assigned, as in Clotfelter, Ladd, and Vigdor (2006), Dee

4), and Nye, Konstantopoulos, and Hedges (2004).

Alternatively, teacher effects could be modeled with random-effects

ators. Lockwood and McCaffrey (2007) provide a detailed compar-

e analysis of fixed and random effects estimators in the context of

ent achievement models.

Figlio (1999) and Harris (2007) explore the impact of relaxing the

9 Typically the only information on family characteristics is the student

participation in free/reduced-price lunch programs, a crude and often

inaccurate measure of family income. Data in North Carolina also include

teacher-reported levels of parental education.10 In general, one could consider models with uncorrelated unobserved

heterogeneity. However, it is likely that the observed inputs (e.g. teacher

and school quality) are correlated with the unobserved student effect,

which would lead to biased estimates in a random-effects framework.

Therefore, in what follows, we assume that the unobserved heterogeneity

may be correlated with the observed inputs and focus on a student/family

fixed effect.11
mption of additive separability by estimating a translog education
uction function.

For evidence on the impact of school resources on parental inputs see

Houtenville and Conway (2008) and Bonesronning (2004).


combine the student and family components so thatvt(zi + mi) = vtxi.

12 The achievement equation at grade t

then becomes:

Ait ¼Xt

h¼1

½ahtXih þ bhtEih� þ vtxi þ eit (3)

Eq. (3) is the least restrictive specification of the cumulativeachievement function that can conceivably be estimatedwith administrative data. In this very general specificationcurrent achievement depends on current and all priorindividual time-varying characteristics and school-basedinputs as well as the student’s (assumed time invariant)family inputs and the fixed individual endowment.

2.3. Assumptions underlying all commonly estimated

empirical models of student achievement

2.3.1. Grade invariance of the cumulative achievement

function

The cumulative model with fixed family inputs (Eq. (3)) isgrade-specific and thus allows for the possibility that theachievement function varies with the grade level.13 Main-taining this flexibility carries a heavy computational cost,however. In pooled regressions, separate coefficients mustbe estimated for each input/grade/time-of-applicationcombination. To make the problem more computationallytractable, it is universally assumed in the empiricalliterature that while the impact of inputs may decay overtime, the cumulative achievement function does not varywith the grade level. In particular, it is assumed that theimpact of an input on achievement varies with the time spanbetween the application of the input and measurement ofachievement, but is invariant to the grade level at which theinput was applied. Thus, for example, having a smallkindergarten class has the same effect on achievement at theend of third grade as does having a small class in secondgrade on fifth-grade achievement. This implies that for any t:

Ait ¼Xt

h¼1

½ahXiðtþ1�hÞ þ bhEiðtþ1�hÞ� þ vtxi þ eit (4)

We refer to Eq. (4) as our ‘‘baseline model.’’ However,the grade invariance assumption that leads to the baselinemodel can be tested. Specifically, each input can beinteracted with time (or grade) dummies, the interactionterms can be added to Eq. (4), and the joint significance ofthe additional terms can be tested.

2.3.2. Assumptions about the unobserved family/student

effect

Estimation of the achievement model and implementa-tion of specification tests (such as the test of gradeinvariance described above) depends on assumptions aboutthe impact of the unobserved student/family effect. Ifunobserved student/family heterogeneity has no importantimpact on the student performance (vt = 0) or suchheterogeneity is not related to the observed inputs, thenunbiased (or, in large samples, consistent) estimates of theparameters in the models can be obtained by estimating theequation by OLS.

If the unobserved heterogeneity is present and ispotentially correlated with observed inputs, then the OLSestimator is generally biased (inconsistent). So long as theunobserved effect is time constant (vt = , t = 1, . . ., T) onecan use fixed effects (FE) or first-difference (FD) estimators.However, these estimators are valid only if the assumptionof strict exogeneity (conditional on the unobserved effect)holds. Specifically, we have to assume that a shock tostudent achievement in grade t (eit) does not affect thechoice of inputs in any grade, including the next grade.Rothstein (2010) discusses this problem in detail inrelation to estimating teacher effects on student perfor-mance. As noted by Rothstein (2010), the strict exogeneityassumption fails if future teacher assignment is partlydetermined by past and/or current shocks to studentperformance (for example, if students who experience adrop in their performance are assigned to a class taught bya relatively high productivity teacher next year). In thiscase, both FE and FD estimators are inconsistent; hence, itis important to check whether strict exogeneity holds. Asimple test for strict exogeneity can be performed byadding lead (future) values of inputs in the set ofexplanatory variables and testing their joint significancein the FE or FD regression (Wooldridge, 2002, chap. 10).14

A popular and practically feasible alternative toassuming either zero or time-constant effects of unob-servable student/family inputs is to assume that theunobserved effect in Eq. (4) is trending: vt = 1 + jt, where t

is a time trend (see, for example, Wooldridge, 2002). Aftertaking first differences in (4), D(xij)t = xij � gi:

DAit ¼Xt�1

h¼1

½ah DXiðtþ1�hÞ þ bh DEiðtþ1�hÞ� þ atXi1

þ btEi1 þ g i þ eit ; (5)

where the new unobserved student/family effect isconstant over time. Therefore, an achievement model thatcontains a trending unobserved effect can be consistentlyestimated by either FE or FD applied to a differencedequation. Strict exogeneity should also be tested in thiscase using the test described above.

12 Note that the marginal effects of fixed parental and child inputs, vt,

are the same for all students and thus vtxi varies over time in the same

manner for all students. If the time-varying marginal effect of the

individual/family fixed component were student-specific then the effect

of the student-specific component in each time period would be perfectly

collinear with observed achievement.13 While the cumulative model with fixed family inputs allows for

differential effects by grade, as written it assumes equal marginal effects

across students within a grade. Of course interactions can be included to

allow for differential effects across different types of students. See, for

example, Wright, Horn, and Sanders (1997). A recent analysis by

14 This is the same test that Rothstein (2010) uses when testing the

validity of his models VAM1 (regressing test score gains on contempora-

neous teacher indicators) and VAM2 (regressing test scores on

contemporaneous teacher indicators and the lagged score). Koedel and

Betts (2011) use the same test in the model with geometric decay.

Lockwood and McCaffrey (2009) directly investigates whether teacher

value-added varies across different types of students.

Rothstein also proposed a more advanced test, which we discuss in more

detail in the Appendix.

asseffeindtimresindserequequunobecdifftest(Win (abotradappof t

2.3.

tatiinvEq.of

typstuthemeany0 �

Ait

Taktim

Ait

�

Colside

Ait

wh

15

esta

two

cova

diffe16

usin

Alth

men

mea


There are several ways to check the validity of variousumptions concerning the unobserved student/familyct. First, strong positive correlation in residuals wouldicate the presence of highly persistent factors, such ase-invariant student or family inputs. Therefore, if foriduals eit and eit�1 Corr(eit, eit�1) is positive, this wouldicate the presence of the unobserved effect. If positiveial correlation in residuals is found not only in the levelsation (such as Eq. (4)), but also in the differencedation (such as Eq. (5)), it would imply that thebserved effect is trending rather than time-constantause a time-constant effect would drop out aftererencing. Another possibility is to use a Hausman-type

comparing the FE (FD) estimates in Eqs. (4) and (5)ooldridge, 2002). If the unobserved student/family effect4) is time constant, then parameter estimates should beut the same in both models. Unfortunately, theitional form of the Hausman test statistic is notlicable in this case. A more complicated general formhe test statistic should be used.15

3. Geometric decay in the impact of prior inputs

Given the burdensome data requirements and compu-onal cost of the full cumulative model, even the age-ariant version of the cumulative achievement model,

(4), has never been directly estimated for a large samplestudents.16 To make the model more tractable, it isically assumed that the marginal impacts of all priordent and school-based inputs decline geometrically with

time between the application of the input and theasurement of achievement at the same rate so that for

given h, a(t+1)�h = lat�h, where l is a scalar and l � 1. With geometric decay Eq. (4) can be expressed as:

¼Xt�1

h¼0

lh½aXiðt�hÞ þ bEiðt�hÞ� þ vtxi þ eit (6)

ing the difference between current achievement and les prior achievement yields:

� lAit�1 ¼Xt�1

h¼0

lh½aXiðt�hÞ þ bEiðt�hÞ� þ vtxi þ eit

!

Xt�2

h¼0

lhþ1½aXiðt�1�hÞ þ bEiðt�1�hÞ� þ lvt�1xi þ leit�1

!

(7)

lecting terms, simplifying and adding lAit�1 to boths produces:

¼ aXit þ bEit þ lAit�1 þ ðvt � lvt�1Þxi þ hit (8)

ere hit = eit� leit�1.

Thus, given the assumed geometric rate of decay, thecurrent achievement level is a function of contemporane-ous student and school-based inputs as well as laggedachievement and an unobserved individual-specific effect.The lagged achievement variable serves as a sufficientstatistic for all past time-varying student and schoolinginputs, thereby avoiding the need for historical data onteachers, peers and other school-related inputs.

Although commonly estimated models assume alleducational inputs persist at a geometric rate, l, onecould still have a tractable model if only some inputs decaygeometrically. The most general test can be constructedusing the model with grade invariance, but no restrictionon input decay, which is summarized in Eq. (4). Thepresence of the input-specific geometric decay can bechecked by testing the following null hypotheses:

H0 :at; j

at�1; j¼

at�1; j

at�2; j¼ � � � ¼

a2; j

a1; j;

or

H0 :bt; j

bt�1; j

¼bt�1; j

bt�2; j

¼ � � � ¼b2; j

b1; j

;

for each input j. These are nonlinear hypotheses that can betested using a Wald-type test.

2.4. Commonly estimated models and specific assumptions

In Eq. (8), it is possible to make different assumptionsabout the rate of decay (l) and unobserved heterogeneity(xi). Below we consider several possibilities:

Assumptions Model

1. (0 < l < 1), vt = lvt�1 Partial persistence model:

Ait ¼ aXit þ bEit þ lAit�1 þ hit

2. (0 < l < 1), vt = vt�1 Partial persistence model with

student fixed effect:

Ait ¼ aXit þ bEit þ lAit�1 þ g i þ hit

3. l = 1, vt = lvt�1 Gains model:

DAit ¼ Ait � Ait�1 ¼ aXit þ bEit þ hit

4. l = 1, vt = vt�1 Gains model with student fixed effect:

DAit ¼ Ait � Ait�1 ¼ aXit þ bEit þ g i þ hit

5. l = 0, vt = lvt�1 Immediate decay model:

Ait ¼ aXit þ bEit þ hit

6. l = 0, vt = vt�1 Immediate decay model with

student fixed effect:

Ait ¼ aXit þ bEit þ g i þ hit

Models 1, 3, and 5 assume that the time-invariantstudent/family inputs decay at the same rate as otherinputs (l), so that vt = lvt�1, and the individual-specificeffect drops out of the achievement equation. In models 2,4, and 6 the marginal effect of the individual-specificcomponent is assumed to be constant over time, i.e.vt = vt�1 and (vt� lvt�1) = (1 � l)vt = , so that gi = xi

is a time-invariant student/family fixed effect. Theremaining differences across models are due to variousassumptions about the rate of decay, l.

When relative efficiency of one of the two estimators cannot be

blished, then the asymptotic variance of the difference between the

estimators is not the same as the difference in the two variances; the

riance should be included in the computation of the variance of the

rence. See, for example, Wooldridge (2002), Section 14.5.1.

Todd and Wolpin (2007) estimate the cumulative achievement model

g a sample of approximately 7000 students from the NLSY79-CS.

ough they possess good measures of parental inputs and achieve-

t levels they have only a few general measures of schooling inputs

sured at the county or state level.


Model 1 is valid when 0 < l < 1 and is perhaps the mostfrequently estimated value-added model; we refer to it asthe partial persistence model. In this model the lagged testscore serves as a sufficient statistic for the time-constantstudent/family inputs as well as for the historical time-varying student and school-based inputs. OLS estimates ofEq. (8) would be unbiased (consistent) so long as thecommon geometric decay assumption is correct andidiosyncratic error, hit, is not correlated with currentinputs and past achievement. The latter assumption wouldfail, for example, if a time-constant student/family effect ispart of the error hit, or hit are serially dependent for otherreasons.

Model 2 maintains the assumption that 0 < l < 1, butexplicitly introduces a time-invariant student-familyeffect. Estimation is complicated due to the presence ofthe lagged dependent variable, which is inevitablycorrelated with the error in the previous period, so thatstrict exogeneity fails, and both FE and FD are inconsistent(asymptotically biased). Therefore, under the standardassumption that the idiosyncratic errors are seriallyuncorrelated, the common approach is to remove theunobserved effect by first-differencing, and then usesecond and possibly further lags of the dependent variableto instrument for DAit�1.17

In the two gains models (models 3 and 4), thecoefficient on lagged achievement in Eq. (8) is unity.18

As noted by Boardman and Murnane (1979) and Todd andWolpin (2003), setting l = 1 implies that the effect of eachinput must be independent of when it is applied. The gainsmodel 3 can be consistently estimated by OLS if assump-tion l = 1 is correct and error hit does not contain anyfactors (e.g. unobserved time-invariant student/familyinputs) that may be correlated with the inputs includedin the model. If l = 1 holds, but unobserved student/familyinputs, such as student ability and parental involvement,are present and are potentially correlated with observedinputs, such as class size and teacher assignment, then it ismore appropriate to use model 4 that can be consistentlyestimated by FE or FD estimators.

Models 5 and 6 assume that the decay is immediateand complete, so that l = 0 and lagged achievement dropsout of the achievement function. Similar to discussionabove, if l = 0 holds and time-invariant student/familyinputs are either not present or not correlated withobserved inputs, then model 5 is correct, and modelparameters (including teacher effects) can be consistent-ly estimated by OLS. However, if unobserved student/family inputs, gi, are correlated with observed inputs,then model 6 is more suitable, and estimation should bedone using either FE or FD.

The validity of the above models can be checked byaugmenting the corresponding equation by lagged valuesof observed inputs, estimating the equation using theappropriate estimation method, and testing joint signifi-cance of lagged inputs. If the underlying model assump-tions are valid, lagged inputs should not appear in thecorresponding model and, therefore, should be jointlyinsignificant. Specifically, in the immediate decay models(5 and 6) it would mean that in the actual data l 6¼ 0. Ingains models (3 and 4) it would mean that l 6¼ 1. In partialpersistence models (1 and 2) it would imply that the decayis either not geometric, or not the same for all inputs, orboth. Moreover, in the models that do not account for atime-constant student/family fixed effect (models 1, 3, and5), significance of lagged inputs would indicate that thestudent/family effect is present in the actual data needs tobe accommodated in the estimation.

Several other tests can be used to determine whetherthe employed estimation methods are valid. Specifically,models 2, 4, and 6 are estimated using either FE or FDestimators that are consistent only if observed inputs arestrictly exogenous (see discussion in Section 2.3.1). Asmentioned above, the strict exogeneity assumption can betested by adding future values of input variables in the FE(FD) regression and subsequently testing their jointsignificance. Future inputs could also be included in theOLS regressions used to estimate models 1, 3, and 5.Rejecting the null of no significance of future inputs inthose regressions would imply that a time-constantunobserved student/family effects is present in the dataand correlated with observed inputs included in the model.

Finally, in the instrumental variables regression used toestimate model 2, it is important to test the validity of theinstruments. As mentioned above, in order for the secondlag of the test score to be a valid instrument, it is necessarythat errors hit are not correlated over time. This assump-tion is usually checked by testing H0: Corr(Dhit,Dhit�1) = �0.5, where Dhit is the error in the differencedequation. In practice, the correlation between the currentand lagged residuals in the differenced equation iscomputed and used for testing. Another standard testchecks whether the instruments are strongly partiallycorrelated with the instrumented variable.

3. Data

In order to test alternative model specifications weutilize data from the Florida Department of Education’s K-20 Education Data Warehouse (EDW), an integratedlongitudinal database covering all Florida public schoolstudents and school employees. Our sample begins withschool-year 1999/2000, which is the first year in whichstatewide standardized testing in consecutive grade levelswas conducted in Florida. Our panel continues through the2007/2008 school year.

During our period of analysis the state administeredtwo sets of reading and math tests to all third throughtenth graders in Florida. The ‘‘Sunshine State Standards’’Florida Comprehensive Achievement Test (FCAT-SSS) is acriterion-based exam designed to test for the skills thatstudents are expected to master at each grade level. The

17 Using the instrumental variables method is necessary because in the

differenced equation, Cov(DAit�1, Dhit) = Cov(Ait�1� Ait�2, hit� hit�1) =

Cov(Ait�1, hit�1) due to Cov(Ait�1, hit) = Cov(Ait�2, hit�1) = Cov(Ait�2, hit) = 0

when {hit} are serially uncorrelated. Because Cov(Ait�1, hit�1) 6¼ 0 by

construction, instruments are needed.18 Alternatively, the model can be derived by starting with a model of

student learning gains (rather than levels) and assuming that there is no

persistence of past schooling inputs on learning gains.

seca ve9 istypincto aFCAscacominitmintheas pon

waend

mamaWenumto sGivof s

peestustueacrec‘‘sesch5%

13%cou

muis eon

typandmutimseccohothme

andconcouconadvcon

reateasomsciealwstu


ond test is the FCAT Norm-Referenced Test (FCAT-NRT),rsion of the Stanford-9 achievement test. The Stanford-

a vertical or development-scale exam. Hence scoresically increase with the grade level and a one-pointrease in the score at one place on the scale is equivalent

one-point increase anywhere else on the scale. We useT-NRT scale scores in all of the analysis. The vertical

le of the Stanford Achievement Test allows us topare achievement gains of students with differing

ial achievement levels. Further, use of the FCAT-NRTimizes potential biases associated with ‘‘teaching to

test,’’ since all school accountability standards, as wellromotion and graduation criteria in Florida are based

the FCAT-SSS, rather than the FCAT-NRT. The FCAT-NRTs last administered in 2007/2008, which determines the

of our sample period.Although achievement test scores are available for bothth and reading in grades 3–10, we limit our analysis tothematics achievement in middle school, grades 6–8.

select middle-school mathematics classes for aber of reasons. First, we require second-lagged scores

erve as potential instruments for lagged achievement.en that testing begins in grade 3 this precludes analysistudent achievement prior to grade 5.Second, it is easier to identify the relevant teacher andr group for middle-school students than for elementarydents. The overwhelming majority of middle schooldents in Florida move between specific classrooms forh subject whereas elementary school students typicallyeive the majority of their core academic instruction in alf-contained’’ classroom. However, for elementaryool students enrolled in self-contained classrooms,are also enrolled in a separate math course and nearly

are enrolled in either special-education or giftedrses.Third, because middle-school teachers often teachltiple sections of a course during an academic year, itasier to clearly identify the effects of individual teachersstudent achievement. In elementary school, teachersically are with the same group of students all day long

thus teacher effects can only be identified by observingltiple cohorts of students taught by a given teacher overe. In contrast, both variation in class composition acrosstions at a point in time as well as variation acrossorts over time help to distinguish teacher effects fromer classroom-level factors affecting student achieve-nt in middle school.Fourth, we choose to avoid high school grades (grades 9

10) because of potential mis-alignment between testtent and curriculum. At the high-school level mathrses become more diverse and specialized. Thus thetent of some high school math courses, particularlyanced courses, may have little correlation withcepts being tested on achievement exams.Finally, we focus on math achievement rather thanding because it is easier to clearly identify the class andcher most relevant to the material being tested. While

e mathematics-related material might be presented innce courses, direct mathematics instruction almostays occurs in math classes. In contrast, middle school

‘‘language arts’’ and reading courses, both of which maycover material relevant to reading achievement tests.

In addition to selecting middle-school math courses foranalysis, we have limited our sample in other ways in anattempt to get the cleanest possible measures of classroompeers and teachers. First, we restrict our analysis of studentachievement to students who are enrolled in only a singlemathematics course and drop grade repeaters (though allother students enrolled in the course are included in themeasurement of peer-group characteristics). Second, toavoid atypical classroom settings and jointly taught classeswe consider only courses in which 10–50 students areenrolled. Third, we eliminate any courses in which there ismore than one ‘‘primary instructor’’ of record for the class.Finally, we eliminate charter schools from the analysissince they may have differing curricular emphases andstudent-peer and student–teacher interactions may differin fundamental ways from traditional public schools.

Estimation of some models requires up to three laggedtest scores. Given statewide testing on the FCAT-SSS beganin 1999/2000 and ended in 2007/2008, our analysis islimited to achievement of Florida traditional public schoolstudents in grades 6–8 over the years 2002/2003 through2007/2008 who took the FCAT-NRT for at least threeconsecutive years. This includes six cohorts of students.Unfortunately, it is not computationally tractable toestimate models that include both contemporaneousand multiple lagged teacher effects using the entiresample. We therefore randomly select 20 of Florida’s 67countywide school districts for analysis.19 Descriptivestatistics for the variables in the 20-district data set areprovided in Table 1.

4. Results

4.1. Tests of grade-invariance

Recall that empirical value-added models universallyassume lagged inputs have the same effect on contempo-raneous achievement, irrespective of grade level. This canbe tested by interacting each input with time (or grade)dummies, and testing the significance of the interactionterms. In order to ensure comparability in the interactions,we normalize the test scores for each grade/year.20 Resultsfrom estimating our baseline model with these interactionterms are presented in Table 2. There are three middle-school grades in the sample: 6, 7 and 8. Thus we includeinteractions with grade 6 and with grade 7. We presenttests for the joint significance of all grade-input interac-tions, as well as separate tests for the significance of gradesix and grade seven interactions. For the first and second-lag interactions we reject the null of grade invariance at the1% significance level in all but the FD regressions. In the FD

19 This is due to our use of explicit teacher indicators and Stata’s limit of

10,998 explanatory variables. For models with only contemporaneous

teacher effects, there are multiple routines available that would work

with the entire statewide sample. See McCaffrey, Lockwood, Mihaly, and

Sass (2012).20
Results without this normalization are provided in Table A1 of the
endix.
dents in Florida may be simultaneously enrolled in App


regression, the first-lag interactions are significant at the5% level, although the second-lag interactions are lesssignificant. Put differently, we find little support for thecommon assumption that prior inputs affect achievementin the same way regardless of the grade in which they areapplied.

4.2. General rate of decay of prior inputs

As discussed in Section 2, the validity of geometricdecay assumptions (immediate, partial, and no persis-tence) can be tested by determining if prior inputs havesignificant effects in the appropriate achievement models.In each model, finding significant effects of prior inputswould suggest that the model is incorrect (or toorestrictive), so that the estimating equation is mis-specified and the resulting estimates of the teacher effectsand coefficients on other inputs may be biased. In models

characteristics that may be correlated with observedinputs, significance of past inputs would also indicatethat the ‘no correlated unobserved heterogeneity’ assump-tion is likely false.

We perform the tests after estimating the augmentedversions of the six student achievement models consideredin Section 2. To make the test computationally feasible welimit the additional terms to prior-year teacher identitiesand first, second and third lags of non-teacher schoolinginputs. As reported in Table 3, we strongly reject the nullthat prior inputs have no effect on current achievement inall cases. This finding suggests that all common geometricdecay models are incorrect. The test statistics are notice-ably larger in the most restrictive model that assumesimmediate decay and no unobserved heterogeneity (firstcolumn in Table 3). Such a result is expected if unobservedheterogeneity is an omitted variable, so that estimatedcoefficients on lagged inputs capture both the direct effectsof the inputs and the effects due to non-zero correlationbetween observed inputs and unobserved student/familyinputs. All other regressions account for unobserved inputsat least partially and hence, it is not surprising that laggedinputs in those regressions are less significant.

4.3. Input-specific decay of past inputs

Table 4 reports results from testing the null hypothesisthat input-specific decay is geometric against the alterna-tive that the rate of decay for that particular input is notgeometric. Tests were performed separately for each inputafter estimating Eq. (4) with varying assumptions aboutthe nature of the student/family input. For computationaltractability, we include three lags of each input and test thenull that the ratio of the coefficients on the first andsecond-lagged inputs equals the ratio of the coefficients onthe second and third lagged inputs. The results indicatethat we cannot reject the null that each input decays at itsown geometric rate. This is true whether we assume thatcorrelated unobserved student/family inputs are notpresent (OLS model), are time constant (FE, FD models)or follow a time trend (FE on FD model).21

4.4. Is the effect of the unobserved student/family input time

invariant?

As discussed above, the finding that test statistics arelarger in the most restrictive immediate decay modelindicate the presence of correlated unobserved heteroge-neity. Correlations between the current and laggedresiduals reported in the next-to-last row of Table 3 arealso informative of the type of unobserved heterogeneity.There is a positive correlation in residuals in the mostrestrictive model, which assumes immediate decay and nocorrelated unobserved heterogeneity (first column in

Table 1

Summary statistics for Florida public school students in 20 randomly

selected districts, 2002/2003–2007/2008.

Mean Std. Dev.

Student characteristics

Female 0.523 0.499

Black 0.205 0.403

Hispanic 0.159 0.366

Asian 0.024 0.154

American Indian 0.003 0.054

Math Score 681.312 28.246

Math Gain 6.465 22.561

Free/Reduced-Price Lunch 0.392 0.488

Number of Schools 1.013 0.113

Disciplinary Incidents 0.459 1.466

Structural Mover 0.314 0.464

Non-Structural Mover 0.140 0.347

Gifted 0.041 0.198

Mental Disability 0.000 0.015

Physical Disability 0.001 0.036

Emotional Disability 0.003 0.057

Other Disability 0.003 0.058

Speech/Language Disability 0.021 0.143

Learning Disability 0.021 0.143

Limited English Proficiency 0.026 0.160

Teacher characteristics

Advanced Degree 0.351 0.477

Professional Certificate 0.844 0.362

Years of Experience 9.290 11.891

1–2 Years of Experience 0.186 0.389





25 Years Plus 0.136 0.343

Class and peers’ characteristics

Math Class Size 24.695 5.049

Peers Proportion Female 0.500 0.115

Peers Proportion Black 0.212 0.226

Peers Proportion Hispanic 0.177 0.155

Peers Proportion Asian 0.025 0.041

Peers Average Age in Months 149.390 9.733

Peers Proportion Changed Schools 0.565 0.406

Peers Proportion Structural Movers 0.376 0.400

Number of observations 209,379

21 In addition to the test results presented in Table 4, we also tested to

see if various combinations of inputs share a common decay rate.

Occasionally we uncovered cases where we could not reject a common

decay rate for two or more inputs, but they were infrequent and did not

follow any particular pattern. For example, the effects of various teacher

credentials did not decay at similar rates.
that assume there are no unobserved student/family

Tabfactagais lresunoBeccoluonethecornegwhand

in

unoagatheingcorinst

is ptheobscontheneiposis rcornegsionis t

Tab

Test

In

Gr

La

Gr

La

Gr

La

Gr

La

Gr

La

Gr

La

Gr

Ti

Gr

La

Gr

La

Note

scho

the


le 3). This indicates there is persistence in unobservedors that determine student performance and oncein suggests that a time-constant student/family effectikely present and is part of the error. In contrast, theidual correlation is negative in the gains model withoutbserved heterogeneity (second column in Table 3).ause adding lagged inputs to the gains model (secondmn of Table 3) is equivalent to first differencing Eq. (4),

would expect positive serial correlation in residuals if unobserved effect were trending. The fact that therelation coefficient in the second column of Table 3 isative and close to �0.5 is consistent with a situationere idiosyncratic errors in (4) are serially independent

the unobserved effect is time-constant.The correlation between the current and lagged residualthe partial persistence model with time-constantbserved effect is �0.404 (last column in Table 3), whichin speaks against the unobserved trend model. However,

correlation is statistically different from �0.5, suggestthat idiosyncratic errors in the model are seriallyrelated (although only slightly), so that the employedrument (lagged test score) may not be valid.

More evidence on the type of unobserved heterogeneityrovided in Table 4, which reports estimation results for

baseline model (Eq. (4)). Similar to Table 3, the patternserved in Table 4 indicate the presence of a time-stant unobserved student/family effect. Specifically, in

model that does not account for unobserved heteroge-ty (first column in Table 4), residuals are stronglyitively correlated. However, after the unobserved effectemoved by differencing (third column of Table 4), therelation between current and lagged residual isative and reasonably close to �0.5. Similar to discus-

above, this again indicates that the unobserved effectime-constant rather than trending.

4.5. Tests of strict exogeneity

In Table 5 we present results from tests of strictexogeneity for several models with varying assumptionsregarding the persistence of schooling inputs and thenature of the unobserved student/family input. In everycase we strongly reject the null that future teacherassignments have no ‘‘effect’’ on current student achieve-ment. In cases when the model is estimated by OLS, jointsignificance of future teacher indicators signifies thepresence of unobserved student/family characteristicsthat are correlated with observed inputs. In cases whenfixed effects or differencing are used, significance of futureteacher indicators suggests that student assignment toteachers is based in part on realized prior achievement.Thus, a key assumption that is needed for the fixed effectsand first-difference estimators to yield asymptoticallyunbiased estimates fails.

4.6. Differences in estimates across models

The results above suggest several general conclusions.An unobserved student/family effect is present and is time-invariant. Even in the models that account for unobservedheterogeneity, lagged inputs are significant, suggestingthat the common-geometric-rate-of-decay assumptionfails (though input-specific geometric decay assumptioncould not be rejected). Both grade-invariance and strictexogeneity are rejected.

Given that virtually all assumptions that are used informulating and estimating student achievement modelsare rejected, it is expected that commonly used empiricalvalue-added models produce biased estimates of teacherproductivity. This is not a real surprise. From a policyperspective, the more important issue is the magnitude of

le 2

s for grade invariance (based on the augmented baseline model).

teraction term Assumption regarding student/family inputs and estimation method

No correlated

unobserved effect

Time-constant

unobserved effect

Time-constant

unobserved effect

Trending

unobserved effect

OLS Fixed effects (FE) First difference (FD) FE on first differences

ade 6 and 7 Once F(57,136621) = 19.79 F(48,136621) = 1.55 F(55,62093) = 1.44 F(49,62093) = 1.94

gged Covariates (0.00) (0.008) (0.02) (0.00)

ade 6 and Once F(28,136621) = 7.94 F(19,136621) = 1.52 F(26,62093) = 1.62 F(23,62093) = 1.68

gged Covariates (0.00) (0.07) (0.02) (0.02)

ade 7 and Once F(29,136621) = 14.66 F(29,136621) = 1.33 F(29,62093) = 1.63 F(26,62093) = 2.30

gged Covariates (0.00) (0.11) (0.02) (0.00)

ade 6 and 7 and Twice F(58,136621) = 6.90 F(56,136621) = 2.14 F(55,62093) = 1.20 F(51,62093) = 2.52

gged Covariates (0.00) (0.00) (0.15) (0.00)

ade 6 and Twice F(29,136621) = 9.74 F(29,136621) = 1.66 F(27,62093) = 1.46 F(25,62093) = 2.49

gged Covariates (0.00) (0.01) (0.06) (0.00)

ade 7 and Twice F(29,136621) = 8.87 F(27,136621) = 2.19 F(28,62093) = 0.83 F(26,62093) = 1.43

gged Covariates (0.00) (0.00) (0.70) (0.07)

ade 6 and 7 and Three F(58,136621) = 3.67 F(55,136621) = 1.15 F(29,62093) = 1.50 F(28,62093) = 1.86

mes Lagged Covariates (0.00) (0.20) (0.04) (0.00)

ade 6 and Three Times F(29,136621) = 5.82 F(27,136621) = 1.32 F(26,62093) = 1.10 F(25,62093) = 1.87

gged Covariates (0.00) (0.12) (0.33) (0.01)

ade 7 and Three Times F(29,136621) = 2.38 F(28,136621) = 1.27 F(29,62093) = 1.50 F(28,62093) = 1.86

gged Covariates (0.00) (0.15) (0.04) (0.00)

: The table displays the F-statistics for testing grade invariance. All regressions use year-by-grade normalized test scores and include grade, year and

ol dummies, teacher indicators for the current and last periods, as well as three lags of time-varying inputs. p-Values are reported in parentheses under

test statistics.

Table 3

Tests of the geometric decay assumption (immediate, complete or partial persistence).

Effect of student/family inputs decay at same rate as other inputs Effect of student/family inputs are time constant

Model name Immediate decay

model

Gains model Partial persistence

model

Immediate

decay model with

student fixed effects

Gains model with

student fixed

effects

Partial persistence

model with student

fixed effects

Model under H0 Ait = aXit + bEit + hit DAit = aXit + bEit + hit Ait = aXit + bEit + lAit�1 + hit Ait = aXit + bEit + gi + hit DAit = aXit + bEit + gi + hit Ait = aXit + bEit + lAit�1 + gi + hit

Estimation method OLS OLS OLS FE FE FD-IV

Lagged Teacher F(1880,136621) = 6.35 F(1880,136621) = 4.15 F(1880,136621) = 4.67 F(1773,136621) = 9.01 F(1773,136621) = 6.46 F(1901,136621) = 3.19

(0.000) (0.000) (0.000) (0.000) (0.000) (0.000)

Once Lagged F(29,136621) = 89.69 F(29,136621) = 5.57 F(29,136621) = 24.51 F(29,136621) = 3.49 F(29,136621) = 1.42 F(29,136621) = 8.30

Covariates (0.000) (0.000) (0.000) (0.000) (0.065) (0.000)

Twice Lagged F(29,136621) = 10.67 F(29,136621) = 1.61 F(29,136621) = 3.33 F(29,136621) = 1.39 F(29,136621) = 1.45 F(29,136621) = 5.41

Covariates (0.000) (0.020) (0.000) (0.081) (0.054) (0.000)

Three Times Lagged F(29,136621) = 12.82 F(29,136621) = 2.49 F(29,136621) = 5.43 F(28,136621) = 1.59 F(28,136621) = 1.72 F(29,136621) = 6.46

Covariates (0.000) (0.000) (0.000) (0.025) (0.011) (0.000)

Rate of Persistence, l 0.00 1.00 0.65 0.00 1.00 �0.021

(0.000) (0.000)

Corr(rest, rest�1) 0.493 �0.420 �0.206 �0.404

[0.003] [0.003] [0.004] [0.003]

Strength of the Instruments F(1,136621) = 34968.89

(0.000)

Notes: Top rows of table report results of F-tests where the null hypothesis is that the effects of the inputs reported in the rows are (jointly) equal to zero. Grade, year and school dummies, as well as current period

inputs and current period teacher dummies are included in all regressions. In the last column (FD-IV), the equation was differenced and estimated by the instrumental variables estimator with twice-lagged

achievement score as the instrument for the differenced first lag of the test score. Because differencing teacher indicators was not feasible, instead of differencing these variables we included the first and second lags

of teacher indicators in the FD-IV regression. Under the null that prior teachers do not matter, the second lag of teacher indicators should be equal to zero. This is the joint hypothesis that we test after running the

FD-IV regression (first row in the last column). For all tests, p-values are reported in parentheses under the test statistics for testing the joint significance of the corresponding variables. The second to last row reports

the serial correlation in residuals (standard errors are reported in brackets underneath).

T.R

. Sa

ss et

al.

/ E

con

om

ics o

f E

du

catio

n R

eview

38

(20

14

) 9

–2

31

8

thelessassteacanresmofromabowhres

testvarmo

Tab

Test

M

Ba

Ait

Ba

Ait

Ba

Ait

Pa

Ait

Pa

Ait

Ba

Ait

Note

stat

teac

scor

uno

Tab

Test

Re

M

No

1–

3–

Ad

Pr

Co

Note

H0 :

scho

the


bias and whether some models yield estimates with bias than others. Unfortunately, absent true random

ignment of students and teachers, we cannot know truecher productivity and hence the magnitude of the biasnot be directly assessed. However, we can compareults from commonly estimated models to our baselinedel to determine the degree to which estimates vary

those produced with the fewest possible assumptionsut the educational process. Further, we can determineich assumptions have the greatest impact on theulting estimates of teacher productivity.In Table A1 of the Appendix, we present results froms comparing estimated coefficients for selected time-

ying inputs and teacher effects produced by differentdels. The test results indicate that estimated teacher

effects are often statistically similar. However, it does notnecessarily mean that differences are practically unimpor-tant. Also, we find that coefficients on other time-varyingvariables are often statistically different, which, neverthe-less, does not guarantee that the estimates are practicallyvery different. Because in policy decisions the magnitudeof the differences in estimates from competing models ismost relevant, in what follows we consider otherapproaches that help to assess the degree of similarityin the effects of time-varying inputs and estimated teachereffects.

Although our focus is on the assumptions of value-added models and the potential for bias in measuringteacher effects, another relevant part of policy decisions isestimation error. Given finite samples of students per

le 5

s of strict exogeneity.

odel Estimation method Test statistic (p-value)

seline model:

¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ hit

OLS F(1929,91945) = 26.583

(0.000)

seline model with student fixed effect:

¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ g i þ hit

FE F(1470,91945) = 6.564

(0.000)

seline model with student fixed effect:

¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ g i þ hit

FD F(1824,39868) = 17.256

(0.000)

rtial decay model:

¼ lAit�1 þ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ hit

OLS F(1929,91945) = 17.681

(0.000)

rtial decay model with student fixed effect:

¼ lAit�1 þ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ g i þ hit

FD-IV F(2346,91945) = 6.441

(0.000)

seline model with student-specific trend:

¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ tg i þ hit

FE on first differences F(1736,91945) = 7.862

(0.000)

: The table displays the F-statistics for testing joint significance of future teacher indicators. p-Values are reported in parentheses under the test

istics. All models include: (i) grade, year and school indicators, (ii) current-year time-varying non-teacher inputs and teacher indicators, (iii) prior-year

her indicators, (iv) three prior years of non-teacher inputs and (v) future teacher indicators. In the regressions with partial persistence (with lagged test

es on the right-hand side), the second lag of the test score is used as an instrument for the differenced first lag of the test score. In the model where the

bserved effect is trending the third lag of non-teacher inputs was not first differenced due to a lack of four-lagged data.

le 4

s for input-specific geometric decay and the time constant unobserved effect (based on the baseline model).

Assumption regarding student/family inputs and estimation method

No correlated

unobserved effect

Time-constant

unobserved effect

Time-constant

unobserved effect

Trending

unobserved effect

OLS Fixed effects (FE) First difference (FD) FE on first differences

duced/Free Lunch 0.30

(0.59)

0.29

(0.59)

0.39

(0.53)

ath Class Size 1.43

(0.23)

0.04

(0.83)

1.27

(0.26)

3.40

(0.07)

n-Structural Mover 0.51

(0.48)

0.03

(0.86)

0.00

(0.95)

0.15

(0.70)

2 Years of Experience 0.38

(0.54)

0.59

(0.44)

1.21

(0.27)

0.01

(0.90)

4 Years of Experience 0.37

(0.54)

1.08

(0.30)

0.00

(0.95)

0.00

(0.99)

vanced Degree 0.11

(0.74)

0.01

(0.94)

0.02

(0.89)

0.19

(0.67)

ofessional Certificate 0.00

(0.96)

0.00

(0.95)

0.24

(0.63)

1.91

(0.17)

rr (residualt, residualt�1) 0.49

[0.003]

�0.44

[0.007]

: Top rows of the table display the F-statistics and t-statistics for testing the input specific geometric decay, i.e. H0 : a3; j=a2; j ¼ a2; j=a1; j or

b3; j=b2; j ¼ b2; j=b1; j . The last row reports the correlation coefficient between the current and lagged residuals. All regressions include grade, year and

ol dummies, teacher indicators for the current and last periods, as well as three lags of time-varying inputs. p-Values are reported in parentheses under

test statistics. Standard errors are reported in brackets under the correlation coefficients.


teacher, the mean squared error of teacher effects will be afunction of both bias and estimation error and thus a veryprecise estimator with a small degree of bias could bepreferred to a less precise unbiased estimator. Unfortu-nately, it is difficult to empirically assess the bias-efficiency tradeoff.22 However, in the concluding sectionwe do discuss the likely tradeoffs between bias andefficiency, particularly with respect to the use of studenteffects to control for unobserved heterogeneity.

4.7. Comparing coefficients on selected time-varying inputs

For accountability purposes, the focus is on the teachereffect estimates that are produced from value-addedmodels. However, in many policy applications, themarginal effects of individual teacher characteristics, likeexperience and educational attainment, or the impacts ofclassroom variables like class size, are of interest.

Table 6 presents parameter estimates of the sixcommon value-added models and the baseline model.

While all models indicate that rookie teachers (the omittedcategory) do not perform as well as more experiencedteachers, the marginal effects of experience appear to berather different across models. The models that produceestimates closest to the baseline model are the gains model

and the partial persistence model with student fixed effects.Differences in the estimated impact on teacher educationalattainment are less pronounced across models; all yieldnegative estimated effects in the range of �0.5 to �1.5.Likewise, with the exception of the gains model with student

fixed effects the class size effects are all fairly small; thepoint estimate for the baseline model is �0.02 andestimates from the other models range from �0.06 to+0.04.

4.8. Comparison of teacher rankings across models

For individual teacher effects, one is generally lessconcerned about the specific value of the point estimate.Rather, the relative ranking of teachers is of greaterinterest, particularly in the context of performance paysystems for teachers. There are various ways in which onecan assess how different models rank teachers. One waywould be to compare the rank correlations of all teacher

Table 6

Comparing coefficients on selected time-varying variables.

Model Baseline Levels model

with student

covariates

Gains model

with student

covariates

Partial persistence

model with

student covariates

Levels model

with student

fixed effects

Gains model

with student

fixed effects

Partial persistence

model with student

fixed effects

Estimation method FE OLS OLS OLS FE FE FD-IV

Lagged inputs included? Yes No No No No No No

Student characteristics

Free/Reduced-Price Lunch 0.268

(0.282)

�4.461***

(0.165)

�0.761***

(0.108)

�1.987***

(0.107)

0.336

(0.266)

0.283

(0.436)

0.190

(0.163)

Non-Structural Mover �0.234

(0.366)

0.780**

(0.334)

�0.450*

(0.266)

�0.042

(0.248)

�0.023

(0.327)

0.019

(0.547)

�0.251

(0.183)

Teacher characteristics

1–2 Years of Experience 1.538***

(0.492)

2.627***

(0.373)

1.747***

(0.307)

2.039***

(0.285)

1.857***

(0.398)

3.103***

(0.657)

1.330***

(0.202)

3–4 Years of Experience 2.060***

(0.698)

3.465***

(0.529)

2.007***

(0.440)

2.490***

(0.407)

2.222***

(0.585)

4.471***

(0.972)

1.839***

(0.253)

5–9 Years of Experience 1.492*

(0.800)

4.350***

(0.605)

1.910***

(0.504)

2.719***

(0.466)

1.651**

(0.689)

4.497***

(1.141)

1.715***

(0.253)

10–14 Years of Experience 2.279**

(0.980)

5.559***

(0.744)

2.175***

(0.624)

3.297***

(0.575)

1.397

(0.853)

4.949***

(1.417)

1.838***

(0.273)

15–24 Years of Experience 1.929*

(1.122)

5.638***

(0.863)

2.310***

(0.725)

3.414***

(0.668)

1.079

(1.002)

5.212***

(1.660)

1.792***

(0.275)

25 Years Plus 2.320*

(1.358)

7.816***

(1.072)

2.931***

(0.901)

4.550***

(0.828)

1.074

(1.203)

7.022***

(2.010)

2.072***

(0.307)

Advanced Degree �1.049

(0.744)

�1.479**

(0.640)

�0.367

(0.528)

�0.736

(0.486)

�1.231*

(0.670)

�0.750

(1.115)

�0.521***

(0.159)

Professional Certificate �0.606

(0.600)

0.362

(0.471)

�0.924**

(0.388)

�0.498

(0.360)

�0.735

(0.500)

�1.620*

(0.844)

�0.489**

(0.232)

Class characteristics

Class Size �0.020

(0.019)

0.036**

(0.017)

�0.057***

(0.014)

�0.026**

(0.013)

�0.037**

(0.018)

�0.097***

(0.030)

�0.003

(0.002)

Number of observations 209,379 209,379 209,379 209,379 209,379 209,379 209,379

R-Squared 0.462 0.486 0.078 0.697 0.443 0.081 0.120

Number of students 136,622 136,622 136,622 136,622 136,622 136,622 136,622

* Significant at the 10% significance level.

** Significant at the 5% significance level.

*** Significant at the 1% significance level.

22 We present comparisons of the estimated standard errors across

models and discuss the relevant issues further in the Appendix.

effeesti

useproproremIn cbonrandistweteaTabof mleasdec

94%onlbeteffealsooneadd25%tea

Tab

Perc

De

te

ToNo

N/

P/

P/

P/

P/

Tim

N/

P/P/

P/

P/

BoNo

N/

P/

P/

P/

P/

Tim

N/

P/P/

P/

P/

Base

23

quin


ct estimates across pairs of models. We present suchmates in Appendix Table A4.However, when estimates of teacher productivity ared for accountability, identifying the most and leastductive teachers is most important. Typically the leastductive teachers are targeted for dismissal or forediation, such as additional professional development.ontrast, the most productive teachers are eligible foruses or permanent increases in salary. Changes in thekings of teachers in the middle of the productivityribution are generally of little consequence. Therefore

focus our comparison of model estimates on thechers who are identified as the most or least productive.le 7 provides information on the degree to which pairs

odels overlap in their identification of the most andt productive teachers, those in the top and bottomiles.23

The extent of overlap is substantial, ranging from 17% to (if the estimates were independent the overlap would

y be 0.1 � 0.1 or 1%). The highest degree of overlap isween the partial persistence models with a time constantct and either one or two lagged scores. The overlaps are

high among models with partial persistence and at least lagged test score, but no unobserved effect; addingitional lagged scores or lagged inputs only affects 20–

of teachers identified as being in the top/bottom ofcher rankings in those models.

We also find a relatively strong overlap between modelswith no unobserved heterogeneity, which typically fallinto the 50–81% range (upper left quadrants of the top-10%and bottom-10% matrices in Table 7). The only exception isa rather low correlation between the models with andwithout lagged inputs (first row, first column in eachmatrix). When comparing models with and without anunobserved student/family effect, the overlap is generallylower and ranges from roughly 12 to about 39% (lower leftquadrants of the matrices). The latter finding suggests thataccounting for unobserved heterogeneity has a substantialimpact on the estimates of teacher rankings. However,small overlap may also be partly due to the fact that FEremoves much variation from the data, which makes theestimates noisier.24

When looking at the overlap among models withunobserved heterogeneity (the lower right quadrants ofthe two matrices in Table 7), the overlap is relatively highamong partial decay models that include different numberof test score lags – the result that was already mentionedearlier. However, identification of the most and leastproductive teachers obtained from fixed effects and first-difference regressions is rather different (numbers on theintersection of the last three rows and first two columns inthe lower right quadrants of the matrices). One possible

le 7

ent of teachers classified as top/bottom 10% in both models (different combinations of two models).

cay/estimation method/no. of

st score lags/no. of input lags

No unobserved effect Time constant effect

N/OLS/0/0 P/OLS/0/3 P/OLS/1/0 P/OLS/3/0 P/OLS/3/3 N/FE/0/0 P/FE/0/3 P/FD/1/0 P/FD/2/0

p 10% unobserved effect

OLS/0/0 100.0

OLS/0/3 22.0 100.0

OLS/1/0 55.0 59.6 100.0

OLS/3/0 55.5 51.4 78.9 100.0

OLS/3/3 49.5 50.5 72.0 79.4 100.0

e constant effect

FE/0/0 33.5 12.4 23.9 26.6 30.7 100.0

FE/0/3 21.1 15.1 21.1 24.8 29.4 45.4 100.0

FD/1/0 28.4 22.9 33.0 32.6 33.9 17.0 25.2 100.0

FD/2/0 27.5 24.8 34.9 32.6 35.3 17.4 24.3 94.0 100.0

FD/2/3 36.7 24.3 35.8 33.5 39.0 21.1 27.1 65.1 62.8

ttom 10% unobserved effect

OLS/0/0 100.0

OLS/0/3 36.7 100.0

OLS/1/0 69.7 53.7 100.0

OLS/3/0 70.2 45.0 77.1 100.0

OLS/3/3 66.1 50.0 69.3 81.2 100.0

e constant effect

FE/0/0 38.5 23.9 36.2 29.4 30.7 100.0

FE/0/3 33.0 27.1 35.8 34.9 35.8 52.8 100.0

FD/1/0 28.4 17.9 26.1 25.2 22.9 22.0 24.3 100.0

FD/2/0 28.4 18.8 27.1 26.1 23.4 21.1 24.8 95.9 100.0

FD/2/3 33.9 20.2 32.6 31.2 31.2 27.1 28.0 61.9 60.6

line model is in bold.

24 A table that summarizes information about standard errors of the

estimated teacher effects is presented in the Appendix. Indeed, standard

Comparisons based on identifying teachers in the top and bottom

tiles of the productivity distribution are provided in the Appendix.

errors are the largest in the FE regressions. Further discussion of standard

errors is provided in the Appendix.


explanation for these differences may be the presence ofthe measurement error in students’ test scores. The first-difference estimator is used in dynamic (geometric decay)models where the lagged score appears on the right-handside. If measurement errors are not correlated over time,then using the lagged score as an instrument would resolveall endogeneity problems, including the measurementerror problem. However, if errors in measuring studentperformance persist over time, then using lagged score asan instrument does not resolve the problem. In contrast,measuring test scores with an error does not cause biasesin fixed effect estimates when only current and laggedinputs are among the regressors. Fixed effects estimationmay also be a preferred estimation method when strictexogeneity fails due to non-random dynamic sorting ofstudents to teachers. If student sorting is transitory (e.g. ifonly the previous-period performance matters for thecurrent teacher assignment, while more distant past doesnot matter), the asymptotic bias due to violation of strictexogeneity is ‘‘averaged’’ over multiple periods. Thus, theasymptotic bias (or inconsistency) becomes smaller whenmultiple years of data are used.25 The first-differenceestimator does not have this property.

Finally, the model that produces the greatest overlapwith the baseline model is the gain model with student fixed

effects (i.e. no decay/FE/no lagged scores/no lagged inputs);about half of teachers identified as being in the top orbottom categories in the baseline model also appear in thetop/bottom categories when the gains model with student

fixed effects is employed.26

5. Conclusion

Empirical research on teacher productivity has beenbased on ‘‘value-added’’ models, which are derived from anunderlying structural model of cumulative achievement.Rarely have the assumptions required to obtain value

added models been tested, however, and never in acomprehensive way. Starting with a general model ofstudent achievement we specify the assumptions requiredto obtain commonly estimated models and derive econo-metric tests of those assumptions. Using data from Floridawe carry out these tests and find that most all of thesimplifying assumptions are easily rejected.

One implication of our work is that estimates fromcommonly used value added models cannot be given astructural interpretation. For example, the marginal effectof prior-year achievement on current test scores cannot beinterpreted as the persistence of all educational inputs.Researchers seeking to derive structural parameters fromvalue-added models need to employ the tests we haveoutlined in this paper to determine if such interpretationscan be justified with their particular data set.

A second implication is that data generating processesused in simulation work, which are based on some of thesame assumptions used to create empirical value-addedspecifications, may be too simplistic. More complexprocesses appear to be at work. For example, the effectsof prior inputs appear to vary with the grade level at whichthey are applied and the rates of decay may vary acrossinputs. It would be valuable to know if the performance ofvalue-added estimators differs when less restrictive datagenerating processes are used.

For accountability purposes the underlying structuralmodel is of little importance. All that is necessary is thatvalue-added models yield accurate estimates of therelative performance of teachers, particularly those atthe top and bottom of the productivity scale. This requiresthat value added models produce unbiased (or at leastminimally biased) predictions of student achievement andthereby yield measures of a teacher’s impact on studentachievement that are free of significant bias. Except in theunlikely event that the biases created by these assumptionviolations happen to all cancel out, our findings suggestthat all commonly estimated value-added models arebiased to some degree. We are not able to determine themagnitude of the bias, however.

Given the reality that many teacher evaluation systemsalready have a major test-score based component and thatis unlikely to change in the near future, the choice betweenalternative value-added models is of significant impor-tance. This is reinforced by our results that indicate teachereffect estimates from different value-added models varygreatly in many cases and the overlap across models in theteachers identified as high or low performing can be low.We find the model which produces teacher effectestimates that most closely align with our most flexiblebaseline model are those from a gains model with studentfixed effects. Models that include student fixed effects tendto produce imprecise or ‘‘noisy’’ estimates of teacherperformance, however. This loss of efficiency in student-fixed-effects models could outweigh the gains fromreducing bias from unobserved student heterogeneity.Also, identification requires significant numbers of stu-dents move between teachers over time, which may beproblematic in some cases. Indeed, none of the value addedmodels currently employed in accountability systemsemploy student fixed effects. Among models without

25 Koedel and Betts (2011) find evidence in support of transitory sorting

and show that observing teachers over multiple time periods mitigates

the dynamic sorting bias.26 The observed degree of overlap (or more generally the correlation)

among estimates from different models will depend on both the

correlation of the systematic bias as well as the correlation of the

estimation error. As suggested by a reviewer, we attempted to sort out

these two effects by estimating teacher effects from distinct student

cohorts in different periods in order to calculate the true variance in

teacher effects (which equals the covariance in the estimated teacher

effects between time periods) for each model. Unfortunately, limiting the

sample to teachers who taught at Florida public schools for a sufficiently

long period of time to allow estimation of teacher effects from two

distinct time periods shrinks the number of comparable teachers by two-

thirds. Moreover, we found within-model covariances between estimated

teacher effects estimated over different time periods were generally small

and often not statistically significantly different from zero. While one

would expect positive correlations across time periods, the low observed

covariances could result from significant biases with the direction of bias

varying over time. Alternatively, the large reduction in sample size could

result in extremely noisy estimates. It is also possible, though unlikely,

that teacher effects were not constant over time. Whatever the reason for

the small covariances, we do not have sufficient confidence in the

estimates of the true variance to make judgments about the contributions

of estimation error to the overlap in teacher effect estimates across

models.

stuthrcomflexdifflag

Ack

EduassthisowFloby

UniEduuseexc

App

fouj.ec

Ref

And

Boar

Bon

Clot

Dee

Ding

Figli


dent fixed effects, a model with partial persistence,ee lagged test scores and three lags of observable inputs

es closest to mimicking the estimates from our mostible baseline model, though results are not mucherent for models with fewer lagged scores and/or fewerged inputs.

nowledgements

We wish to thank the staff of the Florida Department ofcation’s K-20 Education Data Warehouse for their

istance in obtaining and interpreting the data used in study. The views expressed is this paper are solely our

n and do not necessarily reflect the opinions of therida Department of Education. This work is supportedTeacher Quality Research grant R305M040121 from theted States Department of Education Institute forcation Sciences. Thanks also go to Anthony Bryk forful discussion of this research and to John Gibson forellent research assistance.

endix A. Supplementary data

Supplementary data associated with this article can bend, in the online version, at http://dx.doi.org/10.1016/onedurev.2013.10.003.

erences

rabi, T., Das, J., Khwaja, A. I., & Zajonc, T. (2011). Do value-addedestimates add value? Accounting for learning dynamics. American Eco-nomic Journal: Applied Economics, 3, 29–54.dman, A. E., & Murnane, R. J. (1979). Using panel data to improveestimates of the determinants of educational achievement. Sociologyof Education, 52, 113–121.esronning, H. (2004). The determinants of parental effort in educationproduction: Do parents respond to changes in class size? Economics ofEducation Review, 23, 1–9.felter, C. T., Ladd, H. F., & Vigdor, J. L. (2006). Teacher–student matchingand the assessment of teacher effectiveness. The Journal of HumanResources, XLI, 778–820., T. S. (2004). Teachers, race and student achievement in a randomizedexperiment. Review of Economics and Statistics, 86, 195–210., W., & Lehrer, S. F. (2007). Accounting for unobserved ability heterogeneity

within education production functions. (unpublished manuscript).o, D. N. (1999). Functional form and the estimated effects of schoolresources. Economics of Education Review, 18, 241–252.

Guarino, C. M., Reckase, M. D., & Wooldridge, J. (2012). Can value-addedmeasures of teacher education performance be trusted? Working paper #18East Lansing, MI: The Education Policy Center at Michigan State Univer-sity.

Hanushek, E. A. (1979). Conceptual and empirical issues in the estimation ofeducational production functions. Journal of Human Resources, 14, 351–388.

Harris, D. N. (2007). Diminishing marginal returns and the production ofeducation: An international analysis. Education Economics, 15, 31–45.

Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement ofteacher quality. (unpublished manuscript).

Houtenville, A. J., & Conway, K. S. (2008). Parental effort, school resourcesand student achievement. Journal of Human Resources, XLIII, 437–453.

Jacob, B. A., Lefgren, L., & Sims, D. P. (2010). The persistence of teacher-induced learning. Journal of Human Resources, 45, 915–943.

Kane, T., McCaffrey, D., Miller, T., & Staiger, D. (2013). Have we identifiedeffective teachers? Validating measures of effective teaching using randomassignment Seattle, WA: Bill and Melinda Gates Foundation.

Kane, T., & Staiger, D. (2008). Estimating teacher impacts on student achieve-ment: An experimental evaluation. Working paper #14607. Washington,DC: National Bureau of Economic Research.

Koedel, C., & Betts, J. (2011). Does student sorting invalidate value-addedmodels of teacher effectiveness? An extended analysis of the Rothsteincritique. Education Finance and Policy, 6, 18–42.

Lockwood, J. R., & McCaffrey, D. F. (2007). Controlling for individual hetero-geneity in longitudinal models, with applications to student achieve-ment. Electronic Journal of Statistics, 1, 223–252.

Lockwood, J. R., & McCaffrey, D. F. (2009). Exploring student–teacher inter-actions in longitudinal achievement data. Education Finance and Policy, 4,439–467.

McCaffrey, D. F., Lockwood, J. R., Mihaly, K., & Sass, T. R. (2012). A review ofStata routines for fixed effects estimation in normal linear models. TheStata Journal, 12(3), 1–27.

Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teachereffects? Educational Evaluation and Policy Analysis, 26(3), 237–257.

Rice, J. K. (2003). Teacher Quality: Understanding the Effectiveness of TeacherAttributes. Washington, DC: Economic Policy Institute.

Rothstein, J. (2010). Teacher quality in educational production: Tracking,decay and student achievement. Quarterly Journal of Economics, 125,175–214.

Todd, P. E., & Wolpin, K. I. (2003). On the specification and estimation of theproduction function for cognitive achievement. The Economic Journal,113, F3–F33.

Todd, P. E., & Wolpin, K. I. (2007). The production of cognitive achievement inchildren: Home, school and racial test score gaps. Journal of HumanCapital, 1, 91–136.

Wayne, A. J., & Youngs, P. (2003). Teacher Characteristics and StudentAchievement Gains. Review of Educational Research, 73, 89–122.

Wilson, S., Floden, R. E., & Ferrini-Mundy, J. (2001). Teacher PreparationResearch: Current Knowledge, Gaps. and Recommendations, Seattle, WA:Center for the Study of Teaching and Policy.

Wilson, S., & Floden, R. (2003). Creating Effective Teachers: Concise Answers forHard Questions. New York, NY: AACTE Publications.

Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data.Cambridge, MA: MIT Press.

Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teacher and classroomcontext effects on student achievement: Implications for teacher evalu-ation. Journal of Personnel Evaluation in Education, 11, 57–67.



http://refhub.elsevier.com/S0272-7757(13)00136-2/sbref0010











































































economics of education review - faculty.smu.edu · 3 the origins of our work are harris and sass...

Documents