economics of education review - faculty.smu.edu · 3 the origins of our work are harris and sass...
TRANSCRIPT
Vapr
Tima Deb Dec De
1. I
dattimaltebroeduthelearof
stusmstatandlon
Economics of Education Review 38 (2014) 9–23
A R
Artic
Rece
Rece
Acce
JEL c
I21
J24
Keyw
Teac
Valu
*
(A. S1
You
and
027
http
lue-added models and the measurement of teacheroductivity
R. Sass a,*, Anastasia Semykina b, Douglas N. Harris c
partment of Economics, Georgia State University, 14 Marietta Street NW, Atlanta, GA 30303, United States
partment of Economics, Florida State University, 113 Collegiate Loop, Tallahassee, FL 32306, United States
partment of Economics, 302 Tilton Hall, Tulane University, New Orleans, LA 70118, United States
ntroduction
In the last dozen years the availability of administrativeabases that track individual student achievement overe and link students to their teachers has radicallyred how research on education is conducted and hasught fundamental changes to the ways in whichcational programs and personnel are evaluated. Until
late 1990s, research on the role of teachers in studentning was limited primarily to cross-sectional analysesstudent achievement levels or simple two-perioddies of student achievement gains using relativelyall samples of students and teachers.1 The advent ofewide longitudinal databases in Texas, North Carolina Florida, along with the availability of micro-levelgitudinal data from large urban school districts, has
allowed researchers to track changes in student achieve-ment as students move between teachers and schools overtime. This in turn has permitted the use of panel datatechniques to account for the influences of prior educa-tional inputs, students and schools when evaluating thecontributions of teachers to student achievement.
The availability of student-level panel data is alsofundamentally changing school accountability and themeasurement of teacher performance. In Tennessee,Dallas, New York City and Washington DC, models ofindividual student achievement have been used for manyyears to evaluate individual teacher performance. Whilethe stakes are currently low in most cases, there is growinginterest among policymakers to use estimates fromstudent achievement models for high-stakes performancepay, school grades, and other forms of accountability.Chicago, Denver, Houston and Washington, DC have alladopted compensation systems for teachers based onstudent performance. Further, as a result of the federalTeacher Incentive Fund and Race to the Top initiatives, manymore states and districts plan to implement performancepay systems in the near future. Florida is a particularlyinteresting case as the state has recently adopted a very
T I C L E I N F O
le history:
ived 18 May 2012
ived in revised form 30 October 2013
pted 30 October 2013
lassification:
ords:
her productivity
e added
A B S T R A C T
Research on teacher productivity, as well as recently developed accountability systems for
teachers, relies on ‘‘value-added’’ models to estimate the impact of teachers on student
performance. We consider six value-added models that encompass most commonly
estimated specifications. We test many of the central assumptions required to derive each
of the value-added models from an underlying structural cumulative achievement model
and reject nearly all of them. While some of the six popular models produce similar
estimates, other specifications yield estimates of teacher productivity and other key
parameters that are considerably different.
� 2013 Elsevier Ltd. All rights reserved.
Corresponding author. Tel.: +1 404 413 0150.
E-mail addresses: [email protected] (T.R. Sass), [email protected]
emykina), [email protected] (D.N. Harris).
For reviews of the early literature on teacher quality see Wayne and
ngs (2003), Rice (2003), Wilson and Floden (2003) and Wislon, Floden,
Ferrini-Mundy (2001).
Contents lists available at ScienceDirect
Economics of Education Review
jo ur n al h o mep ag e: w ww .e lsev ier . co m / loc ate /ec o ned u rev
2-7757/$ – see front matter � 2013 Elsevier Ltd. All rights reserved.
://dx.doi.org/10.1016/j.econedurev.2013.10.003
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–2310
aggressive teacher accountability system which relies onthese panel data techniques.
Measurement of teacher productivity in both educationresearch and in accountability systems is often based onestimates from panel-data models where the individualteacher effects are interpreted as a teacher’s contributionto student achievement or teacher ‘‘value-added.’’ Thetheoretical underpinning for these analyses is the cumu-lative achievement model developed by Boardman andMurnane (1979), Hanushek (1979), and Todd and Wolpin(2003), where current student achievement is a function ofa student’s entire history of educational and family inputs.However, varying data constraints have led to a widevariety of empirical specifications being estimated inpractice. Each empirical specification makes (typicallyimplicit) assumptions about the parameters of theunderlying structural model.
Understanding the assumptions being made and how totest their validity are important both for interpreting theestimates from empirical models and for determiningwhether the estimates are subject to bias. Model mis-specification, including omitted variables, can yieldparameter estimates that do not represent the constructsof the underlying structural model and are biased.
Two recent studies evaluate various value-added modelspecifications based on the criteria of minimizing bias inestimated teacher effects.2 Kane and Staiger (2008)conduct an experiment in which 78 pairs of teachers wererandomly assigned to classrooms in the same grade andschool. They then compare pairwise differences inestimated teacher effects from the experimental sampleto differences in the pre-experiment value-added esti-mates of the same teacher pairs. Their analysis includedestimates derived from seven value-added specificationswith varying controls for student heterogeneity. For thefive value-added models that accounted for prior-yearstudent achievement, they could not reject the null thatthe estimated within-pair differences in teacher produc-tivity were equivalent to the differences under randomassignment. Random assignment is a key advantage, butexperiments also have limitations. They can generally onlybe implemented on a small scale and only for individuals orinstitutions that voluntarily participate. For example, inKane and Staiger (2008), they could only test whetherwithin-school sorting is an issue and only among pairs ofteachers that a principal was comfortable randomlyassigning students to. The original experiment wasrecently replicated across cities with a much larger sample,with the same general results, but the participation ratewas once again very low and there were significantproblems with non-compliance to the randomization(Kane, McCaffrey, Miller, & Staiger, 2013).
Guarino, Reckase, and Wooldridge (2012) generatesimulated data under various student grouping and teacherassignment scenarios and then compare the estimates from
alternative achievement model specifications to the known(generated) teacher effects. While no specification issuperior under all student/teacher assignment scenarios,a model that estimates current achievement as a function ofprior-year achievement and observable student and teach-er/school inputs is the most robust. The simulation approachhas the advantage of producing known ‘‘true’’ teacher effectsthat can be used to evaluate the estimates from alternativemodels. The disadvantage, however, is that there is no wayto know if the selected data generating processes actuallyreflect the student–teacher matching mechanisms thatoccur in real-world data. In particular, the data generatingprocesses they employ relies on a number of simplifyingassumptions about the underlying cumulative achievementmodel.
We take a different approach and test the assumptionsrequired to derive empirical value-added models from astructural model of student achievement. This allows us todetermine whether the estimates from value-addedmodels have a structural interpretation.3 The validity ofthe assumptions is also important because data generationprocesses used in simulation work rely on (often implicit)assumptions about the underlying structural model ofstudent achievement. The disadvantage, however, is thatfailure of the underlying assumptions does not necessarilymean value-added models fail to accurately classifyteacher performance for accountability. While we cannotdirectly test the magnitude of bias in value-added models,we can and do conduct simple hypothesis testing andconsider how model specification affects the estimatedproductivity of teachers. By comparing estimated teachereffects across models of varying flexibility, we can evaluatethe magnitude of the change in teacher rankings of specificmodeling choices, each with differing data and computa-tional costs. If the results are insensitive to modelingchoices then one can be less concerned about imposingfalse restrictions. But this is not what we find. The resultsare very sensitive to certain types of assumptions.
We begin our analysis in the next section by consider-ing the general form of cumulative achievement functionsand the assumptions which are necessary to generateempirical models that can be practically estimated. In thatsection we also delineate a series of specification tests thatcan be used to evaluate the assumptions underlyingempirical value-added models. Section 3 discusses ourdata and in Section 4 we present our results. In the finalsection we summarize our findings and consider theimplications for future research and for the implementa-tion of accountability systems.
2. Value-added models and tests
2.1. A general cumulative model of achievement
In order to clearly delineate the empirical models thathave been estimated and the assumptions underlyingthem, we begin with a general cumulative model of2 Another branch of recent literature investigates alternative forms of
the cumulative achievement function, emphasizing the impact of
historical home and schooling inputs on current achievement. See Todd3
and Wolpin (2007), Ding and Lehrer (2007), Andrabi, Das, Khwaja, andZajonc (2011) and Jacob, Lefgren and Sims (2010).
The origins of our work are Harris and Sass (2006), though the
analysis has evolved considerably since that original working paper.
stuMu
Ait
what
repschmi0
cha(su
conprinas wThetimnonputtimamtimset
funwe
Ait
4
vary
lead
vect
leve
give
men
the
part
wan
the
5
teac
aid
reso
low
like
pers
inpu6
rand
to c
estim
this
effe
like
alte
to b
(2007
estim
ativ
stud8
assu
prod
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 11
dent achievement in the spirit of Boardman andrnane (1979) and Todd and Wolpin (2003):
¼ At½XiðtÞ; FiðtÞ; EiðtÞ; mi0; eit � (1)
ere Ait is the achievement level for individual student i
the end of their tth year of life, Xi(t), Fi(t) and Ei(t)resent the entire histories of individual, family andool-based educational inputs, respectively. The termis a composite variable representing time-invariant
racteristics an individual is endowed with at birthch as innate ability), and eit is an idiosyncratic error.The vector of school-based educational inputs, Ei(t),tains both school-level inputs such as the quality ofcipals and other administrative staff within a school,4
ell as a vector of classroom-level inputs in classroom.5
latter group of inputs includes peer characteristics,6
e-varying teacher characteristics (such as experience),-teacher classroom-level inputs (such as books, com-ers, etc.) and the primary parameter vector of interest,e-invariant teacher characteristics (including, for ex-ple, ‘‘innate’’ ability and pre-service education). Thee-invariant teacher characteristics can be captured by aof teacher indicator variables.7
If we assume that the cumulative achievementction, At[�] is linear and additively separable,8 then
can rewrite the achievement level at grade t as:
¼Xt
h¼1
½ahtXih þ whtFih þ bhtEih� þ ctmi0 þ eit (2)
where aht, wht and bht represent the vectors of (potentiallytime-varying) weights given to individual, family andschool inputs. The impact of the individual-specific time-invariant endowment in period t is given by ct.
2.2. Cumulative model with fixed family inputs
Estimation of Eq. (2) requires data on both currentand all prior individual, family and school inputs.However, administrative records contain only limitedinformation on family characteristics and no directmeasures of parental inputs.9 Therefore, it is necessaryto assume that family inputs are constant over time andare captured by a student-specific fixed component, zi.
10
However, the marginal effect of these fixed parentalinputs on student achievement may vary over time and isrepresented by kt. Thus, the effect of the fixed familyinput (ktzi) is analogous to the effect of the studentcomponent (ctmi) in (1).
The assumption of fixed parental inputs of courseimplies that the level of inputs selected by families doesnot vary with the level of school-provided inputs a childreceives. For example, it is assumed that parents do notsystematically compensate for low-quality schoolinginputs by providing tutors or other resources.11 Similarly,it is assumed that parental inputs are invariant toachievement realizations; parents do not increase theirinputs when their son or daughter does poorly in school.
The validity of the assumption that family inputs do notchange over time is hard to gauge. Todd and Wolpin (2007),using data from the National Longitudinal Survey of Youth1979 Child Sample (NLSY79-CS), consistently reject exo-geneity of family input measures at a 90% confidence level,but not at a 95% confidence level. They have only limitedaggregate measures of schooling inputs (average pupil-teacher ratio and average teacher salary measured at thecounty or state level) and the coefficients on these variablesare typically statistically insignificant, whether or notparental inputs are treated as exogenous. Thus it is hardto know to what extent the assumed invariance of parentalinputs may bias the estimated impacts of schooling inputs. Itseems reasonable, however, that parents would attempt tocompensate for poor school resources and therefore any biasin the estimated impacts of schooling inputs would betoward zero.
If we assume that the marginal effects of the endow-ment and family inputs are equal to each other in eachperiod, i.e., kt = ct then we can re-label this effect as vt and
Typically administrative data provide little information on time-
ing school-level inputs like scheduling systems, curricular choices,
ership styles and the like. Consequently, it is common to replace the
or of school characteristics with a school fixed effect. When school-
l effects are included, the teacher fixed effect captures the effect of a
n teacher’s time-invariant characteristics on her students’ achieve-
t relative to other teachers at the same school. This obviously limits
comparison group for assessing teacher productivity, which is
icularly problematic in accountability contexts since one typically
ts to compare the performance of a teacher with all other teachers in
school district or state, not just at their own school.
Classroom-level variables may be correlated with the assignment of
hers and students to classrooms. For example, principals may seek to
inexperienced teachers by giving them additional computer
urces. Similarly, classrooms containing a disproportionate share of
-achieving or disruptive students may receive additional resources
teacher aides. Due to the paucity of classroom data on non-teaching
onnel and equipment, most studies omit any controls for non-teacher
ts.
It is well known that if students are assigned to classrooms non-
omly and peer-group composition affects achievement, then failure
ontrol for the characteristics of classroom peers will produce biased
ates of the impact of teachers on student achievement. Recognizing
potential problem, the majority of the existing studies of teacher
cts contain at least crude measures of observable peer characteristics
the proportion who are eligible for free/reduced-price lunch. An
rnative approach is to focus on classes where students are, or appear
e, randomly assigned, as in Clotfelter, Ladd, and Vigdor (2006), Dee
4), and Nye, Konstantopoulos, and Hedges (2004).
Alternatively, teacher effects could be modeled with random-effects
ators. Lockwood and McCaffrey (2007) provide a detailed compar-
e analysis of fixed and random effects estimators in the context of
ent achievement models.
Figlio (1999) and Harris (2007) explore the impact of relaxing the
9 Typically the only information on family characteristics is the student
participation in free/reduced-price lunch programs, a crude and often
inaccurate measure of family income. Data in North Carolina also include
teacher-reported levels of parental education.10 In general, one could consider models with uncorrelated unobserved
heterogeneity. However, it is likely that the observed inputs (e.g. teacher
and school quality) are correlated with the unobserved student effect,
which would lead to biased estimates in a random-effects framework.
Therefore, in what follows, we assume that the unobserved heterogeneity
may be correlated with the observed inputs and focus on a student/family
fixed effect.11
mption of additive separability by estimating a translog educationuction function.
For evidence on the impact of school resources on parental inputs see
Houtenville and Conway (2008) and Bonesronning (2004).
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–2312
combine the student and family components so thatvt(zi + mi) = vtxi.
12 The achievement equation at grade t
then becomes:
Ait ¼Xt
h¼1
½ahtXih þ bhtEih� þ vtxi þ eit (3)
Eq. (3) is the least restrictive specification of the cumulativeachievement function that can conceivably be estimatedwith administrative data. In this very general specificationcurrent achievement depends on current and all priorindividual time-varying characteristics and school-basedinputs as well as the student’s (assumed time invariant)family inputs and the fixed individual endowment.
2.3. Assumptions underlying all commonly estimated
empirical models of student achievement
2.3.1. Grade invariance of the cumulative achievement
function
The cumulative model with fixed family inputs (Eq. (3)) isgrade-specific and thus allows for the possibility that theachievement function varies with the grade level.13 Main-taining this flexibility carries a heavy computational cost,however. In pooled regressions, separate coefficients mustbe estimated for each input/grade/time-of-applicationcombination. To make the problem more computationallytractable, it is universally assumed in the empiricalliterature that while the impact of inputs may decay overtime, the cumulative achievement function does not varywith the grade level. In particular, it is assumed that theimpact of an input on achievement varies with the time spanbetween the application of the input and measurement ofachievement, but is invariant to the grade level at which theinput was applied. Thus, for example, having a smallkindergarten class has the same effect on achievement at theend of third grade as does having a small class in secondgrade on fifth-grade achievement. This implies that for any t:
Ait ¼Xt
h¼1
½ahXiðtþ1�hÞ þ bhEiðtþ1�hÞ� þ vtxi þ eit (4)
We refer to Eq. (4) as our ‘‘baseline model.’’ However,the grade invariance assumption that leads to the baselinemodel can be tested. Specifically, each input can beinteracted with time (or grade) dummies, the interactionterms can be added to Eq. (4), and the joint significance ofthe additional terms can be tested.
2.3.2. Assumptions about the unobserved family/student
effect
Estimation of the achievement model and implementa-tion of specification tests (such as the test of gradeinvariance described above) depends on assumptions aboutthe impact of the unobserved student/family effect. Ifunobserved student/family heterogeneity has no importantimpact on the student performance (vt = 0) or suchheterogeneity is not related to the observed inputs, thenunbiased (or, in large samples, consistent) estimates of theparameters in the models can be obtained by estimating theequation by OLS.
If the unobserved heterogeneity is present and ispotentially correlated with observed inputs, then the OLSestimator is generally biased (inconsistent). So long as theunobserved effect is time constant (vt = , t = 1, . . ., T) onecan use fixed effects (FE) or first-difference (FD) estimators.However, these estimators are valid only if the assumptionof strict exogeneity (conditional on the unobserved effect)holds. Specifically, we have to assume that a shock tostudent achievement in grade t (eit) does not affect thechoice of inputs in any grade, including the next grade.Rothstein (2010) discusses this problem in detail inrelation to estimating teacher effects on student perfor-mance. As noted by Rothstein (2010), the strict exogeneityassumption fails if future teacher assignment is partlydetermined by past and/or current shocks to studentperformance (for example, if students who experience adrop in their performance are assigned to a class taught bya relatively high productivity teacher next year). In thiscase, both FE and FD estimators are inconsistent; hence, itis important to check whether strict exogeneity holds. Asimple test for strict exogeneity can be performed byadding lead (future) values of inputs in the set ofexplanatory variables and testing their joint significancein the FE or FD regression (Wooldridge, 2002, chap. 10).14
A popular and practically feasible alternative toassuming either zero or time-constant effects of unob-servable student/family inputs is to assume that theunobserved effect in Eq. (4) is trending: vt = 1 + jt, where t
is a time trend (see, for example, Wooldridge, 2002). Aftertaking first differences in (4), D(xij)t = xij � gi:
DAit ¼Xt�1
h¼1
½ah DXiðtþ1�hÞ þ bh DEiðtþ1�hÞ� þ atXi1
þ btEi1 þ g i þ eit ; (5)
where the new unobserved student/family effect isconstant over time. Therefore, an achievement model thatcontains a trending unobserved effect can be consistentlyestimated by either FE or FD applied to a differencedequation. Strict exogeneity should also be tested in thiscase using the test described above.
12 Note that the marginal effects of fixed parental and child inputs, vt,
are the same for all students and thus vtxi varies over time in the same
manner for all students. If the time-varying marginal effect of the
individual/family fixed component were student-specific then the effect
of the student-specific component in each time period would be perfectly
collinear with observed achievement.13 While the cumulative model with fixed family inputs allows for
differential effects by grade, as written it assumes equal marginal effects
across students within a grade. Of course interactions can be included to
allow for differential effects across different types of students. See, for
example, Wright, Horn, and Sanders (1997). A recent analysis by
14 This is the same test that Rothstein (2010) uses when testing the
validity of his models VAM1 (regressing test score gains on contempora-
neous teacher indicators) and VAM2 (regressing test scores on
contemporaneous teacher indicators and the lagged score). Koedel and
Betts (2011) use the same test in the model with geometric decay.
Lockwood and McCaffrey (2009) directly investigates whether teacher
value-added varies across different types of students.
Rothstein also proposed a more advanced test, which we discuss in more
detail in the Appendix.
asseffeindtimresindserequequunobecdifftest(Win (abotradappof t
2.3.
tatiinvEq.of
typstuthemeany0 �
Ait
Taktim
Ait
�
Colside
Ait
wh
15
esta
two
cova
diffe16
usin
Alth
men
mea
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 13
There are several ways to check the validity of variousumptions concerning the unobserved student/familyct. First, strong positive correlation in residuals wouldicate the presence of highly persistent factors, such ase-invariant student or family inputs. Therefore, if foriduals eit and eit�1 Corr(eit, eit�1) is positive, this wouldicate the presence of the unobserved effect. If positiveial correlation in residuals is found not only in the levelsation (such as Eq. (4)), but also in the differencedation (such as Eq. (5)), it would imply that thebserved effect is trending rather than time-constantause a time-constant effect would drop out aftererencing. Another possibility is to use a Hausman-type
comparing the FE (FD) estimates in Eqs. (4) and (5)ooldridge, 2002). If the unobserved student/family effect4) is time constant, then parameter estimates should beut the same in both models. Unfortunately, theitional form of the Hausman test statistic is notlicable in this case. A more complicated general formhe test statistic should be used.15
3. Geometric decay in the impact of prior inputs
Given the burdensome data requirements and compu-onal cost of the full cumulative model, even the age-ariant version of the cumulative achievement model,
(4), has never been directly estimated for a large samplestudents.16 To make the model more tractable, it isically assumed that the marginal impacts of all priordent and school-based inputs decline geometrically with
time between the application of the input and theasurement of achievement at the same rate so that for
given h, a(t+1)�h = lat�h, where l is a scalar and l � 1. With geometric decay Eq. (4) can be expressed as:
¼Xt�1
h¼0
lh½aXiðt�hÞ þ bEiðt�hÞ� þ vtxi þ eit (6)
ing the difference between current achievement and les prior achievement yields:
� lAit�1 ¼Xt�1
h¼0
lh½aXiðt�hÞ þ bEiðt�hÞ� þ vtxi þ eit
!
Xt�2
h¼0
lhþ1½aXiðt�1�hÞ þ bEiðt�1�hÞ� þ lvt�1xi þ leit�1
!
(7)
lecting terms, simplifying and adding lAit�1 to boths produces:
¼ aXit þ bEit þ lAit�1 þ ðvt � lvt�1Þxi þ hit (8)
ere hit = eit� leit�1.
Thus, given the assumed geometric rate of decay, thecurrent achievement level is a function of contemporane-ous student and school-based inputs as well as laggedachievement and an unobserved individual-specific effect.The lagged achievement variable serves as a sufficientstatistic for all past time-varying student and schoolinginputs, thereby avoiding the need for historical data onteachers, peers and other school-related inputs.
Although commonly estimated models assume alleducational inputs persist at a geometric rate, l, onecould still have a tractable model if only some inputs decaygeometrically. The most general test can be constructedusing the model with grade invariance, but no restrictionon input decay, which is summarized in Eq. (4). Thepresence of the input-specific geometric decay can bechecked by testing the following null hypotheses:
H0 :at; j
at�1; j¼
at�1; j
at�2; j¼ � � � ¼
a2; j
a1; j;
or
H0 :bt; j
bt�1; j
¼bt�1; j
bt�2; j
¼ � � � ¼b2; j
b1; j
;
for each input j. These are nonlinear hypotheses that can betested using a Wald-type test.
2.4. Commonly estimated models and specific assumptions
In Eq. (8), it is possible to make different assumptionsabout the rate of decay (l) and unobserved heterogeneity(xi). Below we consider several possibilities:
Assumptions Model
1. (0 < l < 1), vt = lvt�1 Partial persistence model:
Ait ¼ aXit þ bEit þ lAit�1 þ hit
2. (0 < l < 1), vt = vt�1 Partial persistence model with
student fixed effect:
Ait ¼ aXit þ bEit þ lAit�1 þ g i þ hit
3. l = 1, vt = lvt�1 Gains model:
DAit ¼ Ait � Ait�1 ¼ aXit þ bEit þ hit
4. l = 1, vt = vt�1 Gains model with student fixed effect:
DAit ¼ Ait � Ait�1 ¼ aXit þ bEit þ g i þ hit
5. l = 0, vt = lvt�1 Immediate decay model:
Ait ¼ aXit þ bEit þ hit
6. l = 0, vt = vt�1 Immediate decay model with
student fixed effect:
Ait ¼ aXit þ bEit þ g i þ hit
Models 1, 3, and 5 assume that the time-invariantstudent/family inputs decay at the same rate as otherinputs (l), so that vt = lvt�1, and the individual-specificeffect drops out of the achievement equation. In models 2,4, and 6 the marginal effect of the individual-specificcomponent is assumed to be constant over time, i.e.vt = vt�1 and (vt� lvt�1) = (1 � l)vt = , so that gi = xi
is a time-invariant student/family fixed effect. Theremaining differences across models are due to variousassumptions about the rate of decay, l.
When relative efficiency of one of the two estimators cannot be
blished, then the asymptotic variance of the difference between the
estimators is not the same as the difference in the two variances; the
riance should be included in the computation of the variance of the
rence. See, for example, Wooldridge (2002), Section 14.5.1.
Todd and Wolpin (2007) estimate the cumulative achievement model
g a sample of approximately 7000 students from the NLSY79-CS.
ough they possess good measures of parental inputs and achieve-
t levels they have only a few general measures of schooling inputs
sured at the county or state level.
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–2314
Model 1 is valid when 0 < l < 1 and is perhaps the mostfrequently estimated value-added model; we refer to it asthe partial persistence model. In this model the lagged testscore serves as a sufficient statistic for the time-constantstudent/family inputs as well as for the historical time-varying student and school-based inputs. OLS estimates ofEq. (8) would be unbiased (consistent) so long as thecommon geometric decay assumption is correct andidiosyncratic error, hit, is not correlated with currentinputs and past achievement. The latter assumption wouldfail, for example, if a time-constant student/family effect ispart of the error hit, or hit are serially dependent for otherreasons.
Model 2 maintains the assumption that 0 < l < 1, butexplicitly introduces a time-invariant student-familyeffect. Estimation is complicated due to the presence ofthe lagged dependent variable, which is inevitablycorrelated with the error in the previous period, so thatstrict exogeneity fails, and both FE and FD are inconsistent(asymptotically biased). Therefore, under the standardassumption that the idiosyncratic errors are seriallyuncorrelated, the common approach is to remove theunobserved effect by first-differencing, and then usesecond and possibly further lags of the dependent variableto instrument for DAit�1.17
In the two gains models (models 3 and 4), thecoefficient on lagged achievement in Eq. (8) is unity.18
As noted by Boardman and Murnane (1979) and Todd andWolpin (2003), setting l = 1 implies that the effect of eachinput must be independent of when it is applied. The gainsmodel 3 can be consistently estimated by OLS if assump-tion l = 1 is correct and error hit does not contain anyfactors (e.g. unobserved time-invariant student/familyinputs) that may be correlated with the inputs includedin the model. If l = 1 holds, but unobserved student/familyinputs, such as student ability and parental involvement,are present and are potentially correlated with observedinputs, such as class size and teacher assignment, then it ismore appropriate to use model 4 that can be consistentlyestimated by FE or FD estimators.
Models 5 and 6 assume that the decay is immediateand complete, so that l = 0 and lagged achievement dropsout of the achievement function. Similar to discussionabove, if l = 0 holds and time-invariant student/familyinputs are either not present or not correlated withobserved inputs, then model 5 is correct, and modelparameters (including teacher effects) can be consistent-ly estimated by OLS. However, if unobserved student/family inputs, gi, are correlated with observed inputs,then model 6 is more suitable, and estimation should bedone using either FE or FD.
The validity of the above models can be checked byaugmenting the corresponding equation by lagged valuesof observed inputs, estimating the equation using theappropriate estimation method, and testing joint signifi-cance of lagged inputs. If the underlying model assump-tions are valid, lagged inputs should not appear in thecorresponding model and, therefore, should be jointlyinsignificant. Specifically, in the immediate decay models(5 and 6) it would mean that in the actual data l 6¼ 0. Ingains models (3 and 4) it would mean that l 6¼ 1. In partialpersistence models (1 and 2) it would imply that the decayis either not geometric, or not the same for all inputs, orboth. Moreover, in the models that do not account for atime-constant student/family fixed effect (models 1, 3, and5), significance of lagged inputs would indicate that thestudent/family effect is present in the actual data needs tobe accommodated in the estimation.
Several other tests can be used to determine whetherthe employed estimation methods are valid. Specifically,models 2, 4, and 6 are estimated using either FE or FDestimators that are consistent only if observed inputs arestrictly exogenous (see discussion in Section 2.3.1). Asmentioned above, the strict exogeneity assumption can betested by adding future values of input variables in the FE(FD) regression and subsequently testing their jointsignificance. Future inputs could also be included in theOLS regressions used to estimate models 1, 3, and 5.Rejecting the null of no significance of future inputs inthose regressions would imply that a time-constantunobserved student/family effects is present in the dataand correlated with observed inputs included in the model.
Finally, in the instrumental variables regression used toestimate model 2, it is important to test the validity of theinstruments. As mentioned above, in order for the secondlag of the test score to be a valid instrument, it is necessarythat errors hit are not correlated over time. This assump-tion is usually checked by testing H0: Corr(Dhit,Dhit�1) = �0.5, where Dhit is the error in the differencedequation. In practice, the correlation between the currentand lagged residuals in the differenced equation iscomputed and used for testing. Another standard testchecks whether the instruments are strongly partiallycorrelated with the instrumented variable.
3. Data
In order to test alternative model specifications weutilize data from the Florida Department of Education’s K-20 Education Data Warehouse (EDW), an integratedlongitudinal database covering all Florida public schoolstudents and school employees. Our sample begins withschool-year 1999/2000, which is the first year in whichstatewide standardized testing in consecutive grade levelswas conducted in Florida. Our panel continues through the2007/2008 school year.
During our period of analysis the state administeredtwo sets of reading and math tests to all third throughtenth graders in Florida. The ‘‘Sunshine State Standards’’Florida Comprehensive Achievement Test (FCAT-SSS) is acriterion-based exam designed to test for the skills thatstudents are expected to master at each grade level. The
17 Using the instrumental variables method is necessary because in the
differenced equation, Cov(DAit�1, Dhit) = Cov(Ait�1� Ait�2, hit� hit�1) =
Cov(Ait�1, hit�1) due to Cov(Ait�1, hit) = Cov(Ait�2, hit�1) = Cov(Ait�2, hit) = 0
when {hit} are serially uncorrelated. Because Cov(Ait�1, hit�1) 6¼ 0 by
construction, instruments are needed.18 Alternatively, the model can be derived by starting with a model of
student learning gains (rather than levels) and assuming that there is no
persistence of past schooling inputs on learning gains.
seca ve9 istypincto aFCAscacominitmintheas pon
waend
mamaWenumto sGivof s
peestustueacrec‘‘sesch5%
13%cou
muis eon
typandmutimseccohothme
andconcouconadvcon
reateasomsciealwstu
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 15
ond test is the FCAT Norm-Referenced Test (FCAT-NRT),rsion of the Stanford-9 achievement test. The Stanford-
a vertical or development-scale exam. Hence scoresically increase with the grade level and a one-pointrease in the score at one place on the scale is equivalent
one-point increase anywhere else on the scale. We useT-NRT scale scores in all of the analysis. The vertical
le of the Stanford Achievement Test allows us topare achievement gains of students with differing
ial achievement levels. Further, use of the FCAT-NRTimizes potential biases associated with ‘‘teaching to
test,’’ since all school accountability standards, as wellromotion and graduation criteria in Florida are based
the FCAT-SSS, rather than the FCAT-NRT. The FCAT-NRTs last administered in 2007/2008, which determines the
of our sample period.Although achievement test scores are available for bothth and reading in grades 3–10, we limit our analysis tothematics achievement in middle school, grades 6–8.
select middle-school mathematics classes for aber of reasons. First, we require second-lagged scores
erve as potential instruments for lagged achievement.en that testing begins in grade 3 this precludes analysistudent achievement prior to grade 5.Second, it is easier to identify the relevant teacher andr group for middle-school students than for elementarydents. The overwhelming majority of middle schooldents in Florida move between specific classrooms forh subject whereas elementary school students typicallyeive the majority of their core academic instruction in alf-contained’’ classroom. However, for elementaryool students enrolled in self-contained classrooms,are also enrolled in a separate math course and nearly
are enrolled in either special-education or giftedrses.Third, because middle-school teachers often teachltiple sections of a course during an academic year, itasier to clearly identify the effects of individual teachersstudent achievement. In elementary school, teachersically are with the same group of students all day long
thus teacher effects can only be identified by observingltiple cohorts of students taught by a given teacher overe. In contrast, both variation in class composition acrosstions at a point in time as well as variation acrossorts over time help to distinguish teacher effects fromer classroom-level factors affecting student achieve-nt in middle school.Fourth, we choose to avoid high school grades (grades 9
10) because of potential mis-alignment between testtent and curriculum. At the high-school level mathrses become more diverse and specialized. Thus thetent of some high school math courses, particularlyanced courses, may have little correlation withcepts being tested on achievement exams.Finally, we focus on math achievement rather thanding because it is easier to clearly identify the class andcher most relevant to the material being tested. While
e mathematics-related material might be presented innce courses, direct mathematics instruction almostays occurs in math classes. In contrast, middle school
‘‘language arts’’ and reading courses, both of which maycover material relevant to reading achievement tests.
In addition to selecting middle-school math courses foranalysis, we have limited our sample in other ways in anattempt to get the cleanest possible measures of classroompeers and teachers. First, we restrict our analysis of studentachievement to students who are enrolled in only a singlemathematics course and drop grade repeaters (though allother students enrolled in the course are included in themeasurement of peer-group characteristics). Second, toavoid atypical classroom settings and jointly taught classeswe consider only courses in which 10–50 students areenrolled. Third, we eliminate any courses in which there ismore than one ‘‘primary instructor’’ of record for the class.Finally, we eliminate charter schools from the analysissince they may have differing curricular emphases andstudent-peer and student–teacher interactions may differin fundamental ways from traditional public schools.
Estimation of some models requires up to three laggedtest scores. Given statewide testing on the FCAT-SSS beganin 1999/2000 and ended in 2007/2008, our analysis islimited to achievement of Florida traditional public schoolstudents in grades 6–8 over the years 2002/2003 through2007/2008 who took the FCAT-NRT for at least threeconsecutive years. This includes six cohorts of students.Unfortunately, it is not computationally tractable toestimate models that include both contemporaneousand multiple lagged teacher effects using the entiresample. We therefore randomly select 20 of Florida’s 67countywide school districts for analysis.19 Descriptivestatistics for the variables in the 20-district data set areprovided in Table 1.
4. Results
4.1. Tests of grade-invariance
Recall that empirical value-added models universallyassume lagged inputs have the same effect on contempo-raneous achievement, irrespective of grade level. This canbe tested by interacting each input with time (or grade)dummies, and testing the significance of the interactionterms. In order to ensure comparability in the interactions,we normalize the test scores for each grade/year.20 Resultsfrom estimating our baseline model with these interactionterms are presented in Table 2. There are three middle-school grades in the sample: 6, 7 and 8. Thus we includeinteractions with grade 6 and with grade 7. We presenttests for the joint significance of all grade-input interac-tions, as well as separate tests for the significance of gradesix and grade seven interactions. For the first and second-lag interactions we reject the null of grade invariance at the1% significance level in all but the FD regressions. In the FD
19 This is due to our use of explicit teacher indicators and Stata’s limit of
10,998 explanatory variables. For models with only contemporaneous
teacher effects, there are multiple routines available that would work
with the entire statewide sample. See McCaffrey, Lockwood, Mihaly, and
Sass (2012).20
Results without this normalization are provided in Table A1 of theendix.
dents in Florida may be simultaneously enrolled in AppT.R. Sass et al. / Economics of Education Review 38 (2014) 9–2316
regression, the first-lag interactions are significant at the5% level, although the second-lag interactions are lesssignificant. Put differently, we find little support for thecommon assumption that prior inputs affect achievementin the same way regardless of the grade in which they areapplied.
4.2. General rate of decay of prior inputs
As discussed in Section 2, the validity of geometricdecay assumptions (immediate, partial, and no persis-tence) can be tested by determining if prior inputs havesignificant effects in the appropriate achievement models.In each model, finding significant effects of prior inputswould suggest that the model is incorrect (or toorestrictive), so that the estimating equation is mis-specified and the resulting estimates of the teacher effectsand coefficients on other inputs may be biased. In models
characteristics that may be correlated with observedinputs, significance of past inputs would also indicatethat the ‘no correlated unobserved heterogeneity’ assump-tion is likely false.
We perform the tests after estimating the augmentedversions of the six student achievement models consideredin Section 2. To make the test computationally feasible welimit the additional terms to prior-year teacher identitiesand first, second and third lags of non-teacher schoolinginputs. As reported in Table 3, we strongly reject the nullthat prior inputs have no effect on current achievement inall cases. This finding suggests that all common geometricdecay models are incorrect. The test statistics are notice-ably larger in the most restrictive model that assumesimmediate decay and no unobserved heterogeneity (firstcolumn in Table 3). Such a result is expected if unobservedheterogeneity is an omitted variable, so that estimatedcoefficients on lagged inputs capture both the direct effectsof the inputs and the effects due to non-zero correlationbetween observed inputs and unobserved student/familyinputs. All other regressions account for unobserved inputsat least partially and hence, it is not surprising that laggedinputs in those regressions are less significant.
4.3. Input-specific decay of past inputs
Table 4 reports results from testing the null hypothesisthat input-specific decay is geometric against the alterna-tive that the rate of decay for that particular input is notgeometric. Tests were performed separately for each inputafter estimating Eq. (4) with varying assumptions aboutthe nature of the student/family input. For computationaltractability, we include three lags of each input and test thenull that the ratio of the coefficients on the first andsecond-lagged inputs equals the ratio of the coefficients onthe second and third lagged inputs. The results indicatethat we cannot reject the null that each input decays at itsown geometric rate. This is true whether we assume thatcorrelated unobserved student/family inputs are notpresent (OLS model), are time constant (FE, FD models)or follow a time trend (FE on FD model).21
4.4. Is the effect of the unobserved student/family input time
invariant?
As discussed above, the finding that test statistics arelarger in the most restrictive immediate decay modelindicate the presence of correlated unobserved heteroge-neity. Correlations between the current and laggedresiduals reported in the next-to-last row of Table 3 arealso informative of the type of unobserved heterogeneity.There is a positive correlation in residuals in the mostrestrictive model, which assumes immediate decay and nocorrelated unobserved heterogeneity (first column in
Table 1
Summary statistics for Florida public school students in 20 randomly
selected districts, 2002/2003–2007/2008.
Mean Std. Dev.
Student characteristics
Female 0.523 0.499
Black 0.205 0.403
Hispanic 0.159 0.366
Asian 0.024 0.154
American Indian 0.003 0.054
Math Score 681.312 28.246
Math Gain 6.465 22.561
Free/Reduced-Price Lunch 0.392 0.488
Number of Schools 1.013 0.113
Disciplinary Incidents 0.459 1.466
Structural Mover 0.314 0.464
Non-Structural Mover 0.140 0.347
Gifted 0.041 0.198
Mental Disability 0.000 0.015
Physical Disability 0.001 0.036
Emotional Disability 0.003 0.057
Other Disability 0.003 0.058
Speech/Language Disability 0.021 0.143
Learning Disability 0.021 0.143
Limited English Proficiency 0.026 0.160
Teacher characteristics
Advanced Degree 0.351 0.477
Professional Certificate 0.844 0.362
Years of Experience 9.290 11.891
1–2 Years of Experience 0.186 0.389
3–4 Years of Experience 0.119 0.324
5–9 Years of Experience 0.202 0.401
10–14 Years of Experience 0.126 0.332
15–24 Years of Experience 0.151 0.358
25 Years Plus 0.136 0.343
Class and peers’ characteristics
Math Class Size 24.695 5.049
Peers Proportion Female 0.500 0.115
Peers Proportion Black 0.212 0.226
Peers Proportion Hispanic 0.177 0.155
Peers Proportion Asian 0.025 0.041
Peers Average Age in Months 149.390 9.733
Peers Proportion Changed Schools 0.565 0.406
Peers Proportion Structural Movers 0.376 0.400
Number of observations 209,379
21 In addition to the test results presented in Table 4, we also tested to
see if various combinations of inputs share a common decay rate.
Occasionally we uncovered cases where we could not reject a common
decay rate for two or more inputs, but they were infrequent and did not
follow any particular pattern. For example, the effects of various teacher
credentials did not decay at similar rates.
that assume there are no unobserved student/familyTabfactagais lresunoBeccoluonethecornegwhand
in
unoagatheingcorinst
is ptheobscontheneiposis rcornegsionis t
Tab
Test
In
Gr
La
Gr
La
Gr
La
Gr
La
Gr
La
Gr
La
Gr
Ti
Gr
La
Gr
La
Note
scho
the
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 17
le 3). This indicates there is persistence in unobservedors that determine student performance and oncein suggests that a time-constant student/family effectikely present and is part of the error. In contrast, theidual correlation is negative in the gains model withoutbserved heterogeneity (second column in Table 3).ause adding lagged inputs to the gains model (secondmn of Table 3) is equivalent to first differencing Eq. (4),
would expect positive serial correlation in residuals if unobserved effect were trending. The fact that therelation coefficient in the second column of Table 3 isative and close to �0.5 is consistent with a situationere idiosyncratic errors in (4) are serially independent
the unobserved effect is time-constant.The correlation between the current and lagged residualthe partial persistence model with time-constantbserved effect is �0.404 (last column in Table 3), whichin speaks against the unobserved trend model. However,
correlation is statistically different from �0.5, suggest- that idiosyncratic errors in the model are seriallyrelated (although only slightly), so that the employedrument (lagged test score) may not be valid.
More evidence on the type of unobserved heterogeneityrovided in Table 4, which reports estimation results for
baseline model (Eq. (4)). Similar to Table 3, the patternserved in Table 4 indicate the presence of a time-stant unobserved student/family effect. Specifically, in
model that does not account for unobserved heteroge-ty (first column in Table 4), residuals are stronglyitively correlated. However, after the unobserved effectemoved by differencing (third column of Table 4), therelation between current and lagged residual isative and reasonably close to �0.5. Similar to discus-
above, this again indicates that the unobserved effectime-constant rather than trending.
4.5. Tests of strict exogeneity
In Table 5 we present results from tests of strictexogeneity for several models with varying assumptionsregarding the persistence of schooling inputs and thenature of the unobserved student/family input. In everycase we strongly reject the null that future teacherassignments have no ‘‘effect’’ on current student achieve-ment. In cases when the model is estimated by OLS, jointsignificance of future teacher indicators signifies thepresence of unobserved student/family characteristicsthat are correlated with observed inputs. In cases whenfixed effects or differencing are used, significance of futureteacher indicators suggests that student assignment toteachers is based in part on realized prior achievement.Thus, a key assumption that is needed for the fixed effectsand first-difference estimators to yield asymptoticallyunbiased estimates fails.
4.6. Differences in estimates across models
The results above suggest several general conclusions.An unobserved student/family effect is present and is time-invariant. Even in the models that account for unobservedheterogeneity, lagged inputs are significant, suggestingthat the common-geometric-rate-of-decay assumptionfails (though input-specific geometric decay assumptioncould not be rejected). Both grade-invariance and strictexogeneity are rejected.
Given that virtually all assumptions that are used informulating and estimating student achievement modelsare rejected, it is expected that commonly used empiricalvalue-added models produce biased estimates of teacherproductivity. This is not a real surprise. From a policyperspective, the more important issue is the magnitude of
le 2
s for grade invariance (based on the augmented baseline model).
teraction term Assumption regarding student/family inputs and estimation method
No correlated
unobserved effect
Time-constant
unobserved effect
Time-constant
unobserved effect
Trending
unobserved effect
OLS Fixed effects (FE) First difference (FD) FE on first differences
ade 6 and 7 Once F(57,136621) = 19.79 F(48,136621) = 1.55 F(55,62093) = 1.44 F(49,62093) = 1.94
gged Covariates (0.00) (0.008) (0.02) (0.00)
ade 6 and Once F(28,136621) = 7.94 F(19,136621) = 1.52 F(26,62093) = 1.62 F(23,62093) = 1.68
gged Covariates (0.00) (0.07) (0.02) (0.02)
ade 7 and Once F(29,136621) = 14.66 F(29,136621) = 1.33 F(29,62093) = 1.63 F(26,62093) = 2.30
gged Covariates (0.00) (0.11) (0.02) (0.00)
ade 6 and 7 and Twice F(58,136621) = 6.90 F(56,136621) = 2.14 F(55,62093) = 1.20 F(51,62093) = 2.52
gged Covariates (0.00) (0.00) (0.15) (0.00)
ade 6 and Twice F(29,136621) = 9.74 F(29,136621) = 1.66 F(27,62093) = 1.46 F(25,62093) = 2.49
gged Covariates (0.00) (0.01) (0.06) (0.00)
ade 7 and Twice F(29,136621) = 8.87 F(27,136621) = 2.19 F(28,62093) = 0.83 F(26,62093) = 1.43
gged Covariates (0.00) (0.00) (0.70) (0.07)
ade 6 and 7 and Three F(58,136621) = 3.67 F(55,136621) = 1.15 F(29,62093) = 1.50 F(28,62093) = 1.86
mes Lagged Covariates (0.00) (0.20) (0.04) (0.00)
ade 6 and Three Times F(29,136621) = 5.82 F(27,136621) = 1.32 F(26,62093) = 1.10 F(25,62093) = 1.87
gged Covariates (0.00) (0.12) (0.33) (0.01)
ade 7 and Three Times F(29,136621) = 2.38 F(28,136621) = 1.27 F(29,62093) = 1.50 F(28,62093) = 1.86
gged Covariates (0.00) (0.15) (0.04) (0.00)
: The table displays the F-statistics for testing grade invariance. All regressions use year-by-grade normalized test scores and include grade, year and
ol dummies, teacher indicators for the current and last periods, as well as three lags of time-varying inputs. p-Values are reported in parentheses under
test statistics.
Table 3
Tests of the geometric decay assumption (immediate, complete or partial persistence).
Effect of student/family inputs decay at same rate as other inputs Effect of student/family inputs are time constant
Model name Immediate decay
model
Gains model Partial persistence
model
Immediate
decay model with
student fixed effects
Gains model with
student fixed
effects
Partial persistence
model with student
fixed effects
Model under H0 Ait = aXit + bEit + hit DAit = aXit + bEit + hit Ait = aXit + bEit + lAit�1 + hit Ait = aXit + bEit + gi + hit DAit = aXit + bEit + gi + hit Ait = aXit + bEit + lAit�1 + gi + hit
Estimation method OLS OLS OLS FE FE FD-IV
Lagged Teacher F(1880,136621) = 6.35 F(1880,136621) = 4.15 F(1880,136621) = 4.67 F(1773,136621) = 9.01 F(1773,136621) = 6.46 F(1901,136621) = 3.19
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Once Lagged F(29,136621) = 89.69 F(29,136621) = 5.57 F(29,136621) = 24.51 F(29,136621) = 3.49 F(29,136621) = 1.42 F(29,136621) = 8.30
Covariates (0.000) (0.000) (0.000) (0.000) (0.065) (0.000)
Twice Lagged F(29,136621) = 10.67 F(29,136621) = 1.61 F(29,136621) = 3.33 F(29,136621) = 1.39 F(29,136621) = 1.45 F(29,136621) = 5.41
Covariates (0.000) (0.020) (0.000) (0.081) (0.054) (0.000)
Three Times Lagged F(29,136621) = 12.82 F(29,136621) = 2.49 F(29,136621) = 5.43 F(28,136621) = 1.59 F(28,136621) = 1.72 F(29,136621) = 6.46
Covariates (0.000) (0.000) (0.000) (0.025) (0.011) (0.000)
Rate of Persistence, l 0.00 1.00 0.65 0.00 1.00 �0.021
(0.000) (0.000)
Corr(rest, rest�1) 0.493 �0.420 �0.206 �0.404
[0.003] [0.003] [0.004] [0.003]
Strength of the Instruments F(1,136621) = 34968.89
(0.000)
Notes: Top rows of table report results of F-tests where the null hypothesis is that the effects of the inputs reported in the rows are (jointly) equal to zero. Grade, year and school dummies, as well as current period
inputs and current period teacher dummies are included in all regressions. In the last column (FD-IV), the equation was differenced and estimated by the instrumental variables estimator with twice-lagged
achievement score as the instrument for the differenced first lag of the test score. Because differencing teacher indicators was not feasible, instead of differencing these variables we included the first and second lags
of teacher indicators in the FD-IV regression. Under the null that prior teachers do not matter, the second lag of teacher indicators should be equal to zero. This is the joint hypothesis that we test after running the
FD-IV regression (first row in the last column). For all tests, p-values are reported in parentheses under the test statistics for testing the joint significance of the corresponding variables. The second to last row reports
the serial correlation in residuals (standard errors are reported in brackets underneath).
T.R
. Sa
ss et
al.
/ E
con
om
ics o
f E
du
catio
n R
eview
38
(20
14
) 9
–2
31
8
thelessassteacanresmofromabowhres
testvarmo
Tab
Test
M
Ba
Ait
Ba
Ait
Ba
Ait
Pa
Ait
Pa
Ait
Ba
Ait
Note
stat
teac
scor
uno
Tab
Test
Re
M
No
1–
3–
Ad
Pr
Co
Note
H0 :
scho
the
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 19
bias and whether some models yield estimates with bias than others. Unfortunately, absent true random
ignment of students and teachers, we cannot know truecher productivity and hence the magnitude of the biasnot be directly assessed. However, we can compareults from commonly estimated models to our baselinedel to determine the degree to which estimates vary
those produced with the fewest possible assumptionsut the educational process. Further, we can determineich assumptions have the greatest impact on theulting estimates of teacher productivity.In Table A1 of the Appendix, we present results froms comparing estimated coefficients for selected time-
ying inputs and teacher effects produced by differentdels. The test results indicate that estimated teacher
effects are often statistically similar. However, it does notnecessarily mean that differences are practically unimpor-tant. Also, we find that coefficients on other time-varyingvariables are often statistically different, which, neverthe-less, does not guarantee that the estimates are practicallyvery different. Because in policy decisions the magnitudeof the differences in estimates from competing models ismost relevant, in what follows we consider otherapproaches that help to assess the degree of similarityin the effects of time-varying inputs and estimated teachereffects.
Although our focus is on the assumptions of value-added models and the potential for bias in measuringteacher effects, another relevant part of policy decisions isestimation error. Given finite samples of students per
le 5
s of strict exogeneity.
odel Estimation method Test statistic (p-value)
seline model:
¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ hit
OLS F(1929,91945) = 26.583
(0.000)
seline model with student fixed effect:
¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ g i þ hit
FE F(1470,91945) = 6.564
(0.000)
seline model with student fixed effect:
¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ g i þ hit
FD F(1824,39868) = 17.256
(0.000)
rtial decay model:
¼ lAit�1 þ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ hit
OLS F(1929,91945) = 17.681
(0.000)
rtial decay model with student fixed effect:
¼ lAit�1 þ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ g i þ hit
FD-IV F(2346,91945) = 6.441
(0.000)
seline model with student-specific trend:
¼ a1Xit þ b1Eit þ � � � þ atXi1 þ btEi1 þ tg i þ hit
FE on first differences F(1736,91945) = 7.862
(0.000)
: The table displays the F-statistics for testing joint significance of future teacher indicators. p-Values are reported in parentheses under the test
istics. All models include: (i) grade, year and school indicators, (ii) current-year time-varying non-teacher inputs and teacher indicators, (iii) prior-year
her indicators, (iv) three prior years of non-teacher inputs and (v) future teacher indicators. In the regressions with partial persistence (with lagged test
es on the right-hand side), the second lag of the test score is used as an instrument for the differenced first lag of the test score. In the model where the
bserved effect is trending the third lag of non-teacher inputs was not first differenced due to a lack of four-lagged data.
le 4
s for input-specific geometric decay and the time constant unobserved effect (based on the baseline model).
Assumption regarding student/family inputs and estimation method
No correlated
unobserved effect
Time-constant
unobserved effect
Time-constant
unobserved effect
Trending
unobserved effect
OLS Fixed effects (FE) First difference (FD) FE on first differences
duced/Free Lunch 0.30
(0.59)
0.29
(0.59)
0.39
(0.53)
ath Class Size 1.43
(0.23)
0.04
(0.83)
1.27
(0.26)
3.40
(0.07)
n-Structural Mover 0.51
(0.48)
0.03
(0.86)
0.00
(0.95)
0.15
(0.70)
2 Years of Experience 0.38
(0.54)
0.59
(0.44)
1.21
(0.27)
0.01
(0.90)
4 Years of Experience 0.37
(0.54)
1.08
(0.30)
0.00
(0.95)
0.00
(0.99)
vanced Degree 0.11
(0.74)
0.01
(0.94)
0.02
(0.89)
0.19
(0.67)
ofessional Certificate 0.00
(0.96)
0.00
(0.95)
0.24
(0.63)
1.91
(0.17)
rr (residualt, residualt�1) 0.49
[0.003]
�0.44
[0.007]
: Top rows of the table display the F-statistics and t-statistics for testing the input specific geometric decay, i.e. H0 : a3; j=a2; j ¼ a2; j=a1; j or
b3; j=b2; j ¼ b2; j=b1; j . The last row reports the correlation coefficient between the current and lagged residuals. All regressions include grade, year and
ol dummies, teacher indicators for the current and last periods, as well as three lags of time-varying inputs. p-Values are reported in parentheses under
test statistics. Standard errors are reported in brackets under the correlation coefficients.
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–2320
teacher, the mean squared error of teacher effects will be afunction of both bias and estimation error and thus a veryprecise estimator with a small degree of bias could bepreferred to a less precise unbiased estimator. Unfortu-nately, it is difficult to empirically assess the bias-efficiency tradeoff.22 However, in the concluding sectionwe do discuss the likely tradeoffs between bias andefficiency, particularly with respect to the use of studenteffects to control for unobserved heterogeneity.
4.7. Comparing coefficients on selected time-varying inputs
For accountability purposes, the focus is on the teachereffect estimates that are produced from value-addedmodels. However, in many policy applications, themarginal effects of individual teacher characteristics, likeexperience and educational attainment, or the impacts ofclassroom variables like class size, are of interest.
Table 6 presents parameter estimates of the sixcommon value-added models and the baseline model.
While all models indicate that rookie teachers (the omittedcategory) do not perform as well as more experiencedteachers, the marginal effects of experience appear to berather different across models. The models that produceestimates closest to the baseline model are the gains model
and the partial persistence model with student fixed effects.Differences in the estimated impact on teacher educationalattainment are less pronounced across models; all yieldnegative estimated effects in the range of �0.5 to �1.5.Likewise, with the exception of the gains model with student
fixed effects the class size effects are all fairly small; thepoint estimate for the baseline model is �0.02 andestimates from the other models range from �0.06 to+0.04.
4.8. Comparison of teacher rankings across models
For individual teacher effects, one is generally lessconcerned about the specific value of the point estimate.Rather, the relative ranking of teachers is of greaterinterest, particularly in the context of performance paysystems for teachers. There are various ways in which onecan assess how different models rank teachers. One waywould be to compare the rank correlations of all teacher
Table 6
Comparing coefficients on selected time-varying variables.
Model Baseline Levels model
with student
covariates
Gains model
with student
covariates
Partial persistence
model with
student covariates
Levels model
with student
fixed effects
Gains model
with student
fixed effects
Partial persistence
model with student
fixed effects
Estimation method FE OLS OLS OLS FE FE FD-IV
Lagged inputs included? Yes No No No No No No
Student characteristics
Free/Reduced-Price Lunch 0.268
(0.282)
�4.461***
(0.165)
�0.761***
(0.108)
�1.987***
(0.107)
0.336
(0.266)
0.283
(0.436)
0.190
(0.163)
Non-Structural Mover �0.234
(0.366)
0.780**
(0.334)
�0.450*
(0.266)
�0.042
(0.248)
�0.023
(0.327)
0.019
(0.547)
�0.251
(0.183)
Teacher characteristics
1–2 Years of Experience 1.538***
(0.492)
2.627***
(0.373)
1.747***
(0.307)
2.039***
(0.285)
1.857***
(0.398)
3.103***
(0.657)
1.330***
(0.202)
3–4 Years of Experience 2.060***
(0.698)
3.465***
(0.529)
2.007***
(0.440)
2.490***
(0.407)
2.222***
(0.585)
4.471***
(0.972)
1.839***
(0.253)
5–9 Years of Experience 1.492*
(0.800)
4.350***
(0.605)
1.910***
(0.504)
2.719***
(0.466)
1.651**
(0.689)
4.497***
(1.141)
1.715***
(0.253)
10–14 Years of Experience 2.279**
(0.980)
5.559***
(0.744)
2.175***
(0.624)
3.297***
(0.575)
1.397
(0.853)
4.949***
(1.417)
1.838***
(0.273)
15–24 Years of Experience 1.929*
(1.122)
5.638***
(0.863)
2.310***
(0.725)
3.414***
(0.668)
1.079
(1.002)
5.212***
(1.660)
1.792***
(0.275)
25 Years Plus 2.320*
(1.358)
7.816***
(1.072)
2.931***
(0.901)
4.550***
(0.828)
1.074
(1.203)
7.022***
(2.010)
2.072***
(0.307)
Advanced Degree �1.049
(0.744)
�1.479**
(0.640)
�0.367
(0.528)
�0.736
(0.486)
�1.231*
(0.670)
�0.750
(1.115)
�0.521***
(0.159)
Professional Certificate �0.606
(0.600)
0.362
(0.471)
�0.924**
(0.388)
�0.498
(0.360)
�0.735
(0.500)
�1.620*
(0.844)
�0.489**
(0.232)
Class characteristics
Class Size �0.020
(0.019)
0.036**
(0.017)
�0.057***
(0.014)
�0.026**
(0.013)
�0.037**
(0.018)
�0.097***
(0.030)
�0.003
(0.002)
Number of observations 209,379 209,379 209,379 209,379 209,379 209,379 209,379
R-Squared 0.462 0.486 0.078 0.697 0.443 0.081 0.120
Number of students 136,622 136,622 136,622 136,622 136,622 136,622 136,622
* Significant at the 10% significance level.
** Significant at the 5% significance level.
*** Significant at the 1% significance level.
22 We present comparisons of the estimated standard errors across
models and discuss the relevant issues further in the Appendix.
effeesti
useproproremIn cbonrandistweteaTabof mleasdec
94%onlbeteffealsooneadd25%tea
Tab
Perc
De
te
ToNo
N/
P/
P/
P/
P/
Tim
N/
P/P/
P/
P/
BoNo
N/
P/
P/
P/
P/
Tim
N/
P/P/
P/
P/
Base
23
quin
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 21
ct estimates across pairs of models. We present suchmates in Appendix Table A4.However, when estimates of teacher productivity ared for accountability, identifying the most and leastductive teachers is most important. Typically the leastductive teachers are targeted for dismissal or forediation, such as additional professional development.ontrast, the most productive teachers are eligible foruses or permanent increases in salary. Changes in thekings of teachers in the middle of the productivityribution are generally of little consequence. Therefore
focus our comparison of model estimates on thechers who are identified as the most or least productive.le 7 provides information on the degree to which pairs
odels overlap in their identification of the most andt productive teachers, those in the top and bottomiles.23
The extent of overlap is substantial, ranging from 17% to (if the estimates were independent the overlap would
y be 0.1 � 0.1 or 1%). The highest degree of overlap isween the partial persistence models with a time constantct and either one or two lagged scores. The overlaps are
high among models with partial persistence and at least lagged test score, but no unobserved effect; addingitional lagged scores or lagged inputs only affects 20–
of teachers identified as being in the top/bottom ofcher rankings in those models.
We also find a relatively strong overlap between modelswith no unobserved heterogeneity, which typically fallinto the 50–81% range (upper left quadrants of the top-10%and bottom-10% matrices in Table 7). The only exception isa rather low correlation between the models with andwithout lagged inputs (first row, first column in eachmatrix). When comparing models with and without anunobserved student/family effect, the overlap is generallylower and ranges from roughly 12 to about 39% (lower leftquadrants of the matrices). The latter finding suggests thataccounting for unobserved heterogeneity has a substantialimpact on the estimates of teacher rankings. However,small overlap may also be partly due to the fact that FEremoves much variation from the data, which makes theestimates noisier.24
When looking at the overlap among models withunobserved heterogeneity (the lower right quadrants ofthe two matrices in Table 7), the overlap is relatively highamong partial decay models that include different numberof test score lags – the result that was already mentionedearlier. However, identification of the most and leastproductive teachers obtained from fixed effects and first-difference regressions is rather different (numbers on theintersection of the last three rows and first two columns inthe lower right quadrants of the matrices). One possible
le 7
ent of teachers classified as top/bottom 10% in both models (different combinations of two models).
cay/estimation method/no. of
st score lags/no. of input lags
No unobserved effect Time constant effect
N/OLS/0/0 P/OLS/0/3 P/OLS/1/0 P/OLS/3/0 P/OLS/3/3 N/FE/0/0 P/FE/0/3 P/FD/1/0 P/FD/2/0
p 10% unobserved effect
OLS/0/0 100.0
OLS/0/3 22.0 100.0
OLS/1/0 55.0 59.6 100.0
OLS/3/0 55.5 51.4 78.9 100.0
OLS/3/3 49.5 50.5 72.0 79.4 100.0
e constant effect
FE/0/0 33.5 12.4 23.9 26.6 30.7 100.0
FE/0/3 21.1 15.1 21.1 24.8 29.4 45.4 100.0
FD/1/0 28.4 22.9 33.0 32.6 33.9 17.0 25.2 100.0
FD/2/0 27.5 24.8 34.9 32.6 35.3 17.4 24.3 94.0 100.0
FD/2/3 36.7 24.3 35.8 33.5 39.0 21.1 27.1 65.1 62.8
ttom 10% unobserved effect
OLS/0/0 100.0
OLS/0/3 36.7 100.0
OLS/1/0 69.7 53.7 100.0
OLS/3/0 70.2 45.0 77.1 100.0
OLS/3/3 66.1 50.0 69.3 81.2 100.0
e constant effect
FE/0/0 38.5 23.9 36.2 29.4 30.7 100.0
FE/0/3 33.0 27.1 35.8 34.9 35.8 52.8 100.0
FD/1/0 28.4 17.9 26.1 25.2 22.9 22.0 24.3 100.0
FD/2/0 28.4 18.8 27.1 26.1 23.4 21.1 24.8 95.9 100.0
FD/2/3 33.9 20.2 32.6 31.2 31.2 27.1 28.0 61.9 60.6
line model is in bold.
24 A table that summarizes information about standard errors of the
estimated teacher effects is presented in the Appendix. Indeed, standard
Comparisons based on identifying teachers in the top and bottom
tiles of the productivity distribution are provided in the Appendix.
errors are the largest in the FE regressions. Further discussion of standard
errors is provided in the Appendix.
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–2322
explanation for these differences may be the presence ofthe measurement error in students’ test scores. The first-difference estimator is used in dynamic (geometric decay)models where the lagged score appears on the right-handside. If measurement errors are not correlated over time,then using the lagged score as an instrument would resolveall endogeneity problems, including the measurementerror problem. However, if errors in measuring studentperformance persist over time, then using lagged score asan instrument does not resolve the problem. In contrast,measuring test scores with an error does not cause biasesin fixed effect estimates when only current and laggedinputs are among the regressors. Fixed effects estimationmay also be a preferred estimation method when strictexogeneity fails due to non-random dynamic sorting ofstudents to teachers. If student sorting is transitory (e.g. ifonly the previous-period performance matters for thecurrent teacher assignment, while more distant past doesnot matter), the asymptotic bias due to violation of strictexogeneity is ‘‘averaged’’ over multiple periods. Thus, theasymptotic bias (or inconsistency) becomes smaller whenmultiple years of data are used.25 The first-differenceestimator does not have this property.
Finally, the model that produces the greatest overlapwith the baseline model is the gain model with student fixed
effects (i.e. no decay/FE/no lagged scores/no lagged inputs);about half of teachers identified as being in the top orbottom categories in the baseline model also appear in thetop/bottom categories when the gains model with student
fixed effects is employed.26
5. Conclusion
Empirical research on teacher productivity has beenbased on ‘‘value-added’’ models, which are derived from anunderlying structural model of cumulative achievement.Rarely have the assumptions required to obtain value
added models been tested, however, and never in acomprehensive way. Starting with a general model ofstudent achievement we specify the assumptions requiredto obtain commonly estimated models and derive econo-metric tests of those assumptions. Using data from Floridawe carry out these tests and find that most all of thesimplifying assumptions are easily rejected.
One implication of our work is that estimates fromcommonly used value added models cannot be given astructural interpretation. For example, the marginal effectof prior-year achievement on current test scores cannot beinterpreted as the persistence of all educational inputs.Researchers seeking to derive structural parameters fromvalue-added models need to employ the tests we haveoutlined in this paper to determine if such interpretationscan be justified with their particular data set.
A second implication is that data generating processesused in simulation work, which are based on some of thesame assumptions used to create empirical value-addedspecifications, may be too simplistic. More complexprocesses appear to be at work. For example, the effectsof prior inputs appear to vary with the grade level at whichthey are applied and the rates of decay may vary acrossinputs. It would be valuable to know if the performance ofvalue-added estimators differs when less restrictive datagenerating processes are used.
For accountability purposes the underlying structuralmodel is of little importance. All that is necessary is thatvalue-added models yield accurate estimates of therelative performance of teachers, particularly those atthe top and bottom of the productivity scale. This requiresthat value added models produce unbiased (or at leastminimally biased) predictions of student achievement andthereby yield measures of a teacher’s impact on studentachievement that are free of significant bias. Except in theunlikely event that the biases created by these assumptionviolations happen to all cancel out, our findings suggestthat all commonly estimated value-added models arebiased to some degree. We are not able to determine themagnitude of the bias, however.
Given the reality that many teacher evaluation systemsalready have a major test-score based component and thatis unlikely to change in the near future, the choice betweenalternative value-added models is of significant impor-tance. This is reinforced by our results that indicate teachereffect estimates from different value-added models varygreatly in many cases and the overlap across models in theteachers identified as high or low performing can be low.We find the model which produces teacher effectestimates that most closely align with our most flexiblebaseline model are those from a gains model with studentfixed effects. Models that include student fixed effects tendto produce imprecise or ‘‘noisy’’ estimates of teacherperformance, however. This loss of efficiency in student-fixed-effects models could outweigh the gains fromreducing bias from unobserved student heterogeneity.Also, identification requires significant numbers of stu-dents move between teachers over time, which may beproblematic in some cases. Indeed, none of the value addedmodels currently employed in accountability systemsemploy student fixed effects. Among models without
25 Koedel and Betts (2011) find evidence in support of transitory sorting
and show that observing teachers over multiple time periods mitigates
the dynamic sorting bias.26 The observed degree of overlap (or more generally the correlation)
among estimates from different models will depend on both the
correlation of the systematic bias as well as the correlation of the
estimation error. As suggested by a reviewer, we attempted to sort out
these two effects by estimating teacher effects from distinct student
cohorts in different periods in order to calculate the true variance in
teacher effects (which equals the covariance in the estimated teacher
effects between time periods) for each model. Unfortunately, limiting the
sample to teachers who taught at Florida public schools for a sufficiently
long period of time to allow estimation of teacher effects from two
distinct time periods shrinks the number of comparable teachers by two-
thirds. Moreover, we found within-model covariances between estimated
teacher effects estimated over different time periods were generally small
and often not statistically significantly different from zero. While one
would expect positive correlations across time periods, the low observed
covariances could result from significant biases with the direction of bias
varying over time. Alternatively, the large reduction in sample size could
result in extremely noisy estimates. It is also possible, though unlikely,
that teacher effects were not constant over time. Whatever the reason for
the small covariances, we do not have sufficient confidence in the
estimates of the true variance to make judgments about the contributions
of estimation error to the overlap in teacher effect estimates across
models.
stuthrcomflexdifflag
Ack
EduassthisowFloby
UniEduuseexc
App
fouj.ec
Ref
And
Boar
Bon
Clot
Dee
Ding
Figli
T.R. Sass et al. / Economics of Education Review 38 (2014) 9–23 23
dent fixed effects, a model with partial persistence,ee lagged test scores and three lags of observable inputs
es closest to mimicking the estimates from our mostible baseline model, though results are not mucherent for models with fewer lagged scores and/or fewerged inputs.
nowledgements
We wish to thank the staff of the Florida Department ofcation’s K-20 Education Data Warehouse for their
istance in obtaining and interpreting the data used in study. The views expressed is this paper are solely our
n and do not necessarily reflect the opinions of therida Department of Education. This work is supportedTeacher Quality Research grant R305M040121 from theted States Department of Education Institute forcation Sciences. Thanks also go to Anthony Bryk forful discussion of this research and to John Gibson forellent research assistance.
endix A. Supplementary data
Supplementary data associated with this article can bend, in the online version, at http://dx.doi.org/10.1016/onedurev.2013.10.003.
erences
rabi, T., Das, J., Khwaja, A. I., & Zajonc, T. (2011). Do value-addedestimates add value? Accounting for learning dynamics. American Eco-nomic Journal: Applied Economics, 3, 29–54.dman, A. E., & Murnane, R. J. (1979). Using panel data to improveestimates of the determinants of educational achievement. Sociologyof Education, 52, 113–121.esronning, H. (2004). The determinants of parental effort in educationproduction: Do parents respond to changes in class size? Economics ofEducation Review, 23, 1–9.felter, C. T., Ladd, H. F., & Vigdor, J. L. (2006). Teacher–student matchingand the assessment of teacher effectiveness. The Journal of HumanResources, XLI, 778–820., T. S. (2004). Teachers, race and student achievement in a randomizedexperiment. Review of Economics and Statistics, 86, 195–210., W., & Lehrer, S. F. (2007). Accounting for unobserved ability heterogeneity
within education production functions. (unpublished manuscript).o, D. N. (1999). Functional form and the estimated effects of schoolresources. Economics of Education Review, 18, 241–252.
Guarino, C. M., Reckase, M. D., & Wooldridge, J. (2012). Can value-addedmeasures of teacher education performance be trusted? Working paper #18East Lansing, MI: The Education Policy Center at Michigan State Univer-sity.
Hanushek, E. A. (1979). Conceptual and empirical issues in the estimation ofeducational production functions. Journal of Human Resources, 14, 351–388.
Harris, D. N. (2007). Diminishing marginal returns and the production ofeducation: An international analysis. Education Economics, 15, 31–45.
Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement ofteacher quality. (unpublished manuscript).
Houtenville, A. J., & Conway, K. S. (2008). Parental effort, school resourcesand student achievement. Journal of Human Resources, XLIII, 437–453.
Jacob, B. A., Lefgren, L., & Sims, D. P. (2010). The persistence of teacher-induced learning. Journal of Human Resources, 45, 915–943.
Kane, T., McCaffrey, D., Miller, T., & Staiger, D. (2013). Have we identifiedeffective teachers? Validating measures of effective teaching using randomassignment Seattle, WA: Bill and Melinda Gates Foundation.
Kane, T., & Staiger, D. (2008). Estimating teacher impacts on student achieve-ment: An experimental evaluation. Working paper #14607. Washington,DC: National Bureau of Economic Research.
Koedel, C., & Betts, J. (2011). Does student sorting invalidate value-addedmodels of teacher effectiveness? An extended analysis of the Rothsteincritique. Education Finance and Policy, 6, 18–42.
Lockwood, J. R., & McCaffrey, D. F. (2007). Controlling for individual hetero-geneity in longitudinal models, with applications to student achieve-ment. Electronic Journal of Statistics, 1, 223–252.
Lockwood, J. R., & McCaffrey, D. F. (2009). Exploring student–teacher inter-actions in longitudinal achievement data. Education Finance and Policy, 4,439–467.
McCaffrey, D. F., Lockwood, J. R., Mihaly, K., & Sass, T. R. (2012). A review ofStata routines for fixed effects estimation in normal linear models. TheStata Journal, 12(3), 1–27.
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teachereffects? Educational Evaluation and Policy Analysis, 26(3), 237–257.
Rice, J. K. (2003). Teacher Quality: Understanding the Effectiveness of TeacherAttributes. Washington, DC: Economic Policy Institute.
Rothstein, J. (2010). Teacher quality in educational production: Tracking,decay and student achievement. Quarterly Journal of Economics, 125,175–214.
Todd, P. E., & Wolpin, K. I. (2003). On the specification and estimation of theproduction function for cognitive achievement. The Economic Journal,113, F3–F33.
Todd, P. E., & Wolpin, K. I. (2007). The production of cognitive achievement inchildren: Home, school and racial test score gaps. Journal of HumanCapital, 1, 91–136.
Wayne, A. J., & Youngs, P. (2003). Teacher Characteristics and StudentAchievement Gains. Review of Educational Research, 73, 89–122.
Wilson, S., Floden, R. E., & Ferrini-Mundy, J. (2001). Teacher PreparationResearch: Current Knowledge, Gaps. and Recommendations, Seattle, WA:Center for the Study of Teaching and Policy.
Wilson, S., & Floden, R. (2003). Creating Effective Teachers: Concise Answers forHard Questions. New York, NY: AACTE Publications.
Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data.Cambridge, MA: MIT Press.
Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teacher and classroomcontext effects on student achievement: Implications for teacher evalu-ation. Journal of Personnel Evaluation in Education, 11, 57–67.