measuring test measurement error: a general...
TRANSCRIPT
Measuring Test Measurement Error:
A General Approach
Donald BoydHamilton Lankford
University at Albany
Susanna Loeb
Stanford University
James Wyckoff
University of Virginia
Test-based accountability as well as value-added asessments and much experi-
mental and quasi-experimental research in education rely on achievement tests
to measure student skills and knowledge. Yet, we know little regarding funda-
mental properties of these tests, an important example being the extent of mea-
surement error and its implications for educational policy and practice. While
test vendors provide estimates of split-test reliability, these measures do not
account for potentially important day-to-day differences in student perfor-
mance. In this article, we demonstrate a credible, low-cost approach for esti-
mating the overall extent of measurement error that can be applied when
students take three or more tests in the subject of interest (e.g., state assessments
in consecutive grades). Our method generalizes the test–retest framework by
allowing for (a) growth or decay in knowledge and skills between tests, (b) tests
being neither parallel nor vertically scaled, and (c) the degree of measurement
error varying across tests. The approach maintains relatively unrestrictive, tes-
table assumptions regarding the structure of student achievement growth. Esti-
mation only requires descriptive statistics (e.g., test-score correlations). With
student-level data, the extent and pattern of measurement-error heteroscedasti-
city also can be estimated. In turn, one can compute Bayesian posterior means
of achievement and achievement gains given observed scores—estimators hav-
ing statistical properties superior to those for the observed score (score gain).
We employ math and English language arts test-score data from New York City
to demonstrate these methods and estimate the overall extent of test measure-
ment error is at least twice as large as that reported by the test vendor.
Keywords: generalizability theory, reliability, testing, high-stakes testing, correlational
analysis, longitudinal studies, effect size
Journal of Educational and Behavioral Statistics
2013, Vol. 38, No. 6, pp. 629–663
DOI: 10.3102/1076998613508584
# 2013 AERA. http://jebs.aera.net
629
Test-based accountability, teacher evaluation, and much experimental and
quasi-experimental research in education rely on achievement tests as an impor-
tant metric to assess student skills and knowledge. Yet, we know little regarding
the properties of these tests that bear directly on their use and interpretation. For
example, evidence is often scarce regarding the extent to which standardized
tests are aligned with educational standards or the outcomes of interest to policy
makers or analysts. Similarly, we know little about the extent of test measure-
ment error and the implications of such error for educational policy and practice.
The estimates of reliability provided by test vendors capture only one of a num-
ber of different sources of error.
This article focuses on test measurement error and demonstrates a credible
approach for estimating the overall extent of error. For the achievement tests
we analyze, the measurement error is at least twice as large as that indicated
in the technical reports provided by the test vendor. Such error in measuring stu-
dent performance results in measurement error in the estimation of teacher effec-
tiveness, school effectiveness, and other measures based on student test scores.
The relevance of test measurement error in assessing the usefulness of metrics
such as teacher value added or schools’ adequate yearly progress often is noted
but not addressed, due to the lack of easily implemented methods for quantifying
the overall extent of measurement error. This article demonstrates such a tech-
nique and provides evidence of its usefulness.
Thorndike (1951) articulates a variety of factors that can result in test scores
being noisy measures of student achievement. Technical reports by test vendors
provide information regarding test measurement error as defined in classical test
theory and item response theory (IRT). For both, the focus is on the measurement
error associated with the test instrument (i.e., randomness in the selection of test
items and the raw score to scale score conversion). This information is useful, but
provides no information regarding the error from other sources, for example,
variability in test conditions.
Reliability coefficients based on the test–retest approach using parallel test
forms is viewed in the psychometric literature to be the gold standard for quan-
tifying measurement error from all sources. Students take alternative, but parallel
(i.e., interchangeable), tests two or more times sufficiently separated in time to
allow for the ‘‘random variation within each individual in health, motivation,
mental efficiency, concentration, forgetfulness, carelessness, subjectivity or
impulsiveness in response and luck in random guessing,’’ but sufficiently close
in time that the knowledge, skills, and abilities of individuals taking the tests are
unchanged (Feldt & Brennan, 1989). However, there are relatively few examples
of this approach to measurement-error estimation in practice, especially in the
analysis of student achievement tests used in high-stakes settings.
Rather than analyze the consistency of scores across tests close in time, the
standard approach is to divide a single test into parallel parts. Such split-test
reliability only accounts for the measurement error resulting from the random
Measuring Test Measurement Error
630
selection of test items from the relevant population of items. As Feldt and Brennan
(1989) note, this approach ‘‘frequently present[s] a biased picture,’’ in that, ‘‘report-
ed reliability coefficients tend to overstate the trustworthiness of educational
measurement, and standard errors underestimate within-person variability’’ because
potentially important day-to-day differences in student performance are ignored.
In this article, we show that there is a credible approach for measuring the
overall extent of measurement error applicable in a wide variety of settings.
Estimation is straightforward and only requires estimates of the variances and
correlations of test scores in the subject of interest at several points in time
(e.g., third-, fourth-, and fifth-grade math scores for a cohort of students).
Student-level data are not needed. Our approach generalizes the test–retest
framework to allow for (a) either growth or decay in the knowledge and skills
of students between tests, (b) tests to be neither parallel nor vertically scaled,
and (c) the extent of measurement error to vary across tests. Utilizing test-
score covariance or correlation estimates and maintaining minimal structure
characterizing the nature of achievement growth, one can estimate the overall
extent of test measurement error and decompose a test-score variance into the
part attributable to real differences in achievement and the part attributable to
measurement error. When student-level data are available, the extent and pat-
tern of measurement-error heteroscedasticity also can be estimated.
The following section briefly introduces generalizability theory and shows how
the total measurement error is reflected in the covariance structure of observed test
scores. In turn, we explain our statistical approach and report estimates of the over-
all extent of measurement error associated with New York (NY) State assessments
in math and English language arts (ELA), and how the extent of test measurement
error varies across ability levels. These estimates are then used to compute Baye-
sian posterior means and variances of ability conditional on observed scores, the
posterior mean being both the best (i.e., lowest mean square error) and an unbiased
predictor of a student’s actual ability. We conclude with a summary and a brief
discussion of ways in which information regarding the extent of test measurement
error can be informative in analyses related to educational practice and policy.
1. Measurement Error and the Structure of Test-Score Covariances
From the perspective of classical test theory, an individual’s observed score is
the sum of the true score representing the expected value of test scores over some
set of test replications and the residual difference, or random error, associated
with test measurement error. Generalizability theory extends test theory to expli-
citly account for multiple sources of measurement error.1 Consider the case
where a student takes a test at a point in time with the test consisting of a set
of tasks (e.g., questions) drawn from some universe of similar conditions of
measurement. Over a short time period, there is a set of possible test occasions
(e.g., dates) for which the student’s knowledge/skills/ability is constant. Even
Boyd et al.
631
so, the test performance of a student typically will vary across such occasions.
First, randomness in the selection of test items along with students doing espe-
cially well or poorly on particular tasks is one source of measurement error. Tem-
poral instability in student performance due to factors aside from changes in
ability (e.g., sleepiness) is another.
Consider the case where students complete a sequence of tests in a subject or
related subjects. Let Sij in Sij ¼ tij þ Zij represent the ith student’s score on the
exam taken on one occasion during the jth testing period. For exposition, we
assume there is one exam per grade.2 The student’s universe score, tij, is the
expected value of Sij over the universe of generalization (e.g., the universes of
possible tasks and occasions). Comparable to the true score in classical test theory,
tij measures the student’s skills or knowledge. Zij is the test measurement error
from all sources where EZij ¼ 0, EZijtik ¼ 0; 8j; k, and EZijZik ¼ 0 ; 8 j 6¼ k;
the errors have zero mean, are not correlated with actual achievement, and are
not correlated over time. Allowing for heteroscedasticity across students,
s2Zij� EZ2
ij is the test measurement-error variance for the ith student in grade
j. Let s2Z�j� Es2
Zijrepresent the mean measurement-error variance for a partic-
ular test and test-taking population. In the case of homoscedastic measurement
error, s2Zij¼ s2
Z�j; 8i.
Researchers and policy makers are interested in decomposing the variance of
observed scores for the jth test, ojj, into the variance of universe scores, gjj, and
the measurement-error variance; ojj ¼ gjj þ s2Z�j
. The generalizability coeffi-
cient, Gj � gjj=ojj, measures the portion of the test-score variance that is
explained by the variance of universe scores.
Si ¼ ti þ Zi: ð1Þ
Vector notation is employed in Equation 1 where S0i � Si1 Si2 � � � SiJ½ �,t0i � ti1 ti2 � � � tiJ½ �, and Z0i � Zi1 Zi2 � � � ZiJ½ � for the first through the Jth
tested grades.3 Equation 2 defines �ðiÞ to be the autocovariance matrix for the ith
student’s observed test scores, Si. H is the autocovariance matrix for the universe
scores in the population of students. Zi is the diagonal matrix with the
measurement-error variances for the ith student on the diagonal.
�ðiÞ ¼ E Si � ESið Þ Si � ESið Þ0� �
¼ E ti � Etið Þ ti � Etiið Þ0� �
þ EðZiZ0
iÞ
¼
oi11 oi12 � � � oi1J
oi21 oi22 � � � oi2J
..
. ... . .
.
oiJ1 oiJ2 oiJJ
266664
377775 ¼
g11 g12 � � � g1J
g21 g22 � � � g2J
..
. ... . .
.
gJ1 gJ2 gJJ
266664
377775þ
s2Zi1
0 � � � 0
0 s2Zi2� � � 0
..
. ... . .
.0
0 0 0 s2ZiJ
2666664
3777775
¼ H þ Zi: ð2Þ
Measuring Test Measurement Error
632
�� � E�ðiÞ ¼ H þ Z�: ð3Þ
The test-score covariance matrix for the population of test takers, ��, is shown
in Equation 3 where Z� is the diagonal matrix with s2Z�1; s2
Z�2; ::: ; s2
Z�Jon the
diagonal.4 Note that corresponding off-diagonal elements of �ðiÞ, �ði0Þ, and
�� are equal; oi jk ¼ ojk ¼ gjk ; 8j 6¼ k. In contrast, corresponding diagonal ele-
ments oijj ¼ gjj þ s2Zij
and ojj ¼ gjj þ s2Z�j
are not equal when measurement
error is heteroscedastic.
�� ¼
o11 o12 o13 o1J
o22 o23 � � � o2J
o33 o3J
. .. ..
.
oJJ
2666664
3777775¼
g11=G1 g12 g13 g1J
g22=G2 g23 � � � g2J
g33=G3 g3J
. .. ..
.
gJJ=GJ
2666664
3777775: ð4Þ
With ojk ¼ gjk ; 8j 6¼ k; and ojj ¼ gjj
�Gj, we have the formula for �� in
Equation 4.
Let rjk and rjk , respectively, represent the test-score and universe-score corre-
lations for tests j and k. These correlations along with Equation 4 imply the
test-score correlation matrix, R:
R¼
1 r12 r13 r14 r15
1 r23 r24 r25 ���1 r34 r35
1 r35
1
. ..
26666664
37777775¼
1ffiffiffiffiffiffiffiffiffiffiffiG1G2
pr12
ffiffiffiffiffiffiffiffiffiffiffiG1G3
pr13
ffiffiffiffiffiffiffiffiffiffiffiG1G4
pr14
ffiffiffiffiffiffiffiffiffiffiffiG1G5
pr15
1ffiffiffiffiffiffiffiffiffiffiffiG2G3
pr23
ffiffiffiffiffiffiffiffiffiffiffiG2G4
pr24
ffiffiffiffiffiffiffiffiffiffiffiG2G5
pr25 ���
1ffiffiffiffiffiffiffiffiffiffiffiG3G4
pr34
ffiffiffiffiffiffiffiffiffiffiffiG3G5
pr35
1ffiffiffiffiffiffiffiffiffiffiffiG4G5
pr45
1
. ..
26666664
37777775:
ð5Þ
The presence of test measurement error (i.e., Gj < 1) implies that each corre-
lation of test scores is smaller than the corresponding correlation of universe
scores. In contrast, the off-diagonal elements of the empirical test-score covar-
iance matrix are estimates of the off-diagonal elements of the universe-score
covariance matrix; ojk ¼ gjk .
Estimates of the ojk or the rjk alone are not sufficient to infer estimates of the
gjj and Gj, as there are J more parameters in both Equations 4 and 5 than there are
moments.5 However, there is a voluminous literature in which researchers
employ more parsimonious covariance- and correlation-matrix specifications
to economize on the number of parameters to be estimated while retaining suf-
ficient flexibility in the covariance structure. For a variety of such structures, one
can estimate gjj and Gj, though the reasonableness of any particular structure will
be context-specific.
Boyd et al.
633
As an example, suppose that one knew or had estimates of test-score correla-
tions for parallel tests taken at times t1; t2; . . . ; tJ , where time intervals between
consecutive tests can vary. Correlation structures that allow for changes in skills
and knowledge over time typically maintain that the correlation between any two
universe scores is smaller, the longer is the time span between the tests. For
example, one possible specification is rjk ¼ r tk�tjj j with r < 1. Here the correla-
tion of universe scores decreases at a constant rate as the time interval between
the tests increases. Maintaining this structure and assuming Gj ¼ G; 8j; G
and r are identified with three tests, as shown in Equation 6.6 If J � 4,
G1; G2; . . . GJ ; and r are identified.
r ¼ r13=r12ð Þ1= t3�t2j jG ¼ r12r23=r13: ð6Þ
This example generalizes the congeneric model analyzed by Joreskog (1971).
Tests are said to be congeneric if the true scores, tik , are linear functions of a
common ti� (i.e., true scores are perfectly correlated). For this case, Joreskog
shows that G1; G2; and G3 are identified, which generalizes the test–retest
framework where r ¼ 1 and Gj ¼ G; 8j.The structure rjk ¼ r tk�tjj j has potential uses, but is far from general. The cen-
tral contribution of this article is to show that the overall extent of test measure-
ment error and universe-score variances can be estimated maintaining far less
restrictive universe-score covariance structures, thereby substantially generaliz-
ing the test–retest approach. The intuition is relatively straightforward. For
example, in a wide range of universe-score covariance structures, gjk in Equation
4 can be expressed as functions of gjj and gkk .7 In such cases, estimates of the
ojk ¼ gjk ; j 6¼ k; can be used to estimate gjj and Gj ¼ gjj=ojj.
Additional intuition follows from an understanding of circumstances in which
our approach is not applicable. The primary case is where a universe score is
multidimensional with at least one of the dimensions of ability not correlated
with any of the abilities measured by the other tests. For example, suppose the
universe score for the second exam measures two abilities such that ti2 ¼to
i2 þ ci2 with Covðci2; tikÞ ¼ 0;8k, and Covðtoi2; tikÞ 6¼ 0 ; 8k 6¼ 2.8 Because
o2k ¼ g2k ¼ Covðtoi2; tikÞ is not a function of V ðci2Þ, knowledge of the ojk does
not identify V ðci2Þ, g22 ¼ Vðtoi2Þ þ Vðci2Þ, or G2 ¼ Vðto
i2Þ þ Vðci2Þ� ��
o22.
Thus, in cases where tests measure multidimensional abilities, application of our
approach is appropriate only if every skill and ability measured by each test is
correlated with one or more skill or ability measured by the other tests. When this
property does not hold, the extent of measurement error and the extent of
variation in ci2 measured by Vðci2Þ are confounded. (Regarding dimensionality,
it is relevant to note that IRT models used in test scoring typically maintain that
each test measures ability along a single dimension, which can be, and often is,
tested.)
Measuring Test Measurement Error
634
Note that an increase in the measurement error in the jth test (i.e., a decrease in
Gj), keeping other things constant, implies the same proportionate reduction in
every test-score correlation in the jth row and column of R in Equation 5, but
no change in any of the other test-score correlations, as Gj only appears in that
row and column. Whether Gj is identified crucially depends upon whether a
change in Gj is the only explanation for such a proportionate change in
rjk ; 8k;with no change in rmn; m; n 6¼ j. Another possible explanation is the case
where ci2 represents an ability not correlated with any of the abilities measured
by the other tests. An increase in Vðci2Þ would imply proportionate declines in
r2k and r2k ; 8k; with rmn and rmn; m; n 6¼ 2; unchanged. However, in many cir-
cumstances, analysts will find it reasonable to rule out this possibility, for exam-
ple, dismiss the possibility that the universe-score correlations for the first and
second exams and the second and third exams could decline at the same time that
the universe-score correlation for the first and third exams remained unchanged.
More generally, a variety of universe-score correlation structures rule out the
possibility of a proportionate change in every universe-score correlation in the
jth row and column with no change in every other rmn; m; n 6¼ j. In those cases,
a proportionate change in the rjk ; 8k; with no change in rmn; m; n 6¼ j; necessa-
rily implies an equal proportionate change in Gj.
In Equation 5, note that ðr13=r14Þ=ðr23=r24Þ ¼ ðr13=r14Þ=ðr23=r24Þ. In general,
rgj
�rhj : rgk
�rhk as rgj
�rhj : rgk
�rhk . Also, often it is reasonable to maintain
that the universe-score correlation matrix follows some general structure, which
implies functional relationships among the universe-score correlations. This, in
turn, simplifies expressions such as ðr13=r14Þ=ðr23=r24Þ. In this way, the relative
magnitudes of the rjk are key in identifying the rjk . One example is the case of
rjk ¼ r tk�tjj j which implies that r ¼ rjk
�rjm
� �1= tm�tkj j. More generally, the pattern
of decline in rj; jþm as m increases in the jth row (column) relative to the pattern of
decline for rk; kþm in other rows (columns) is key in identifying rjk .
Identification is not possible in the case of a compound symmetric universe-score
correlation structure (i.e., correlations are equal for all test pairs). Substituting
rjk ¼ r; 8j; k in Equation 5 makes clear that a proportionate increase (decrease)
in r accompanied by an equal proportionate reduction (increase) in all the Gj leaves
all the test-score correlations unchanged. Thus, our approach can identify the Gj
only if it is not the case that rjk ¼ r; 8j; k. Fortunately, it is quite reasonable to rule
out this possibility in cases where tests in a subject or related subjects are taken over
time, as the correlations typically will differ reflecting the timing of tests.
The extent of test measurement error can be estimated whether or not tests are
vertically scaled. Given the prevalence of questions regarding whether tests in
practice are vertically scaled (e.g., Ballou, 2009), it is fortunate that our approach
can employ test-score correlations as in Equation 5. Each test must reflect an
interval scale, but the scales can differ across tests. Even though the lack of
Boyd et al.
635
vertical scaling has a number of undesirable consequences regarding what can be
inferred from test scores, no problem arises with respect to the estimation of the
extent of test measurement error for the individual tests, measured by Gj. In anal-
yses where tests are known, or presumed, to be vertically scaled, as in the esti-
mation of growth models, the extent of test measurement error can be
estimated employing either test-score covariances or the corresponding correla-
tions. However, in estimating the extent of measurement error and universe-score
variances, nothing is lost by employing the correlations, and there is the advan-
tage that the estimator does not depend upon whether the tests actually are ver-
tically scaled.
In summary, smaller test-score correlations can reflect either larger measure-
ment error or smaller universe-score correlations, or a combination of both. It is
possible to distinguish between these explanations in a variety of settings, includ-
ing situations in which tests are neither parallel nor vertically scaled. In fact, the
tests can measure different abilities, provided that, first, there is no ability mea-
sured by a test that is uncorrelated with all the abilities measured by the other
tests, and, second, one can credibly maintain at least minimal structure character-
izing the universe-score correlations for the tests being analyzed.
Our approach falls within the general framework for the analysis of covar-
iance structures discussed by Joreskog (1978), the kernel of which can be found
in Joreskog (1971). Our method also draws upon that employed by Abowd and
Card (1989) to study the covariance structure of individual and household earn-
ings, hours worked, and other time-series variables.
2. Estimation Strategy
To decompose the variance of test scores into the parts attributable to real
differences in achievement and measurement error requires estimates of test-
score variances and covariances or correlations along with assumptions regard-
ing the structure characterizing universe-score covariances or correlations. One
approach is to directly specify the rjk (e.g., assume rjk ¼ r tk�tjj j). We label this
the reduced-form approach as such a specification directly assumes some
reduced-form stochastic relationship between the universe scores. An alternative
is to assume an underlying structure of achievement growth, including random
and nonrandom components, and infer the corresponding reduced-form pattern
of universe-score correlations.
ti; jþ1 ¼ bjtij þ yi; jþ1: ð7Þ
Equation 7 is one such structural specification where academic achievement,
measured by universe scores, is cumulative. This first-order autoregressive struc-
ture models attainment in grade j þ 1 as depending upon the level of knowledge
and skills in the prior grade,9 possibly subject to decay (if bj < 1) that can vary
across grades. A key assumption is that decay is not complete, that is, bj > 0.
Measuring Test Measurement Error
636
bj ¼ b; 8j; is a special case, as is bj ¼ 1. yi; jþ1 is the gain in student achievement
in grade j þ 1, gross of any decay. In a fully specified structural model, one must
also specify the statistical structure of the yi; jþ1.10 For example, yi; jþ1 could be
a function of a student-level random effect, mi, and white noise,
ei; jþ1 : yi; jþ1 ¼ mi þ ei; jþ1. Alternatively, yi; jþ1 could be a first-order autoregres-
sive process or a moving average. Each such specification along with Equation 7
implies reduced-form structures for the covariance and correlation matrices in
Equations 4 and 5.11 As demonstrated below, one can also employ a hybrid
approach that continues to maintain Equation 7 but, rather than fully specifying
the underlying stochastic structure of test-to-test achievement gains, assumes that
the underlying structure is such that E yi; jþ1jtij
� �is a linear function of tij.
The relative attractiveness of these approaches will vary depending upon the
particular application. For example, when analysts employ test-score data to esti-
mate models of achievement growth and also are interested in estimating the
extent of test measurement error, it would be logical in the latter analysis to main-
tain the covariance or correlation structures implied by the model/models of
achievement growth maintained in the former analysis. At the same time, there
are advantages of employing the hybrid, linear model developed below. For
example, the framework has an intuitive, relatively flexible, and easy-to-
estimate universe-score correlation structure so that the approach can be applied
whether or not the tests are vertically scaled. The hybrid model also lends itself to
a relatively straightforward analysis of measurement-error heteroscedasticity and
also allows the key linearity assumption to be tested. Of primary importance is
whether there is a convincing conceptual justification for the specification
employed in a particular application. Analysts may have greater confidence in
assessing the credibility of a structural or hybrid model of achievement growth
than assessing the credibility of a reduced-form covariance structure considered
in isolation.
2.1. A Linear Model
In general, the test-to-test gain in achievement can be written as the sum of its
mean conditional on the prior level of ability and a random error having zero
mean; yi; jþ1 ¼ E yi; jþ1jtij
� �þ ui; jþ1, where ui; jþ1 � yi; jþ1 � E yi; jþ1jtij
� �and
E ui; jþ1tij ¼ 0. The assumption that such conditional mean functions are linear
in parameters is at the core of regression analysis. We go a step further and
assume that E yi; jþ1jtij
� �is a linear function of tij; E yi; jþ1jtij
� �¼ aj þ bj tij,
where aj and bj are parameters. Here we do not explore the full set of stochastic
structures characterizing test-to-test learning, yi; jþ1, for which a linear specifica-
tion is a reasonably good approximation. However, it is relevant to note that the
linear specification is a first-order Taylor approximation for any E yi; jþ1jtij
� �and
that tij and yi; jþ1 having a bivariate normal distribution is sufficient, but not
Boyd et al.
637
necessary, to assure linearity in tij. As discussed below, the assumption of line-
arity can be tested.
Equation 7 and yi; jþ1 ¼ aj þ bjtij þ ui; jþ1 imply that ti; jþ1 ¼ aj þ cjtijþui; jþ1, where cj � bj þ bj; the universe score in grade j þ 1 is a linear function
of the universe score in the prior grade. The two components of coefficient cj
reflect (a) part of the student’s proficiency in grade j þ 1 having already been
attained in grade j, attenuated per Equation 7, and (b) the expected growth during
year j þ 1 being linearly dependent on the prior-year achievement, tij.
The linear model ti; jþ1 ¼ aj þ cjtij þ ui; jþ1 implies that rj; jþ1 ¼cj
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffigjj
�gjþ1; jþ1
q(e.g., r12 ¼ c1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig11=g22
p). In addition, rj; jþ2 ¼ rj; jþ1 rjþ1; jþ2
(e.g., r13 ¼ c2c1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig11=g33
p¼ c1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig11=g22
pc2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig22=g33
p¼ r12r23), rj; jþ3 ¼ rj; jþ1
rjþ1; jþ2 rjþ2; jþ3, and so on. This structure along with Equation 5 implies the
following moment conditions:
r12 r13 r14 � � �r23 r24 � � �
r34 � � �. .
.
26664
37775 ¼
ffiffiffiffiffiffiffiffiffiffiffiG1G2
pr12
ffiffiffiffiffiffiffiffiffiffiffiG1G3
pr12r23
ffiffiffiffiffiffiffiffiffiffiffiG1G4
pr12r23r34 � � �ffiffiffiffiffiffiffiffiffiffiffi
G2G3
pr23
ffiffiffiffiffiffiffiffiffiffiffiG2G4
pr23r34 � � �ffiffiffiffiffiffiffiffiffiffiffi
G3G4
pr34 � � �
. ..
26664
37775 : ð8Þ
BecauseffiffiffiffiffiffiG1
pand r12 only appear as a multiplicative pair, the parameters are
not identified, but r�12 �ffiffiffiffiffiffiG1
pr12 is identified. The same is true for r�J�1; J �ffiffiffiffiffiffi
GJ
prJ�1; J , where J is the last grade for which test scores are available.
After substituting the expressions for r�12 and r�J�1; J , the Nm ¼ J J � 1ð Þ=2
moments in Equation 8 are functions of the N� ¼ 2J � 3 parameters in
� ¼ ½G2 G3 � � � GJ�1 r�12 r23 � � � rJ�2;J�1 r�J�1; J �, which can be identified pro-
vided that J � 4. With one or more additional parameter restrictions, J ¼ 3 is
sufficient for identification. For example, when Gj ¼ G, estimates of the test-
score correlations for J ¼ 3 tests imply the following estimators:
r12 ¼ r13=r23 r23 ¼ r13=r12 G ¼ r12 r23=r13 : ð9Þ
In general, estimated test-score correlations together with assumptions regard-
ing the structure of student achievement growth are sufficient to estimate the
universe-score correlations and the relative extent of measurement error mea-
sured by the generalizability coefficients. In turn, estimates of Gj and the test-
score variance, ojj, imply the test measurement-error variance estimator
s2Z�j¼ ojjð1� GjÞ as well as the universe-score variance estimator gjj ¼ ojjGj
measuring the dispersion in student achievement in grade j.
The equations in (9) illustrate the intuition regarding identification discussed
in Section 1.0. Consider the implications of r12, r23, and r13 being smaller, which
need not imply an increase in the extent of test measurement error. The last equa-
tion in (9) implies that dG�
G ¼ dr23=r23 þ dr12=r12 � dr13=r13. Thus, G would
Measuring Test Measurement Error
638
remain constant if the proportionate change in r13 equals the sum of the propor-
tionate changes in r12 and r23. In such cases, the magnitude of the proportionate
reduction in r13 equals or exceeds the proportionate reduction in r12ðr23Þ. With
strict inequalities, r12 and r23 will decline as shown in the first two formulae
in Equation 9. If the proportionate reduction in r13 equals the proportionate
reductions in both r12 and r23, r12 and r23 would remain constant, but G would
have the same proportionate reduction. In other cases, changes in r12, r23, and r13
will imply changes in G as well as a change in either r12 or r23, or changes in both.
Whether the parameters are exactly identified as in Equation 9 or overidenti-
fied, the parameters can be estimated using a minimum distance estimator. For
example, suppose the elements of the column vector rð�Þ are the moment condi-
tions on the right-hand side of Equation 8 after having substituted the expressions
for r�12 and r�J�1; J . With r representing the corresponding vector of Nm test-score
correlations for a sample of students, the minimum-distance estimator is
argmin� ½r � rð�Þ�0B ½r � rð�Þ�, where B is any positive semidefinite matrix. �is locally identified if plimB ¼ B0 and rank½B0 qrð�Þ=q�0� � N�, NM � N�
being a necessary condition. Equalities imply the parameters are exactly identi-
fied with the estimators implicitly defined in r ¼ rð�Þ and unaffected by the
choice of B. Equation 9 is one such example. We employ the identity matrix
so that �MD ¼ argmin� ½r � rð�Þ�0 ½r � rð�Þ�.12 The estimated generalizability
coefficients, in turn, can be used to infer estimates of the universe-score var-
iances, gjj ¼ Gj ojj, and measurement-error variances s2Z�j¼ gjjð1� GjÞ
�Gj ¼
ð1� GjÞ ojj. Rather than estimating gjj and s2Z�j
in such a second step, the
moment conditions ojj ¼ gjj
�Gj and ojj ¼ ð1� GjÞ
.s2Z�j
can be included in
r and rð�Þ, yielding parameter estimates and standard errors of gjj and s2Z�j
, in
addition to the other parameters in rð�Þ.The variance of the minimum-distance estimator is V �MDð Þ ¼ ½Q0Q��1
Q0
VðrÞQ ½Q0Q��1, where Q is the matrix of derivatives Q ¼ qrð�Þ=q�. VðrÞ enters
the formula because sample moments, r, are employed as estimates of the corre-
sponding population moments, ro, where the limit distribution of r isffiffiffiffiffiffiNS
pr � r0ð Þ �!d N ½0;VðrÞ�. The precision of the estimator �MD is affected
by random sampling error that can also be assessed using bootstrapping; comput-
ing �MD for each of a large number of bootstrapped samples will provide infor-
mation regarding the distribution of the �MD, including an estimate of V �MDð Þ.
2.2. Additional Points
Estimation of the overall extent of measurement error for a population of test
takers only requires descriptive statistics and correlations of test scores, an attrac-
tive feature of our approach. Additional inferences are possible when student-
Boyd et al.
639
level data are available, an important example being the analysis of the extent
and pattern of heteroscedasticity. The linear model ti; jþ1 ¼ aj þ cjtijþ ui; jþ1 and
the formula Sik ¼ tik þ Zik imply that Zi; jþ1 � cjZij þ ui; jþ1 ¼ Si; jþ1� aj � cjSij.
The variances of the expressions before and after the equality being equal imply
Equation 10.
s2Z
i; jþ1þ c2
j s2Z
ij¼ V Si; jþ1 � cjSij
� � s2
ujþ1: ð10Þ
Here cj ¼ gj; jþ1
�gjj and s2
ujþ1¼ gjþ1; jþ1 � g2
j; jþ1
.gjj.
13 By specifying a func-
tional relationship between s2Z
i; jþ1and s2
Zij, Equation 10 can be used to explore
the nature and extent of measurement-error heteroscedasticity. s2Z
i; jþ1¼ s2
Zij
is
an example, but is of limited use in that it does not allow for either (a) variation
in common factors affecting s2Z
ijfor all students (e.g., a decrease in s2
Z�j¼ Es2
Zij
resulting from an increase in the number of test items) or (b) variation between
s2Z
ijand s2
Zi; jþ1
for individual students, holding s2Z�j
and s2Z�jþ1
constant. To allow
for differences in the population mean measurement-error variance across tests,
one could employ the specification s2Z
i; jþ1
.s2Z�jþ1¼ s2
Zij
.s2Z�j
or, equivalently,
s2Z
i; jþ1¼ Kj s2
Zij, where Kj � s2
Z�jþ1
.s2Z�j
. Here the proportionate difference
between s2Z
i; jþ1and s2
Z�; jþ1for the ith test taker is the same as that between s2
Zij
and s2Z�j
. To meaningfully relax this assumption, we assume that
s2Z
i; jþ1¼ Kj s2
Zijþ �ij, where the random variable �ij has zero mean. This formu-
lation along with Equation 10 implies Equation 11. Thus, the mean
measurement-error variance for a group students represented by C can be esti-
mated using Equation 12. One can also employ the noisy student-level estimate
in Equation 13 as the dependent variable in a regression analysis estimating the
extent to which s2Z
ijvaries with the level of student achievement or other vari-
ables, as employed below.
s2Z
ij¼ V Si; jþ1 � cjSij
� � s2
ujþ1� �ij
h i.Kj þ c2
j
� ð11Þ
s2Z
C j¼ 1=NCð Þ
Xi2C
Si; jþ1 � �Sjþ1
� � cj Sij � �Sj
� h i2
� s2ujþ1
��Kj þ c2
j
� ð12Þ
s2Z
ij¼ Si; jþ1 � �Sjþ1
� � cj Sij � �Sj
� h i2
� s2ujþ1
��Kj þ c2
j
� ð13Þ
The parameters entering the universe-score covariance or correlation structure
can be estimated without specifying the distributions of tij and Zij, but additional
inferences are possible with such specifications. When needed, we assume that tij
Measuring Test Measurement Error
640
and Zij are normally distributed. If Zij is either homoscedastic or heteroscedastic
with s2Zij
not varying with the level of ability, tij and Sij are bivariate normal,
which implies that the conditional distribution of tij given Sij is normal with
moments E tij Sij
��� �¼ ð1� GijÞmj þ GijSij and V tij Sij
��� �¼ ð1� GijÞgjj, where
mj � Etij ¼ ESij. In the homoscedastic case, Gij ¼ Gj. With heteroscedasticity
and s2Zij
not varying with ability, Gij ¼ gjj
.gjj þ s2
Zij
� . The Bayesian posterior
mean of tij given Sij, E tij Sij
��� �, is unbiased and the best (i.e., minimum mean
square error) estimator of the student’s actual ability.14 V tij Sij
��� �and easily com-
puted Bayesian credible bounds (confidence intervals) can be employed to mea-
sure the precision of the best unbiased predictor (BUP) for each student.
Computing posterior means and variances as well as credible bounds are
somewhat more complicated when the extent of test measurement error system-
atically varies across ability levels, as in our application (i.e., sZij¼ sZj
ðtijÞ).
The normal density of Zij is g j Zij tij
��� �¼ f Zij
.sZjðtijÞ
� .sZjðtijÞ, where
fð Þ is the standard normal density. The joint density of tij and Zij, shown in
Equation 14, is not bivariate normal, due to sZijbeing a function of tij.
hj Zij;tij
� �¼ g j Zij tij
��� �f jðtijÞ ¼ 1
sZjðtijÞ ffiffiffiffiffigjjp f Zij
.sZjðtijÞ
� f ðtij�mjÞ
. ffiffiffiffiffigjj
p� ð14Þ
k j Sij
� �¼Z 1�1
h j Sij � tij; tij
� �dtij ¼
Z 1�1
g j Sij � tij tij
��� �f jðtijÞ dtij ð15Þ
k j Sij
� �¼XM
m¼1g j Sij � t�mj t
�mj
���� .M ð16Þ
E tij Sij
��� �¼ 1
k j Sij
� �M
XM
m¼1t�mj g j Sij � t�mj t
�mj
���� ð17Þ
P tij < a Sij
��� �¼ 1
k j Sij
� �M
Xt�
mj<a
g j Sij � t�mj t�mj
���� ð18Þ
The conditional density of tij given Sij is h j Sij � tij; tij
� ��k j Sij
� �, where
k j Sij
� �is the density of Sij. As shown in Equation 15, Sij is a mixture of normal
random variables. Given sZij¼ sZj
ðtijÞ, the integral can be calculated using
Monte Carlo integration with importance sampling as shown in Equation 16
where t�mj; m ¼ 1; 2; . . . ;M ; is a sufficiently large set of random draws from the
distribution f jðtijÞ, which need not be normal. Similarly, the posterior mean abil-
ity level given any particular score can be computed using Equation 17. Also, the
formula for the cumulative posterior distribution of tij in Equation 18 can be used
to compute Bayesian credible bounds. For example, the 80% credible interval is
Boyd et al.
641
(L, U) such that P L tij U Sij
��� �¼ 0:80. Here we choose the lower and upper
bounds such that P tij < L Sij
��� �¼ 0:10 and P tij U Sij
��� �¼ 0:90.
The linear model is a useful tool for estimating the overall extent of test mea-
surement error. Estimation is straightforward and the key requirement that
E ti; jþ1
��tij
� �is a linear function of tij will be reasonable in a variety of circum-
stances. However, this will not always be the case. Exams assessing minimum
competency are one possible example. Thus, in assessing the applicability of the
linear model in each possible use, one must assess whether the assumptions
underlying the linear model are likely to hold. Fortunately, whether ti; jþ1 is a
linear function of tij often can be tested, as demonstrated in Section 3.1.
Finally, it is important to understand that the linear model is only one of the
specifications that fall within our general approach. One can carry out empirical
analyses employing fully specified statistical structures for the yij. Furthermore,
rather than inferring the correlation structure based on a set of underlying
assumptions, one can start with an assumed covariance or correlation structure.
A range of specifications for the structure of correlations are possible, including
rjk ¼ r tk�tjj j and variations on the specification shown in Equation 8. Again, the
reasonableness of any particular structure will be context specific.
3. An Empirical Application
We estimate the parameters in the linear model employing test-score moments
(e.g., correlations) for the third- through eighth-grade NY State math and ELA tests
taken by the cohort of New York City (NYC) students who were in the third grade
during the 2004–2005 school year. Students who made normal grade progression
were in the eighth grade in 2009–2010. The exams, developed by CTB-McGraw
Hill, are aligned to the NY State learning standards and are given to all registered
students, with limited accommodations and exclusions. Here we analyze IRT scale
scores, but our approach also can be used to analyze raw scores.
Table 1 reports descriptive statistics for the sample of students. Correlations
for ELA and math are shown below the diagonals in Tables 2 and 3. Employing
these statistics as estimates of population moments results in sampling error, as
discussed in the last paragraph of Section 2.1. However, the extent of such error
will be relatively small in cases where most students in the population of interest
are tested (e.g., statewide assessments), with missing scores primarily reflecting
absences on test days due to random factors such as illness. Individuals in the
population of interest also may not be tested due to nonrandom factors, for exam-
ple, a student subpopulation being exempt from testing. More subtle problems
also can arise. For example, across grades and subjects in our sample of NYC
students, roughly 7% of the students having scores in one grade have missing
scores for the next grade. This would not be a problem if the scores were missing
completely at random (see Rubin, 1987; Schafer, 1997). However, this is not the
Measuring Test Measurement Error
642
TABLE 1.
Descriptive Statistics for Cohort
ELA Math
M SD M SD
Grade 3 626.8 37.3 616.5 42.3
Grade 4 657.9 39.0 665.8 36.0
Grade 5 659.3 36.1 665.7 37.5
Grade 6 658.0 28.8 667.8 37.5
Grade 7 661.7 24.4 671.0 32.5
Grade 8 660.5 26.0 672.2 31.9
N ¼ 67,528 N ¼ 74,700
Note: ELA ¼ English language arts.
TABLE 2.
Correlations of Scores on the New York State ELA Examinations in Grades 3 Through 8
Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Grade 3 0.7416 0.6934 0.6937 0.6571 0.6332
Grade 4 0.7416 0.7342 0.7346 0.6958 0.6705
Grade 5 0.6949 0.7328 0.7173 0.6794 0.6548
Grade 6 0.6899 0.7357 0.7198 0.7309 0.7044
Grade 7 0.6573 0.6958 0.6800 0.7303 0.6923
Grade 8 0.6356 0.6709 0.6514 0.7050 0.6923
Note: ELA ¼ English language arts. Computed values below the diagonal and fitted values above.
TABLE 3.
Correlations of Scores on the New York State Math Examinations in Grades 3 Through 8
Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Grade 3 0.7286 0.7003 0.6603 0.6393 0.6119
Grade 4 0.7286 0.7694 0.7254 0.7023 0.6722
Grade 5 0.6936 0.7755 0.7597 0.7355 0.7039
Grade 6 0.6616 0.7248 0.7592 0.7964 0.7623
Grade 7 0.6480 0.6998 0.7323 0.7944 0.7929
Grade 8 0.6091 0.6685 0.7077 0.7643 0.7929
Note: Computed values below the diagonal and fitted values above.
Boyd et al.
643
case as students who have missing scores typically score relatively low in the
grades for which scores are present. The exception is that scores are missing for
some very high-scoring students who skip the next grade. Dropping observations
with any missing scores would yield a sample not representative of the overall
student population. Pairwise computation of correlations would reduce, but not
eliminate, the problem. Imputation of missing data, which we employed prior
to computing the descriptive statistics reported in Tables 1 through 3, is a better
solution for dealing with such systematic patterns of missing data.15
3.1. Testing Model Assumptions
The simple correlation structure in Equation 8 follows from assuming that
E ti; jþ1jtij
� �is linear in tij. Whether linearity is a reasonably good approximation
can be assessed using test-score data. The lines in Figure 1(a) and (b) are non-
parametric estimates of E Si8jSi7ð Þ for ELA and math, respectively, showing how
eighth-grade scores are related to scores in the prior grade. The bubbles with
white fill show the actual combinations of observed seventh- and eighth-grade
scores, with the area of each bubble reflecting the relative number of students
with that score combination.
The dark bubbles toward the bottoms of Figure 1(a) and (b) show the IRT stan-
dard errors of measurement (SEMs) for the seventh-grade tests (right vertical axis)
reported in the test technical reports (CTB/McGraw-Hill, 2009). Note that the
extent of measurement error associated with the test instrument is meaningfully
larger for both low and high abilities, reflecting the nonlinear mapping between
raw and scale scores. Each point of the conditional SEM plot corresponds to a par-
ticular scale score as well as the corresponding raw score; movements from one dot
to the next (left to right) reflect a 1-point increase in the raw score (e.g., one addi-
tional correct answer), with the scale-score change shown on the horizontal axis.
For example, starting at an ELA scale score of 709, a 1-point raw score increase
corresponds to a 20-point increase in the scale score to 729. In contrast, starting
from a scale score of 641, a 1-point increase in the raw score corresponds to a
2-point increase in the scale score. This varying coarseness of the raw to scale score
mappings—reflected in the varying spacing of points aligned in rows and columns
in the bubble plot—explains why the reported scale score SEMs are substantially
higher for both low and high scores. Even if the variance were constant across the
range of raw scores, the same would not be true for scale scores.
The fitted nonparametric curves in Figure 1(a) and (b), as well as very similar
results for other grades, provide strong evidence that E Si; jþ1jSij
� �is not a linear
function of Sij. Even so, this does not contradict our assumption that E ti; jþ1jtij
� �is a linear function of tij; test measurement error can explain E Si; jþ1jSij
� �being S
shaped even when E ti; jþ1jtij
� �is linear in tij. It is not measurement error per se
that implies E Si; jþ1jSij
� �will be an S-shaped function of Sij; E Si; jþ1jSij
� �will be
Measuring Test Measurement Error
644
linear in Sij if the measurement-error variance is constant (i.e., s2Zij¼ s2
Z�j; 8i).
However, E Si; jþ1jSij
� �will be an S-shaped function of Sij when Zij is heterosce-
dastic with sZij¼ sZj
ðtijÞ having a U shape (e.g., the SEM patterns shown in
Figure 1). The explanation and an example are included in the Appendix, along
with a discussion of how information regarding the pattern of test measurement error
can be used to obtain consistent estimates of the parameters in a polynomial specifica-
tion of E ti; jþ1jtij
� �. We utilize this approach to eliminate the inconsistency of the
parameter estimates resulting from the measurement error reflected in the SEMs
reported in the technical reports. Even though this does not eliminate any inconsis-
tency of parameter estimates resulting from other sources of measurement error, we
are able to adjust for the meaningful heteroscedasticity reflected in the reported SEMs.
Results from using this approach to analyze the NYC test-score data are shown in
Figure 2 for ELA and math. The thicker, S-shaped curves correspond to the ordinary
least squares estimate of Si8 regressed on Si7 using a cubic specification. The third-
order polynomial is the lowest order specification that can capture the general fea-
tures of the nonparametric estimates of E Si; jþ1jSij
� �in Figure 1. The dashed lines are
cubic estimates of E ti; jþ1jtij
� �obtained using the approach described in the Appen-
dix to avoid parameter estimate inconsistency associated with that part of test mea-
surement error reflected in the SEMs reported in the technical reports. For
comparison, the straight lines are the estimates of E ti; jþ1jtij
� �employing this
approach and a linear specification. It is striking how close the consistent cubic esti-
mates of E ti; jþ1jtij
� �are to being linear.16 Similar patterns were found for the other
grades. Overall, the assumption that E ti; jþ1jtij
� �is a linear function of tij appears to
be quite reasonable in our application.
0
20
40
60
80
100
120
140
160
180
200
600
625
650
675
700
725
750
775
800
600 625 650 675 700 725 750 775 800
grad
e 8
scor
e
grade 7 score
0
20
40
60
80
100
120
140
160
180
200
600
625
650
675
700
725
750
775
800
(a) ELA (b) Math
600 625 650 675 700 725 750 775 800
repo
rted
SE
M fo
r gra
de 7
repo
rted
SE
M fo
r gra
de 7
grad
e 8
scor
e
grade 7 score
FIGURE 1. Nonparametric regression of Grade 8 scores on scores in Grade 7, bubble
graph showing the joint distribution of scores and standard error of measurement for
seventh-grade scores.
Boyd et al.
645
3.2. Estimated Model
Parameter estimates and standard errors are reported in Table 4. The predicted
correlations implied by the estimated models, shown above the diagonals in
Tables 2 and 3, allow us to assess how well the estimated models fit the observed
correlations shown below the diagonals. To evaluate goodness of fit, consider the
absolute differences between the empirical and predicted correlations. The aver-
age, and average proportionate, absolute differences for ELA are .001 and .002,
respectively. For math, the differences are .003 and .005. Thus, the estimated lin-
ear models fit the NYC data quite well.
The estimated generalizability coefficients in Table 4 for math are meaning-
fully larger than those for ELA, and the estimates for ELA are higher in some
grades compared to others. These differences are of sufficient size that one could
reasonably question whether they reflect estimation error or a fundamental short-
coming of our approach, or both, rather than underlying differences in the extent
of test measurement error. Fortunately, we can compare the estimates to the
reliability measures reported in the technical reports for the NY tests, to see
whether the reliability coefficients differ in similar ways. The top two lines in
Figure 3 show the reported reliability coefficients for math (solid line) and ELA
(dashed line). The lower two lines show the generalizability coefficient estimates
reported in Table 4. It is not surprising that the estimated generalizability coeffi-
cients are smaller than the corresponding reported reliability coefficients, as the lat-
ter statistics do not account for all sources of measurement error. However,
consistencies in the patterns are striking. The differences between the reliability and
generalizability coefficients vary little across grades and subjects, averaging 0.117.
600
625
650
675
700
725
750
775
scor
e(a) ELA
600
625
650
675
700
725
750
775
scor
e
(b) Math
600 625 650 675 700 725 750 775prior score
600 625 650 675 700 725 750 775prior score
biased cubic consistent cubic consistent linear
FIGURE 2. Cubic regression estimates of E Si;jþ1jSij
� �as well as consistent estimates of
cubic and linear specifications of E ti;jþ1jtij
� �, Grades 7 and 8.
Measuring Test Measurement Error
646
The generalizability coefficient estimates for math are higher than those for ELA,
mirroring corresponding difference between the reliability coefficients reported in
the technical reports. Also, in each subject the variation in the generalizability
TABLE 4.
Correlation and Generalizability Coefficient Estimates, New York City
Parametersa ELA Math
r�34 0.8369 0.8144
(0.0016) (0.0016)
r45 0.9785 0.9581
(0.0013) (0.0012)
r56 0.9644 0.9331
(0.0012) (0.0011)
r67 0.9817 0.9647
(0.0012) (0.0011)
r�78 0.8168 0.8711
(0.0013) (0.0013)
G4 0.7853 0.8005
(0.0025) (0.0024)
G5 0.7169 0.8057
(0.0018) (0.0020)
G6 0.7716 0.8227
(0.0019) (0.0019)
G7 0.7184 0.8284
(0.0019) (0.0020)
Note: ELA ¼ English language arts.aThe parameter subscripts here correspond to the grade tested. For example, r�34 is the correlation of
universe scores of students in Grades 3 and 4.
0.85
0.90
0.95
ELA Feldt-Raju Alpha
0.70
0.75
0.80
4 5 6 7
gene
raliz
abili
ty c
oeff
icie
nt
Math Feldt-Raju Alpha
ELA - G
Math - G
grade
FIGURE 3. Generalizability and reliability coefficient estimates for New York math and
English language arts (ELA) exams by grade.
Boyd et al.
647
coefficient estimates across grades closely mirrors the corresponding across-grade
variation in the reported reliability coefficients. This is especially noteworthy, given
the marked differences between math and ELA in the patterns across grades.
The primary motivation for this article is the desire to estimate the overall
extent of measurement error motivated by concern that the measurement error
in total is much larger than that reported in test technical reports. The estimates
of the overall extent of test measurement error on the NY math exams, on
average, are over twice as large as that indicated by the reported reliability
coefficients. For the NYC ELA tests, the estimates of the overall extent of mea-
surement error average 130% higher than that indicated by the reported relia-
bility coefficients. The extent of measurement error from other sources
appears to be at least as large as that associated with the construction of the test
instrument.
Estimates of the variances of actual student achievement can be obtained
employing estimates of the overall extent of test measurement error together with
the test-score variances. Universe-score variance estimates for our application are
reported in column 3 of Table 5. Estimates of the variances of universe-score
gains are shown in column 6. Because these values are much smaller than the
variances of test-score gains, the implied generalizability coefficient estimates
in column 7 are quite small. We estimate that only 20% of the variance in math
gain scores is actually attributable to variation in achievement gains. Gain scores
in ELA are even less reliable.
Estimation of the overall extent of measurement error for a population of students
only requires test-score variances and correlations. Additional inferences are possi-
ble employing student-level test-score data. In particular, such data can be used to
TABLE 5.
Variances of Test Scores, Test Measurement Error, Universe Scores, Test-Score Gains,
Measurement Error for Gains, and Universe-Score Gains and Generalizabiltity
Coefficient for Test-Score Gain for English Language Arts (ELA) and Math
ELA
(1) (2) (3) (4) (5) (6) (7)
s2S�j
s2Z�j
gjj ¼ Gj s2S�j
s2�S�j
s2�Z�j
s2�t�j
G�j ¼ s2
�t�j
.s2
�S�j
Grade 7 1,520.8 326.5 1,194.3 763.8 695.3 68.4 0.090
Grade 6 1,303.0 368.8 934.2 646.2 558.9 87.3 0.135
Grade 5 832.1 190.0 642.1 407.4 357.6 49.8 0.122
Grade 4 595.1 167.6 427.5
Math
Grade 7 1,297.6 259.0 1,038.6 661.9 532.8 129.1 0.195
Grade 6 1,409.5 273.8 1,135.7 677.9 523.8 154.1 0.227
Grade 5 1,409.5 250.0 1,159.5 527.8 431.0 96.8 0.183
Grade 4 1,054.9 181.0 873.9
Measuring Test Measurement Error
648
estimate s2Z
ij¼ s2
ZjðtijÞ þ �ij characterizing how the variance of measurement
error varies with student ability. (�ij is a random variable having zero mean.) Here
we specify s2Z
jðtijÞ to be a third-order polynomial, compute s2
Zij
using Equation 13,
and employ observed scores as estimates of tij. Regressing s2Z
ijon Sij would yield
inconsistent parameter estimates since Sij measures tij with error. However, con-
sistent parameter estimates can be obtained using a two-stage least squares,
instrumental-variables estimator where the instruments are the scores for each
student not used to compute s2Z
ij. In the first stage, Sij for grade j is regressed on
Sik ; k 6¼ j; jþ 1; along with squares and cubes, yielding fitted values Sij. In turn,
s2Z
ijis regressed on Sij to obtain consistent estimates of the parameters in s2
ZjðtijÞ.
The bold solid lines in Figure 4 show sZjðtijÞ. The dashed lines are the IRT
SEMs reported in the test technical reports. Let Zij ¼ Zaij þ Zb
ij, where Zaij is the
measurement error associated with test construction, Zbij is the measurement error
from other sources and s2Z
ij¼ s2
Zaijþ s2
Zbij
, assuming that Zaij and Zb
ij are uncorre-
lated. For a particular test,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffis2Z
jðtijÞ � s2
ZajðtijÞ
qcan be used to estimate sZb
j
ðtijÞ.The thin lines in Figure 4 show these ‘‘residual’’ estimates. The range of ability
levels for which sZbjðtijÞ is shown roughly corresponds to our estimates of the
ranges containing 99% of actual abilities. In Figure 4(b), for example, it would
be the case that Pð608 ti7 715Þ ¼ 0:99 if our estimate of the ability distri-
bution were correct.
There are a priori explanations for why sZajðtijÞ would be a U-shaped function
for IRT-based scale scores and an inverted U-shaped function in the case of raw
scores. A speculative, but somewhat believable, hypothesis is that the variance of
the measurement error unrelated to the test instrument is relatively constant
across ability levels. However, this begs the question as to whether the relevant
‘‘ability’’ is measured in raw score or scale-score units. If the raw-score measure-
ment error variance were constant, the nonlinear mapping from raw scores to
scale scores would imply a U-shaped scale-score measurement error variance—
possibly explaining the U-shaped patterns of sZbjðtijÞ in Figure 4. Whatever the
explanation, values of sZajðtijÞ and sZb
jðtijÞ are roughly comparable in magnitude
and vary similarly over a wide range of abilities. We have less confidence in
the estimates of sZbjðtijÞ for extreme ability levels. Because sZb
jðtijÞ is the square
root of a residual, computed values offfiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffis2ZjðtijÞ � s2
ZajðtijÞ
qcan be quite sensitive
to estimation error when s2Z
jðtijÞ � s2
ZjaðtijÞ is close to zero. Note that for
the case corresponding to Figure 4(a), we estimate that only 1.8% of students
have universe scale scores exceeding 705. In Figure 4(d), the universe scores
of slightly less than 5% of students exceed 720.
Boyd et al.
649
3.3. Inferences Regarding Universe Scores and Universe-Score Gains
Observed scores typically are used to directly estimate student achievement
and achievement gains. More precise estimates of universe scores and universe-
score gains for individual students can be obtained employing observed scores
along with the parameter estimates in Table 4 and the estimated measurement-
error heteroscedasticity reflected in sZjðtijÞ. As an example, the solid S-shaped
lines in Figure 5 show the values of E tij Sij
��� �for fifth- and seventh-grade ELA
50(a) grade 5 ELA
50(b) grade 7 ELA
50(c) grade 5 math
50(d) grade 7 math
40 40
40 40
30 30
30
20
30
10
20
mea
suer
men
t err
or S
Em
easu
erm
ent e
rror
SE
mea
suer
men
t err
or S
Em
easu
erm
ent e
rror
SE
20
10
20
10
0 0
10
0 0
level of achievement
level of achievement
600 625 650 675 700 725 750level of achievement
600 625 650 675 700 725 750
600 625 650 675 700 725 750level of achievement
600 625 650 675 700 725 750
ση jˆ ση j
ˆ bσηijaˆ
FIGURE 4. Estimated standard errors of measurement reported in technical reports, sZaj,
estimates for the measurement error from all sources, sZj, and estimates for the residual
measurement error, sZbj
.
Measuring Test Measurement Error
650
and math. Referencing the 45 line, the estimated posterior mean ability lev-
els for higher scoring students are substantially below the observed scores,
while predicted ability levels for low-scoring students are above the observed
scores. This Bayes ‘‘shrinkage’’ is largest for the highest and lowest scores
due to the estimated pattern of measurement-error heteroscedasticity. The
dashed lines show 80% Bayesian credible bounds for ability conditional on
the observed score. For example, the BUP of the universe score for fifth-
grade students scoring 775 in ELA is 737, 38 points below the observed
score. We estimate that 80% of students scoring 775 have universe scores
in the range 719 to 757; P 718:8 < tij < 757:2 Sij ¼ 775��� �
¼ 0:80. In this
case, the observed score is 18 points higher than the upper bound of the
80% credible interval. Midrange scores are more informative, reflecting the
575
600
625
650
675
700
725
750
775ac
tual
abi
lity
(a) Grade 5 ELA
575
600
625
650
675
700
725
750
775
actu
al a
bilit
y
(b) Grade 7 ELA
575
600
625
650
675
700
725
750
775
actu
al a
bilit
y
(c) Grade 5 Math
575
600
625
650
675
700
725
750
775ac
tual
abi
lity
(d) Grade 7 Math
observed score575 600 625 650 675 700 725 750 775
observed score575 600 625 650 675 700 725 750 775
observed score575 600 625 650 675 700 725 750 775
observed score575 600 625 650 675 700 725 750 775
45° line
45° line 45° line
45° line
FIGURE 5. Estimated posterior mean ability level, given the observed score, and 80%Bayesian credible bounds, Grades 5 and 7 English language arts (ELA) and math.
Boyd et al.
651
smaller standard deviation of test measurement error. For an observed score
of 650, the estimated posterior mean and 80% Bayesian credible bounds are
652 and (638, 668), respectively. The credible bounds range for a 775 score
is 30% larger than that for a score of 650.
Utilizing test scores to directly estimate students’ abilities clearly is problematic
for high- and, to a lesser extent, low-scoring students. To explore this relationship
further, consider the root of the expected mean squared errors (RMSE) associated
with estimating student ability using (a) observed scores and (b) estimated posterior
mean abilities conditional on observed scores.17 For the fifth-grade math exam, the
RMSE associated with using E tij Sij
��� �to estimate students’ abilities is 14.9 scale-
score points. In contrast, the RMSE associated with using Sij is 17.2, 15% larger.
This difference is meaningful, given that E tij Sij
��� �differs little from Sij over the
range of scores for which there are relatively more students. Over the range of abil-
ities from 620 to 710, the RMSE for E tij Sij
��� �and Sij are 14.9 and 15.1, respectively.
However, for ability levels below 620, the RMSEs are 13.4 and 20.9, respectively,
the latter being 57% larger. For students whose actual abilities are greater than 710,
the RMSE associated with using Sij to estimate tij is 26.6, which is 62% larger than
the RMSE for E tij Sij
��� �. By accounting for test measurement error from all sources,
it is possible to compute estimates of student achievement that have statistical prop-
erties superior to those corresponding to the observed scores of students.
Turning to the measurement of ability gains, the solid S-shaped curve in
Figure 6 shows the posterior mean universe-score change in math between Grades
5 and 6 conditional on the observed score change.18 Again, the dashed lines show
80% credible bounds. For example, among students observed to have a 40-point
score increase between the fifth and sixth grades, their actual universe-score
changes are estimated to average 12.7. Eighty percent of all students having a
40-point score increase are estimated to have actual universe-score changes falling
in the interval�2.3 to 27.0. It is noteworthy that for the full range of score changes
shown (+50 points), the 80% credible bounds include no change in actual ability.
Many combinations of scores yield a given score change. Figure 6 corre-
sponds to the case where one knows the score change but not the pre- and post-
scores. However, for a given score change, the mean universe-score change and
credible bounds will vary across known score levels because of the pattern of
measurement-error heteroscedasticity. For example, Figure 7 shows the poster-
ior mean universe-score change and credible bounds for various scores consis-
tent with a 40-point increase. For example, students scoring 710 on the Grade 5
exam and 750 on the Grade-6 exam are estimated to have a 10.3 point universe-
score increase on average, with 80% of such students having actual changes in
ability in the interval (�11.4, 31.7). For students scoring at the fifth-grade pro-
ficiency cut score (648), the average universe-score gain is 19.6 with 80% of
such students having actual changes in the interval (�0.17, 37.4). (Note that
a 40-point score increase is relatively large in that the standard deviation of the
Measuring Test Measurement Error
652
score change between the fifth and sixth grades is 26.0.) The credible bounds
for a 40-point score increase include no change in ability for all fifth-grade
scores other than those between 615 and 647.
A striking feature of Figure 7 is that the posterior mean universe-score change,
E t6 � t5 S5; S6jð Þ ¼ E t6 S5; S6jð Þ � E t5 S5; S6jð Þ, is substantially smaller than the
observed-score change. Consider E t6 � t5 S5 ¼ 710; S6 ¼ 750jð Þ ¼ 10:3, which
–50
–40
–30
–20
–10
0
10
20
30
40
50
chan
ge in
abi
lity
–50 –40 –30 –20 –10 0 10 20 30 40 50change in score
FIGURE 6. Estimated posterior mean change in ability, given the score change, and 80%credible bounds, Grades 5 and 6 mathematics.
–20
–10
0
10
20
30
40
50
chan
ge in
abi
lity
grade five score575 600 625 650 675 700 725 750 775
FIGURE 7. Estimated posterior mean change in ability for the observed scores in Grades
5 and 6 mathematics for S6 � S5 ¼ 40 and 80% credible bounds.
Boyd et al.
653
is substantially smaller than the 40-point score increase. First, Eðt6 S6 ¼ 750Þj¼ 734:0 is 16 points below the observed score due to the Bayes shrinkage toward
the mean. E t6 S5 ¼ 710; S6 ¼ 750jð Þ ¼ 729:5 is even smaller. Because S6 is a
noisy estimate of t6 and t5 is correlated with t6, the value of S5 provides infor-
mation regarding the distribution of t6 that goes beyond the information gained
by observing S6.19 Eðt5 S5 ¼ 710Þj ¼ 705:3 is less than 710 because the latter is
substantially above Eti5. However, Eðt5 S5; S6Þj ¼ 719:2 is meaningfully larger
than Eðt5 S5Þj ¼ 707:5 and larger than S5 ¼ 710, because S6 ¼ 750 is substan-
tially larger than S5. In summary, among NYC students scoring 710 on the
fifth-grade math exam and 40 points higher on the sixth-grade exam, we estimate
the mean gain in ability is little more than one fourth as large as the actual score
change; E t6 S5; S6jð Þ � E t5 S5; S6jð Þ ¼ 729:5� 719:2 ¼ 10:3. The importance of
accounting for the estimated correlation between ability levels in Grades 5 and 6
is reflected in the fact that the mean ability increase would be 2½ times as large
were the ability levels uncorrelated; E t6 S6jð Þ � E t5 S5jð Þ ¼ 734:0� 705:3 ¼ 28:7.
4. Conclusion
We show that there is a credible approach for estimating the overall extent of
test measurement error using nothing more than test-score variances and nonzero
correlations for three or more tests. This analysis of covariances or correlations is
a meaningful generalization of the test–retest method and can be used in a variety
of settings. First, substantially relaxing the requirement that the tests be parallel,
our approach does not require tests to be vertically scaled. The tests can even
measure different abilities, provided that there is no ability measured by a test
that is uncorrelated with all the abilities measured by the other tests. Second,
as in the case of congeneric tests analyzed by Joreskog (1971), the method allows
the extent of measurement error to differ across tests. Third, the approach only
requires some persistence (i.e., correlation) in ability across the test administra-
tions, a requirement far less restrictive than requiring that ability remains con-
stant. However, as with the test–retest framework, the applicability of our
approach crucially depends upon whether a sound case can be made that the tests
to be analyzed meet the necessary requirements.
We illustrate the general approach employing a model of student achievement
growth in which academic achievement is cumulative following a first-order
autoregressive process: tij ¼ bj�1ti; j�1 þ yij where there is at least some persis-
tence (i.e., bj�1 > 0) and the possibility of decay (i.e., bj�1 < 1) that can differ
across grades. An additional assumption is needed regarding the stochastic prop-
erties of yij. Here we have employed a reduced-form specification where
E ti; jþ1jtij
� �is a linear function of tij, an assumption that can be tested. Fully
specified structural models also could be employed. In addition, rather than infer-
ring the correlation structure based on a set of underlying assumptions, one can
Measuring Test Measurement Error
654
directly assume a correlation structure where there are a range of possibilities
depending upon the tests being analyzed.
Estimation of the overall extent of measurement error for a population of stu-
dents only requires test-score descriptive statistics (e.g., correlations); neither
student-level test scores nor assumptions regarding functional forms for the distri-
bution of either abilities or test measurement error are needed. However, one can
explore the extent and pattern of measurement-error heteroscedasticity employing
student-level data. Standard distributional assumptions (e.g., normality) allow one
to make inferences regarding universe scores and gains in universe scores. In par-
ticular, for a student with a given score, the Bayesian posterior mean and variance
of tij given Sij, E tij Sij
��� �and V tij Sij
��� �, are easily computed where the former is
both unbiased and the best predictor of the student’s actual ability. Similar statistics
for universe-score gains also can be computed. We show that using the observed
score as an estimate of a student’s underlying ability can be quite misleading for
relatively low- or high-scoring students. However, the bias is eliminated and the
mean square error substantially reduced when the posterior mean is employed.
We have focused on estimating the extent of test measurement error via an anal-
ysis of test-score correlations or covariances. An alternative approach is to estimate
the extent of measurement error in conjunction with estimating student-level, latent
variable models of achievement growth. The two approaches are related in that
student-level structural models of growth that can be estimated within the latent vari-
able framework imply covariance and correlation structures that can be employed
using our approach. The analysis of correlations (covariances) has several advan-
tages if the goal is to estimate the extent of test measurement error. First, estimation
only requires test-score descriptive statistics, whereas student-level data are needed
to estimate latent variable models. Maximum likelihood estimation of latent vari-
able models and the extent of measurement error require assumptions regarding the
ability and test measurement error distributions. In general, the estimation of latent
variable models is more complex, especially if the empirical model allows for het-
eroscedastic measurement error. Another consideration is that our approach can be
employed when tests are not vertically scaled or measure different, but correlated,
abilities. Finally, estimates obtained using test-score correlations are more likely
to be robust to misspecifications of the underlying structure of achievement growth.
As the analysis of Rogosa and Willet (1983) makes clear, commonly observed
covariance patterns can be consistent with quite different models of achievement
growth; the underlying correlation structures implied by different growth models
can yield universe-score correlation patterns and values that are indistinguish-
able. Rather than identifying the actual underlying covariance structure, our goal
is to estimate the extent of measurement error as well as values of the universe-
score variances and correlations. We conjecture that the inability to distinguish
between quite different universe-score correlation structures (corresponding to
different underlying models of achievement growth) actually is advantageous,
Boyd et al.
655
given our goal, in that the estimated extent of test measurement error based on an
analysis of test-score correlations will be robust to a range of universe-score cov-
ariance structure misspecifications. This conjecture is consistent with our finding
that estimates of measurement-error variances are quite robust across a range of
structural specifications. Monte Carlo simulations using a wide range of under-
lying covariance structures could provide more convincing evidence, but goes
beyond the scope of this article.
In any particular analysis, estimation will be based on empirical variances and
correlations for a sample of test takers; yet, the analysis typically will be motivated
by an interest in the extent of measurement error or the variance of abilities, or
both, for some population of individuals. Thus, an important consideration is
whether the sample of test takers employed is representative of the population
of interest. In addition to the possibility of meaningful sampling error, subpopula-
tions of interest may be systematically excluded in sampling, or data may not be
missing at random. Such possibilities need to be considered when assessing
whether parameter estimates are relevant for the population of interest. Issues of
external validity can also arise. Just as the variance of universe scores can vary
across populations, the same often will be true for the extent of test measurement
error, possibly reflecting differences in test-taking environments. Even if the rela-
tionship between individuals’ measurement-error variances and their abilities,
s2Z
jðtijÞ, does not differ across populations, the population measurement-error
variance, s2Z�j
, will when the populations have different ability distributions.
Estimates of the overall extent of test measurement error have a variety of uses
that go beyond merely assessing the reliability of various assessments. Using
E tij Sij
��� �, rather than Sij, to estimate tij is one example. Judging the magnitudes
of the effects of different causal factors relative to either the standard deviation of
ability or the standard deviation of ability gains is another. Bloom, Hill, Black,
and Lipsey (2008) discuss the desirability of assessing the magnitudes of effects
relative to the dispersion of ability or ability gains, rather than test scores or test-
score gains, but note that analysts often have had little, if any, information
regarding the extent of test measurement error.
As demonstrated above, the same types of data researchers often employ to esti-
mate how various factors affect educational outcomes can be used to estimate the
overall extent of test measurement error. Based on the variance estimates shown
in columns 1 and 3 of Table 5, for the tests we analyze, effect sizes measured relative
to the standard deviation of ability will be 10% to 18% larger than effect sizes mea-
sured relative to the standard deviation of test scores. In cases where it is pertinent to
judge the magnitudes of effects in terms of achievement gains, effect sizes measured
relative to the standard deviation of ability gains will be 200% to over 300% larger
than effect sizes measured relative to the standard deviation of test-score gains.
Estimates of the extent and pattern of test measurement error can also be used
to assess the precision of a variety of measures based on test scores, including
Measuring Test Measurement Error
656
binary indicators of student proficiency, teacher- and school-effect estimates and
accountability measures such as No Child Left Behind adequate-yearly-progress
requirements. It is possible to measure the reliability of such measures as well as
employ the estimated extent of test measurement error to calculate more accurate
measures, useful for accountability purposes, research, and policy analysis.
Overall, this article has methodological and substantive implications. Metho-
dologically, it shows that the total measurement-error variance can be estimated
without employing the limited and costly test–retest strategy. Substantively, it
shows that the total measurement error is substantially greater than that measured
using the split-test method, suggesting that much empirical work has been under-
estimating the effect sizes of interventions that affect student learning.
Appendix
Measurement error can result in E Si; jþ1jSij
� �being a nonlinear function of Sij
even when E ti; jþ1jtij
� �is linear in tij. E ti; jþ1jtij
� �¼ b0 þ b1tij implies that
ti; jþ1 ¼ b0 þ b1tij þ ui; jþ1, where Eui; jþ1 ¼ 0 and Etij ui; jþ1 ¼ 0. With
Si; jþ1 ¼ ti; jþ1 þ Zi; jþ1, Si; jþ1 ¼ b0 þ b1tij þ Zi; jþ1 þ ui; jþ1 and, in turn,
E Si; jþ1
��Sij
� �¼ b0 þ b1E tij
��Sij
� �. Thus, the nonlinearity of E Si; jþ1
��Sij
� �depends
upon whether E tij
��Sij
� �is nonlinear in Sij. Consider the case where
tij � Nðmj;s2tjÞ and Zij � Nð0;s2
ZijÞ and the related discussion in Section 2.2.
When Zij is either homoscedastic or heteroscedastic with s2Zij
not varying with
the level of ability, tij and Sij will be bivariate normal so that E tij Sij
��� �¼
ð1� GijÞmj þ GijSij, implying that E Si; jþ1
��Sij
� �is also linear in Sij. Thus, it is not
measurement error per se that implies E Si; jþ1jSij
� �is nonlinear. Rather,
E Si; jþ1jSij
� �is nonlinear in Sijwhen Zij is heteroscedastic with the extent of mea-
surement error varying with the ability level (i.e., sZij¼ sZj
ðtijÞ). When sZjðtijÞ
is U shaped, as in Figure 1, E Si; jþ1jSij
� �is an S-shaped function of Sij, even when
E ti; jþ1jtij
� �is linear in tij.
Sij and tij are not bivariate normal when sZij¼ sZj
ðtijÞ, but E tij Sij
��� �can be
computed using simulation as discussed in Section 2.2. Consider the following
example which is roughly consistent with the patterns found for the NYC tests:
tij � Nð670; 900Þ and Zij � N 0; s2ZjðtijÞ
� with snj
ðtijÞ ¼ soþ aðtij � mjÞ2
and sn�j � EsnðtijÞ ¼ so þ a gjj ¼ 15. The three cases in Figure A1 differ with
respect to the degree of heteroscedasticity: the homoscedastic case (so ¼ 15
and a ¼ :0), moderate heteroscedasticity (so ¼ 12 and a ¼ :00333 � � �) and
more extreme heteroscedasticity (so ¼ 3 and a ¼ :01333 � � �). Simulated val-
ues of E Si; jþ1
��Sij
� �¼ b0 þ b1E tij
��Sij
� �for each cases are shown in Figure
A2, with b0 ¼ 0 and b1 ¼ 1. E Si; jþ1
��Sij
� �is linear in the homoscedastic case
Boyd et al.
657
and the degree to which E Si; jþ1
��Sij
� �is S shaped depends upon the extent of
this particular type of heteroscedasticity.
Knowing that the S-shaped patterns of E Si; jþ1
��Sij
� �in Figure 1 can be consis-
tent with E ti; jþ1jtij
� �being linear in tij is useful, but of greater importance is
whether E ti; jþ1jtij
� �is in fact linear for the tests of interest. This can be explored
employing the cubic specification ti; jþ1 ¼ b0 þ b1tij þ b2t2ij þ b3t
3ij þ �i; jþ1,
where b2 ¼ b3 ¼ 0 implies linearity. Substituting Sij ¼ tij þ Zij and regressing
Si; jþ1 on Sij would yield biased parameter estimates. However, if lkij �
E tkij Sij
���� ; k ¼ 1; 2; 3, were known for each student, regressing Si; jþ1 on
l1ij; l
2ij; and l3
ij would yield consistent estimates.20
Computing lkij; k ¼ 1; 2; 3, for each student requires knowledge of the overall
extent and pattern of measurement error. It is the lack of such knowledge that
30
60
90
120
none
moderate
Extent ofheteroskedasticity:
0
mea
sure
men
t err
orst
adar
d de
viat
ionn
prior ability/skills
more extreme
575 625 675 725 775
FIGURE A1. Examples showing different degrees of heteroscedastic measurement error.
575
625
675
725
775
575 625 675 725 775
nonemoderatemore extreme
Extent ofheteroskedasticity:
scor
e, S
i2
prior score, Si1
FIGURE A2. How the relationship between E Si2 jSi1Þð and Si1 varies with the degree of
heteroscedasticity.
Measuring Test Measurement Error
658
motives this article. However, we are able to compute lk
ij ¼ E tkij Sij
���� accounting
for the meaningful measurement-error heteroscedasticity reflected in the reported
SEMs,21 even though this does not account for other sources of measurement error.
Computation of E tkij Sij
���� also requires an estimate of gjj which can be obtained by
solving for gjj implicitly defined in gjj ¼ ojj� s2Z�j¼ ojj �
Rs2Z tð Þf tjmj; gjj
� �dt.
Using Monte Carlo integration with importance sampling
gjj ¼ ojj � 1M
PMm¼1
s2Zj
t�mj
� , where the t�mj are random draws from the distribution
N mj; ~gjj
� �and ~gjj is an initial estimate of gjj. This yielded an updated value of ~gjj
which can be used to repeat the prior step. Relatively few iterations are needed
for converge to the fixed point—our estimate of gjj. The estimate gjj allows us
to compute values of lk
ij and, in turn, regress Si jþ1 on l1
ij; l2
ij; and l3
ij.
Authors’ Note
The authors’ joint research is summarized and available at www.TeacherPolicyResearch.org
Acknowledgments
We are grateful to the New York City Department of Education and the New York State
Education Department for the data employed in this article. Thanks also to Dale Ballou,
Henry Braun, Ed Haertel, J. R. Lockwood, Dan McCaffrey, Tim Sass, Jeffery Zabel, JEBS
editor Sandip Sinharay, and three anonymous referees for their helpful comments and
suggestions.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, author-
ship, and/or publication of this article: We appreciate financial support from the National
Science Foundation and National Center for the Analysis of Longitudinal Data in Educa-
tion Research (CALDER). CALDER is supported by IES Grant R305A060018 to the
American Institutes for Research.
Notes
1. Many authors discuss classical test theory, for example, Haertel (2006).
See Cronbach, Linn, Brennan, and Haertel (1997) and Feldt and Brennan
(1989) for useful introductions to generalizability theory and Brennan
(2001) for more detail.
Boyd et al.
659
2. Time intervals between tests need not be either annual or constant. For exam-
ple, from a randomized control trial, one might know test-score correlations
for tests administered at the start, end, and at a point during the experiment.
3. For example, the third grade might be the first tested grade. To simplify
exposition, we often will not distinguish between the ith grade and the ith
tested grade, even though we will mean the latter.
4. �� can be estimated using its empirical counterpart �� ¼P
i Si � �Sð ÞSi � �Sð Þ0
.NS , where NS is the number of students with observed test scores.
This corresponds to the case where one or more student cohorts are tracked
through all J grades, a key assumption being that the values of the ojk are
constant across cohorts. A subset of the ojk can be estimated when the scores
for individual students only span a subset of the grades included; a particular
ojk can be estimated provided one has test-score data for students in both grades
j and k.
5. In Equation 4, there are JðJ þ 1Þ=2 moments and J þ JðJ þ 1Þ=2 ¼JðJ þ 3Þ=2 parameters. In Equation 5, there are JðJ � 1Þ=2 moments and
J þ JðJ � 1Þ=2 ¼ JðJ þ 1Þ=2 parameters.
6. These estimators are consistent, but biased as they are ratios of estimators.
The same is true in several other examples discussed below.
7. In general, ti; jþm ¼ Eðti; jþm
��tijÞ þ �i; jþm, where E �i; jþm tij ¼ 0. Utilizing a
Taylor-series approximation for Eðti; jþ1
��tijÞ, ti; jþm ¼ am0 þ am
1 ðtij � mjÞþam
2 ðtij � mjÞ2 þ � � � þ �i; jþm where mj ¼ Etij. Thus, gj; jþm ¼ Eðtij � mjÞðti; jþm � mjþmÞ ¼ am
1 gjj þ am2 s
3tjþ � � �, where gj; jþm is a function of gjj.
8. An example might be a series of social studies tests in which only one exam
tests whether students know the names of state capitals, with this knowledge
not correlated with any of the knowledge/abilities measured by the other
tests.
9. Todd and Wolpin (2003) discuss the conditions under which this will be the
case.
10. When tij and ti; j�1 are homoscedastic, as assumed above, the same must be
true for yij per Equation 7.
11. Examples of such derivations are available upon request.
12. See Cameron and Trivedi (2005) for a more detailed discussion of minimum
distance estimators. The equally weighted minimum distant estimator, �MD, is
consistent but less efficient than the estimator corresponding to the optimally
chosen B. However, �MD does not have the finite sample bias problem that
arises from the inclusion of second moments. See Altonji and Segal (1996).
13. The equations ti; jþ1 ¼ aj þ cjtij þ ui; jþ1 and E ui; jþ1tij ¼ 0 imply that
gj; jþ1 � covðtij; ti; jþ1Þ ¼ cjgjj and, in turn, cj ¼ gj; jþ1
�gjj. With ui; jþ1 ¼
ti; jþ1 � aj � cjtij, it follows that s2ujþ1¼ gjþ1; jþ1 þ c2
j gjj � 2cjgj; jþ1 ¼
gjþ1; jþ1 � g2j; jþ1
.gjj.
Measuring Test Measurement Error
660
14. Even though E tij Sij
��� �is the best unbiased predictor (BUP) for the ability of
any individual test taker, the distribution of the E tij Sij
��� �is not the BUP for
the distribution of abilities. Neither is the rankings of the E tij Sij
��� �the BUP
for ability rankings. See Shen and Louis (1998). However, the latter two
BUPs can be computed employing the distribution of observed scores and
the parameter estimates used to compute E tij Sij
��� �.
15. We impute values of missing scores using SAS Proc MI. The Markov Chain
Monte Carlo procedure is used to impute missing score gaps (e.g., a missing
fourth grade score for a student having scores for Grades 3 and 5). This
yielded an imputed database with only monotone missing data (e.g., scores
included for Grades 3 through 5 and missing in all grades thereafter). The
monotone missing data were then imputed using the parametric regression
method.
16. The cubic estimates of E ti; jþ1jtij
� �in the graphs might be even closer to linear
if we had accounted for all measurement error. This was not done to avoid pos-
sible circularity; one could question results where the estimates of the overall
measurement-error variances are predicated maintaining linearity and the esti-
mated variances are then used to assess whether E ti; jþ1jtij
� �is in fact linear.
17. The expected values are computed using Monte Carlo simulation described
in Section 2.2 and assuming the parameter estimates are correct.
18. The joint density of tij; ti; jþ1;Zij; and Zi; jþ1 is h j tij; ti; jþ1;Zij;Zi; jþ1
� �¼
g j Zij tij
��� �g jþ1 Zi; jþ1 ti; jþ1
��� �f ðtij; ti; jþ1Þ. With �i � ti; jþ1 � tij and Di �
Si; jþ1 � Sij ¼ �i þ Zi; jþ1 � Zij, the joint density of tij; �i;Zij; and Di is
h j tij; tij þ �i;Zij;Di � �i þ Zij
� �. Integrating over tij and Zij yields the
joint density of �i and Di; z �i;Dið Þ ¼R1�1
R1�1 g jþ1 Di � �i þ Zij tijþ
����iÞ
f 2ðtij þ �i tij
�� Þ g j Zij tij
��� �f 1ðtijÞ dZij dtij, where f 1ðtijÞ is the marginal
density of tij and f 2ðti; jþ1 tij
�� Þ is the conditional density of ti; jþ1 given tij.
This integral can be computed using z �i;Dið Þ ¼ 1=Mð ÞPM
m¼1 g jþ1
Di � �i þ Z�mj t�mj þ �i
���� f 2ðt�mj þ �i t�mj
��� Þ, where ðt�mj; Z�mjÞ; m ¼ 1; 2; � � � ;
M ; is a sufficiently large number of draws from the joint distribution
of ðtij; ZijÞ. In turn, the density of the posterior distribution of �i
given Di is v �i Dijð Þ ¼ z �i;Dið Þ=l Dið Þ, where l Dið Þ ¼ 1=Mð ÞPM
m¼1 g jþ1
Di � t�m; jþ1 þ t�mj þ Z�mj t�m; jþ1
���� is the density of Di and ðt�m; jþ1; t
�mj;
Z�mjÞ; m ¼ 1; 2; � � � ;M ; is a sufficiently large number of draws from the
joint distribution of ðti; jþ1; tij;ZijÞ. The cumulative posterior distribution is
P �i a Dijð Þ ¼ 1=M lðDiÞð ÞP
t�m; jþ1
�t�mja g jþ1 Di � t�m; jþ1þ
�t�mj þ Z�mjj
t�m; jþ1:Þ. Finally, the posterior mean ability gain given Di is E �i Dijð Þ ¼1=M lðDiÞð Þ
PMm¼1 t�m; jþ1 � t�mj
� g jþ1 Di � t�m; jþ1 þ t�mj þ Z�mj t
�m; jþ1
���� .
Boyd et al.
661
19. E t6 S5; S6jð Þ would equal Eðt6 S6Þj if either the sixth-grade exams were not
subject to measurement error or the fifth- and sixth-grade universe scores
were not correlated.
20. See the discussion of the ‘‘structural least squares’’ estimator in Kukush,
Schneesweiss, and Wolf (2005).
21. Because standard error of measurement (SEM) values are reported for a lim-
ited set of scores, a flexible functional form for s2Z tð Þ was fit to the reported
SEM. This function was then used in computation of moments.
References
Abowd, J. M., & Card, D. (1989). On the covariance structure of earnings and hours
changes. Econometrica, 57, 411–445.
Altonji, J. G., & Segal, L. M. (1996). Small sample bias in GMM estimation of covariance
structures. Journal of Business and Economic Statistics, 14, 353–366.
Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and
Policy, 4, 351–383.
Bloom, H. S., Hill, C. J., Black, A. R., & Lipsey, M. W. (2008). Performance trajectories
and performance gaps as achievement effect-size benchmarks for educational inter-
ventions. Journal of Research on Educational Effectiveness, 1, 289–328.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and applications.
New York, NY: Cambridge University Press.
Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. H. (1997). Generalizability
analysis for performance assessments of student achievement or school effectiveness.
Educational and Psychological Measurement, 57, 373–399.
CTB/McGraw-Hill. (2009). New York state testing program 2009: Mathematics, Grades
3-8 (Technical Report). Monterey, CA: Author.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational mea-
surement (3rd ed., pp. 105–146). New York, NY: American Council on Education.
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th
ed., pp. 65–110). Westport, CT: Praeger.
Joreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36,
109–133.
Joreskog, K. G. (1978). Structural analysis of covariance and correlation matrices.
Psychometrika, 43, 443–477.
Kukush, A., Schneesweiss, H., & Wolf, R. (2005). Relative efficiency of three estimators
in a polynomial regression with measurement errors. Journal of Statistical Planning
and Inference, 127, 179–203.
Rogosa, D. R., & Willett, J. B. (1983). Demonstrating the reliability of difference
scores in the measurement of change. Journal of Educational Measurement, 20,
335–343.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY:
John Wiley.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London, England: Chap-
man & Hall.
Measuring Test Measurement Error
662
Shen, W., & Louis, T. A. (1998). Triple-goal estimates in two-stage hierarchical models.
Journal of the Royal Statistical Society, 60, 455–471.
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement
(pp. 560–620). Washington, DC: American Council on Education.
Todd, P. E., & Wolpin, K. I. (2003). On the specification and estimation of the production
function for cognitive achievement. The Economic Journal, 113, F3–F33.
Authors
DONALD BOYD is a senior fellow at the Rockefeller Institute of Government, University
at Albany, 411 State Street, Albany, NY 12203; e-mail: [email protected]. Boyd
focuses on state and local government fiscal issues, with special emphasis on taxation,
budgeting, and pensions.
HAMILTON LANKFORD is a professor of Educational Administration and Policy
Studies, School of Education, University at Albany, Albany, NY 12222; e-mail:
[email protected]. His research largely focuses on the teacher workforce (e.g., career
choices and teacher sorting) and draws upon different statistical literatures to adapt/
develop empirical methods needed to address policy questions.
SUSANNA LOEB is the Barnett Family Professor of Education and director of the Center
for Education Policy Analysis, School of Education, Stanford University, Stanford, CA
94305; e-mail: [email protected]. Loeb studies the relationship between schools and
federal, state, and local policies, focusing particularly on teacher and school leader
labor markets and education finance.
JAMES WYCKOFF is the Curry Memorial Professor of Education and Policy and direc-
tor of the Center for Education Policy and Workforce Competitiveness at the University
of Virginia, Curry School of Education, University of Virginia, Charlottesville, VA
22904; e-mail: [email protected]. His research focuses on issues of teacher labor
markets including teacher preparation, recruitment, assessment, and retention.
Manuscript received April 30, 2012
First revision received December 10, 2012
Second revision received June 13, 2013
Accepted July 15, 2013
Boyd et al.
663