classical test theory as a first- order item response ...paul w. holland machteld hoskens research...
TRANSCRIPT
RESEARCH REPORT October 2002 RR-02-20
Classical Test Theory as a First-Order Item Response Theory: Application to True-Score Prediction From a Possibly Nonparallel Test
Paul W. Holland Machteld Hoskens
Research & Development Division Princeton, NJ 08541
Classical Test Theory as a First-Order Item Response Theory:
Application to True-Score Prediction From a Possibly Nonparallel Test
Paul W. Holland, Educational Testing Service
Machteld Hoskens, CTB-McGraw Hill
October 2002
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:
Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541
i
Abstract
We give an account of classical test theory (CTT) in terms of the more fundamental ideas of item
response theory (IRT). This approach views CTT as a very general version of IRT, and the
commonly used IRT models as detailed elaborations of CTT for special purposes. We then use
this approach to CTT to derive some general results regarding the prediction of the true score of
a test from an observed score on that test as well from an observed score on a different test. This
leads us to a new view of linking tests that were not developed to be linked to each other. In
addition we propose true-score prediction analogues of the Dorans and Holland measures of the
population sensitivity of test linking functions. We illustrate the accuracy of the first-order theory
using simulated data from the Rasch model and illustrate the effect of population differences
using a set of real data.
Key words: test theory, true scores, best linear predictors, test linking, nonparallel tests,
simulation, Rasch Model
ii
Acknowledgements
We would like to thank Neil Dorans, Skip Livingston, and two anonymous referees for many
suggestions that have greatly improved this paper. The work reported here is collaborative in
every respect and the order of authorship is alphabetical. It was begun while both authors were
on the faculty at the Graduate School of Education, University of California, Berkeley.
1
1. Introduction
This paper has two tasks. The first is to show how classical test theory (CTT) can be
viewed as a mean and variance (i.e., first-order) approximation to a very general version of item
response theory (IRT). This task connects CTT more closely to IRT and provides simplified
ways of making calculations relevant to IRT models using the easier mean and variance
calculations of CTT. It is often hard to see the structure of a full IRT model through the forest of
item response functions, their numerous parameters, prior ability distributions, complex
estimation techniques, the computation of plausible values drawn from posterior distributions of
ability, etc. This is not a call for the return to the simpler ideas of CTT, but rather a suggestion to
use them when they can give insight into the more complex IRT calculations.
Our second task is to show that this approach can bear fruit. We demonstrate how it gives
insight into the problem of predicting (a) the true score of a given test (i.e., direct true-score
prediction) and (b) the true scores of tests that are not necessarily parallel to the given test (i.e.,
indirect true-score prediction).
We organized the rest of this paper as follows. In the remainder of this section we
develop our notation and discuss the basic assumptions that underlie the general IRT model we
assume throughout the rest of this paper. The focus of the Section 2, “CTT as a First and Second
Moment Approximation to IRT,” is on showing how CTT can be derived from our general IRT
model, with the main result being Theorem 4. In Section 3, “Direct and Indirect Prediction of
True Scores From Observed Scores,” we give definitions of what we call direct and indirect true-
score prediction in terms of our general IRT model. We also introduce the idea of replacing the
posterior distribution of a true score by the best linear predictor (BLP) of the true score from the
manifest data. In addition, we discuss the relationship between the posterior variance of the true
score and the average prediction error of the BLP of the true score. Section 4, “Further Uses of
the First-Order IRT,” applies the results of the previous sections to two related problems. In
Section 5, “Examples Using Real and Simulated Data,” we examine real and simulated data to
show how these ideas work out in the case of the Rasch model. Finally, Section 6, “Discussion,”
contains suggestions for future research.
2
Basic Notation
Let X and Y denote the raw test information collected using two testing instruments that
we also call X and Y. For us, X and Y denote two random vectors, each realization of which is
associated with a single examinee. Underlying it all is a population P of examinees, which will
not play a major role in our analysis. Instead, G will denote a subgroup or subpopulation defining
variable so that G = g indicates membership in a particular subpopulation of P denoted by g. For
instance, G could denote gender, so that the possible values of G would be G = male or G =
female. The subpopulations defined by G will be ubiquitous throughout our analysis, while P
will stay in the background. In almost all of our analyses, we will be conditioning probabilities
and expectations on the event that G = g and will use |G to denote this.
This paper is closely related to the work discussed in Dorans and Holland (2000) in that
this paper is partly concerned with the effects of subgroup membership on various aspects of
linking the scores from different tests. This is why we have kept the subgroup membership
function, G, in the notation.
It was our intent to cover only the simple case of fixed-length, nonadaptive tests. At this
point we are not sure if our development is sufficiently rich to include adaptive tests or other
cases where item responses of certain types are missing. This is a problem for future
consideration.
Once we have test data, we need to score it, so we assume that there are two real-valued
scoring functions, sX (.) for X and sY (.) for Y, with the resulting scores denoted by capital letters,
i.e., SX = sX (X) and SY = sY (Y). In practice, sX might be “number-right,” formula scores, or some
weighted combination of item scores. The same holds for sY. We regard the definition of the
scoring functions as external to our analyses. The only property of the scoring functions that we
assume is that the scoring functions assign a unique score to each vector of test performance
data, X or Y. In our notation, SX and SY denote random variables over P that give the scores
obtained from the tests X and Y, respectively, for an examinee randomly sampled from P.
The IRT Assumptions
We have specified a notation for two tests, their test data, and their scores. Now we bring
unobservables or latent variables into the picture. As usual, we let �X and �Y denote two latent or
inherently unobservable variables that govern or lie behind the tests, X and Y. The thetas are
3
sometimes viewed as what the two tests measure, and they need not be the same thing in our
analysis.
Over the population P, we presume that sampling an examinee at random from P induces
a joint distribution of the variables X, Y, SX, SY, �X, �Y, and G. We use this joint distribution to
define the distributions of these variables as well as their means, variances, and covariances.
Thus, this joint distribution on P gives meaning to the equations such as (1) – (4) below. We use
P{} to denote the probability function for these random variables and to define the IRT model.
We make four very general assumptions about the IRT model:
1-DIM: �X and �Y are real numbers (not vectors of real numbers).
NO DIF: For any G, P{X, Y |�X, �Y, G} = P{X, Y |�X, �Y}. (1)
COND IND: P{X, Y |�X, �Y} = P{X |�X, �Y} P{Y |�X, �Y}. (2)
SIMP: P{X |�X, �Y} = P{X |�X} and P{Y |�X, �Y} = P{Y |�Y}. (3)
1-DIM. Initially, the latent variables �X and �Y are abstract quantities with no assumed
numerical properties. We clarify this by assuming the thetas are real numbers rather than vectors
or abstract categories. The assumption, 1-DIM, is restricting and eliminates all multidimensional
IRT models for each of the tests, but it is widely assumed in practice, so we use it in this
analysis. We have not explored the extent to which 1-DIM can be relaxed in the results we report
below, but we recognize that this is a task for further research.
The other three assumptions are often implicitly assumed to operationalize what it means
for X and Y to measure �X and �Y. We believe it is useful to make them explicit.
NO DIF. This assumption is intended to apply for any G, which we will always interpret
as any function on P (a) that involves only observable data and (b) that is not determined by the
observed test data in either X or Y but might involve some other test data as well as examinee
characteristics. The observable and nonreflexive nature of the “legitimate” Gs are important
restrictions that need to be kept in mind when applying our results. We will not mention it again,
but it is tacitly assumed that whenever we use the phrase, “for any G,” we actually mean “for any
legitimate G.”
The assumption in (1) is that �X and �Y are the only things that affect the performance
recorded in X and Y. The NO DIF assumption means no differential item functioning in the very
general sense that given �X and �Y , membership in the groups indicated by G has no additional
influence on the performance of an examinee on these tests. Because X and Y will usually
4
contain item-level responses, this use of the term NO DIF is compatible with other uses of DIF in
the literature. The NO DIF assumption is an unstated part of many IRT analyses. Within certain
IRT models, it can be tested in various ways. We do not consider testing it here but assume NO
DIF and use it in many of our calculations. We have not examined what changes in our analysis
would take place if we modified this assumption to allow DIF.
The roles of X and Y cannot be reversed with those of �X and �Y in the NO DIF
assumption. Later we will consider the reversed, or posterior, probability, P{�X, �Y | X, Y, G },
for which the effect of G can rarely be ignored in the way that it is in (1).
COND IND. Mathematically this assumption states that given �X and �Y, X and Y are
conditionally independent of each other. It means that information from Test X is useless for
predicting performance on Test Y given the two theta values for an examinee. Usually this
assumption is stated in terms of local independence of test items within a test once the theta
values are given, but we use this version of the assumption because we never look within X or Y
beyond the scores SX and SY.
SIMP. This assumption is related to COND IND because it involves conditional
independence as well. The first part assumes that X is independent of �Y given �X (i.e., �X is
specific to X). The second part assumes that �Y is specific to Y in the same sense. Relative to �X
and �Y, the SIMP assumption asserts that X and Y exhibit simple structure in the sense often used
in factor analysis.
For some, it helps to think of the thetas as what the observed test data measure and that
the three assumptions, NO DIF, COND IND, and SIMP, merely follow from what it ought to
mean for a test to measure something. For us, these assumptions together define what it means
for the thetas to govern or lie behind the observed test data.
Because SX and SY are functions of X and Y, respectively, they may be substituted for X and
Y in (1) – (4), and the resulting equations will hold as well.
The combined effect of NO DIF, COND IND, and SIMP is the following basic equation
that we state as a theorem to identify it. This result does not depend on the dimensionality
assumption, 1-DIM.
Theorem 1. Under assumptions NO DIF, COND IND, and SIMP, the conditional
distribution of X and Y given �X, �Y, and G is simplified as follows:
5
P{X, Y |�X, �Y, G} = P{X |�X} P{Y |�Y}. (4)
Equation (4) is often implicit in the particular forms of likelihood functions and other
important elements of IRT models applied to testing problems.
These four assumptions are made time and again in the application of IRT to testing
problems. Throughout this analysis, we will avoid making any additional functional form
assumptions (i.e., Rasch model, 3PL, Partial Credit, Graded Response, etc.) that are the usual
fare of IRT applications. The one exception that we make is a very mild restriction on the
functional form of the IRT model that is satisfied by every IRT model in common use. We will
show that CTT can be viewed as a mean-and-variance approximation to this very general class of
IRT models.
In Appendices A and B, we summarize two other mathematical results that we also need
for the derivations in this paper—one on using conditioning to calculate first and second
moments and the other on BLPs. The results in these two appendices are well-known and, along
with our IRT assumptions, are the only tools we use here.
Reparameterizing the Thetas in Terms of True Scores
Because the abstract nature of �X and �Y makes them somewhat difficult to discuss, we
wish to avoid that in this paper, and instead we introduce the true-score reparameterization of the
thetas. This reparameterization makes it easier to think about what the latent variables are and
will lead us to connect the general IRT model described above to CTT. We define the true
scores, �X and �Y, in the usual way by:
�X = �X (�X) = E(SX |�X), (5)
and
�Y = �Y (�Y) = E(SY |�Y). (6)
We note that due to the NO DIF assumption we also have
�X = E(SX |�X, G), and �Y = E(SY |�Y, G), (7)
for any choice of G.
The functions �X = �X (�X) and �Y = �Y (�Y) reparameterize the abstract latent quantities �X
and �Y into new latent quantities that are in the range of the values assigned by the scoring
functions SX = sX (X) and SY = sY (Y). Thus, the �s are equivalent one-dimensional
reparameterizations of the �s and have units (i.e., X- or Y-score points) that are, in some ways,
6
more understandable than the logits or probits of the theta scales. In special cases, the functions
�X = �X (�X) and �Y = �Y (�Y) are called the “test characteristic functions” of X and Y, respectively.
In order for this reparameterization from the �s to the �s to be useful, we need to make
one further assumption about the IRT model.
CSI. The functions �X (�X) and �Y (�Y) in (5) and (6) are continuous and strictly increasing
(CSI) functions of �X and �Y. The CSI assumption allows �X and �Y to be reparameterizations of
�X and �Y with no loss of information between the �s and �s. The CSI condition always holds for
the scoring functions and the IRT models that are widely used in practice. CSI is the mild
restriction on the functional form of the IRT models that was mentioned earlier.
In our development of CTT, we will reduce the joint distribution of X, Y, SX, SY, �X, �Y,
and G to the joint distribution of SX, SY, �X, �Y, and G. For example, (1) – (4) may be replaced,
without change, by the same equations where �X and �Y are replaced by �X and �Y and X and Y
are replaced by SX and SY. In what follows, we will assume that we have reparameterized the
latent quantities into the corresponding true scores (i.e., the �s) and will ignore the �s in the rest
of this paper.
2. CTT as a First and Second Moment Approximation to IRT
In this section, we show how to relate the general IRT model discussed in the previous
section to CTT. As discussed, we do this by showing in some detail that CTT gives an
approximation of the more detailed results of IRT modeling that is accurate up to the first and
second moments of the score distributions. CTT is a first-order theory because it is primarily
concerned with means and variances. As such, it applies widely to any IRT model satisfying our
basic assumptions—that is, to all of the models in routine use.
The Basics of CTT
In CTT, the data are reduced to the scores SX, SY, and G, and the IRT model is reduced to
the true scores �Y and �X and their distribution over the relevant subpopulations of P. In the
course of our development, we repeatedly use the assumptions NO DIF, COND IND, and SIMP.
The error term. The most basic equation of CTT is the equation:
SX = �X + eX. (8)
7
Equation (8) will automatically hold in our development because we define the error
term, eX, by
eX = SX - �X. (9)
We begin our analysis with an examination of the conditional mean and variance of SX
given �X and G. Because of the NO DIF assumption, we can drop the conditioning on G, so we
examine the moments of SX given its true score, �X. By definition we have
E(SX |�X) = �X, (10)
and we define the conditional variance of SX to be
Var(SX |�X) = 2
XS� (�X). (11)
Once we have these two conditional moments of SX, we can study the corresponding
moments of eX both conditionally given the true score, �X, and marginally, where �X is averaged
out. The basic results are summarized in Theorem 2.
Theorem 2: (a) E(eX |�X) = 0, (12)
so that
E(eX |G) = 0, for any G. (13)
In addition,
(b) Var(eX |�X) = 2
XS� (�X) = Var(SX |�X), (14)
(note that (14) shows 2
XS� (�X) is the conditional standard error of measurement of SX)
and
(c) 2|Xe G� = Var(eX |G) = E[ 2
XS� (�X) |G] = E[Var(SX |�X)|G]. (15)
We outline the proof of Theorem 2 to show how the definitions and assumptions we have
made work together. Part (a) follows from: E(eX |�X) = E(SX � �X |�X) = E(SX |�X) – E(�X |�X) =�X �
�X = 0. Then (b) follows from Var(eX |�X) = Var(SX � �X |�X) = Var(SX |�X) = 2
XS� (�X) and the fact
that Var(eX |G) = E[Var(eX |G, �X) |G] + Var[E(eX |G, �X) |G] = E[ 2
XS� (�X) |G] + Var[0 |G] =
E[ 2
XS� (�X) |G]. Similar reasoning gives (c). QED.
In this derivation, we used Theorem A, parts (b) and (c), which is in Appendix A, as well
as the NO DIF assumption. Theorem 2 shows that in this first-order IRT, the conditional mean of
8
the error given the true score is constant and = 0 for any value of �X, but that, in general, the
conditional variance of the error term given the true score is not constant.
Equation (14) shows that the conditional variance (given �X) of the error term, eX, and the
of observed score, SX, are the same. In addition, it is a truism that Var(�X |�X) = 0, so that we have
Var(SX |�X) = Var(�X |�X) + Var(eX |�X) (16)
However, the important formula relating the observed score variance to the sum of the
true-score variance and the error variance is actually a statement about the marginal (given G)
variances of SX, �X, and eX. Theorem 3 summarizes the basic results for the marginal variances
and covariances of SX, �X, and eX.
Theorem 3: The error term and true score are uncorrelated, such as
(a) Cov(eX, �X |G) = 0,
from which it follows that
(b) Cov(eX, SX |G) = Cov(eX, eX |G) = Var(eX |G) = 2|Xe G� ,
(c) Cov(SX, �X |G) = Cov(�X, �X |G) = Var(�X |G) = 2|X G�
� ,
and
(d) 2|XS G� = 2
|Xe G� + 2|X G�
� . (17)
Part (a) is the key result, and the rest follows from it. Part (a) follows from
Cov(eX, �X |G) = E(eX �X |G) – E(eX |G)E(�X |G) = E[E(eX �X |�X, G) |G] – 0 = E[�X E(eX |�X, G) |G]
= E[�X 0 |G] = 0. QED.
In CTT, the unconditional standard error of measurement is |Xe G� , and, from (14), the
conditional standard error of measurement is Xe� (�X) =
XS� (�X). These are not the same
things—the former is a summary average of the latter, as we see in (15).
Using (17), we may define the usual CTT form of the reliability as the ratio of (marginal)
true-score variance to the (marginal) total variance, except that because we condition on G, it is a
conditional reliability that depends, in general, on the subpopulation defined by G. Reliability is,
as usual,
9
2|XS G� =
2|
2|
X
X
G
GS
��
�
= 2 2
| |2
|
X X
X
S G G
S G
e� �
�
�
= 1 � 2
|2
|
X
X
e G
GS
�
�
. (18)
In the following development we will see that this formula for the reliability of SX,
�S GX |2 , plays its usual role in our version of CTT.
A First-Order Item Response Theory
The first-order IRT that we will discuss involves only the joint distribution (SX, SY, �X, �Y)
conditional on G up to its first and second moments. All of the other details of this distribution
are suppressed in this first-order theory. Theorem 4 gives all of the first and second moments that
are relevant to any IRT model that involves two tests, X and Y, satisfying the five IRT
assumptions defined in Section 1.
Theorem 4: If X and Y are two tests satisfying the five IRT assumptions (1-DIM, NO
DIF, COND IND, SIMP, and CSI) then
(a) the mean vector of (SX, SY, �X, �Y) given G is:
E[(SX, SY, �X, �Y) |G] = ( |XS G� , |YS G� , |XS G� , |YS G� ), (19)
and (b) the variances, correlations, and covariances of (SX, SY, �X, �Y) given G are shown in Table
1, where the covariances are above the diagonal and the correlations below it.
Table 1
Variances, Covariances, and Correlations of (SX, SY, �X, �Y) Given G
SX SY �X �Y SX 2
|XS G� �� �X Y G| 2
|X G�� �
� �X Y G|
SY � � �� �S G S G GX Y X Y| | | 2
|YS G� �� �X Y G| 2
|Y G��
�X �S GX | � �� �S G GY X Y| | �
� X G|2 �
� �X Y G|
�Y � �� �S G GX X Y| | �S GY | �
� �X Y G| �� Y G|2
10
The argument for the means is easy, i.e., |XS G� = E(SX |G) = E[E(SX |G, �X) |G] = E[�X
|G] = |X G�� , and |YS G� = E(SY |G) = E[E(SY |G, �Y) |G] = E[�Y |G) = |Y G�
� . Thus the
conditional mean vector of (SX, SY, �X, �Y), E[(SX, SY, �X, �Y) |G] is ( |X G�� , |Y G�
� , |X G�� , |Y G�
� )
= ( |XS G� , |YS G� , |XS G� , |YS G� ).
We also show how two of the covariance-matrix expressions are derived to illustrate our
analysis. The covariance between SX and SY is a good case in point:
Cov(SX, SY |G) = E[Cov(SX, SY |G, �X, �Y) |G] +
Cov[E(SX |G, �X, �Y), E(SY |G, �X, �Y) |G] = E[0 |G] + Cov[�X, �Y |G] = �� �X Y G| . In this
derivation, we used Theorem A, part (d) in Appendix A and all three of the IRT assumptions,
NO DIF, COND IND, and SIMP. The corresponding correlation computation is
Correl(SX, SY |G) = �
� �
� �X Y
X Y
G
S G S G
|
| | =
�
� �
�
�
�
�
� � �
�
�
�
X Y
X Y
X
X
Y
Y
G
S G S G
G
G
G
G
|
| |
|
|
|
|
= �
� �
�
�
�
�
� �
� �
� �X Y
X Y
X
X
Y
Y
G
G G
G
S G
G
S G
|
| |
|
|
|
| = � � �
� �S G S G GX Y X Y| | | ,
using the definition of the reliabilities given earlier.
Another interesting case is the covariance between SY and �X:
Cov(�X, SY |G) = E[Cov(�X, SY |G, �X, �Y) |G] +
Cov[E(�X |G, �X, �Y), E(SY |G, �X, �Y) |G] = E[0 |G] + Cov[�X,��Y |G] = �� �X Y G| .
In this derivation, we also made use of the fact that a random variable given itself (i.e., �X
given �X) is a constant and has zero covariance with any other variable. We hope this is enough
detail to clarify how to calculate the entries in the covariance matrix of Theorem 4. QED.
Theorem 4 summarizes all the means, correlations, and covariances we need to compute
all of the quantities of interest to us in our first-order IRT. We first want to illustrate how
Theorem 4 can be used to give all of the usual results of CTT. We do this by considering three
examples—the disattenuation formula, the interpretation of reliability as the correlation between
11
parallel tests, and the Spearman-Brown formula for predicting the reliability of a whole test from
half-tests.
Disattenuation. This is the relationship between the correlation of the two observed
scores, SX and SY, and the correlation of the two true scores, �X and �Y. From the covariance
matrix in Theorem 4, we have:
�S S GX Y | = � � �� �S G S G GX Y X Y| | | , (20)
and the usual disattenuation formula is easy to derive from (20), i.e.,
�� �X Y G| =
�
� �
S S G
S G S G
X Y
X Y
|
| |. (21)
Reliability and Parallel Tests. Suppose SX and SY are such that their true scores are
perfectly correlated (i.e., congeneric), so that
�� �X Y G| = 1, (22)
and furthermore, suppose that SX and SY are equally reliable in the sense that
�S GX |2 = �S GY |
2 . (23)
Then (21) is easily rearranged to show that
�S GX |2 = �S S GX Y | . (24)
Thus, the reliability of SX is the correlation between SX and a “parallel” (i.e., congeneric
and equally reliable) measure, SY.
The Spearman-Brown Formula: In this case the score for the whole test is the sum SX +
SY where X and Y are congeneric and equal reliability (i.e., (22) and (23).) In addition, we
assume that X and Y are given equal weight in the sense that their standard deviations are equal,
that is
X YS S�� � . (25)
When (23) holds, (25) is equivalent to the assumption of equal errors of measurement,
X Ye e�� � . When (22), (23), and (25) are satisfied, the familiar Spearman-Brown formula
holds:
2
22( )
||
|
2
1X
X YX
S GS S G
S G�
�
�
�
��
. (26)
12
Estimating Error Variances. One of the great triumphs of early psychometric theory was
the discovery of ways to estimate the parallel forms reliability of a test from a single
administration of the test rather than the administration of two tests. An early approach used the
split-halves method to correlate the scores on parallel half-forms of the test and then used the
Spearman-Brown formula to step up the correlation between the half-forms to that of parallel
full-forms. This can be interpreted as an attempt to implement (26) directly. However, from the
IRT perspective taken here, the natural approach to estimating the reliability of a test is through
model-based estimates of the error variance, Var(eX |G), using the relationship � e GX |2 =
E[� S X2 (�X)|G]. Once an estimate of � e GX |
2 is in hand, it can be combined with the sample
estimate of the total score variance and (18) to estimate �S GX |2 without any direct reference to
parallel forms of the test.
One might take various routes to estimate the error variance. A simple one that we use in
our example section estimates the function � S X2 (�X) = Var(SX |�X) by using a specific IRT
model. This variance function will depend on the form of the IRT model assumed and the form
of the scoring function. In our case, we used the sum of the individual item conditional
variances. In addition, the distribution of �X for the subgroup defined by G will need to be
estimated. Existing computer programs for IRT analyses can provide both of these estimates.
The error-term variance, � e GX |2 , is then computed by averaging the estimate of � S X
2 (�X) over
the estimated distribution of �X.
3. Direct and Indirect Prediction of True Scores From Observed Scores
Many constituencies want to link the scores on tests that were not designed to be linked.
For example, they want to link the scores from a state’s standardized assessment to the National
Assessment of Educational Progress (NAEP) scale in order to be able to interpret the state’s
testing results more widely. Or, they want to link scores from a state’s high-school exit exam to
the M and V scores on the College Board’s SAT I so that students can avoid taking the real SAT
I. One of us (PWH) chaired a National Research Council (NRC) committee that made
recommendations about the feasibility of linking tests in this more general setting (Feuer,
Holland, Green, Bertenthal, & Hemphill, 1999). It is partly the result of this experience that led
13
to the research reported here. The NRC committee’s findings were pessimistic, but the
committee also concluded that quantitative knowledge was lacking about the tradeoffs that
linking different types of tests would entail. We hope that this paper and related research, such as
Dorans and Holland (2000), will help to clarify some of the technical considerations that must be
faced in providing such information.
Direct True-Score Prediction
If we start with SX, an examinee’s score on X, and his or her group membership as
indexed by the value of G, and we then want to make an inference about �X, the unobserved true
score for that examinee, the standard way to proceed is to report the posterior probability
distribution of �X, given by:
P{�X |SX, G} = P{ | }
P{ | }P{ | }
X XX
X
SG
S G�
� . (27)
In (27), we have invoked the NO DIF assumption, which simplifies the numerator of the
ratio from P{SX |�X, G} to P{SX |�X}. The posterior distribution in (27) summarizes what is known
about the latent true score given the observed test performance (summarized by SX) and whatever
else we known about the examinee (summarized here by the examinee’s value of G). We note
that the score, SX, can be replaced by the entire set of test data, X, in (27), but for our purposes
here we will reduce all the test data to the scores.
We will call this summarization “direct true-score prediction” to reflect the direct
connection between a latent true score and its corresponding observed scores. Later we will
consider “indirect true-score prediction.”
The posterior distribution P{�X | SX, G} in (27) can be very complex and hard to calculate.
In the form specified by (27), the posterior distribution also gives little insight into its detailed
structure. Therefore, summarizing the full posterior distribution by its first and second moments
is sometimes useful. For example:
E(�X | SX, G) and Var(�X | SX, G). (28)
The posterior mean in (28) is a prediction of �X based on the information that is available
from the examinee. The posterior variance (or better yet, its square root, the posterior standard
14
deviation) in (28) is a measure of the error of this prediction. In terms of true-score prediction, an
inference about an examinee is a prediction of his or her value of �X along with a measure of the
prediction error. Some authors call these considerations true-score estimation, but we follow
Holland (1990) and call it true-score prediction.
The posterior mean and variance in (28) can also be difficult to calculate even though
they are simplified summaries of the full posterior distribution in (27). They can be
approximated by using the BLP of �X from SX and G and the average prediction error of the BLP,
which is more carefully described in Appendix B. We denote the BLP of �X from SX and G by:
L(�X |SX, G) = �G + �GSX, (29)
where �G and �G may be calculated using Theorem B in Appendix B. We use the notation L(�X |SX,
G) in order to make the best linear predictor appear formally like the conditional mean that it
approximates. Using the results of Theorem B we see that
L(�X | SX, G ) = |X G�� + |X XS G�
�|
|
X
X
G
S G
��
�
(SX � |XS G� ), (30)
and then, using the formulas in Theorem 4, (30) becomes
L(�X | SX, G ) = |XS G� + �S GX |2 (SX � |XS G� ). (31)
Thus, the BLP of �X from SX, and G is just the Kelley (1923) true-score estimation
formula. Lord and Novick (1968, pp. 64–65) give some of the same analysis in their discussion
of the linear minimum mean squared error regression functions and their relation to the Kelley
formula for estimating true scores as a function of observed scores.
Prediction also includes measures of the prediction error. When we use the posterior mean
to predict �X, then the posterior variance or its square root can be used for this purpose. When the
posterior variance is too complicated to compute, it can sometimes be roughly summarized by its
average value over the conditional distribution of SX given G:
E[Var(�X |SX, G) |G]. (32)
Even the calculation in (32) may be difficult to make, but we can always use the results
of Theorem B in Appendix B, part (e2) to show that the average prediction error of the BLP
provides an upper bound on (32) that is easy to calculate and that is close to (32) when the
posterior mean E(�X | SX, G ) is close to being linear in SX. Specifically we have:
15
E[Var(�X |SX, G) |G] = 2| ,X XS G�
� � Var[E(�X |SX, G ) � L(�X |SX, G ) |G]
= �� X G|2 (1 � 2
|X XS G�� ) � �
� X G|2 2
DGk , (33)
where 2
|
E( | , ) L( | , )Var[( ) | ]X X X X
X
DGG
S G S G Gk�
� ��
�
�
. (34)
We can use Theorem 4 to simplify (33) to
E[Var(�X |SX, G) |G] = |2X G�
� (1 � 2|XS G� ) � |
2X G�
�2DGk , (35)
or
E[Var(�X |SX, G) |G] = |2S GX
�2
|XS G� (1 � 2|XS G� ) � |
2X G�
�2DGk . (36)
If we use the BLP to predict �X, then the average prediction error, 2| ,X XS G�
� , is the proper
measure of the average uncertainty of this prediction. The first terms on the right sides of (35) or
(36) give us easily calculated measures of this uncertainty in the BLPs prediction of �X. We will
use these ideas in Section 4.
Conditioning on Nontest Information
Before we leave direct true-score prediction, we will comment on a mathematical detail
that has some important consequences. From Theorem A, part (b) in Appendix A, it follows that
E{E(�X |SX, G) |G} = E(�X |G), (37)
and furthermore that (38)
E{E(�X | SX) |G} = E(�X |G), (38)
does not hold, in general. The relevance of this last statement is that conditioning on both the test
data SX and the nontest data G is necessary if the system of predictions of �X, E(�X | SX, G) is to
produce the average value of �X over the group defined by G when the predictions are themselves
averaged over the distribution of observed scores in the group defined by G—which is the
content of (37). If G is left out of the prediction, as it is in E(�X | SX), then its average over the
distribution of observed scores in the group defined by G will not, in general, equal the average
true score, E(�X |G). The less reliable the test and the more strongly associated G is with test
performance, the larger will be the discrepancy between the average of E(�X | SX) over the group
16
determined by G and E(�X |G). This is discussed extensively in Mislevy, Beaton, Kaplan, and
Sheehan (1992) and is one of the reasons for the use of conditioning in NAEP.
The estimate a posteriori of theta, the EAP (Bock & Mislevy, 1982), is an example of a
prediction of a latent variable that does not include nontest data (i.e., G) in the conditioning.
Wainer et al. (2001) consider the prediction of what we would call �X from both SX and SY, using,
in our notation, L(�X |SX, SY), which is the BLP of �X from SX and SY. This is also an example of
not including nontest information in the prediction of �X. Wainer and his colleagues call this
scoring rather than predicting and perhaps that is a good distinction to make. When scoring a test
rather than predicting �X from available information, it is usually thought proper to exclude
anything about the test taker from the conditioning other than his or her test data. The result is
that averaging the scores over subgroups of examinees produces discrepancies between this
average and the average value of �X over the subgroups of examinees of interest. On the other
hand, averaging predictions that do include the relevant nontest information will not exhibit such
discrepancies. From this we deduce that what we are calling predicting is not the same thing as
scoring a test, no matter how similar they might seem to be.
Indirect True-Score Prediction
Suppose now that we are interested in the true score, �X, but what is available to us is the
observed score from a different test, SY, as well as the information from G. This is the setting
when talk turns to linking tests that are not assumed to be parallel. Thus, the test information is
only indirectly related to the true score that is the target of our prediction, which is why we call
this the problem of indirect true-score prediction. Following the previous discussion, we are
naturally led to consider the posterior distribution, P{�X |SY, G}. This conditional distribution has
a little more complexity than what is in (27). We have
P{�X |SY, G} = P{ | , }
P{ | }P{ | }
Y XX
Y
S GG
S G�
� . (39)
The numerator of the ratio can be reexpressed as
P{SY |�X, G} = E[P{SY |�Y } |�X, G]. (40)
In (40), we use both NO DIF and SIMP to rid the inner conditional probability of its
dependence on both G and �X. The inner conditional probability in (40) is the usual likelihood
17
function for SY, while the outer expectation is over the conditional distribution of �Y given both �X
and G.
Comparing (27) to (39) and (40), we see that indirect true-score prediction has a new
place for dependence on G to emerge. It is in the conditional distribution of �Y given �X and G,
which is used in (40) to average the likelihood function of SY, which does not depend on G. In
our real data example, we will show that this does occur.
Again, the posterior distribution in (39) can be complicated and in some cases it may be
useful to reduce it to its posterior mean and variance:
E(�X |SY, G) and Var(�X |SY, G). (41)
In turn, we can approximate the elements of (41) by the corresponding BLP, L(�X |SY, G),
and its average prediction error, 2| ,X YS G�
� .
Using Theorem B in Appendix B, the formula for the BLP L(�X |SY, G) is
L(�X |SY, G) = |X G�� + |X YS G�
�|
|
X
Y
G
S G
��
�
(SY � |YS G� ), (42)
and applying Theorem 4 to this reduces it to
L(�X |SY, G ) = |XS G� + � �� �S G GY X Y| |
|
|
X
X
G
S G
��
�
|
|
X
Y
S G
S G
�
�
(SY � |YS G� ), (43)
or
L(�X |SY, G ) = |XS G� + | |X Y YG S G� �� � |XS G�
|
|
X
Y
S G
S G
�
�
(SY � |YS G� ), (44)
or
L(�X |SY, G ) = |XS G� + |X YS S G�|
|
X
Y
S G
S G
�
�
(SY � |YS G� ). (45)
(45) shows that the BLP L(�X |SY, G ) is, in fact, the population linear regression function
of SX on SY, given G, that is, L(SX |SY, G). At first, we were surprised by this result, but on
reflection it seems intuitively plausible or perhaps even obvious.
Applying the average prediction error of the BLP to approximate the average posterior
variance, using the results of Appendix B again, we get:
18
E[Var(�X |SY, G) |G] = 2| ,X YS G�
� � Var[E(�X |SY, G ) � L(�X |SY, G ) |G]
= �� X G|2 (1 � 2
|X YS G�� ) � �
� X G|2 2
IGk , (46)
where 2IGk =
|
E( | , ) L( | , )Var[( ) | ]X Y X Y
X G
S G S G G�
� ��
�
. (47)
Again, applying the results of Theorem 4 to (46), we get:
E[Var(�X |SY, G) |G] = �� X G|2 (1 � 2
|YS G�2
|X Y G� �� ) � �
� X G|2 2
IGk , (48)
or
E[Var(�X |SY, G) |G] = �� X G|2 (1 � 2
|YS G�
2|
2 2| |
X Y
X Y
S S G
S G S G
�
� �) � �
� X G|2 2
IGk , (49)
or
E[Var(�X |SY, G) |G] = |2
GXS� ( 2|YS G� � 2
|X YS S G� ) � �� X G|2 2
IGk . (50)
Note that the leading term of (50) is smaller than the average residual variance in linear
regression because the prediction error for true scores is smaller than for their observed scores.
Again, the leading terms of (49) and (50) give the average prediction errors of this indirect BLP
of �X. We will use them again in the next section.
Using these results, we see that the practice of using linear regression to predict one
observed test score from another, as in Pashley and Phillips (1993) has a clear meaning in terms
of indirect true-score prediction as defined here—namely, that it is the same as the BLP L(�X |SY,
G). The coefficients of the BLP all depend, in general, on G, however. In the use of the BLP to
project the scores of SY onto the scale of SX, care must be given to include as predictors of SX the
main effect of and interactions of G with the test score SY. The precision measure of the BLP
given by (50) does not come out of the usual regression analysis programs and involves the
reliability of SX.
4. Further Uses of the First-Order IRT
In this section, we will examine two further applications of the material developed in the
previous two sections. First, we consider the increase in prediction error that arises as we move
19
from direct to indirect prediction. Second, we develop analogues to the measure of the
population dependence of equating functions introduced by Dorans and Holland (2000).
The Price of Indirection
We can use our results to get a measure of the increase in average prediction error that
occurs when we predict the true score of SX from scores on a test that need not be parallel or
closely related to X. We propose to use the ratio of the square roots of the average prediction
errors (either the average posterior variances given in (35) and (49) or their leading terms, the
prediction errors of the BLPs). This gives us the prediction error inflation factor given by
H = [ ( | , ) | ][ ( | , ) | ]
X Y
X X
E Var S G GE Var S G G
�
�
=
2| 2
2|
2 2|
1
1
X Y
X
X
S S GIG
S G
DGS G
k
k
�
�
�
� �
� �
. (51)
H is a measure of the amount by which the average prediction error of the prediction of
�X is inflated when we use SY rather than SX to make the prediction. Indirect prediction is always
worse, and H is a measure of how much worse on average. Using the square root puts the
inflation factor into units that are similar to a percentage increase in the standard deviation. In
Section 5, we examine how much the 2Gk -factors matter in a real application.
Analogues of Measures of the Population Dependence of Linking Functions
Dorans and Holland (2000) propose measures of the influence of subgroup membership
on observed-score linking functions. There are natural analogues of that work to true-score
prediction. Because our aim is to use the BLPs as approximations to the posterior means, we will
concentrate on the BLPs in this discussion.
In the Dorans and Holland approach, the linking functions computed on each
subpopulation are compared to the linking function that is computed on the whole population. In
our situation, this corresponds to comparing
Direct Prediction: L(�X | SX, G = g) and L(�X | SX), (52)
or
20
Indirect Prediction: L(�X | SY, G = g) and L(�X | SY). (53)
The analogues to the Dorans and Holland root-mean-square-deviation (RMSD) measure
are the predicted difference functions given by:
Direct Prediction: PDD(s) =
2[ ( | , ) ( | )]X X X X gg
X
L S s G g L S s w
�
� �� � � �
�
�, (54)
or
Indirect Prediction: PDI(s) =
2[ ( | , ) ( | )]X Y X Y gg
X
L S s G g L S s w
�
� �� � � �
�
�, (55)
where wg is the proportion of the whole population that is in the subpopulation denoted by G = g.
These measures show the average amount, in true-score standard deviation units, that the
subpopulations indicated by G affect the BLP for each value of SX.
Dorans and Holland also propose single number summaries of the RMSD functions. The
analogues here are the expected predicted difference values given by:
Direct Prediction: EPDD =
2
,
[ ( | , ) ( | )] ( | )X X X X X gs g
X
L S s G g L S s P S s G g w
�
� �� � � � � �
�
�, (56)
or
Indirect Prediction: EPDI =
2
,
[ ( | , ) ( | )] ( | )X Y X Y Y gs g
X
L S s G g L S s P S s G g w
�
� �� � � � � �
�
�. (57)
However, the numerator of (56) can be expressed as the square root of the following
quantity,
2E{[( ( | , ) ( | )] }X X X XL S G L S�� � = Var[( ( | , ) ( | )]X X X XL S G L S�� � .
The equality of the second moment and the variance follows from the equality of the
unconditional expectations of the BLPs of �X. Similar expressions hold for indirect prediction.
Hence, we obtain the following alternative representation of EPD values in (56) and (57):
21
Direct Prediction: EPDD = SD[( ( | , ) ( | )]X X X X
X
L S G L S
�
� �
�
� , (58)
and
Indirect Prediction: EPDI = SD[( ( | , ) ( | )]X Y X Y
X
L S G L S
�
� �
�
� . (59)
The measures for indirect prediction are more analogous to those of Dorans and Holland
than are those for direct prediction because they involve linking Y to the true score scale of X.
We have not investigated the utility of finding analogues to the parallel-linear linking functions
used in Dorans and Holland to simplify their measures. In the next section, we evaluate these
measures for a real data example.
5. Examples Using Real and Simulated Data
In this section we report some preliminary results using the ideas developed in this paper.
These results make use of simulated data using a simple IRT model as well as an example using
real data.
The Simulation Study
In order to see how well the first order analysis approximates the posterior mean and
variance in a real IRT model, we carried out a small simulation study using the software
ConQuest (Wu, Adams, & Wilson, 1997). In the study, the two tests, called X and Y, had 40
items each—except in Cases 5 and 6, explained below in Table 2. In all of our analyses, the Y
scores were linked to the �X scale. Because our interest was in the accuracy of the first order
theory, we did not investigate the effects of multiple groups of examinees in the simulation, so
this part of our study had no G.
All of the item responses were simulated using a one-dimensional Rasch model with b
parameters that we varied to mimic several interesting differences between X and Y. The model
had four basic sets of b parameters called “spread,” “spread low,” “spread high,” and “peaked,”
respectively. In all four conditions, the bs were on the logit scale.
In the spread condition, the bs were randomly sampled from the uniform distribution on
[-1.75, 1.75]. For the spread-low condition, they were also sampled from this uniform
distribution, and then 0.25 logits were subtracted from all of them. For the spread-high condition,
22
the bs were again randomly drawn from the uniform distribution on [–1.75, 1.75], and then 0.75
logits were added to all of them. For the peaked condition, the bs were randomly drawn from the
normal distribution with mean 0 and standard deviation 0.1 logits.
With these definitions, we created six cases or six pairs of test conditions for tests X and
Y with the item parameters described in Table 2.
Table 2
The Sets of Item Parameters Describing Each Pair of Test Conditions Used in the Simulation Study
Case Test X Test Y
1 Spread Spread 2 Peaked Peaked 3 Peaked Spread 4 Spread low Spread high 5 Spread, 20 items Spread, 20 items 6 Spread, 40 items Spread, 20 items
These six cases encompass the following conditions that might arise in linking two tests:
�� Case 1: The two tests have difficulty parameters spread out over similar large
ranges of values.
�� Case 2: The two tests have difficulty parameters concentrated in about the same
small range of values.
�� Case 3: The two tests have similar average difficulty, but Y has a wide range of
difficulty parameters and X has a narrow range of difficulty parameters.
�� Case 4: The two tests both have wide ranges of difficulty parameters, but they
have very different average difficulty, with X being the easier test.
�� Case 5: The same as the condition in Case 1 but with both tests half as long, and
therefore both X and Y are less reliable than in Case 1.
�� Case 6: The same as the condition in Case 1 except that X is twice a long as Y, so
that a less reliable test is being linked to the scale of a more reliable test.
23
The thetas for the two tests were assumed to be distributed as bivariate normal with
means 0 and standard deviations 1 and with one of two correlations between �X and �Y, either � =
.8 or � = .5. In each simulation, we used N = 2000 simulated examinees (i.e., “simulees”)
The structure of the simulation consisted of 12 = 2 × 6 combinations of a choice of one of
the two correlations for the bivariate ability distribution of (�X, �Y) and a choice of one of the six
sets of pairs of item parameters as specified in Table 2. In each simulation, a sample of 2000
simulees with (�X, �Y)-values were generated from the selected bivariate ability distribution, and
values of their dichotomous item responses from X and Y were then simulated using the selected
value of (�X, �Y) and the pair of Rasch models with item parameters indicated by the appropriate
case in Table 2. The raw scores on X and Y were taken to be the number-right scores on each
test, i.e.,
SX = sX (X) = jj
X� , and SY = sY (Y) = jj
Y� . (60)
When necessary, we transformed all the theta values to the corresponding true-score scale
using the transformations in (5) and (6) that now take the form
�X� = jj
XP� (�X), and �Y� = jj
YP� (�Y), (61)
where the item response functions in (61) are the Rasch type, i.e., they are given by
logit[PjX(��X)] = ��X – b jX , and logit[PjY(��Y)] = ��Y – b jY. (62)
In the transformations defined by (61) and (62), the true bs were used rather than estimates of
them based on the sample data. Thus, in this study, the theta-to-true-score transformation was the
correct population transformation, rather than an approximation estimated from sample data.
An important part of our simulation was to obtain estimates of the posterior means and
variances: E(�X |SX) and Var(�X |SX) for direct prediction and E(�X |SY) and Var(�X |SY) for indirect
prediction. The program ConQuest can produce plausible values (i.e., sample draws) from the
posterior distributions of �X |SX, so we exploited this facility in our simulation. In calculating the
posterior distributions, we again used the population values for the bs and the appropriate normal
distribution for the priors. When we were concerned with direct prediction, we drew from the
posterior distribution P{�X |SX}, and, when we were concerned with indirect prediction, we first
drew from the posterior distribution P{�Y |SY} and then used the conditional distribution P{�X |�Y,
SY} = P{�X |�Y} to make the final draws from the posterior distribution P{�X |SY} (Gelman,
24
Carlin, Stern, & Rubin, 1995). For each simulee, we generated 100 plausible values from the
conditional distribution of �X given either SX (for direct prediction) or SY (for indirect prediction).
We then used the true-score transformation for X in (61) to transform the theta plausible values
to true-score or tau-plausible values. Simulees were grouped on the basis of SX or of SY, and then
means and variances were computed for all of the true-score plausible values represented by each
of these groups of simulees with identical values of SX or SY. For each condition of the simulation
design, we generated 10 replicate data sets and averaged the results. The means and sds across
the 10 replications are given in Tables 3, 4, and 5.
These means and variances of the plausible values of �X then formed our estimates of the
posterior means and variances for direct and indirect prediction. They are to be compared to the
values obtained from the first-order BLP theory outlined in Section 3.
In order to implement our first-order IRT analysis, we needed estimates of the tests’
reliabilities, which we obtained by using the approach outlined in Section 2. We operationalized
the error variance as the integration specified by:
� eX2 = E[� S X
2 (�X)] = E[� S X2 (�X)] =
{ ( )[ ( )]} ( )P PjX Xj
jX X X� � � ��z��
�
�1 d�X. (63)
In (63), � (.) denotes the standard normal density function, and we used the true b-values in
the IRFs within the integral rather than estimated values. The integration was carried out
numerically. Table 3 shows the resulting reliabilities for the five different sets of item parameters
used in our study averaged across the various conditions in which they appeared, along with the
standard deviations.
25
Table 3
Average Reliability Values for Five Sets of Item Parameters
Pattern of item difficulties
Average reliability (standard deviation)
Spread .88 (.005) Peaked .89 (.002) Spread low .88 (.003) Spread high .87 (.002) Spread 20 items .79 (.008)
From Table 3, it is evident that the only factor that strongly affects the reliability of the
tests used in the simulation is the number of items in the test, i.e., 20 versus 40. These reliability
values indicate that the test used in our study was not unrealistic in terms of the usual measures
of test reliability.
Simulation Results
How good an approximation to the posterior expectation is the posterior BLP? We
answer this question in two ways. First, by using the overall measure of the discrepancy between
the posterior means and the BLP given by 2DGk and 2
IGk . Table 4 shows the values of 1,000 × the
k2 factors for the various conditions of the simulation design. The values in Table 4 are means
across the 10 replications of each simulation condition. All of these values are very small,
indicating that the BLP is a good approximation to the posterior means for the cases covered in
our simulation. In addition, it suggests that in many cases the k2 factors can be ignored in the
computation of H. The square root of each k2 factor is a percentage of a standard deviation of the
true-score distribution. This measure of the average squared difference between the BLP and the
conditional expectation ranges from 3% to 5% for direct and from 7% to 10% for indirect
prediction.
Examining Table 4 more closely, Cases 1, 4, and 6 are essentially the same for direct
prediction and the values of 2DGk reflect this. For direct prediction, Cases 2 and 3 are also the
same and have identical values of 2DGk . Case 5 is the only case of direct prediction that involves
a 20-item test, and its value of 2DGk is the largest. The values of 2
DGk are all much smaller than
26
the corresponding values of 2IGk . For indirect prediction, the biggest differences in 2
IGk are
between the two values of the theta correlations, �. The differences in 2IGk between the six cases
are much smaller than the differences due to the theta correlation. We interpret this to mean that
the details of the differences in the item parameters and number of items are not as important as
the lack of parallelism indicated by the different thetas for the two tests. In the case of � = .0.5,
the tests are measuring very different things.
Table 4
Values of 1,000 × k2 for Each Condition in the Simulation Study
Case Test X Test Y Direct Indirect ��� = 0.8
Indirect ���� = 0.5
5 Spread 20 Spread 20 2.6 (0.7) 5.7 (1.1) 7.8 (1.7) 6 Spread 40 Spread 20 1.6 (0.5) 6.2 (1.8) 9.0 (2.9) 1 Spread Spread 1.5 (0.4) 6.0 (1.3) 9.9 (3.8) 4 Spread low Spread high 1.5 (0.3) 14.1 (3.4) 11.5 (2.1) 2 Peaked Peaked 1.2 (0.3) 5.5 (1.4) 9.7 (2.0) 3 Peaked Spread 1.2 (0.4) 5.2 (1.7) 10.0 (2.7)
Note: Rows are sorted by the value of k2 for the case of direct prediction. (Values in parentheses are standard errors based on 10 replications of each simulation condition.)
Case 4 is interesting in that it is the only one in which the two tests are differentially
targeted for the underlying population. X is a bit too easy for the population and Y is a bit too
hard for them. This is roughly what can happen in vertical equating studies. We note that the
values of 2IGk are the biggest for these two cases and that they get larger as � increases. At first,
we thought this was an error in the simulation. We were convinced, however, that it was not after
we did three more versions of Case 4 where � was .90, .99, and 1.00 and the values of 1,000 2IGk
were 13.2, 14.8, and 14.9, respectively. Figure 2 shows both the conditional expectation and the
BLP for Case 4 � = 0.80, and we see substantial curvilinearity in the conditional expectation.
The “bend” gets stronger the more correlated the two tests become, and it is related to the
difference in the levels of difficulty of the two tests and not to the lack of a perfect correlation
27
between the abilities they measure. For comparison, we also did Case 1 (X spread and Y spread)
for � = 0.99, and instead of going up, 1,000 2IGk went down to 1.8 from 6.0 for � = 0.80.
While the values of the k2 factors indicate that, for these examples at least, the linear
approximation to the posterior mean by the BLP is quite good, our second approach includes two
graphs (see Figures 1 and 2) that show what the posterior means and the BLP look like as
functions of the conditioning score value. We only show the graphs that correspond to the largest
and smallest values of the k2-factors in our original study design.
The BLP is an approximation to the conditional expectation because Theorem A, part (b)
and Theorem B, part (d) in the appendices show that, averaged over the score distribution, both
L(�X |SX) and E(�X |SX) have the same value (this is also true for L(�X |SY) and E(�X |SY)). Hence,
there can be no constant bias between the approximation and the target—they must cross as we
see in Figures 1 and 2. Our conclusion is that the BLP is a remarkably good linear approximation
to the posterior mean in the IRT model studied here. This suggests the need for future analyses
along these lines for more complicated IRT models.
How good is H as an approximation to the loss of precision that arises through the use of
indirect prediction? First of all, there were, to two decimal values, virtually no differences
between the values of H computed using (51) and one where the value of k2 is set to 0. This
finding is not surprising in light of the small values of k2 that emerged in our study, and it
suggests analyses of H that ignore k2 probably give useful results in many situations. This is a
useful topic for future research.
Figure 3 gives typical examples of the posterior variances for direct and indirect
prediction. It shows a curvilinear relationship between the posterior variance and the
conditioning test score. This curvilinear relationship is predicted from the simple beta-binomial
model (Gelman et al. 1995, p. 477), which is a special case of the Rasch IRT models used here.
Comparing the two graphs, we see that the posterior standard deviations for direct prediction are
smaller than those for indirect prediction, which is exactly what the inflation factor H is
attempting to measure. To see how well H does this, we computed the average ratios of the
posterior standard deviations at each conditioning score point, indirect divided by direct, and
then averaged the results over the score points. If H is to be a useful average measure, it needs to
reflect the average amount by which the posterior standard deviation is increased when we move
from direct to indirect prediction.
28
(a)
(b)
Figure 1. (a) Plot of posterior mean, E(|), and posterior BLP, L(|), for Case 3, X peaked, Y
spread, with � = 0.8. (best fit). (b) Difference plotted against raw score.
0
10
20
30
40�
X
0 10 20 30 40SY
E(�X|SY)L(�X|SY)
0 10 20 30SY
-3
-2
-1
0
1
2
E(�X|SY)-L(�X|SY)
29
(a)
(b)
Figure 2. (a) Plot of posterior mean, E(|), and posterior BLP, L(|), for Case 4, X spread low,
Y spread high, and � = 0.8. (worst fit). (b) Difference plotted against raw score.
0 10 20 30SY
0
10
20
30�
X
L(�X|SY)E(�X|SY)
0 10 20 30SY
-3
-2
-1
0
1
2
E(�X|SY)-L(�X|SY)
30
(a) (b)
Figure 3. Comparison of conditional standard deviations and their average with the
average predicted by the BLP, for both direct and indirect prediction. (a) Case 1, X spread,
Y spread, � = 0.8. (b) Case 2, X peaked, Y peaked, � = 0.8.
0 10 20 30Observed Test Score
1
2
3
4
5
6
7
8
9
Poste
rior S
D
SD(�X|SY)
E[SD(�X|SY)]SD(�X|SX)
obs M[SD(�X|SY)]
E[SD(�X|SX)]obs M[SD(�X|SX)
0 10 20 30Observed Test Score
1
2
3
4
5
6
7
8
9
Post
erio
r SD
SD(�X|SY)
E[SD(�X|SY)]SD(�X|SX)
obs M[SD(�X|SY)]
E[SD(�X|SX)]obs M[SD(�X|SX)
31
When the two tests are not of the same length (i.e., Case 6), some means of connecting
their score values needs to be worked out to form these ratios of posterior standard deviations.
We simply scored the two tests by the percentage correct and pooled values of the standard
deviations associated with the neighboring scores of the test with the larger number of score
values. We used (64) to form the target ratios.
Target ratios = Var( | ) { }Var( | )
X YX
s X X
S s P S sS s
�
�
�
��
�
. (64)
The values of the target ratios and the values of H are given in Table 5.
Table 5
Average Values of Target Ratios and H
Case X Y H � = .8
H � = .5
Mean Ratio � = .8
Mean Ratio � = .5
5 Spread 20 Spread 20 1.49 (.04) 1.95 (.05) 1.51 (.04) 1.99 (.02) 1 Spread Spread 1.86 (.03) 2.53 (.03) 1.89 (.03) 2.58 (.04) 6 Spread 40 Spread 20 1.99 (.04) 2.60 (.05) 2.04 (.03) 2.64 (.08) 4 Spread
Low Spread High
1.92 (.03) 2.55 (.03) 1.77 (.03) 2.48 (.03)
2 Peaked Peaked 2.02 (.02) 2.70 (.02) 2.07 (.03) 2.81 (.07) 3 Peaked Spread 2.03 (.02) 2.71 (.03) 2.03 (.03) 2.77 (.06)
Note: Table shows the average values of H and of the average ratios of the target posterior standard deviations for the several simulation conditions in the study. (Standard deviations across the 10 replications are in parentheses.)
Table 5 shows that the H values, while usually smaller than the target values, are quite
close and give exactly the same general type of information about the effect of indirect
prediction relative to direct prediction for the 12 conditions of the simulation study. We regard
this finding as a clear support for further work on the BLP tools we have developed here.
How does indirect prediction increase the imprecision of the prediction of the true score
of the target test, X. Using the H values, we get a clear picture as to what happens when we link a
test that measures a different construct in a different way to a target test in terms of degrading the
accuracy of the prediction of the target true score. For example, in this study the posterior
32
standard deviation is inflated by factors ranging from 49% to 171%, depending on the simulation
condition. As the correlation between the constructs lessens, the inflation factor increases.
Interestingly, the smallest inflation factors arise for the case of the least reliable tests, Case 6.
This may be due to the fact that the least reliable tests have poorer direct prediction to begin with
and thus the least to lose from using indirect test data to predict their true scores. This finding is
worth more investigation than we have reported here.
A Real Data Example. We also examined an example using test data from an
administration of a fifth grade Science assessment in two states in 1998. This assessment
involved both a multiple choice test (MC) with 29 items and a performance task test (PT) that
had a performance task followed by nine questions asking the students to record their
observations and explain them. The PT questions were scored dichotomously using expert
judgement. We will use these two different testing formats as the two tests in our study and use
the PT scores to indirectly predict the true scores on the MC.
Data were available for 1,202 girls and 1,096 boys, and we will use gender as the
subgroup-defining variable, G.
Table 6 shows some raw score (number right) means. According to the raw score means,
the two tests, MC versus PT, reverse the order of the two groups—boys perform better on the
MC (by 1.6% of a standard deviation) and girls better on the PT (by 9.9% of a standard
deviation). We used the Dorans-Holland root-expected-mean-square-difference measure
(REMSD)(Doran & Holland, 2000) to measure the average boy-girl difference between parallel-
linear equating functions linking these two tests. The REMSD is 5.6%, which is of moderate size
compared to the examples in their paper.
We used ConQuest to estimate IRT models for these data. In particular, we wanted to
estimate two different thetas, one for the MC and another one for the PT. We also wanted to
obtain separate bivariate normal ability distributions for boys and girls. In all cases, we fit Rasch
models for the items and bivariate normal distributions for the thetas. We did not want to add an
investigation of DIF to this study, so we estimated common item parameters for both genders.
33
Table 6
Raw Means of Number Right Scores
All Girls Boys Difference Standardized differencea
Multiple choice
16.17 (4.96)
16.13 16.21 -0.08 -1.6%
Performance task
4.73 (2.03)
4.83 4.63 0.20 9.9%
Note: Means of number right scores for the MC and PT for all students and separately for boys and girls. (Standard deviations in parentheses.) a Difference divided by the standard deviation for all, expressed as a percentage.
We did this estimation in two ways. First, we estimated a model for the girls only,
anchored the item parameters for boys to the values obtained for girls, and estimated the boys’
bivariate theta distribution subject to this constraint. Second, we reversed the process and
anchored the item parameters to the values estimated for the boys. The results for both
approaches are given separately and show minor differences. Some evidence showed that the
item parameters were slightly different for the two groups, but we do not think these differences
are large enough to affect the conclusions we reached in this example. Table 7 summarizes
various quantities of interest when the IRT analyses are performed separately for boys and girls
and for the total group. We see very little difference between the two methods of anchoring the
items, so we will comment on it no further. The reliabilities of both tests, the MC and PT, are
slightly higher for boys than for girls. However, the average posterior variances (shown here in
the square root scale) show the opposite trend, with the girls having slightly smaller average
posterior variances for either direct or indirect prediction. The values of k2 are all small and have
virtually no effect on H, the inflation factor. H is larger for boys than for girls, indicating that the
indirect prediction of the true score for the MC from the PT scores inflates the prediction error
more for the boys than it does for the girls.
34
Table 7
IRT Results for the Science Assessment Test Data
Item parameters anchored at values for Girls
Item parameters anchored at values for Boys
Girls Boys All Girls Boys Reliability (MC) .74 .77 .75 .73 .77 Reliability (PT) .56 .61 .58 .55 .60 Square root of E[Var(�X|SX, G)| G]
2.11 2.14 2.13 2.13 2.16
Square root of E[Var(�X|SY, G)| G]
3.41 3.75 3.57 3.38 3.74
2DGk .002 .002 .003 .003 .002
2IGk .006 .007 .005 .012 .004
H (setting k2 = 0) 1.62 1.76 1.67 1.60 1.74 H 1.62 1.76 1.68 1.60 1.74
The values of the EPDD and EPDI values are respectively .023 (2.3%) and .038 (3.8%).
These values indicate that subgroup differences have bigger effects on indirect prediction than
they have for direct prediction. The values of the EPDD and EPDI measures are smaller than the
Dorans-Holland REMSD value of 5.6% given earlier. Since the connection between the two
calculations is mostly by analogy, there is no reason for their values to be equal.
6. Discussion
The general IRT model we have developed here reproduces the main results of CTT in
considerable detail, including CTT models that involve more than one test. In addition, we see
that using the concept of BLP provides us with a version of CTT that does not need to assume
the form of the conditional means and variances is linear or constant and that does not really hold
for test data. Furthermore, our simulation study suggests that BLPs are useful alternatives to the
posterior means of the true scores, at least for the simple model we have examined.
In addition to these general considerations, this approach allows us to successfully
distinguish between direct and indirect true-score prediction in a simple but useful way. For
example, it allows us to compute an index of the loss of information that accompanies the linking
35
of nonparallel tests using true score prediction as the criterion. This is expressed in our index, H,
which can be computed from quantities that are usually available in test-linking studies. In
addition, we can generalize the Kelley formula for predicting a true score from an observed score
so that the formula predicts the true score from a nonparallel test. Our analysis shows that linear
regression gives an appropriate linking function (viewed as a BLP approximation to the posterior
expectation) but that the proper residual standard deviation is not that given by the usual
regression results. This justifies the use of multiple regression to link tests in studies such as
Pashley and Phillips (1993) and Williams, Billeaud, Davis, Thissen, and Sanford (1995).
The oft-stated assertion that “regression is not equating” immediately comes to mind
when we talk about linking in the manner that we have in this paper. We think that our approach
is a useful starting point, but it is not directly about test equating per se. For example, the
symmetry requirement of equating cannot hold for true score prediction as we have defined it
here.
This research suggests several topics that might be worth future investigations. First, it
seems useful to investigate the improvements that could be garnered by use of the best quadratic
predictor rather than the BLP. The departures from linearity shown in Figure 2 suggest that a
quadratic term will fit most of the departure from linearity that the conditional expectation
exhibits there. Since we have only looked at the simplest IRT model, however, it also would be
worthwhile to investigate the value of these ideas in more complex models for which the total
score is not a sufficient statistic. In addition, instead of the constant average variance formula in
computing H, it may be worthwhile to find a quadratic version of this using the beta-binomial as
a starting point.
Another possible use for the BLP and the other quantities is to provide “targets” for the
convergence of complex estimation procedures such as those exploiting Markov Chain Monte
Carlo methods. It is possible that having an easily computed target quantity like the BLP
available could indicate when the samples from the posterior distributions have converged to
reasonable values.
36
References
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation in a microcomputer
environment. Applied Psychological Measurement, 6, 431–444.
Dorans, N.,& Holland, P. W. (2000). Population invariance and the equatability of tests: Basic
theory and the linear case. Journal of Educational Measurement, 37, 281–306.
Feuer, M. J., Holland, P. W., Green, B. F., Bertenthal, M. W., & Hemphill, F. C. (1999).
Uncommon measures. Washington, DC: National Academy Press.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London:
Chapman and Hall.
Holland, P. W. (1990). On the sampling theory foundations of item response theory models.
Psychometrika, 55, 577–601.
Kelley, T. L. (1923). Statistical methods. New York: Macmillan.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:
Addison-Wesley.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population
characteristics from sparse matrix samples of item responses. Journal of Educational
Measurement, 29, 133–161.
Pashley, P. J., & Phillips, G. W. (1993). Toward world-class standards: A research study linking
national and international assessments. Princeton NJ: Educational Testing Service.
Wainer, H., Vevea, J.L., Camacho, F., Reeve, B.B. III, Rosa, K., Nelson, L., et al. (2001).
Augmented scores—“borrowing strength” to compute scores based on small numbers of
items. In D. Thissen & H. Wainer (Eds.), Test scoring, pp. 343-387. Mahwah, NJ:
Erlbaum.
Williams, V., Billeaud, K., Davis, L. A., Thissen, D., & Sanford, E. E. (1995). Projecting to the
NAEP scale: Results from the North Carolina End-of-Grade testing program (Technical
Report No. 34). Chapel Hill, NC: National Institute of Statistical Science, University of
North Carolina, Chapel Hill.
Wu, M., Adams, R.,& Wilson, M. (1997). ConQuest [Computer software]. Melbourne, Australia:
Australian Council for Educational Research.
37
Appendix A
Some Facts About Conditional Distributions, Means, Variances, and Covariances
We will use U, V, and W to denote random variables defined over P so that we can state
these results more generally than the specific notation we developed in Section 1 for testing
applications. We will also, whenever possible, state results conditionally given a subpopulation,
defined in terms of G, as in Section 1. We will use the notation, E(U| V, G), to denote the
conditional expectation, or mean, of U given the values of V and G, and the notation, Var(U| V,
G), for the corresponding conditional variance. Finally, Cov(U, V |W, G) denotes the conditional
covariance of U and V given W and G. We use P{U = u |V, G} to denote the conditional
probability distribution of U given V and G.
Theorem A: If U, V, and W denote random variables for which all of the following
conditional means, variances, and covariances are well-defined, then the following relationships
always hold:
(a) P{U = u | G} = E[ P{U = u | V, G}|G],
where the outer expectation is averaging P{U = u | V, G} over the conditional distribution
of V given G; this is repeated in parts (b)–(d).
(b) E(U | G) = E[E(U | V, G)| G],
i.e., the appropriate mean of conditional expectation is a (less) conditional expectation.
(c) Var(U | G) = E[Var(U | V, G)| G] + Var[E(U | V, G)| G],
i.e., the mean of the conditional variance plus the variance of the conditional mean, which
is the basis of all “within plus between” decompositions of a variance.
(d) Cov(U, V | G) =
E[Cov(U, V | W, G)| G] + Cov[E(U | W, G), E(V | W, G) | G],
i.e., this is a generalization of partc to covariances.
38
Appendix B
Best Linear Predictors
In this paper we use the idea of a BLP of one random variable by another as a way
around doing the more difficult computations of conditional means and variances. Suppose U
and V are two random variables, then the BLP of U from V is denoted by
L(U| V) = � + �V,
where � and ��are chosen to minimize
E(U – � – �V)2.
Hence a BLP is “best” only in the sense of minimizing the average prediction error. The
BLP may also be put into a form that is similar to the “everything is conditional on G form” used
in Theorem A, by having �G and �G chosen to minimize the quantity: E[(U – � – �V)2| G]. In
this case, we will denote the BLP by L(U |V, G).
The value of the minimized E[(U – � – �V)2 |G] is called the “average prediction error”
and it may be expressed as the average, over V given G, of the “conditional prediction error,”
E[(U – � – �V)2 |V, G]. Below we use a somewhat evocative, but nonstandard, notation for these
prediction error measures. 2
| ,UV G� (V) = E[(U – L(U|V, G))2|V, G], and
2| ,UV G� = E{ 2
| ,UV G� (V)|G} = E{E[(U – L(U|V, G))2|V, G]|G}
= E{[(U – L(U|V, G))2|G}.
Thus, 2| ,UV G� (V) measures how poorly L(U |V, G) predicts U for a given value of V,
while 2| ,UV G� is an average of this prediction error measure over the distribution of V given G. In
this sense, 2| ,UV G� (V) is analogous to the conditional variance of U given V and G while 2
| ,UV G�
is analogous to the (conditional given G) mean of the conditional variance of U given V and G.
It is important to point out that when the conditional expectation function, E(U |V, G) is
linear in V, it is identical to the BLP, L(U |V, G) because the conditional expectation is the best
predictor of any form, linear or nonlinear. In this case, 2| ,UV G� (V) is the conditional variance of
39
U given V and G and 2| ,UV G� is the (conditional given G) mean of the conditional variance of U
given V and G.
Theorem B summarizes some well-known and easily derived facts about the BLP.
Theorem B: If L(U |V, G) = �G��+ �GV is the BLP of U from V, in the sense of
minimizing E[(U – �G � �GV)2 |G], then �G� �G and the average squared prediction error 2| ,UV G�
have these values:
(a) �G = | |GU G V G� � �� ,
(b) �G = ||
|
U GUV G
V G
��
�,
and
(c) 2| ,UV G� = 2 2
| |(1 )U G UV G� �� .
In addition, the BLP has these relationships to the conditional moments of U given V:
(d) E[L(U| V, G)| G] = E[U| G],
i.e., the mean of the BLP is the mean of the predicted variable, U. This parallels Theorem
A, part (b).
(e1) 2| ,UV G� = E[Var(U| V, G)| G] + E[(E(U| V, G) – L(U| V, G))2| G],
= E[Var(U| V, G)| G] + Var[E(U| V, G) – L(U| V, G)| G],
or
(e2) E[Var(U| V, G)| G] = 2| ,UV G� – Var[E(U| V, G) – L(U| V, G)| G].
Part (e1) parallels Theorem A, part (c) in that it is like a “between and within” variance
decomposition. Part (e2) shows how the mean of the conditional variance of U can be expressed
in terms of the average squared prediction error and the variance of the difference between the
BLP and the corresponding conditional expectation.
The quantities given in Theorem B, parts (a) to (c), are exactly the same as the formulas
for the intercept, slope, and residual variance formulas that hold when the conditional
distribution of U given V has a linear conditional expectation function and constant conditional
variance. These conditions hold, for example, when U and V have a joint bivariate normal
distribution. However, the BLP is useful even when the conditional mean function is not linear
40
or when the conditional variance function is not constant, as is the case in most of the IRT
applications we consider here. Parts (d), (e1), and (e2) of Theorem B show the connection
between the BLP and the conditional mean and variance of the joint distribution of U and V.
These last three results motivate our notation of L(U| V, G) to mimic the conditional expectation
notation, E(U| V, G).