a strategy for developing a common metric in item response theory when parameter posterior...

11
Journal of Educational Measurement Spring 2011, Vol. 48, No. 1, pp. 1–11 A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known Peter Baldwin National Board of Medical Examiners Growing interest in fully Bayesian item response models begs the question: To what extent can model parameter posterior draws enhance existing practices? One prac- tice that has traditionally relied on model parameter point estimates but may be improved by using posterior draws is the development of a common metric for two independently calibrated test forms. Before parameter estimates from independently calibrated forms can be compared, at least one form’s estimates must be adjusted such that both forms share a common metric. Because this adjustment is estimated, there is a propagation of error effect when it is applied. This effect is typically ig- nored, which leads to overconfidence in the adjusted estimates; yet, when model parameter posterior draws are available, it may be accounted for with a simple sampling strategy. In this paper, it is shown using simulated data that the proposed sampling strategy results in adjusted posteriors with superior coverage properties than those obtained using traditional point-estimate-based methods. In recent years, improvements in computing power and estimation methodology— notably MCMC—have made the use of fully Bayesian item response theory (IRT) models feasible for operational problems. The benefits of these models have been widely discussed elsewhere (e.g., Patz & Junker, 1999a, 1999b; Wainer, Bradlow, & Wang, 2007) and their popularity is rising. Given this growing interest, it is sensible to consider how model parameter posterior draws may enrich procedures that have traditionally relied on model parameter point estimates. Detecting differential item functioning is one such procedure that has been shown to benefit when posterior draws are available (Wang, Bradlow, Wainer, & Muller, 2008). This paper shows that another procedure, developing a common metric for two independently calibrated test forms, has the potential for improvement when fully Bayesian IRT models are utilized. Developing a Common Metric in Item Response Theory When test items that measure a common latent trait are modeled using IRT, some parameters must be fixed to arbitrary values to identify the model. The consequence of this identification problem is that it cannot be assumed that independently esti- mated model parameters are expressed on a common metric. Thus, for many IRT applications, it is necessary to linearly adjust estimates from independent calibra- tions such that they share a common metric. In practice, the transformation constants used to make this adjustment can never be known exactly—they must be estimated. The errors associated with these estimated constants have been an ongoing inter- est for measurement specialists, which has led to two profitable areas of research: (a) developing and comparing methods for estimating the transformation constants Copyright c 2011 by the National Council on Measurement in Education 1

Upload: peter-baldwin

Post on 21-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Journal of Educational MeasurementSpring 2011, Vol. 48, No. 1, pp. 1–11

A Strategy for Developing a Common Metric in ItemResponse Theory When Parameter Posterior

Distributions Are KnownPeter Baldwin

National Board of Medical Examiners

Growing interest in fully Bayesian item response models begs the question: To whatextent can model parameter posterior draws enhance existing practices? One prac-tice that has traditionally relied on model parameter point estimates but may beimproved by using posterior draws is the development of a common metric for twoindependently calibrated test forms. Before parameter estimates from independentlycalibrated forms can be compared, at least one form’s estimates must be adjustedsuch that both forms share a common metric. Because this adjustment is estimated,there is a propagation of error effect when it is applied. This effect is typically ig-nored, which leads to overconfidence in the adjusted estimates; yet, when modelparameter posterior draws are available, it may be accounted for with a simplesampling strategy. In this paper, it is shown using simulated data that the proposedsampling strategy results in adjusted posteriors with superior coverage propertiesthan those obtained using traditional point-estimate-based methods.

In recent years, improvements in computing power and estimation methodology—notably MCMC—have made the use of fully Bayesian item response theory (IRT)models feasible for operational problems. The benefits of these models have beenwidely discussed elsewhere (e.g., Patz & Junker, 1999a, 1999b; Wainer, Bradlow, &Wang, 2007) and their popularity is rising. Given this growing interest, it is sensibleto consider how model parameter posterior draws may enrich procedures that havetraditionally relied on model parameter point estimates. Detecting differential itemfunctioning is one such procedure that has been shown to benefit when posteriordraws are available (Wang, Bradlow, Wainer, & Muller, 2008). This paper shows thatanother procedure, developing a common metric for two independently calibratedtest forms, has the potential for improvement when fully Bayesian IRT models areutilized.

Developing a Common Metric in Item Response Theory

When test items that measure a common latent trait are modeled using IRT, someparameters must be fixed to arbitrary values to identify the model. The consequenceof this identification problem is that it cannot be assumed that independently esti-mated model parameters are expressed on a common metric. Thus, for many IRTapplications, it is necessary to linearly adjust estimates from independent calibra-tions such that they share a common metric. In practice, the transformation constantsused to make this adjustment can never be known exactly—they must be estimated.

The errors associated with these estimated constants have been an ongoing inter-est for measurement specialists, which has led to two profitable areas of research:(a) developing and comparing methods for estimating the transformation constants

Copyright c© 2011 by the National Council on Measurement in Education 1

Page 2: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Baldwin

so that practitioners can estimate them with as little error as possible (e.g., Baker &Al-Karni, 1991; Haebara, 1980; Jodoin, Keller, & Swaminathan, 2003; Stocking &Lord, 1983) and (b) estimating the errors in the estimated transformation constantsfor the purpose of either comparing various methods of estimating these constants orto verify that they are small enough to ignore (e.g., Baker, 1996; Ogasawara 2000,2001). Both of these lines of inquiry have provided reassurance that, for many appli-cations, transformation error is tolerable.

Of course, the extent to which transformation error can be ignored depends notonly on its magnitude but also on the inferences being made from the transformedmodel parameter estimates. Furthermore, there are test designs that may affect thesize and impact of error in the transformation constants—e.g., designs with smallnumbers of linking items or designs that link multiple forms sequentially so thattransformation error may aggregate. For these reasons, it may be preferable to ac-count for transformation error rather than to assume it can be ignored without con-sequence. In this paper I show that when posterior draws from the model parametersare available, a simple sampling strategy can correct for the propagation of error ef-fects that arise when independently estimated parameters are placed on a commonmetric.

Current Practice

The customary strategy for developing a common metric in IRT when test formsare independently calibrated is to create a link of common model parameters (mostoften item parameters) across forms. Estimates for these common parameters arethen used to estimate the linear relationship between the two arbitrarily definedscales. Once the transformation constants are estimated, they can be used to placeboth forms’ parameter estimates onto a common scale; however, this approach hasthe disadvantage of ignoring estimation error in the transformation constants. Conse-quently, once transformed, the adjusted model parameter estimates incorporate what-ever error was present in the transformation constants.

Various methods exist for estimating the transformation constants and it is notmy goal to propose another in this paper. Instead, I propose a general strategy fordeveloping a common metric that can be applied regardless of the method used forestimating transformation constants.

Developing a Common Metric When Posterior Draws are Available

Suppose a testing program has two independently calibrated test forms, A and B,and that the estimates from form B must be put on form A’s scale to make compar-isons. Further, suppose that samples A and B cannot be assumed to be equivalent,but that a subset of common items appears on both forms. This is sometimes referredto as a NEAT (non-equivalent groups with anchor test) design. The parameter esti-mates for these common items can then be used to estimate the linear relationshipbetween scale A and scale B. Although the choice of common metric is arbitrary,once the linear relationship is estimated it is typical to transform only one of theform’s estimates—e.g., this year’s estimates are put onto last year’s scale.

Suppose that instead of having only point estimates for each model parameter,posterior draws are available. There is no reason that the customary strategy for

2

Page 3: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Developing a Common Metric Using Posterior Draws

developing a common metric cannot still be applied using posterior draws. Here,the mean or median of each common parameter’s posterior may be taken as its pointestimate and then used to estimate the relationship between the scales. Once thisrelationship is estimated, scale B may be transformed such that it is common withscale A. Note, however, that everything on scale B must be adjusted. So, in additionto the point estimates, each form B posterior draw must now also be transformed.Thus, the customary strategy generalizes to the case when posterior draws are avail-able as follows: (a) the best estimates of the common parameters are used to obtainthe best estimates of the transformation constants and (b) the estimated transforma-tion constants are applied to every posterior draw. This strategy may often producevery reasonable results; however, error in the estimated transformation constants re-mains unavoidable and ignored. As a result, the transformed posterior distributionsare not expected to be diffuse enough to reflect the error that is introduced throughthe imperfect transformation.

But, suppose that instead of estimating the transformation constants once and thenapplying them to every draw, a single random multivariate draw is taken from formA’s posterior and a single random multivariate draw is taken from form B’s posterior.Taking each of these multivariate draws, transformation constants can be estimatedusing the sampled values associated with the common parameters. These estimatedconstants can then be used to transform form B’s single random multivariate draw.Of course, this yields but one transformed draw from form B’s multivariate posterior;however, this procedure can simply be repeated as many times as needed to producethe desired number of transformed posterior draws.

In this way, the proposed strategy yields a set of transformed posterior draws thatare more diffuse than we would expect if a single transformation were applied to allthe posterior draws. This increased uncertainty reflects the error that is added to theparameter estimates by the imperfect transformation.

Simulation Study

A small simulation study was conducted to demonstrate proof of concept for theproposed strategy. This study followed a very simple design: Let form A and form Bbe two parallel forms, each with 32 dichotomously scored items, 10 of which werecommon to both forms. For each of 5,000 replications, 2,000 multivariate posteriordraws were simulated for each form and form B’s draws were transformed to formA’s scale three times: once using the customary strategy, once using the proposedstrategy, and once using the true transformation constants, which, because the datawere simulated, were known. The coverage properties of the resultant transformedform B posteriors were then compared.

Because the goal here is merely to demonstrate proof of concept, it is desirable tocontrol for several confounding sources of error that arise in practice. For this reason,simulating data for this study, while not difficult, was somewhat atypical. The detailsof the simulation procedure are not central to the study and the relative length ofthe description far exceeds its relative importance. Therefore, readers interested inhow the data were generated are referred to the appendix for a detailed description.Here, the discussion is limited to the characteristics of the simulated data set, not theprocess for producing it.

3

Page 4: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Baldwin

Posterior draws were simulated to resemble those obtained using real data froma high stakes 10th grade exit exam calibrated with 1,000 examinees using a fullyBayesian 3-parameter logistic IRT model (3PL; Birnbaum, 1968). The posteriordistributions from which these draws were sampled had two properties useful forthis study. The first is that prior to being transformed, posteriors had perfect cover-age properties. By “perfect coverage properties” it is meant that a given posteriordistribution was a true probability distribution for its associated parameter. Perfectcoverage properties were obtained by defining the posterior probability distributionsfirst and then sampling the (true) parameters from these distributions.

Further clarification on this point may be helpful. Here, a true probability distribu-tion identifies the probability, without error, of its associated parameter falling withina given interval. All probability distributions have this property and thus, strictlyspeaking, the qualifier true is unnecessary; however, denoting these probability dis-tributions as true distinguishes them from the estimated probability distributions thatarise in practice, which are imperfect due to problems such as misspecified priors,estimation error, and model misfit. Simulating posteriors with perfect coverage prop-erties eliminates these potentially confounding sources of error.

Given the existence of true parameters, it follows that there also exists true trans-formation constants that describe the relationship between the scales associated withforms A and B without error. This leads to the second useful (and perhaps self-evident) property of the posterior distributions simulated for this study: transformedposteriors had perfect coverage properties when the true transformation constantswere used. Thus, imperfections in the observed coverage properties could be reason-ably attributed to either (a) sampling error that arises from sampling the posteriorsor (b) error in the transformation process.

Sampling error arises because the coverage properties of the posteriors were mea-sured by examining the empirical coverage probabilities associated with the poste-rior draws (more on the evaluation criteria below). Because the number of sampledposterior draws is finite, the observed coverage properties are imperfect due to sam-pling error—despite perfect posteriors. Because the interest is in measuring error inthe transformation process, the confound of this sampling error must be minimized.This was accomplished by aggregating results across subsets of model parametersand 5,000 replications (these subsets are described in greater detail in the next sec-tion). In this way, observed differences across transformation strategies could be rea-sonably attributed to differences in the strategies themselves.

For each replication, 2,000 adjusted form B multivariate posterior draws were pro-duced using the true transformation constants and the two strategies described in theearlier sections:

1. Customary strategy: The mean of each anchor item difficulty parameter’s asso-ciated set of 2,000 posterior draws was taken as its point estimate. These pointestimates were used to estimate the transformation constants, which were thenused to adjust all 2,000 form B multivariate posterior draws.

2. Proposed strategy: Two multivariate draws—one from each form’s set of 2,000(multivariate) posterior draws—were selected at random. The sampled diffi-culty values associated with the anchor items were used to estimate the trans-

4

Page 5: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Table 1Observed 95% Coverage Probabilities for Three Groups of Transformed Form B Posteriors

Customary Proposed True TransformationTransformed Posterior Group Strategy Strategy Constants (Criterion)

Simulee Proficiency (θ) .93 .95 .95Item Discrimination (a) .84 .95 .95Item Difficulty (b) .89 .95 .95

formation constants. These estimated constants were then used to adjust formB’s multivariate posterior draw. This process was repeated 2,000 times, yielding2,000 transformed multivariate posterior draws.

For both strategies, transformation constants were estimated using the mean-sigma method (see Hambleton & Swaminathan, 1985; Hambleton, Swaminathan,& Rogers, 1991; or Kolen & Brennan, 2004, for details on this procedure).1

Evaluation Criteria

Three groups of transformed form B parameter posteriors were analyzed:2

1. Simulee proficiency: all proficiency (θ) parameters (1,000 simulees × 5,000replications = 5,000,000 posteriors)

2. Item discrimination: all item discrimination (a) parameters (32 items × 5,000replications = 160,000 posteriors)

3. Item difficulty: all item difficulty (b) parameters (32 items × 5,000 replications= 160,000 posteriors)

To evaluate the three transformation strategies—i.e., the customary strategy, theproposed strategy, and the true transformation constants—95% coverage probabili-ties were computed by calculating the proportion of posteriors in each group (i.e., thesimulee proficiency, item discrimination, and item difficulty groups just described)that contained the true parameter within the middle 95% of its respective empiricalcumulative distribution function.

Results

Table 1 shows the 95% coverage probability results for the three groups of trans-formed form B posteriors: simulee proficiencies, item discriminations, and item diffi-culties. As expected, when the true transformation constants were used, the expectedcoverage probability, .95, was observed. Moreover, for all groups of transformed pos-teriors, the coverage probabilities for the proposed strategy came closer to criterionvalue of .95 than those for the customary strategy. These improvements ranged from.02 to .10.

Discussion and Summary

For all groups of posteriors, the proposed strategy outperformed the customarystrategy, with coverage probabilities very close to those observed when the truetransformation constants were used (differences were less than .01 for all groups).

5

Page 6: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Baldwin

The customary strategy performed less well, underestimating the error in the trans-formed parameters and exhibiting coverage probabilities ranging from .84 to .93.These results were consistent with expectation: if transformation error is ignored,confidence in the resultant transformed estimates will be too high; whereas the pro-posed strategy produced posteriors that were more diffuse on average, reflecting theuncertainty that is added to the parameter estimates when they are transformed im-perfectly.

It is important to emphasize that the proposed strategy does not reduce (or in-crease) error in the transformed parameter point estimates. These point estimatesremain essentially unaffected by choice of strategy; the benefit of the proposed strat-egy is that it has the potential to provide a more accurate measure of the confidenceone should have in the point estimates. This can be important because the estimatederrors associated with the model parameter estimates can affect the inferences thatare made. For this reason, a procedure that accounts for transformation error shouldbe of interest to a wide range of measurement specialists. The proposed strategy wasintended to be such a procedure and it performed well under the limited conditionsreported here, noticeably improving the coverage properties of the transformed pos-terior distributions compared to those associated with the customary approach.

As described earlier, for this study data were simulated based on a 3PL IRTcalibration of a 32-item test administered to 1,000 examinees. Further, test formswere linked with only 10 anchor items. These conditions are not intended to beexhaustive—again, the goal here is merely to demonstrate proof of concept—andfor many applications, more examinees, more items, and longer anchor tests will beavailable. When this is so, the estimated transformation line will be better estimatedand the advantage of the proposed strategy may be diminished.3 Still, even when thisis so, notable improvements may be realized when multiple forms are linked sequen-tially. In such a context, transformation error, however small, aggregates, and failingto account for it could lead to faulty inferences more grave than the transformationerrors suggest individually.

Although the proposed strategy is straightforward and showed promise, the find-ings reported here were based on posterior distributions that were simulated withouterror. Simulating data in this manner allowed for proof of concept to be clearly es-tablished; however, several challenges that arise in operational contexts (e.g., mis-specified priors, estimation error, and model misfit) could limit the generalizabilityof the reported findings. So, even though the findings presented here are supportiveof further research into their generalizability, the extent to which they generalize toempirical data remains unknown.

AppendixData Simulation

In the main body of the paper, the characteristics of the simulated data were de-scribed. This description should be sufficient to understand the reported findings;however, the actual process of simulating the data used here may be unfamiliar tosome readers and, in any case, may be of general interest.

Probably the most common simulation strategy for studies like this one involvesstarting with a set of true parameters, generating a matrix of model-based responses,

6

Page 7: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Developing a Common Metric Using Posterior Draws

and then estimating the posterior distributions of the model parameters. Such a strat-egy, although straightforward, was not feasible here because it does not ensure thatthe estimated posteriors have perfect coverage properties. Since perfect coverageproperties was a requirement of this study, data were instead simulated by first defin-ing a multivariate posterior distribution and then sampling from this distribution todefine the set of true parameters. This strategy ensures that the posteriors are indeedprobability distributions.

Thirty-two dichotomously scored items and 2,000 examinees were selected at ran-dom from a large-scale high school exit exam’s operational data set. Examinees weredivided into two random samples of 1,000, denoted A and B. These two data setswere then independently calibrated with the software SCORIGHT (Wang, Bradlow,& Wainer, 2004a) using the 3-PL model and the default (or recommended) settings(see Wang, Bradlow, & Wainer, 2004b for details). Included in SCORIGHT’s outputare posterior draws for all item and person parameters. Here, 2,000 such draws foreach parameter were retained after burn-in and thinning.

Item parameter posterior distributions typically resemble one of three shapes:log-normal for discrimination (a) parameters, normal for difficulty (b) parameters,and logit-normal for pseudo-guessing (c) parameters. Proficiency parameter poste-rior distributions are generally approximately normal. Thus, if we let hi = log(ai )and qi = log(ci/(1 − ci )), the multivariate posterior for each form’s 1,096 param-eters (1,000 proficiency parameters plus 32 items with 3 parameters each) can beapproximated by a multivariate normal distribution. This distribution is described bya mean vector μ, the means of each parameter’s posterior draws, and �, the variance-covariance matrix describing the variance of each parameter’s posterior draws andthe covariance between the draws for every pair of parameters:

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

θ1

...

θ1000

h1

...

h32

b1

...

b32

q1

...

q32

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

∼ N1096(μ, �). (A1)

7

Page 8: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Baldwin

Simulated posterior draws were sampled from each form’s multivariate normaldistribution for this study. In this way, the simulated data, while not the direct resultof a calibration, resembled operational output.4

For each replication and form, 2001 multivariate posterior draws were sampledfrom each form’s respective multivariate normal distribution just described. Al-though both forms A and B consisted of the same 32 items, for each replicationonly 10 items were selected (at random) for use as the anchor test. For the 22 nonan-chor items, their 2001st draw was defined as the true parameter. For the 10 anchoritems, the 2001st draw was only provisionally defined as the true parameter—an ad-ditional small adjustment needed to be applied to half of these interim values beforethey satisfied the requirements of this study.

Using these provisional anchor-item parameters, a transformation line was esti-mated using the mean-sigma method and treating form A’s scale as the base metric.The choice of the mean-sigma method was arbitrary—the goal was merely to pro-duce a plausible transformation line (indeed, replicating the entire study using Stock-ing and Lord’s (1983) method produced nearly identical results). For the purpose ofthis study, this resultant transformation was then defined as the true transformationline with a slope of γ and an intercept of η. A dilemma arises at this point: the provi-sional anchor-item parameters do not fall exactly on the true equating line—and thuscannot be said to be invariant. While this lack of invariance is a natural consequenceof the random mechanism by which all of these true values were defined, it mustbe remedied. Here, the remedy took the form of small corrections that restored theinvariance property, thereby producing parameters that were identical across formsexcepting the arbitrary difference in metric. These corrections are described as fol-lows:

h∗F .i = hF .i + offsetF .h.i , (A2)

b∗F .i = bF .i + offsetF .b.i , (A3)

q∗F .i = qF .i + offsetF .q.i , (A4)

where ∗ indicates the corrected draw; h, b, and q, are item parameters (or transfor-mations of them in the case of h and q as described earlier); F is form (either A or Bin this case); and i is a given anchor item. Offsets were computed as follows:

offsetA.h.i = hB.i − log(γ) − h A.i , (A5)

offsetA.b.i = γbB.i + η − bA.i , (A6)

offsetA.q.i = qB.i − qA.i , (A7)

offsetB.h.i = h A.i − log(γ−1) − hB.i , (A8)

8

Page 9: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Developing a Common Metric Using Posterior Draws

offsetB.b.i = bA.i − η

γ− bB.i , (A9)

offsetB.q.i = qA.i − qB.i . (A10)

For each anchor item i, one form, A or B, was selected at random and all 2,001 itemparameter draws (i.e., the 2,000 simulated posterior draws plus the provisional an-chor item parameter) were subject to the appropriate adjustments shown in EquationsA2–A4. If this seems complex, note that each correction does nothing more than shifta parameter’s associated simulated draws by a constant such that the adjusted provi-sional parameter falls on the transformation line. This adjustment imposes parameterinvariance.

Modifying the common item parameters (and their associated posteriors) in thismanner satisfies both conditions set forth in the body of the paper: (a) the posteriorsremain perfect (adding a constant to both the simulated posterior draws and the pro-visional anchor item parameter has no consequence for the coverage properties of agiven posterior) and (b) the transformed posteriors are also perfect (applying the truelinear transformation to both the true parameters and their respective posteriors alsohas no consequence for the coverage properties).

Acknowledgments

This paper builds on previous work I began at the University of Massachusetts. Fortheir contributions to this earlier work, I would like to express my gratitude to Lisa A.Keller, Ronald K. Hambleton, and Erin M. Conlon. For their thoughtful comments onthe current paper, I would like to thank Brian Clauser, Polina Harik, Michael Jodoin,and Howard Wainer of the National Board of Medical Examiners. I am also pleasedto thank the National Board of Medical Examiners for supporting this work.

Notes1Certainly, there are other methods for estimating the transformation constants

and some have been shown to outperform the mean-sigma approach under someconditions—e.g., loss function methods such as Haebara (1980) and Stocking andLord (1983). Indeed, multiple methods were initially used for this study and themean-sigma method did not perform best. Nevertheless, the purpose here is not tocompare methods for estimating scaling constants, and the mean-sigma method, be-ing both simple and perhaps most widely known, seemed a sensible choice.

2Missing from these analyses are results related to the coverage properties of thepseudo-guessing parameters. The reason for their exclusion is that these parametersare not subject to any scale adjustment because they are expressed on the probabilitymetric, which is in no way arbitrary. Nevertheless, it should be noted that dependingon how one goes about sampling the posteriors, the proposed strategy and customarystrategy may produce slightly different results for these parameters due to samplingeffects.

3For example, additional analyses showed that discrepancies between the trans-formed posteriors yielded by the customary strategy and those obtained using

9

Page 10: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Baldwin

the true transformation constants were halved when all 32 items were anchoritems.

4These log-normal and logit-normal distributions raise the question: why were themeans of the posterior draws taken as the point estimates for the a and c parameters?Why not use the median, or better still, why not a = eh and c = eq/(1 + eq )? Theanswer is twofold. First, for better or worse, SCORIGHT reports the means of theposterior draws as the point estimates and these data were simulated to resembleSCORIGHT output. Second, all analyses were in fact replicated using a = eh insteadof a = a and c = eq/(1 + eq ) instead of c = c and it was found to make no practicaldifference: all changes in observed coverage probabilities were between −.004 and.001.

References

Baker, F. B. (1996). An investigation of the sampling distributions of equating coefficients.Applied Psychological Measurement, 20, 45–57.

Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRTequating coefficients. Journal of Educational Measurement, 28, 147–162.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability.In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Reading, MA: Addison-Wesley.

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method (IowaTesting Programs Occasional Papers, No. ITPOP27). Iowa City, IA: University of Iowa,Iowa Testing Programs.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applica-tions. Boston, MA: Kluwer-Nijhoff.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item responsetheory. Newbury Park, CA: Sage.

Jodoin, M. G., Keller, L. A., & Swaminathan, H. (2003). A comparison of linear, fixed com-mon item, and concurrent parameter estimation equating procedures in capturing academicgrowth. Journal of Experimental Education, 71, 229–250.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods andpractices (2nd ed.). New York, NY: Springer.

Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using mo-ments. Economic Review (Otaru University of Commerce), 5, 1–23.

Ogasawara, H. (2001). Standard errors of item response theory equating/linking by responsefunction methods. Applied Psychological Measurement, 25, 53–67.

Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte Carlomethods for item response models. Journal of Educational and Behavioral Statistics, 24,146–178.

Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC for IRT: Multi-ple item types, missing data, and rated responses. Journal of Educational and BehavioralStatistics, 24, 342–366.

Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory.Applied Psychological Measurement, 7, 201–210.

Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications.New York, NY: Cambridge University Press.

Wang, X., Bradlow, E. T., & Wainer, H. (2004a). SCORIGHT: A computer program for scor-ing tests built of testlets including a model for covariate analysis (Version 3.0) [Computerprogram]. Princeton, NJ: Educational Testing Service.

10

Page 11: A Strategy for Developing a Common Metric in Item Response Theory When Parameter Posterior Distributions Are Known

Developing a Common Metric Using Posterior Draws

Wang, X., Bradlow, E. T., & Wainer, H. (2004b). User’s guide for SCORIGHT (version 3.0):A computer program for scoring tests built of testlets including a module for covariateanalysis. Princeton, NJ: Educational Testing Service; Philadelphia, PA: National Board ofMedical Examiners.

Wang, X., Bradlow, E., Wainer, H., & Muller, E. (2008). A Bayesian method for studying DIF:A cautionary tale filled with surprises and delights. Journal of Educational and BehavioralStatistics, 33, 363–384.

Author

PETER BALDWIN is a Measurement Scientist, National Board of Medical Exam-iners, 3750 Market Street, Philadelphia, PA 19104-3102; [email protected]. Hisprimary research interests include psychometric methods.

11