estimation methods for one-parameter testlet models

Journal of Educational MeasurementSummer 2013, Vol. 50, No. 2, pp. 186–203

Estimation Methods for One-Parameter Testlet Models

Hong JiaoUniversity of Maryland

Shudong Wang and Wei HeNorthwest Evaluation Association

This study demonstrated the equivalence between the Rasch testlet model and thethree-level one-parameter testlet model and explored the Markov Chain MonteCarlo (MCMC) method for model parameter estimation in WINBUGS. The estima-tion accuracy from the MCMC method was compared with those from the marginal-ized maximum likelihood estimation (MMLE) with the expectation-maximizationalgorithm in ConQuest and the sixth-order Laplace approximation estimation inHLM6. The results indicated that the estimation methods had significant effectson the bias of the testlet variance and ability variance estimation, the random errorin the ability parameter estimation, and the bias in the item difficulty parameter es-timation. The Laplace method best recovered the testlet variance while the MMLEbest recovered the ability variance. The Laplace method resulted in the smallest ran-dom error in the ability parameter estimation while the MCMC method producedthe smallest bias in item parameter estimates. Analyses of three real tests generallysupported the findings from the simulation and indicated that the estimates for itemdifficulty and ability parameters were highly correlated across estimation methods.

Testlets are a commonly used format in large-scale assessments (Wainer & Kiely,1987). Passage-based reading comprehension tests and scenario-based science testsare such examples in practice. Items constructed around a testlet may be subjectto local item dependence (Yen, 1993). That is, an examinee’s response to an itemin a testlet may affect the probability of the examinee’s response to another itemin the testlet given the person and item parameters (Embretson & Reise, 2000;Hambleton & Swaminathan, 1985). At present, item response theory (IRT) models,which assume local item independence, are still prevalent in calibrating, equating,and scaling testlet-based large-scale assessments though IRT models are not robustto local item dependence caused by testlets.

Testlet effects can be conceptualized from multiple perspectives: an interactionbetween a testlet and persons, multidimensionality, or contextual effects on itemsnested within a testlet. In accordance with these conceptualizations, different mod-els have been proposed: the Bayesian random-effects testlet models (e.g., Bradlow,Wainer, & Wang, 1999), the Rasch testlet models (Wang & Wilson, 2005), and thethree-level one-parameter testlet model (Jiao, Wang, & Kamata, 2005). Estimationmethods differ for these testlet models. The marginalized maximum likelihood es-timation (MMLE) method with the expectation-maximization (EM) algorithm wasused to estimate parameters for the Rasch testlet models (Wang & Wilson, 2005);the sixth-order approximation Laplace (Laplace) method was explored for the three-level one-parameter testlet model (Jiao et al., 2005); and the Markov Chain MonteCarlo (MCMC) method was demonstrated for the two-parameter, three-parameter,

186 Copyright c© 2013 by the National Council on Measurement in Education

One-Parameter Testlet Model Estimation

and the graded-response IRT models (Wang, Bradlow, & Wainer, 2005). One study(He, Jiao, & Wang, 2007) compared the MMLE and the Laplace methods in termsof item parameter recovery. Little is known about the relative performance of theseestimation methods.

This study first demonstrated the equivalence between the Rasch testlet modeland the three-level one-parameter testlet model and explored parameter estimationof the one-parameter testlet model using the MCMC method in WINBUGS 1.4.3.Its performance was compared with two other estimation methods: the MMLE andLaplace methods under simulation conditions. Model parameter estimates from thethree estimation methods were evaluated in terms of estimation errors. An exampleof real data analysis was provided for further illustration.

One-Parameter Testlet Models

The Rasch testlet models (Wang & Wilson, 2005) were proposed for both dichoto-mous and polytomous items by viewing testlet effects as other dimensions in additionto the general dimension intended to be measured by a test. The models are set up asa special case of the multidimensional random coefficients multinomial logit model(Adams, Wilson, & Wang, 1997). This is essentially a Rasch version of a bi-factormodel. The Rasch testlet model is expressed as

p jdi = 1

1 + exp[−(θ j − bi + γ jd(i))], (1)

where θ j represents the general ability for person j, bi is the difficulty for item iwithin testlet d, γ jd(i) is the random-effects parameter associated with item i withintestlet d for person j, or testlet-specific ability, and p jdi is the probability of a correctresponse by person j on item i within testlet d. The magnitude of a testlet effect isrepresented by σ2

γ jd(i), which is the variance of the γ jd(i) parameter. A person’s general

ability and a testlet-specific ability underlie the performance of an item within atestlet. This is essentially the same conceptualization as in the Bayesian random-effects testlet models (Bradlow et al., 1999).

A three-level one-parameter testlet model (Jiao et al., 2005) was proposed for di-chotomous item responses. Testlet effects are conceptualized from the hierarchicalgeneralized linear modeling perspective where the contextual effect exerted by a test-let on items associated with it is modeled into a three-level hierarchical generalizedlinear model for item analysis (Kamata, 2001). The Level 1 model is set up by ex-pressing the log-odds of person j answering item i in testlet d using a linear regressionequation that includes an intercept term as follows:

log

(pid j

1 − pid j

)= ηid j = β0d j +

k−1∑q=1

βqd j Xqid j , (2)

where pid j is the probability that person j answers item i in testlet d correctly. Xqid j

is the qth dummy variable for person j, with a value of 1 when q = i and 0 whenq �= i, for item i within testlet d. The coefficient β0d j is an intercept term, which isthe reference item effect, and βqd j is a coefficient associated with Xqid j , where q =1, . . . , k – 1, and k is the total number of items on a test. The individual item effect

187

Jiao, Wang, and He

βqd j is the unique item effect for item q relative to the reference item effect β0d j . Theprobability that person j answers item i within testlet d correctly is expressed as

pid j = 1

1 + exp[−ηid j ].

Level 2 models testlet effects as

β0d j = γ00 j + u0d j and βqd j = γq0 j , (3)

where q = 1, . . . , k – 1, γ00 j is the fixed effect of the Level 1 intercept and u0d j is arandom effect of the Level 1 intercept. The random effect u0d j can be conceptualizedas an interaction between a testlet and a person. This is analogous to a person-specifictestlet effect γ jd(i) in Bradlow et al.’s (1999) formulation (see Equation 1). In (3), γq0 j

is the item-specific effect for an item with the qth dummy variable. It is assumed thatu0d j ∼ N(0, σ2

u), and σ2u is analogous to σ2

γ in Bradlow et al. (1999), which representsthe magnitude of testlet effects.

Level 3 models person effects. It decomposes γ00 j at the Level 2 model (Equation3) into a fixed part and a random part. The random part is the person effect, which isequivalent to the person’s ability. The effects for the items remain fixed in the level3 model, although they do not have to. Therefore, the Level 3 model is

γ00 j = π000 + w00 j and γq0 j = πq00, (4)

where q = 1, . . . , k – 1, w00 j ∼ N(0, σ2w). σ2

w is the variance of the ability distri-bution. This is the simplest version of the multilevel testlet model. This multilevelmodel easily can be extended to include covariates related to the characteristics ofitems, testlets, or persons, or even group variables which are one level higher in thehierarchy by incorporating linear predictors into the model.

This multilevel testlet model can be algebraically shown to be equivalent to theRasch testlet model by Wang and Wilson (2005) by combining Equations 2 through4 into

p jdi = 1

1 + exp[−(w00 j − (−π000 − πq00) + u0d j )], (5)

where w00 j = θ j , − π000 − πq00 = bi ,u0d j = γ jd(i). Due to the equivalence of theRasch testlet model (Wang & Wilson, 2005) and the three-level one-parameter testletmodel (Jiao et al., 2005), this study uses one generic term—the one-parameter testletmodel—to refer to these two different approaches to conceptualizing and modelingtestlet effects. The equivalence assures that the comparison among different param-eter estimation methods is meaningful and valid.

Estimation Methods

Wang and Wilson (2005) demonstrated the use of ConQuest (Wu, Adams, &Wilson, 1998) to estimate the Rasch testlet model parameters using the MMLE withthe EM algorithm. The details can be found in Wang and Wilson (2005). In gen-eral, the MMLE method treats person parameters and testlet parameters as randomeffects item difficulty as a fixed effect. The marginal likelihood is formed by inte-grating out the random-effects parameters from the likelihood function. The integral

188


in the marginal likelihood is intractable. The MMLE with the EM algorithm is an ap-proximation to the integral with numerical integration techniques (Tuerlinckx et al.,2004). A numerical approximation to the likelihood then is maximized with respectto the fixed-effects parameters and the parameters related to the population densitiesof the random effects. The population densities for the random-effects parametersare assumed to be normal with a mean of zero and unknown variances. Constraintsare set for the variance-covariance matrix where the variances are to be estimatedand the covariances are zero indicating no association among any pairs of randomeffects. The EM algorithm, an indirect maximization method, optimizes the likeli-hood function indirectly. In the EM algorithm, the random effects are considered asmissing data and form the complete data with the observed item response data. Ineach EM cycle, the expected value of the complete data log likelihood is computedgiven the observed data and the fixed-effects parameter estimates and variance esti-mates from the previous cycle. When this E-step is completed, the M-step follows tomaximize the expected log-likelihood. After the estimation of the item parametersthe mean vector, and variance-covariance matrix, the estimates of the person abilityparameter can be obtained from the mean vector of the marginal posterior distribu-tion (the expected a posteriori estimates) or the maximum point of the conditionallikelihood (the maximum likelihood estimates). Quadrature points are used for inte-gration. ConQuest also adjusts the quadrature points based on the recent estimatesfor better integration (Wang & Wilson, 2005).

Jiao et al. (2005) used HLM6 (Raudenbush, Bryk, Cheong, & Congdon, 2004) toestimate the three-level one-parameter testlet model, which is essentially a hierarchi-cal generalized linear model using the sixth-order Laplace approximation methods.In the hierarchical generalized linear model framework, the integral in the marginallikelihood is not tractable. One possible solution is to approximate the integrand sothat the approximation integral has a closed form. Both the Laplace sixth order ap-proximation method and the penalized quasi-likelihood (PQL; Breslow & Clayton,1993) can be used. In the Laplace sixth-order method, the log of the integrand ofthe random effects is approximated by a sixth-order Taylor series expansion aboutits maximum estimates of the random effects (Raudenbush, Yang, & Yosef, 2000)and the Laplace method is then used to integrate. The integrated likelihood is max-imized using the Fisher scoring method, and the empirical Bayes estimates and thefixed-effects estimates then can be found jointly. The Laplace method is an accurateapproximation with high computational efficiency (Raudenbush et al., 2000). TheLaplace approximation method was developed to overcome bias in the model pa-rameter estimates with binary outcomes in the hierarchical generalized linear modelusing the PQL method which was developed for parameter estimation for the linearmixed models. Research has well documented that a considerable amount of down-ward bias existed in the PQL estimates of random effects in the hierarchical gener-alized linear model (Breslow & Lin, 1995; Diaz, 2007; Goldstein & Rabash, 1996;Raudenbush et al., 2000; Rodriguez & Goldman, 1995, 2001). Further, Snijders andBosker (1999) indicated that the estimates from the Laplace sixth-order approxima-tion method were more accurate than those from the PQL. Thus, this study did notinclude the PQL method in the comparison.

189

Jiao, Wang, and He

Wang et al. (2005) developed the SCORIGHT software to estimate the modelparameters using the MCMC algorithm for the two-parameter and three-parameterdichotomous testlet models and the testlet version of the graded-response model(Samejima, 1969). However, their study and software did not include a testlet modelrelated to the one-parameter/Rasch testlet model and its estimation. This currentstudy developed WINBUGS code for estimating the model parameters using theMCMC algorithm. This Bayesian estimation method is computationally intensivebut the model parameter estimation is less complicated. In the MCMC method, everyparameter is treated as a random effect. Prior distributions are specified for all modelparameters. Given the priors and the likelihood of the observed item responses, theposterior distributions are obtained and used for next-stage sampling. In general, theMCMC procedure avoids numerical integration by taking samples from the posteriordistributions using some sampling procedures such as Gibbs sampling. The advan-tages of the Bayesian procedure include full assessment of the uncertainty in modelparameter estimation. The drawback is the intensive computation time and the as-sessment of convergence.

Methods

The MCMC algorithm in WINBUGS 1.4.3 (Lunn, Thomas, Best, & Spiegelhalter,2000) was explored to estimate parameters for the one-parameter testlet model. Theimplementation of MCMC in WINBUGS counts on the priors for the parameters inthe one-parameter testlet model and item response data to obtain the posterior. In thiscurrent study, (1) can be expressed as follows in the Bayesian framework:

y jdi ∼ Bernoulli(p jdi ),

log i t(p jdi ) = θ j − bi + γ jdi , i = 1, 2, . . . ,I, d = 1, 2, . . . ,T, j = 1, 2, . . . ,J ,

bi ∼ N (0, 1),θ j ∼ N(0, σ2

θ j

), and γ jdi ∼ N

(0, σ2

γ jdi

).

These specify the distributions for item difficulty (b), ability (θ), and testlet effect(γ) parameters. The priors for item difficulty parameters follow a standard normaldistribution. The priors for both ability and testlet effect parameters are normallydistributed with a mean of 0 and their respective variance. The inverse of variance forboth parameters follows gamma distributions with both scale and shape parametersspecified as 1. The inverse gamma distribution is a noninformative but proper priordistribution for the variance of random effects and is favored due to its conditionalconjugacy that the prior and the posterior distributions are from the same family.These priors lead to the joint posterior distributions for each sampled parameter as

S = {θ j , σ

2θ j

, bi , γ jd(i), σ2γ jd(i)

}. (6)

This also can be written as

P(S |U ) ∝ L(u, θ j , bi , γ jd(i))P(θ j

∣∣0, σ2θ j

)P

(σ2

θ j

)P(bi )P

(γ jd(i)

∣∣0, σ2γ jd(i)

)P

(σ2

γ jd(i)

).

(7)

190

Simulation Study

To compare the three estimation methods in recovering the true parameters of theone-parameter testlet model, a simulation study was conducted. The true values ofthe person ability parameter were randomly generated from a standard normal distri-bution with a sample size of 1,000. The true values for the item difficulty parametersalso were randomly generated from a standard normal distribution. The test structurewas simulated mimicking a K-12 large-scale reading test with 54 multiple-choiceitems. The test consisted of six testlets with nine items in each testlet. For differentstudy conditions, the true ability and item difficulty parameters remained the same,while the testlet effects varied.

Testlet parameters γ jd(i) were simulated from a normal distribution N (0, σ2γ jd(i)

) by

specifying testlet variance σ2γ jd(i)

at four levels: 0, .25, .5625, and 1, which representno, small, moderate, and large testlet effects, respectively (Wang & Wilson, 2005).When σ2

γ jd(i)= 0, the one-parameter testlet model reduces to the Rasch model. The

Rasch model was fitted to the data.Item responses were generated by incorporating the true ability, item difficulty,

and testlet parameters into the Rasch testlet model in (1). Once item responses weregenerated, three estimation methods were applied to recover the true model param-eters. ConQuest (Wu, Adams, Wilson, & Haldane, 2007) was used to implementthe MMLE procedure; HLM6 software (Raudenbush et al., 2004) was used for theLaplace approximation method. The MCMC approach was implemented in WIN-BUGS 1.4.3 (Lunn et al., 2000). Since the MCMC algorithm is within the Bayesianframework, the expected a posteriori estimation method was chosen in ConQuest toget person parameter estimates and the empirical Bayes estimators for person abilityparameters were used for the Laplace methods in HLM6 to make valid comparisonsof ability estimates.

Simulation conditions were obtained by fully crossing the levels of testlet effectsand estimation methods. Within each condition, 25 replications were implementeddue to the time-intensive analyses. For some estimation methods, one replicationcould last up to 2 days. Based on Harwell, Stone, Hsu, and Kirisci (1996), 25 replica-tions should be sufficient for IRT simulation studies. In addition, a post hoc check ofthe standard errors of the means for the ability and item difficulty parameter estimatesindicated that as sample size increased the standard error of the mean decreased, andthe diminishing rate flattened out around a sample size of 15 to 20. Further, the mag-nitudes of the estimation bias (between –.075 and .073) in the item difficulty param-eters were very small—only about 3%—as compared to the range of the generatedtrue values (between –2.42 and 2.92). Compared to what was reported in Wang andWilson (2005) i.e., the magnitudes of bias in item difficulty estimates for dichoto-mous items were between –.063 and .050 for the generating values between –2.00and 2.00 over 100 replications, the estimation biases in the present study were notsubstantially larger. The diminishing effects of sample sizes or the number of replica-tions on estimation errors was summarized in an appendix available from the authors.These results in general supported the use of 25 replications in the current study.

The three estimation methods were compared in recovering the true testlet vari-ances and ability variance by referencing the estimates to the true values used for

191

Jiao, Wang, and He

data generation over replications. The three estimation methods were further com-pared and evaluated in terms of bias (systematic error), standard error (SE, randomerror), and root mean squared error (RMSE, total error) in item and person param-eter estimation. Bias, SE, and RMSE were computed based on (8), (9), and (10),respectively:

Bias(∧β) =

N∑r=1

(β̂r − β)

N, (8)

SE(∧β) =

√√√√ 1

N

N∑r=1

(β̂r − ¯̂β)

2, (9)

RMSE(β̂) =√√√√ 1

N

N∑r=1

(β̂r − β)2, (10)

where β is the true model parameter, β̂r is the estimated model parameters for the rthreplication, ¯̂

β is the mean of the estimated model parameters over replications, andN is the number of replications.

In addition, univariate two-way analyses of variance were run with two fac-tors, testlet effects (four levels) and estimation methods (three levels), to determinewhether observed differences across simulation conditions in parameter recoverywere of statistical significance. Bias, SE, and RMSE in the model parameter recoverywere used as the dependent variables, respectively. Multivariate analysis of varianceintentionally was not used to avoid the difficulty in interpreting the results. Effectsizes were computed to gauge the practical significance of the investigated effects.

Convergence

For the MCMC estimation method programmed in WINBUGS, several criteriawere applied to evaluate convergence. Both dynamic trace lines and time series linesindicated that four chains with very divergent starting values achieved convergencebefore 1,000 iterations for the Rasch model (where the testlet effect was zero) and be-fore 2,000 iterations for the one-parameter testlet model. The quantile plots showedrelatively convergent and smooth lines for the four chains. A Gelman-Rubin statistic(R; as modified by Brooks and Gelman, 1998) smaller than 1.05 generally supportsconvergence (Lunn et al., 2000). A sample check over replications indicated that Rgenerally was close to 1 and smaller than 1.05. The Brooks-Gelman ratio diagnosticplots indicated that stability and convergence usually occurred before 1,000 itera-tions for the Rasch model and 2,000 iterations for the one-parameter testlet model.Thus, 1,000 iterations were burned in for the conditions with null testlet effects and2,000 iterations were used for other conditions.

192

Figure 1. Estimation bias in testlet variance.

Real Data Analysis

The real data were obtained from a K-12 large-scale reading comprehension testbattery. The test battery was designed to assess students’ reading achievement fromGrades 3 to 11. The Grades 9 to 11 tests consisted of 54 items, nine items in each ofthe six passages. The sample sizes for the three grades were 5,004, 3,676, and 2,831,respectively. Yen’s Q3 (1984), a most frequently used index in detecting local itemdependence, first was computed to detect testlet effects in the data sets. The originaltest battery was calibrated using the Rasch model. In this study, each test was fittedto the one-parameter testlet model using each of the three estimation methods. Theestimated model parameters were compared.1

Results

Simulation Study2

Testlet variance recovery. The estimation biases of testlet variance are summa-rized in Figure 1. When the testlet effect was small, the MMLE and Laplace methodswell recovered the true value while the MCMC method slightly overestimated it. Forthe simulation conditions with moderate and large testlet effects, the Laplace andMCMC estimation methods recovered the true variance with less average bias thanMMLE. The MMLE method underestimated the true variance. A univariate two-wayanalysis of variance was conducted with bias as the dependent variable and estima-tion methods and testlet effects (excluding the null condition) as factors. The resultsindicated that estimation methods (F(2, 216) = 23.677, p = .000), testlet variancemagnitudes (F(2, 216) = 23.933, p = .000), and their interaction (F(4, 216) = 6.278,

193

Figure 2. Average bias in the ability variance estimation.

p = .000) all had significant effects on the bias in the testlet variance estimation.The effects were large for estimation methods (f = .47) and testlet variance mag-nitudes (f = .47) but moderate for their interaction (f = .34). The post hoc Tukeyprocedure indicated that the biases between pairwise estimation methods were sig-nificantly different except that between Laplace and MCMC and the biases betweendifferent testlet effects all were significant. In general, the Laplace method producedthe smallest bias. MMLE performed similarly to the Laplace method when the testleteffect was small. MCMC performed similarly to the Laplace method when the test-let effect was moderate or large. The Laplace and the MCMC estimation methodswere grouped into a homogenous set indicating that they performed similarly and ingeneral recovered the true testlet variance better.

Ability variance recovery. The true ability variance remains 1.0 across all sim-ulation conditions. Figure 2 summarizes the bias in the ability variance estimates.In general, all three estimation methods overestimated the true ability variance. Thedeviation from the true value for MMLE, Laplace, and MCMC were rank orderedfrom least to most. A univariate two-way analysis of variance indicated that onlyestimation methods (F(2, 288) = 6.694, p = .001) significantly affected the bias inthe ability variance estimation with small effect sizes (f = .22). The MMLE methodhad the smallest bias except in the null condition where the bias differences amongthe three methods were not discernible. The Tukey comparisons indicated that thepairwise bias comparisons among the estimation methods were statistically signifi-cant except the one between the Laplace and the MCMC estimation methods (whichwere grouped into a homogenous set).

194

Table 1Standard Error in Ability Parameter Estimation

Testlet Estimation Sample Std.variance method size Minimum Maximum Mean deviation

Null MMLE 1,000 .1621 .4210 .2882 .0429Laplace 1,000 .1753 .4140 .2810 .0414MCMC 1,000 .1788 .4239 .2878 .0428

Small MMLE 1,000 .1713 .4660 .2825 .0431Laplace 1,000 .1677 .3874 .2652 .0385MCMC 1,000 .1763 .4196 .2804 .0412

Moderate MMLE 1,000 .1471 .4620 .2766 .0441Laplace 1,000 .1435 .3665 .2498 .0369MCMC 1,000 .1427 .4142 .2702 .0436

Large MMLE 1,000 .1257 .5098 .2745 .0468Laplace 1,000 .1447 .3312 .2353 .0346MCMC 1,000 .1525 .3920 .2620 .0403

Ability parameter recovery. Neither estimation method nor testlet variance mag-nitude significantly impacted the bias in the ability parameter estimates. This mightbe due to constraining the mean of ability parameters to 0 for scale identificationand scale comparability. However, the variability in the bias of ability parameter es-timates increased as the testlet variance increased. This pattern was consistent acrossestimation methods.

The Laplace method produced the smallest random errors (SE) in ability estimates(see Table 1). As testlet variance increased, the random error decreased for all meth-ods. A possible explanation for this might be the reduced difficulty in detecting largertestlet effects. The MMLE method yielded the largest random error across all fourlevels of testlet effects (see Table 1). The analysis of variance indicated that the ef-fects of testlet variance (F(3, 11988) = 325.06, p = .000), estimation method (F(2,11988) = 266.691, p = .000), and their interaction (F(6, 11988) = 27.768, p = .001)all had significant impact on the random error with small (f = .23), moderate (f =.26), and small (f = .12) effect sizes, respectively. The Tukey HSD tests indicatedthat all pairwise comparisons were significant for both factors.

Estimation methods did not significantly impact the total error measured byRMSE. As testlet effects increased, the total error and its variability also increased(see Table 2). The analysis of variance indicated that only testlet effect magnitudes(F(3, 11988) = 541.709, p = .000) significantly affected the total error of estima-tion with a moderate effect size (f = .37). The Tukey procedure indicated that eachpairwise comparison of the total error between different levels of testlet effects wassignificant.

Item difficulty parameter recovery. The analysis of variance results indicatedthat estimation methods (F(2, 636) = 12.8, p = .000) and testlet effect magnitude(F(3, 636) = 5.789, p = .001) significantly affected the bias in the item parameterestimation (see Table 3) with small effect sizes (f = .20 and f = .17 respectively). The

195

Table 2Root Mean Squared Error in Ability Parameter Estimation


Null MMLE 1,000 .1815 .8873 .3095 .0639Laplace 1,000 .1762 .9684 .3090 .0710MCMC 1,000 .1809 .9005 .3092 .0640

Small MMLE 1,000 .1779 1.1196 .3584 .1005Laplace 1,000 .1720 1.2486 .3555 .1140MCMC 1,000 .1807 1.1085 .3565 .0993

Moderate MMLE 1,000 .1667 1.3453 .4035 .1449Laplace 1,000 .1584 1.5049 .3977 .1623MCMC 1,000 .1619 1.2872 .3980 .1449

Large MMLE 1,000 .1665 1.5324 .4551 .1921Laplace 1,000 .1566 1.7430 .4430 .2118MCMC 1,000 .1673 1.5478 .4452 .1917

Table 3Bias in Item Difficulty Parameter Estimation


Null MMLE 1,000 −.0236 .0559 .0082 .0157Laplace 1,000 −.0222 .057 .0091 .0157MCMC 1,000 −.0751 .0465 −.0020 .0179

Small MMLE 1,000 −.0398 .0481 .0108 .0179Laplace 1,000 −.0400 .0473 .0105 .0175MCMC 1,000 −.0529 .0437 .0022 .0211

Moderate MMLE 1,000 −.0523 .0410 .0023 .0197Laplace 1,000 −.0385 .0487 .0117 .0183MCMC 1,000 −.0694 .0491 .0023 .0241

Large MMLE 1,000 −.0362 .0726 .0178 .0246Laplace 1,000 −.0264 .0644 .0162 .0208MCMC 1,000 −.0740 .0714 .0066 .0299

Tukey HSD tests showed that the MCMC estimates were significantly smaller thanthe MMLE and Laplace estimates which formed a homogeneous set. Regarding thetestlet effects, two pairwise comparisons, one between null and large and the otherbetween large and moderate levels were significantly different.

Only the testlet magnitude significantly impacted the random error in the item dif-ficulty parameter estimation (F(3, 636) = 2.666, p = .000) with a small effect size(f = .11). The Tukey test indicated no significant pairwise comparison for eitherfactor. Also, the total error in the item difficulty parameter estimation was only sig-nificantly impacted by the testlet magnitude (F(3, 636) = 7.361, p = .000) with asmall effect size (f = .19). The Tukey procedure indicated significant pairwise com-parisons between the large testlet effect and each of the other three levels.

196

Table 4Q3 and MCMC Results for the Real Data

Q3 MCMC

Grade 9 Grade 10 Grade 11 Grade 9 Grade 10 Grade 11(N = 5,004) (N = 3,673) (N = 2,831) (N = 5,004) (N = 3,673) (N = 2,831)

Within .0200 .0161 .0374 .2639 .2572 .3993Between −.0254 −.0249 −.0284 / / /Total −.0186 −.0187 −.0185 / / /Testlet1 .0339 .0171 .0211 .3632 .2392 .1855Testlet2 .0049 .0010 .0258 .1480 .1550 .2366Testlet3 .0325 .0105 .0135 .3788 .2048 .1707Testlet4 .0041 −.0031 .0155 .1360 .1477 .1847Testlet5 .0157 .0210 .0755 .2153 .2890 .8108Testlet6 .0290 .0504 .0733 .3421 .5072 .8076

Note. The bold italicized numbers indicated testlets with small or moderate testlet effects as simulated.

In summary, the estimation methods significantly affected the bias in the abilityvariance and testlet variance estimation, the random error in the ability parameterestimation, and the bias in the item difficulty parameter estimation with at least asmall effect size. The magnitude of testlet effects significantly impacted all evalua-tion criteria except the bias in ability parameter estimates. The interaction betweenthe estimation methods and the magnitude of testlet effects significantly affected thebias only in the testlet variance estimation and the random error only in the abilityparameter estimation.

Real Data Analyses

The real data sets were collected from a K-12 large-scale reading comprehensiontest battery. Yen’s Q3 (1984) first was computed to detect the existence of local itemdependence in the data sets. There were 54 items with six testlets and nine itemsfor each passage. The expected value of Q3 statistics is –.0189 (− 1

n−1 ) when thereis no testlet effect. There were a total number of 1,431 pairwise Q3 statistics, 216within-testlet Q3 statistics, and 1,215 between-testlet Q3 statistics. The results aresummarized in Table 4.

The average within-testlet Q3 was relatively large while the average between-testlet Q3 was relatively small compared to the expected value of –.0189. For Grade9, there was not a very large average within-testlet Q3 for a particular testlet. ForGrade 10, the last reading passage displayed relatively larger within-testlet Q3. Thelast two passages in the Grade 11 test showed the largest within-testlet Q3 magni-tude. The MCMC results for the same data indicated three testlets with small3 testleteffects in Grade 9, two with small testlet effects in Grade 10 and two with moderate4

testlet effects in Grade 11.Further, the same data were analyzed using the MMLE and Laplace estimation

methods. The ability variance and the testlet variance estimates are summarizedin Table 5. The ability variance estimates from all three methods were close to 1with slight variation. The Laplace method provided an overall estimate of the testlet

197

Table 5Ability Variance and Testlet Variance Estimates Using Three Estimation Methods

Grade Method Ability Testlet1 Testlet2 Testlet3 Testlet4 Testlet5 Testlet6 Average

9 MMLE 1.0350 .3520 .1070 .3580 .1150 .2240 .3050 .2435Laplace 1.0197 / / / / / / .2439MCMC 1.1270 .3632 .1480 .3788 .1360 .2153 .3421 .2639

10 MMLE .9950 .1700 .0870 .1380 .0510 .2670 .4890 .2003Laplace 1.0080 / / / / / / .2079MCMC .9322 .2392 .1550 .2048 .1477 .2890 .5072 .2572

11 MMLE 1.0580 .1430 .2650 .1440 .2010 .7390 .8630 .3925Laplace 1.0845 / / / / / / .3605MCMC 1.0650 .1855 .2366 .1707 .1847 .8108 .8076 .3993

Table 6Summary of Item Difficulty Estimates

Grade Method N Minimum Maximum Mean Std. deviation

9 MMLE 54 −2.4661 .5073 −.5922 .6658Laplace 54 −2.4315 .5024 −.5927 .6611MCMC 54 −2.4334 .5026 −.5890 .6588



variance. The MMLE and the MCMC methods produced estimates for each test-let variance. The magnitude of individual testlet variances ranged from negligible tosmall for most testlets across all three grades; the exceptions were the last two testletsin Grade 11, which exhibited moderate testlet effects. The testlet variance estimatesfrom the MCMC method generally were higher than the MMLE estimates; this wasconsistent with the findings in the simulation study where the MMLE estimates un-derestimated the true values.

The item parameter estimates from the three estimation methods were highly cor-related: all correlations were above .999, consistent with the findings of He et. al.(2007). Essentially, there was no difference in the distributions of item parameter es-timates (Table 6). The ability parameter estimates from the three estimation methodsalso were highly correlated, with correlations all above .99. The distributions of abil-ity parameter estimates were very similar except that the variability of the Laplaceestimates was relatively smaller (Table 7).

To better understand the real data analyses, one replication from a similarsimulated study condition was examined further. As the average testlet effects forall three grades were small as simulated, replication 1 of the study condition with a

198

Table 7Summary of Ability Estimates

Grade Method N Minimum Maximum Mean Std. deviation

9 MMLE 5,004 −3.5837 2.3866 .0000 .9540Laplace 5,004 −2.8845 2.2064 .0000 .8920MCMC 5,004 −3.0974 2.3516 .0000 .9467



Table 8Summary of Item and Ability Parameter Estimates for Replication 1 of the Study ConditionWith Small Testlet Effect

Parameters Method N Minimum Maximum Mean Std. deviation

Item difficulty MMLE 54 −2.3506 3.0520 .1484 1.1958Laplace 54 −2.3583 3.0685 .1484 1.1977MCMC 54 −2.3327 3.0203 .1465 1.1894

Person ability MMLE 1,000 −2.6118 2.8925 .0000 .9537Laplace 1,000 −2.3873 2.8050 .0000 .8994MCMC 1,000 −2.5427 2.9973 .0000 .9511

small testlet effect was summarized and the descriptive statistics for item and abilityparameter estimates are presented in Table 8. When the testlet effect was small,both item and person parameter estimates had no essential practical differences; theestimates were perfectly correlated and their distributions resembled each other.

In general, the real data analyses indicated that there were no essential differencesin item and ability parameter estimates. These results may have occurred due tonot much testlet effect in the real data sets. This is supported by the average testletvariance for each test presented in Table 5. Alternatively, the results may be due tothe fact that the choice of estimation procedure really does not make much practicaldifference.

Summary and Discussion

This study demonstrated the equivalence between the Rasch testlet model (Wang& Wilson, 2005) and the three-level one-parameter testlet model (Jiao et al., 2005).It further explored the estimation of the one-parameter/Rasch testlet model using theMCMC estimation method in WINBUGS. The performance of the MCMC methodwas compared with two other estimation methods: MMLE and Laplace for the one-parameter/Rasch testlet model.

Based on the simulation study, the estimation method had significant effects on thebias of the testlet variance and ability variance estimation, the random error in the

199

Jiao, Wang, and He

ability parameter estimation, and the bias in the item difficulty parameter estimation.The Laplace method best recovered the testlet variance while the MMLE best recov-ered the ability variance. The Laplace method resulted in the smallest random errorin the ability parameter estimation while the MCMC method produced the smallestbias in item parameter estimates.

The testlet effects for the real data sets were negligible to small for most testlets,though a few testlets demonstrated moderate effects. On average, the testlet effectswere small for all three grades. The results from the real data analyses generallysupported what was observed in the simulation: The MCMC estimates were aboutthe same as the MMLE and Laplace estimates. Essentially the estimation methodreally did not make a practical difference.

In terms of the estimation time, the three studied estimation methods varied from10 minutes to 2 days. In general, ConQuest implementing the MMLE method usedabout 30 minutes to finish one analysis. WINBUGS implementing the MCMC algo-rithm needed about 6 hours to finish one replication. The shortest run time for theLaplace estimation method in HLM6 was about 20 minutes, but the longest run timecould go up to 2 full days. With the advance of computer technology, this may notbe an issue in the near future.

Though the MCMC estimation method is a computationally intensive procedure,in some study conditions the MCMC algorithm was more efficient than the Laplacemethod and better recovered the testlet effects than the MMLE method. In addition,there are possibilities to extend the one-parameter testlet model to solve more com-plicated measurement problems. For instance, the one-parameter testlet model can beextended to include variables to model item clustering and person clustering at thesame time (e.g., Jiao, Kamata, Wang, & Jin, 2012). However, ConQuest and HLM6cannot handle the complexity in such a model. One can program the MMLE algo-rithm for complex models (as one reviewer pointed out). However, the first and sec-ond derivatives would become very complex for highly parameterized models. Moreimportantly, convergence for the MMLE method could be very difficult to achieve ormultiple local maxima may exist. The MCMC method will be a potential solution tosuch complex model parameter estimation. The findings from this study supportedthat MCMC is a proper estimation method for the one-parameter testlet model.

Different estimation algorithms and estimation software were compared in thisstudy. The ability distribution was constrained to be zero in each estimation algorithmfor scale identification to guarantee that the comparisons across different estimationmethods were meaningful and valid. When the MMLE method was implementedin ConQuest, the convergence criterion was set at .0001. For the Laplace methodimplemented in HLM, the convergence criteria for the macro iteration also were setat .0001. Thus, the convergence criteria were set at about the same level. This studyis not interested in the effects of convergence criteria on model parameter estimation.Future study may explore the effects of this factor.

The MCMC algorithm is a Bayesian method. Priors affect the inferences basedon the posteriors. In this study, the focus was on comparing estimation methods.The priors were intentionally set as noninformative but proper to avoid the stronginfluence from the informative priors on the MCMC estimates. The sample sizesfor both the simulation studies and the real data analyses were large. The impacts

200


of the priors on the posteriors would be outweighed by the information in the itemresponse data. Future research may explore the impact of different priors on posteriorinferences, especially under small sample size conditions such as 200 or 500 personsas studied in Wang and Wilson (2005).

Several limitations in this current study need to be addressed in future explo-rations. First, uniform testlet effects were simulated in this current study. As observedin the real data sets, testlet effects are nonuniform across testlets in a test. Simula-tion conditions with varying testlet effects are expected to be a valuable extension.The total sample size was set constant at 1,000. As one reviewer pointed out, manylarge-scale tests have sample sizes well beyond this. Larger sample size seems a fruit-ful extension. This current study included 25 replications. Given the purpose of thestudy, this number of replications may not seriously jeopardize the internal validityof this current exploration. However, the number of replications may be treated as afactor and more replications could be included in future extensions. Further, the totalnumber of items was fixed in the current study. Future studies can vary the numberof testlets and the number of items in each testlet to investigate the impact on modelparameter estimation across methods. The current study simulates a test with a bal-anced design where the same number of items comprises each testlet. Real tests mayuse an unbalanced design in which more items are developed for long passages whilefewer items are included for short passages. In addition, some tests may contain bothstandalone items and testlet-based items. Whether a different test structure affectsthe findings from this current study is unknown and further exploration is needed.

Local item independence, one of the assumptions for IRT models, often is vio-lated in testlet-based assessments. The development of testlet models promotes bettermodeling of such item response data. The utility of the one-parameter testlet modelis broad in dealing with testlet-based assessments where the one-parameter/Raschmodel currently is used to set up the measurement system. The accurate estima-tion of testlet effects (testlet variances) in the stage of field testing helps to identifypassages and associated items that are likely to cause local item dependence. Theexplicit modeling of testlet effects help to ensure more accurate estimation of itemand ability parameters. This study provides empirical evidence regarding three pos-sible methods of estimating such parameters of the one-parameter testlet model. Ithelps practitioners choose an estimation method when the estimation accuracy ofone or more parameters of the one-parameter testlet model is of interest in practicaltesting situations. Based on the empirical results from this study, practitioners maychoose an estimation method that is most convenient as the estimation method doesnot make a real practical difference.

Notes1As the purpose of this study was to compare estimation methods for the testlet

models, we did not include the comparison between the parameter estimates fromthe Rasch model and the Rasch testlet model.

2Only significant effects with at least small effect sizes were reported at a nominalType I error rate of .05. Nonsignificant or significant results with negligible effectsizes were not reported in this session. The magnitude of effect size is classified as

201

Jiao, Wang, and He

negligible (f < .1), small (.1 < f < .25), moderate (.25 < f < .4), and large (f > .4)in the analysis of variance.

3Here “small” described the estimated testlet variances close to the small testleteffects in the simulation study.

4Here “moderate” described the estimated testlet variance close to the moderatetestlet effects in the simulation.

Acknowledgments

The authors would like to thank the editors, Dr. Brian Clauser and Dr. JamesCarlson, and the reviewers for their valuable advice and suggestions which greatlyimproved the clarity and focus of the manuscript. An earlier version of this articlewas presented at the 2008 Annual Meeting of the National Council on Measurementin Education in New York City, New York.

References

Adams, R. J., Wilson, M. R., & Wang, W.-C. (1997). The multidimensional random coeffi-cients multinomial logit model. Applied Psychological Measurement, 21, 1–23.

Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.Psychometrika, 64, 153–168.

Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear mixedmodels. Journal of the American Statistical Association, 88, 9–25.

Breslow, N. E., & Lin, X. (1995). Bias correction in generalized linear mixed models with asingle component of dispersion. Biometrika, 82, 81–91.

Brooks, S. P., & Gelman, A. (1998). Alternative methods for monitoring convergence of iter-ative simulations. Journal of Computational and Graphical Statistics, 7, 434–455.

Diaz, R. E. (2007). Comparison of PQL and Laplace 6 estimates of hierarchical linear modelswhen comparing groups of small incident rates in cluster randomized trials. ComputationalStatistics and Data Analysis, 51, 2871–2888.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:Lawrence Erlbaum.

Goldstein, H., & Rabash, J. (1996). Improved approximations for multilevel models with bi-nary responses. Journal of the Royal Statistical Society, Series B, 159, 505–513.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applica-tions. Boston, MA: Kluwer.

Harwell, M., Stone, C. A., Hsu, T., & Kirisci, L. (1996). Monte Carlo studies in item responsetheory. Applied Psychological Measurement, 20, 101–125.

He, W., Jiao, H., & Wang, S. (2007, April). Comparing parameter recovery between Raschtestlet model and one-parameter testlet model. Paper presented at the meeting of the Amer-ican Educational Research Association, Chicago, IL.

Jiao, H., Kamata, A., Wang, S., & Jin, Y. (2012). A multilevel testlet model for dual localdependence. Journal of Educational Measurement, 49, 82–100.

Jiao, H., Wang, S., & Kamata, A. (2005). Modeling local item dependence with the hierarchi-cal generalized linear model. Journal of Applied Measurement, 6, 311–321.

Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal ofEducational Measurement, 38, 79–93.

Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS – a Bayesian mod-eling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.

202


Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., & Congdon, R. (2004). HLM6: Hierarchi-cal linear and nonlinear modeling [Computer program]. Chicago, IL: Scientific SoftwareInternational.

Raudenbush, S., Yang, M., & Yosef, M. (2000). Maximum likelihood for generalized lin-ear models with nested random effects via high-order, multivariate Laplace approximation.Journal of Computational and Graphical Statistics, 9, 141–157.

Rodriguez, B., & Goldman, N. (1995). An assessment of estimation procedures for multilevelmodels with binary responses. Journal of the Royal Statistical Society, A, 158, 73–89.

Rodriguez, G., & Goldman, N. (2001). Improved estimation procedures for multilevel modelswith binary response: A case study. Journal of the Royal Statistical Society, 164, 339–355.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.Psychometrika Monographs, No. 17.

Snijders, T., & Bosker, R. (1999). An introduction to basic and advanced multilevel modeling.Thousand Oaks, CA: Sage.

Tuerlinckx, F., Rijmen, F., Molenberghs, G., Verbeke, G., Briggs, D., Van den Noortgate, W.,Meulders, M., & De Boeck, P. (2004). Estimation and software. In P. De Boeck & M.Wilson (Eds.), Explanatory item response models. New York, NY: Springer.

Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A casefor testlets. Journal of Educational Measurement, 24, 185–201.

Wang, W.-C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measure-ment, 29, 126–149.

Wang, X., Bradlow, E. T., & Wainer, H. (2005). User’s guide for SCORIGHT (Version 3.0):A computer program for scoring tests built of testlets including a module for covariatesanalysis (Research Report 04–49). Princeton, NJ: Educational Testing Service.

Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ConQuest: Generalized item responsemodeling software [Computer software and manual]. Camberwell, Australia: AustralianCouncil for Educational Research.

Wu, M. L., Adams, R. J., Wilson M. R., & Haldane, S. (2007). ACER Conquest 2.0: Gen-eral item response modeling software [computer program manual]. Camberwell, Australia:ACER Press.

Yen, W. M. (1984). Effect of local item dependence on the fit and equating performance of thethree-parameter logistic model. Applied Psychological Measurement, 8, 125–145.

Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item de-pendence. Journal of Educational Measurement, 30, 187–213.

Authors

HONG JIAO is Assistant Professor, Measurement, Statistics and Evaluation, Department ofHuman Development and Quantitative Methodology, 1230B Benjamin Building, Universityof Maryland, College Park, MD 20742; [email protected]. Her primary research interestsinclude item response theory and psychometrics in large-scale assessments.

SHUDONG WANG is Senior Research Scientist, Northwest Evaluation Association, 121 NWEverett St., Portland, OR 97206; [email protected]. His primary research interestsinclude computerized adaptive test and generalized linear mixed model applications in ed-ucational research.

WEI HE is Senior Research Scientist, Northwest Evaluation Association, 121 NW Everett St.,Portland, OR 97206; [email protected]. Her primary research interests include comput-erized adaptive/based testing, psychometrics, and large-scale educational assessment.

203

estimation methods for one-parameter testlet models

Documents