a critical evaluation of diagnostic score reporting: some theory and applications sandip sinharay,...

A Critical Evaluation of Diagnostic Score Reporting: Some Theory and

Applications

Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman

Copyright 2009 by Educational Testing Service. All rights reserved. No reproduction in whole or in part is permitted without the express written permission of the copyright owner

Paper presented at the Statistical and Applied Mathematical Sciences Institute, 9 July, 2009.

Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the scores.

Paul W. Holland (2001)

Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the diagnostic scores.

Outline

Examples of diagnostic score reports. Approaches to report diagnostic scores. Problems with existing diagnostic scores in education. A method to evaluate if diagnostic scores have added value. Applications of the method to operational test data. Conclusions and recommendations.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.

5

What Are Diagnostic Scores?

Diagnostic scores refer to scores on any meaningful cluster of items (subtests).

Typically, they refer to scores on content areas.

For example, on a test for prospective teachers of children, diagnostic scores are scores on the content areas Reading, Science, Social Studies, and Mathematics.


7


9

Subscores, Augmented Subscores, and Objective Performance Index

Subscores: Raw/percent scores on the subtests.

Augmented subscore (Wainer et al., 2001): A weighted average of the subscore of interest (e.g., reading) and the other subscores (e.g., science, social studies, and mathematics).

Objective Performance Index (Yen, 1987): A weighted average of (i) the observed subscore, and (ii) an estimate of the subscore based on the examinee’s overall test performance.


10

Cognitive Diagnostic Models (CDM)

Assumptions: solving each test item requires one or more skills

(Q matrix)each examinee has a latent ability parameter

corresponding to each of the skills the probability of a score depends on the skills

the item requires and the ability parameters

The ability estimates are the diagnostic scores.


11

Examples of CDMs

Rule Space Method (RSM; Tatsuoka, 1983, 2009): An early attempt at diagnostic scoring. Attribute Hierarchy Method (Leighton, Gierl, & Hunka, 2004): Extension of RSMThe DINA and NIDA models (Junker and Sijtsma, 2001).Multiple classification latent class model (Maris, 1999).General diagnostic model (GDM; von Davier, 2008).Reparameterized unified model (RUM; Hartz, 2002; Roussos et al., 2007).


12

Examples of CDMs..Continued

Bayesian Network (Almond et al., 2007).Multidimensional item response theory (De la Torre & Patz, 2005; Yao & Boughton, 2007). Multicomponent latent trait model (e.g., Embretson, 1997). The higher-order latent-trait model (de La Torre, 2005). The DINO and NIDO models.Many excellent reviews of CDM’s exist (e.g., Rupp & Templin, 2008; von Davier et al., 2008; DiBello, Roussos, & Stout, 2007).


13

Is It Possible to Report High-quality Diagnostic Scores for the

Existing Educational Tests?

Standards 1.12, 2.1, 5.12 etc. of Standards for Educational & Psychological Testing (1999) demand proof of adequate reliability, validity, and distinctness of diagnostic scores.


14

Classical Test Theory

x=test score.

Partition the score x as: x=xt + xe, E(xe)=0, Cov(xt, xe)=0, V(xe)=σe

2, V(xt)= σt2,

Reliability = Correlation between scores on a test and a parallel form of the test = ρ2(x,xt).

Validity measures the extent to which a test is doing the job it is supposed to do (=correlation between x and a criterion score y).


15

Is It Possible to Report High-quality Diagnostic Scores?...Cont’d

Diagnostic scores in educational tests most often have few items, but cover broad domains—

low reliability are highly correlated are outcomes of retrofitting


16

Is It Possible to Report High-quality Diagnostic Scores?...Cont’d

Luecht, Gierl, Tan & Huff (2006): “Inherently unidimensional item and test information cannot be decomposed to produce useful multidimensional score profiles—no matter how well intentioned or which psychometric model is used to extract the information. Our obvious recommendation is not to try to extract something that is not there.”


17

An Empirical Check of Reliability of Diagnostic Scores

Form X (120 Items)

Reading: 30 Items Math: 30 Items Social Studies: 30 Items Science: 30 Items

Form A (40 Items)


Form B (40 Items)


Form C (40 Items)



18

An Empirical Check of Reliability of Diagnostic Scores….continued

Of the 6,035 examinees who scored 4 (1st quartile) or lower on science on Form A, 49 percent scored higher than 1st quartile on science on Form B.

Of the 383 examinees who scored 8 (3rd quartile) on Math and 4 on science on Form A, 32 percent had science score higher than or equal to their Math score on Form B.

rScience A, Science B=0.48.

rScience A,Total B=0.63.


19

A Method Based on Classical Test Theory (Haberman, 2008)

Compute the PRMSE of • the subscore (=reliability)• the total scoreA subscore has added value over the total score only if the PRMSE of the subscore is larger than the PRMSE of the total score. A subscore has added value A subscore can be predicted better by the corresponding subscore on a parallel form than by the total score on the parallel form.


20

A Method Based on Classical Test Theory…Continued

Subscore s=st +se; Total score x=xt + xe

PRMSE for the subscore = ρ2(s,st)= Subscore reliability.

PRMSE for the total score = ρ2(x,st) = ρ2(x,xt) ρ2(xt,st).


21


Can report a weighted average of the subscore and the total score (0.4xReading+0.2xTotal) if its PRMSE is large enough.

Special case of the augmented subscore (Wainer et al., 2001).

The computations need only simple summary statistics.

What About Validity?

Reliability is an aspect of construct validity.

Recent work of Haberman (2008): A subscore that is not distinct or not reliable has limited value w.r.t. validity.

Thus, the method also examines whether the subscores have adequate validity (though additional validity studies are recommended).


23

An Example: GRE Subject Biology

Sub-

score

The PRMSE’s for the subscore, total score, and weighted average

PRMSEsub (=reliability)

PRMSEtotal PRMSEwtd

Cell. & Molec.

.89 .78 .91

Organis-mal

.85 .89 .91

Ecol. & Evol.

.87 .79 .89


24

Results from a Survey of Operational Data (Sinharay, 2009)

Test Name

# Sub scores

Average Length

Average reliability

Average Corr-dis

# subscores added value

# Wtd av. with added value

Old

SAT-V3 26 0.79 0.95 0 1

Sch. Std. Prg: Eng

4 15 0.70 0.98 0 0

DSTP (8th gr. M)

4 19 0.77 1.00 0 0

Teachers of math.

3 16 0.62 0.95 0 0

Old SAT 2 69 0.92 0.76 2 2

Praxis Series™

4 25 0.72 0.78 2 4

SweSAT 5 24 0.78 0.69 4 5

Percent of subscores with added value for different subscore length and average disattenuated correlation

Percent of subscores with added value for different average subscore reliability and average disattenuated correlation

Percent of weighted averages with added value


28

Main Findings from the Survey of Operational Data

More than 50% of the tests had no subscore with added value.

Weighted averages had added value more often than subscores.

The subscores that had added value were

• based on a sufficient number of items (20+)

• sufficiently distinct from each other (disattenuated correlation less than 0.9)


29

Reporting of Aggregate-level Subscores

To determine if aggregate-level subscores have added value, use an approach based on PRMSEs similar to that used for individual-level subscores.

The computation of PRMSEs is a bit different and is based on between and within-aggregation sum of squares.


30


s=sA +se; x=xA + xe

PRMSE for the aggregate-level subscore = ρ2(sav,sA) = σ2(sA)/[σ2(sA)+ σ2(se)/n]

PRMSE for the aggregate-level total score = ρ2(xav,sA) = ρ2(xav,xA) ρ2(xA,sA)


31

Reporting of Subscores Based on MIRT Models

Fit, using a stabilized Newton-Raphson method (Haberman, von Davier, & Lee, 2008), a MIRT model with item response function

where θi corresponds to subscore i

1 1 2 2

1 1 2 2

exp( ... )( ) ,

1 exp( ... )j j jK K j

jj j jK K j

a a a bP

a a a b

θ


32

Reporting of Subscores Based on MIRT…Continued

The diagnostic scores are the posterior means of the ability parameters.Calculate the proportional reduction in mean squared error (PRMSEMIRT)

Compare PRMSEMIRT to PRMSEwtd to examine if MIRT does better than CTT.

2

2

( ( | ))1

( ( ))i i i

i i

E E

E E

X


33

Results for Aggregate-level Subscores and MIRT-based Subscores

Aggregate-level subscores, just like individual-level subscores, rarely have added value.

The PRMSEMIRT is very close to PRMSEwtd for the several tests we looked at.


34

Conclusions and Recommendations

Most of the existing diagnostic scores on educational tests lack quality.

Evidence of adequate reliability, validity, and distinctness of the diagnostic scores should be provided.

If a CDM is used, it should be demonstrated that the model parameters can be reliably estimated in a timely manner and the model fits the data better than a simpler model.


35

Conclusions and Recommendations

To report meaningful diagnostic scores for some tests, changing the structure by using assessment engineering practices (Luecht et al., 2006) may be necessary.

Alternatives: Scale anchoring (Beaton & Allen, 1992) and item mapping (Zwick et al., 2001).


36

References for the Haberman method

Haberman (2008). Journal of Educational and Behavioral Statistics.

Sinharay, Haberman, & Puhan (2007). Educational Measurement: Issues and Practice.

Sinharay & Haberman (2008). Measurement.

Haberman, Sinharay, & Puhan (2009). British Journal of Math. & Stat. Psychology.


37

References for the Haberman method

Puhan, Sinharay, Haberman, & Larkin (in press). Applied Measurement in Education.

Sinharay (2009). ETS RR.

Haberman & Sinharay (2009). ETS RR.

Sinharay, Puhan, & Haberman (2009). Invited presentation at the annual meeting of NCME.

a critical evaluation of diagnostic score reporting: some theory and applications sandip sinharay,...

Documents

existing diagnostic

diagnostic scoring

rawpercent scores

educational testing

general diagnostic model

haberman copyright

test item

augmented subscores