a critical evaluation of diagnostic score reporting: some theory and applications sandip sinharay,...
TRANSCRIPT
A Critical Evaluation of Diagnostic Score Reporting: Some Theory and
Applications
Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman
Copyright 2009 by Educational Testing Service. All rights reserved. No reproduction in whole or in part is permitted without the express written permission of the copyright owner
Paper presented at the Statistical and Applied Mathematical Sciences Institute, 9 July, 2009.
Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the scores.
Paul W. Holland (2001)
Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the diagnostic scores.
Outline
Examples of diagnostic score reports. Approaches to report diagnostic scores. Problems with existing diagnostic scores in education. A method to evaluate if diagnostic scores have added value. Applications of the method to operational test data. Conclusions and recommendations.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
5
What Are Diagnostic Scores?
Diagnostic scores refer to scores on any meaningful cluster of items (subtests).
Typically, they refer to scores on content areas.
For example, on a test for prospective teachers of children, diagnostic scores are scores on the content areas Reading, Science, Social Studies, and Mathematics.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
7
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
9
Subscores, Augmented Subscores, and Objective Performance Index
Subscores: Raw/percent scores on the subtests.
Augmented subscore (Wainer et al., 2001): A weighted average of the subscore of interest (e.g., reading) and the other subscores (e.g., science, social studies, and mathematics).
Objective Performance Index (Yen, 1987): A weighted average of (i) the observed subscore, and (ii) an estimate of the subscore based on the examinee’s overall test performance.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
10
Cognitive Diagnostic Models (CDM)
Assumptions: solving each test item requires one or more skills
(Q matrix)each examinee has a latent ability parameter
corresponding to each of the skills the probability of a score depends on the skills
the item requires and the ability parameters
The ability estimates are the diagnostic scores.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
11
Examples of CDMs
Rule Space Method (RSM; Tatsuoka, 1983, 2009): An early attempt at diagnostic scoring. Attribute Hierarchy Method (Leighton, Gierl, & Hunka, 2004): Extension of RSMThe DINA and NIDA models (Junker and Sijtsma, 2001).Multiple classification latent class model (Maris, 1999).General diagnostic model (GDM; von Davier, 2008).Reparameterized unified model (RUM; Hartz, 2002; Roussos et al., 2007).
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
12
Examples of CDMs..Continued
Bayesian Network (Almond et al., 2007).Multidimensional item response theory (De la Torre & Patz, 2005; Yao & Boughton, 2007). Multicomponent latent trait model (e.g., Embretson, 1997). The higher-order latent-trait model (de La Torre, 2005). The DINO and NIDO models.Many excellent reviews of CDM’s exist (e.g., Rupp & Templin, 2008; von Davier et al., 2008; DiBello, Roussos, & Stout, 2007).
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
13
Is It Possible to Report High-quality Diagnostic Scores for the
Existing Educational Tests?
Standards 1.12, 2.1, 5.12 etc. of Standards for Educational & Psychological Testing (1999) demand proof of adequate reliability, validity, and distinctness of diagnostic scores.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
14
Classical Test Theory
x=test score.
Partition the score x as: x=xt + xe, E(xe)=0, Cov(xt, xe)=0, V(xe)=σe
2, V(xt)= σt2,
Reliability = Correlation between scores on a test and a parallel form of the test = ρ2(x,xt).
Validity measures the extent to which a test is doing the job it is supposed to do (=correlation between x and a criterion score y).
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
15
Is It Possible to Report High-quality Diagnostic Scores?...Cont’d
Diagnostic scores in educational tests most often have few items, but cover broad domains—
low reliability are highly correlated are outcomes of retrofitting
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
16
Is It Possible to Report High-quality Diagnostic Scores?...Cont’d
Luecht, Gierl, Tan & Huff (2006): “Inherently unidimensional item and test information cannot be decomposed to produce useful multidimensional score profiles—no matter how well intentioned or which psychometric model is used to extract the information. Our obvious recommendation is not to try to extract something that is not there.”
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
17
An Empirical Check of Reliability of Diagnostic Scores
Form X (120 Items)
Reading: 30 Items Math: 30 Items Social Studies: 30 Items Science: 30 Items
Form A (40 Items)
Reading: 10 Items Math: 10 Items Social Studies: 10 Items Science: 10 Items
Form B (40 Items)
Reading: 10 Items Math: 10 Items Social Studies: 10 Items Science: 10 Items
Form C (40 Items)
Reading: 10 Items Math: 10 Items Social Studies: 10 Items Science: 10 Items
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
18
An Empirical Check of Reliability of Diagnostic Scores….continued
Of the 6,035 examinees who scored 4 (1st quartile) or lower on science on Form A, 49 percent scored higher than 1st quartile on science on Form B.
Of the 383 examinees who scored 8 (3rd quartile) on Math and 4 on science on Form A, 32 percent had science score higher than or equal to their Math score on Form B.
rScience A, Science B=0.48.
rScience A,Total B=0.63.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
19
A Method Based on Classical Test Theory (Haberman, 2008)
Compute the PRMSE of • the subscore (=reliability)• the total scoreA subscore has added value over the total score only if the PRMSE of the subscore is larger than the PRMSE of the total score. A subscore has added value A subscore can be predicted better by the corresponding subscore on a parallel form than by the total score on the parallel form.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
20
A Method Based on Classical Test Theory…Continued
Subscore s=st +se; Total score x=xt + xe
PRMSE for the subscore = ρ2(s,st)= Subscore reliability.
PRMSE for the total score = ρ2(x,st) = ρ2(x,xt) ρ2(xt,st).
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
21
A Method Based on Classical Test Theory…Continued
Can report a weighted average of the subscore and the total score (0.4xReading+0.2xTotal) if its PRMSE is large enough.
Special case of the augmented subscore (Wainer et al., 2001).
The computations need only simple summary statistics.
What About Validity?
Reliability is an aspect of construct validity.
Recent work of Haberman (2008): A subscore that is not distinct or not reliable has limited value w.r.t. validity.
Thus, the method also examines whether the subscores have adequate validity (though additional validity studies are recommended).
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
23
An Example: GRE Subject Biology
Sub-
score
The PRMSE’s for the subscore, total score, and weighted average
PRMSEsub (=reliability)
PRMSEtotal PRMSEwtd
Cell. & Molec.
.89 .78 .91
Organis-mal
.85 .89 .91
Ecol. & Evol.
.87 .79 .89
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
24
Results from a Survey of Operational Data (Sinharay, 2009)
Test Name
# Sub scores
Average Length
Average reliability
Average Corr-dis
# subscores added value
# Wtd av. with added value
Old
SAT-V3 26 0.79 0.95 0 1
Sch. Std. Prg: Eng
4 15 0.70 0.98 0 0
DSTP (8th gr. M)
4 19 0.77 1.00 0 0
Teachers of math.
3 16 0.62 0.95 0 0
Old SAT 2 69 0.92 0.76 2 2
Praxis Series™
4 25 0.72 0.78 2 4
SweSAT 5 24 0.78 0.69 4 5
Percent of subscores with added value for different subscore length and average disattenuated correlation
Percent of subscores with added value for different average subscore reliability and average disattenuated correlation
Percent of weighted averages with added value
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
28
Main Findings from the Survey of Operational Data
More than 50% of the tests had no subscore with added value.
Weighted averages had added value more often than subscores.
The subscores that had added value were
• based on a sufficient number of items (20+)
• sufficiently distinct from each other (disattenuated correlation less than 0.9)
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
29
Reporting of Aggregate-level Subscores
To determine if aggregate-level subscores have added value, use an approach based on PRMSEs similar to that used for individual-level subscores.
The computation of PRMSEs is a bit different and is based on between and within-aggregation sum of squares.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
30
A Method Based on Classical Test Theory…Continued
s=sA +se; x=xA + xe
PRMSE for the aggregate-level subscore = ρ2(sav,sA) = σ2(sA)/[σ2(sA)+ σ2(se)/n]
PRMSE for the aggregate-level total score = ρ2(xav,sA) = ρ2(xav,xA) ρ2(xA,sA)
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
31
Reporting of Subscores Based on MIRT Models
Fit, using a stabilized Newton-Raphson method (Haberman, von Davier, & Lee, 2008), a MIRT model with item response function
where θi corresponds to subscore i
1 1 2 2
1 1 2 2
exp( ... )( ) ,
1 exp( ... )j j jK K j
jj j jK K j
a a a bP
a a a b
θ
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
32
Reporting of Subscores Based on MIRT…Continued
The diagnostic scores are the posterior means of the ability parameters.Calculate the proportional reduction in mean squared error (PRMSEMIRT)
Compare PRMSEMIRT to PRMSEwtd to examine if MIRT does better than CTT.
2
2
( ( | ))1
( ( ))i i i
i i
E E
E E
X
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
33
Results for Aggregate-level Subscores and MIRT-based Subscores
Aggregate-level subscores, just like individual-level subscores, rarely have added value.
The PRMSEMIRT is very close to PRMSEwtd for the several tests we looked at.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
34
Conclusions and Recommendations
Most of the existing diagnostic scores on educational tests lack quality.
Evidence of adequate reliability, validity, and distinctness of the diagnostic scores should be provided.
If a CDM is used, it should be demonstrated that the model parameters can be reliably estimated in a timely manner and the model fits the data better than a simpler model.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
35
Conclusions and Recommendations
To report meaningful diagnostic scores for some tests, changing the structure by using assessment engineering practices (Luecht et al., 2006) may be necessary.
Alternatives: Scale anchoring (Beaton & Allen, 1992) and item mapping (Zwick et al., 2001).
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
36
References for the Haberman method
Haberman (2008). Journal of Educational and Behavioral Statistics.
Sinharay, Haberman, & Puhan (2007). Educational Measurement: Issues and Practice.
Sinharay & Haberman (2008). Measurement.
Haberman, Sinharay, & Puhan (2009). British Journal of Math. & Stat. Psychology.
Confidential and Proprietary. Copyright © 2007 by Educational Testing Service.
37
References for the Haberman method
Puhan, Sinharay, Haberman, & Larkin (in press). Applied Measurement in Education.
Sinharay (2009). ETS RR.
Haberman & Sinharay (2009). ETS RR.
Sinharay, Puhan, & Haberman (2009). Invited presentation at the annual meeting of NCME.