criterion referencd testing 30 yeam later: promise broken, promise kept

3
Criterion~Referencd T-ting 30 Yeam Later: Promise Broken, Promise Kept Jason Millman Corn ell University What promises has CRT kept or broken? Why do we need denser tests to make good, criterion-referenced interpretations? What wisdom has grown from experi- ence with CRTs? on Hambleton and Jane R Rogers (1989) quote a repre- sentative of a large testing company as having said almost 30 years ago that he “would keep an eye on crite- rion-referenced [CRI testing until it died” (p. 146). If a norm-referenced statement that is full of jealousy is permitted, I wish I had aged as well and received as much attention over the past 30 years as CR testing has. It is decidedly here to stay In his 1963 paper, Bob Glaser modestly and generously announced the heritage of his wonderful off- spring. If Bob was the parent, then certainly Jim Popham and Ted Husek were the delivery team. Their widely read, widely cited paper (1969) trumpeted CR testing’s ar- rival and promise to the Kingdom of Measurement. What is that promise? Glaser put it this way: ‘1 student’s score on a CR measure provides explicit infor- mation as to what the individual can or cannot do” (pp. 519-520). Did CR testing keep its promise? Were the criterion-referenced tests (CRTs) that were developed by test publishers capable of such an inter- pretation? Not often. Promise Broken The unfulfilled promise is that CRTs would permit valid inferences about what a student could and could not do. In my view, the promise was un- fulfilled because of a faulty assump- tion. Thirty years ago I was a young pup, full of ambition and optimism. I thought that if only educators could write good test specifications, explic- itly stating what was and was not part of the content coverage, CRTs would be able to meet their promise. More than that, even. I believed CRTs could give quantitative inter- pretations, such as: Billy can answer 65% of the questions contained in a given domain. But I was wrong. I was aglow over emerging meth- ods of describing test item popula- tions. I especially liked the item shell and replacement set scheme (Osburn & Shoemaker, 1968) and even helped develop a computer language that could be used to construct such shells and sets (Millman &Outlaw, 1978). When the first cries of dissension by test developers hit the airwaves, Popham’s (1974) amplified objectives seemed a reasonable compromise. I remember being impatient with test publishers who used the popular CRT name but did not adhere to its rigorous requirements-such as, use of clearly specified domains and an explicit strategy for selecting items from those domains. Most CRTs failed abysmally against these re- quirements (Walker et al., 1979). The publisher’s behavior was not necessarily a pernicious attempt to cash in on the new fad. CRTs re- quired in those days exacting test specifications that exceeded the ca- pability, resources, and patience of test developers to produce. I also remember having the plea- sure of running an American Educa- tional Research Association work- shop with Bob Glaser on CR testing during its early days. My part had the participants wringing their hands over writing detailed test specifications. It was much later that I could appreciate why that activity was a bust and somewhat misguided. Clearly specified domains were not sufficient to know what examinees can and cannot do. CRTs, however, did offer benefits even if they did not keep their promise of yielding the kind of interpretations Glaser called for. They totally destroyed the monopoly of norm-referenced interpretations that was held in many quarters. The improvement in specifications did promote the alignment of instruc- tion and assessment. The account- ability movement of the 1970s and 1980s was well served by this align- ment. Some might say the current popularity of education standards was facilitated by the CRT move- ment. The mass of new publications and procedures assured continuing employment for psychometricians and allowed many of us to grow pro- fessionally. But these pseudo-CRTs did not tell what a student could and could not do. I now know Iwas off course. Clear and well-explicated domains are insufficient to assure inter- pretability. If the domain defines a Jason Millman is a Professor at Cor- nell University, 405 Kennedy Hall, Ithaca, NY 14853-4203. His specializa- tions are educational measurement and program and personnel evaluation. Winter 1994 19

Upload: jason-millman

Post on 28-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Criterion~Referencd T-ting 30 Yeam Later: Promise Broken, Promise Kept Jason Millman Corn ell University

What promises has CRT kept or broken? Why do we need denser tests to make good, criterion-referenced interpretations? What wisdom has grown from experi- ence with CRTs?

on Hambleton and Jane R Rogers (1989) quote a repre- sentative of a large testing company as having said almost 30 years ago that he “would keep an eye on crite- rion-referenced [CRI testing until it died” (p. 146). If a norm-referenced statement that is full of jealousy is permitted, I wish I had aged as well and received as much attention over the past 30 years as CR testing has. It is decidedly here to stay

In his 1963 paper, Bob Glaser modestly and generously announced the heritage of his wonderful off- spring. If Bob was the parent, then certainly Jim Popham and Ted Husek were the delivery team. Their widely read, widely cited paper (1969) trumpeted CR testing’s ar- rival and promise to the Kingdom of Measurement.

What is that promise? Glaser put it this way: ‘1 student’s score on a CR measure provides explicit infor- mation as to what the individual can or cannot do” (pp. 519-520).

Did CR testing keep its promise? Were the criterion-referenced tests (CRTs) that were developed by test publishers capable of such an inter- pretation? Not often.

Promise Broken The unfulfilled promise is that CRTs would permit valid inferences about what a student could and could not do. In my view, the promise was un- fulfilled because of a faulty assump- tion.

Thirty years ago I was a young pup, full of ambition and optimism. I thought that if only educators could write good test specifications, explic- itly stating what was and was not part of the content coverage, CRTs would be able to meet their promise. More than that, even. I believed CRTs could give quantitative inter- pretations, such as: Billy can answer 65% of the questions contained in a given domain. But I was wrong.

I was aglow over emerging meth- ods of describing test item popula- tions. I especially liked the item shell and replacement set scheme (Osburn & Shoemaker, 1968) and even helped develop a computer language that could be used to construct such shells and sets (Millman &Outlaw, 1978). When the first cries of dissension by test developers hit the airwaves, Popham’s (1974) amplified objectives seemed a reasonable compromise.

I remember being impatient with test publishers who used the popular CRT name but did not adhere to its rigorous requirements-such as, use of clearly specified domains and an explicit strategy for selecting items from those domains. Most CRTs failed abysmally against these re- quirements (Walker et al., 1979). The publisher’s behavior was not necessarily a pernicious attempt to cash in on the new fad. CRTs re- quired in those days exacting test specifications that exceeded the ca- pability, resources, and patience of test developers to produce.

I also remember having the plea- sure of running an American Educa- tional Research Association work- shop with Bob Glaser on CR testing during its early days. My part had the participants wringing their hands over writing detailed test specifications. It was much later that I could appreciate why that activity was a bust and somewhat misguided. Clearly specified domains were not sufficient to know what examinees can and cannot do.

CRTs, however, did offer benefits even if they did not keep their promise of yielding the kind of interpretations Glaser called for. They totally destroyed the monopoly of norm-referenced interpretations that was held in many quarters. The improvement in specifications did promote the alignment of instruc- tion and assessment. The account- ability movement of the 1970s and 1980s was well served by this align- ment. Some might say the current popularity of education standards was facilitated by the CRT move- ment. The mass of new publications and procedures assured continuing employment for psychometricians and allowed many of us to grow pro- fessionally. But these pseudo-CRTs did not tell what a student could and could not do.

I now know Iwas off course. Clear and well-explicated domains are insufficient to assure inter- pretability. If the domain defines a

Jason Millman is a Professor at Cor- nell University, 405 Kennedy Hall, Ithaca, NY 14853-4203. His specializa- tions are educational measurement and program and personnel evaluation.

Winter 1994 19

FIGURE 1. Score interpretability versus

broad construct-such as, knowl- edge of the American Civil War-no matter how well spelled out it is, with a limited number of test items, we still won’t know what tasks within that domain the student can and cannot do. We can construct reading proficiency and mathe- matical reasoning scales. We can place students on such scales, a highly important measurement function. However, we would pro- bably still not know what tasks the student can and cannot do. Low task intercorrelations-that is, task specificity-work against such CR interpretations. Reporting by a nar- rower domain, such as the Battle of Gettysburg, helps only if enough items are sampled from that domain.

Promise Kept In my view, CRTs yield CR interpre- tations closest to Glaser’s conception

Item per Cell

item density

when test developers increase item density. Just as population density is a ratio of the number of organisms per unit area, so we can think of item density as the number of questions per unit of content-items per cell. Cronbach called our attention to the bandwidth-fidelity dilemma: Sampling broadly within the bound- aries of a domain will probably in- crease validity at the expense of internal consistency. High item den- sity increases the ability to interpret what a student can and cannot do at the expense of reducing the validity of measuring broadly defined con- structs. Figure 1 illustrates the rela- tionship between item density and interpretability.

When CRTs came into this world, there began a long, uphill climb to- ward more and more tests with high item density. A number of diagnostic instruments were produced. But ed- ucators had trouble managing all

Mean Number of Items per Cell

1963 1993 Year

this information. Gradually, the av- erage item density slipped back-but not to pre-1963 levels. I envision the situation to look something like Fig- ure 2. That marginal increase in item density over the 1963 levels is one indication of how well the CRT promise was kept.

NAEP Scales as a Case in Point When Ralph Tyler launched the Na- tional Assessment of Educational Progress (NAEP), he wanted to find out what 90%, half, and 10% of the students could and could not do. NAEP interpretations typically fo- cused on specific items. Even if NAEP exercise writers weren’t sure what underlying cognitive construct may have been involved, and even if they couldn’t generalize to a re- spectably broad construct, at least they could describe some specific tasks students could and could not do. Interpretations stayed close to describing the NAEP exercises.

Some item statistics still form part of NAEP reports (see, e.g., Mullis, Dossey, Owen, & Phillips, 1993). How- ever, elusive constructs such as the ability to “identify extraneous infor- mation,” “list the possible arrange- ments in a sample space,” and “solve for the length of missing segments in more complex similarity situations” (pp. 236-237) dominate. These abili- ties are typically measured by one or, at most, a few items. The NAEP in- strument appears to be operating with an item density of one item per cell.

NAEP tests are just not designed to provide, nor do they claim to pro- vide, the promised CR interpreta- tion. Their constructs are too broad. (See Forsyth, 1991, on this point.) Their role as “The Nation’s Report Card” requires, for all practical pur- poses, that progress be reported in broadly defined domains.

Even though NAEP’s purpose is not to yield CR interpretations in Glaser’s meaning of the term, does the project have features that pro- mote such interpretations? I think not.

Two recent changes in NAEP’s op- eration have been the return of at- tention to performance assessment and the use of the categories-basic,

FIGURE 2 . Trend of mean item density over time Continued on p. 39

20 Educational Measurement Issues and Practice

NCME Instructional Module: Guidelines for the Development of Item

Performance Assessment of our own Professional Behavior, Berk, No. 1,

Performance-Based Assessment: Implications of Task Specificity, Linn &

Perspectives on Educational Testing: Discussion, Jones, No. 2, p. 28 Psychometric and Social Issues in Admissions to Israeli Universities,

Banks, Ward & Murray-Ward, No. 1, p. 34

p. 27

Burton, No. 1, p. 5

Beller, No. 2, p. 12

CRT 30 Years Later Continued from p . 20

proficient, and advanced-to report levels of achievement. Will each of these two shifts add to the CR in- terpretability of NAEP results? The answers are no and no.

Performance assessment works against high item density. Such tasks are often time consuming, and therefore fewer of them can be administered. Such tasks usually measure a variety of skills and, cor- respondingly, yield low correlations between tasks. The resulting com- bination of low item density and task specificity precludes CR inter- pretations.

Determining that a student is at the proficient level in a subject- reading, for example-does not give us a confident assessment of the specific tasks a student can or can- not do. Although the achievement levels are defined in terms of groups of “elusive constructs,” illustrated above, performance on tasks mea- suring such constructs can span the achievement scale.

With the shift of the NAEP con- tract from the Education Commis- sion of the States to the Educational Testing Service came use of item re- sponse theory in scale development and reporting. Item response theory (IRT) technology doesn’t automati- cally infuse interpretability into the NAEP scales. IRT can help increase interpretability if the instruments have high item density. IRT scales have the best CR interpretability when they contain both the ability levels of subjects (students, schools, etc.) and the difficulty levels of a ho- mogeneous set of items. Then they come closer to what Bob Glaser wanted of a CR interpretation. He wrote:

Underlying the concept of achievement measurement is the notion of a continuum of knowl- edge acquisition ranging from no

Publication Process in Educational Measurement, The, Jaeger & Hen-

Recontextualizing Mental Measurement, Goldstein, No. 1, p. 16 Rise and Fall of Criterion-Referenced Measurement? The, Hambleton,

Swedish Scholastic Aptitude Test, The: Development, Use, and Research,

Vermont Portfolio Assessment Program, The: Findings and Implications,

dricks, No. 1, p. 20

No. 4, p. 21

Wedman, No. 2, p. 5

Koretz, Stecher, Klein, & McCaffrey, No. 3, p. 5

proficiency at all to perfect perfor- mance. An individual’s achieve- ment level falls at some point on this continuum as indicated by the behaviors he displays during testing. . . . the specific behaviors implied at each level of profi- ciency can be identified. . . . Crite- rion-referenced measures indicate the content of the behavioral repertory, and the correspondence between what an individual does and the underlying continuum of achievement. (pp. 519-520)

Older and Wiser Now that CR testing has 30 years of experience behind it, we can reflect on its capabilities with less bright- eyed optimism and more realism. Part of that realism is that only oc- casionally are tests desired that, be- cause of their restricted domain of coverage and high item saturation, are capable of supplying CRinter- pretations. But when they are de- sired, care is required in the choice of tasks. Test makers can profit from guidelines about task con- struction and selection. Their awareness of the underlying cogni- tive demands for each task can facil- itate both the choice of tasks and the interpretation of performance. CR interpretation is also aided by scales that show ability and task difficulty together and by interpretations that employ constructs closer to the item level. But most of all, to capture Glaser’s original intention that CRTs be capable of telling us what students can and cannot do, we need higher item density-more items per acre of domain.

Note This is a revision of a paper presented

as part of the symposium, “Criterion- Referenced Measurement: A 30-Year Retrospective,” at the Annual Meetings of the American Educational Research Association and the National Council on Measurement in Education in At- lanta, GA, April 1993. I wish to thank

the editor and reviewers for many help- ful suggestions with respect to the orig- inal paper.

References Forsyth, R. A. (1991). Do NAEP scales

yield valid criterion-referenced inter- pretations? Educational Measure- ment: Issues and Practice, 10 (3),3-9, 16.

Glaser, R. (1963). Instructional technol- ogy and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519-521.

Hambleton, R. K., &Rogers, H. J. (1989). Solving criterion-referenced measurement problems with latent response models. International Jour- nal of Educational Research, 13,

Millman, J., &Outlaw, W S. (1978). Testing by computer. AEDS Journal,

Mullis, I. V S., Dossey, J. A., Owen, E. H., Phillips, G.N (1993). NAEP 1992 mathematics report card for the na- tion and the states (Report No. 23-ST02). Washington, DC: U. S. De- partment of Education, Office of Edu- cational Research and Improvement.

Osburn, H. G., & Shoemaker, D. M. (1968). Pilot project on computer gen- erated test items (DHEW Grant No. 1-7-068533-3917). Washington, DC: Office of Education. (ERIC Document Reproduction Service No. ED 026 856)

Popham, W J. (1974). Curriculum de- sign: The problem of specifying in- tended learning outcomes. In J. Blaney, I. Housego, &G. McIntosh (Eds.), Program development in edu- cation (pp. 76-88). Vancouver, BC: University of British Columbia, Cen- tre for Continuing Education.

Popham, W J., &Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educa- tional Measurement, 6, 1-9.

Walker, C. B., Dotseth, M., Hunter, R., Smith, K. O., Kampe, L., Strickland, G., Neafsey, S., Garvey, C., Bastone, M., Weinberger, E., Yohn, K., & Smith, L. S. (1979). CSE Criterion- Referenced Test handbook. Los Ange- les: University of California, Center for the Study of Evaluation.

145-160.

11, 57-72.

Winter 1994 39