editorial

2
Editorial I t was a pleasure to see and speak with so many colleagues during the 2005 meeting in Montreal. I thought that the French-style restaurants and cooking were quite good. And we were fortunate to experience the glorious spring weather there . . . By good fortune, the articles and the cover graphic in this issue of EMIP focus, generally speaking, on test de- velopment issues. You will see articles on the numbers of response options for multiple-choice items, the quality of lo- cal school system assessments, the role of writing quality in scoring constructed-response items, and an ITEMS module on practice analysis questionnaires for specifying content for credentialing examinations. Michael Rodriguez reports a meta-analysis of 27 studies conducted since 1925 on the relationship between the num- bers of multiple-choice item response options and item dif- ficulty, item discrimination, and test score reliability. He focuses the discussion of his results on the validity of in- ferences from test scores. Rodriguez finds that three op- tions are optimal in many situations, especially in situations where additional plausible distractors are difficult to gen- erate, where test speededness encourages lower performers to guess, and where broad content coverage is difficult to achieve within time limits. He points out that because exam- inees tend to guess strategically rather than randomly, four- and five-option items may function as three- and four-option items for some examinees. This study highlights how much more work is required to make such a highly refined area of educational testing—writing multiple-choice items—a little less art and a little more science. It also nicely complements emerging work on examinees’ cognitive processing while re- sponding to assessment tasks (e.g., see EMIP 23(4)). You may recall that the summer 2004 issue of EMIP (23(2)) was a special issue on the Nebraska state assessment program. Nebraska’s program is unique: It is standards-based as required by No Child Left Behind (NCLB), but assessments are locally developed and classroom-based. In the current is- sue of EMIP, Susan Brookhart reports on an evaluation of the mathematics assessments submitted as part of the Ne- braska statewide assessment program. You may be aware of the literature on classroom assessment practice, to which Brookhart has been a major contributor. It suggests that the quality of classroom assessment practice has improved over the last 15 years but is still evolving. Many of you may have raised a skeptical eyebrow about Nebraska’s response to NCLB. In fact, Brookhart’s evaluation of statewide mathe- matics assessment practice indicates room for improvement but overall good quality. The evaluation focuses on the qual- ity of content (e.g., alignment and coverage) and on scoring procedures. William Schafer and his colleagues at the University of Maryland conducted a scoring study when the statewide Maryland School Performance Assessment Program (MSPAP) was still operating. 1 MSPAP assessed students in 1 I should acknowledge that I was State Assessment Director from 1991 to 1997. grades 3, 5, and 8 in six content areas. All items required ex- aminees to construct their responses. Responses were scored for content relevance and accuracy; writing quality should not have influenced the scores assigned to responses. Schafer and his colleagues examined whether or not writing quality (e.g., correctness, word choice, organization) did influence scores. Their results are quite interesting and indicate im- plications for the scoring of constructed-response items in current assessment programs. This issue also contains an ITEMS (Instructional Topics in Educational Measurement Series) module. Mark Raymond describes procedures for developing, refining, and analyz- ing practice analysis questionnaires for those of you who work on credentialing and other certification examinations. Raymond’s instructional module should make an excellent companion to an article in the spring 2005 issue of EMIP (24(1)) on using knowledge, skill, and ability statements in developing licensure and certification examinations. Please be sure to look at a brief clarification in this issue regarding a previous ITEMS module on standard setting (see EMIP 23(4)). Cover Visual: Response Time Graphic I think that the graphic on the cover of this issue of EMIP is visually stunning. It also conveys quite a bit of information about a test item, once you understand the three dimen- sions it portrays. Patrick Meyer and his colleagues at James Madison University have been producing very interesting work on measures of item response time as indicators of effort in responding to items. Meyer discusses the display below. At James Madison University, we are working on models that use item response time to improve our understanding of examinee cognition. Several of our studies use response time as an indicator of examinee motivation in low-stakes testing. For example, Wise and Kong (2005) developed a measure of examinee motivation that uses response time information only. Wise and DeMars (in press) created the effort-moderated item response theory model to im- prove ability estimation by using response time to account for examinee motivation. In each of these studies, collat- eral research was necessary to validate the interpretation of response time. We use these response time graphs in gathering such evidence and to explore the links between response time, ability, and cognition. An item from a 60-item computer-based information lit- eracy test (ILT) is depicted in this response time graph. The item demonstrates some interesting features. Most notably, low-ability and high-ability examinees seem to process the item differently (see the Ability axis). Re- sponse time has little effect on high ability levels; the surface is flat across the item-time deviation dimension at high ability levels. The cognitive processes that these examinees employ tend to lead to the correct answer regardless of how much time they take. Perhaps exami- nees of high ability know the answer but some are more Summer 2005 1

Upload: steve-ferrara

Post on 21-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Editorial

Editorial

It was a pleasure to see and speak with so many colleaguesduring the 2005 meeting in Montreal. I thought that the

French-style restaurants and cooking were quite good. Andwe were fortunate to experience the glorious spring weatherthere . . .

By good fortune, the articles and the cover graphic inthis issue of EMIP focus, generally speaking, on test de-velopment issues. You will see articles on the numbers ofresponse options for multiple-choice items, the quality of lo-cal school system assessments, the role of writing quality inscoring constructed-response items, and an ITEMS moduleon practice analysis questionnaires for specifying content forcredentialing examinations.

Michael Rodriguez reports a meta-analysis of 27 studiesconducted since 1925 on the relationship between the num-bers of multiple-choice item response options and item dif-ficulty, item discrimination, and test score reliability. Hefocuses the discussion of his results on the validity of in-ferences from test scores. Rodriguez finds that three op-tions are optimal in many situations, especially in situationswhere additional plausible distractors are difficult to gen-erate, where test speededness encourages lower performersto guess, and where broad content coverage is difficult toachieve within time limits. He points out that because exam-inees tend to guess strategically rather than randomly, four-and five-option items may function as three- and four-optionitems for some examinees. This study highlights how muchmore work is required to make such a highly refined area ofeducational testing—writing multiple-choice items—a littleless art and a little more science. It also nicely complementsemerging work on examinees’ cognitive processing while re-sponding to assessment tasks (e.g., see EMIP 23(4)).

You may recall that the summer 2004 issue of EMIP(23(2)) was a special issue on the Nebraska state assessmentprogram. Nebraska’s program is unique: It is standards-basedas required by No Child Left Behind (NCLB), but assessmentsare locally developed and classroom-based. In the current is-sue of EMIP, Susan Brookhart reports on an evaluation ofthe mathematics assessments submitted as part of the Ne-braska statewide assessment program. You may be aware ofthe literature on classroom assessment practice, to whichBrookhart has been a major contributor. It suggests thatthe quality of classroom assessment practice has improvedover the last 15 years but is still evolving. Many of you mayhave raised a skeptical eyebrow about Nebraska’s responseto NCLB. In fact, Brookhart’s evaluation of statewide mathe-matics assessment practice indicates room for improvementbut overall good quality. The evaluation focuses on the qual-ity of content (e.g., alignment and coverage) and on scoringprocedures.

William Schafer and his colleagues at the University ofMaryland conducted a scoring study when the statewideMaryland School Performance Assessment Program(MSPAP) was still operating.1 MSPAP assessed students in

1 I should acknowledge that I was State Assessment Director from 1991to 1997.

grades 3, 5, and 8 in six content areas. All items required ex-aminees to construct their responses. Responses were scoredfor content relevance and accuracy; writing quality shouldnot have influenced the scores assigned to responses. Schaferand his colleagues examined whether or not writing quality(e.g., correctness, word choice, organization) did influencescores. Their results are quite interesting and indicate im-plications for the scoring of constructed-response items incurrent assessment programs.

This issue also contains an ITEMS (Instructional Topics inEducational Measurement Series) module. Mark Raymonddescribes procedures for developing, refining, and analyz-ing practice analysis questionnaires for those of you whowork on credentialing and other certification examinations.Raymond’s instructional module should make an excellentcompanion to an article in the spring 2005 issue of EMIP(24(1)) on using knowledge, skill, and ability statements indeveloping licensure and certification examinations.

Please be sure to look at a brief clarification in this issueregarding a previous ITEMS module on standard setting (seeEMIP 23(4)).

Cover Visual: Response Time GraphicI think that the graphic on the cover of this issue of EMIP isvisually stunning. It also conveys quite a bit of informationabout a test item, once you understand the three dimen-sions it portrays. Patrick Meyer and his colleagues at JamesMadison University have been producing very interestingwork on measures of item response time as indicators ofeffort in responding to items. Meyer discusses the displaybelow.

At James Madison University, we are working on modelsthat use item response time to improve our understandingof examinee cognition. Several of our studies use responsetime as an indicator of examinee motivation in low-stakestesting. For example, Wise and Kong (2005) developed ameasure of examinee motivation that uses response timeinformation only. Wise and DeMars (in press) createdthe effort-moderated item response theory model to im-prove ability estimation by using response time to accountfor examinee motivation. In each of these studies, collat-eral research was necessary to validate the interpretationof response time. We use these response time graphs ingathering such evidence and to explore the links betweenresponse time, ability, and cognition.

An item from a 60-item computer-based information lit-eracy test (ILT) is depicted in this response time graph.The item demonstrates some interesting features. Mostnotably, low-ability and high-ability examinees seem toprocess the item differently (see the Ability axis). Re-sponse time has little effect on high ability levels; thesurface is flat across the item-time deviation dimensionat high ability levels. The cognitive processes that theseexaminees employ tend to lead to the correct answerregardless of how much time they take. Perhaps exami-nees of high ability know the answer but some are more

Summer 2005 1

Page 2: Editorial

motivated to spend extra time checking the plausibility ofthe distracters. (See the Item Deviation Times for high-and low-ability examinees.) At lower levels of ability, re-sponse time plays a more prominent role. Low-ability ex-aminees improve their chances of a correct answer byspending more time on the item (as indicated by the in-creasing probabilities at increasing deviation times amonglow-ability examinees). That is, with enough thought, theyfigure out the answer. A review of the item content suggeststhat these students may not initially know the answer,but they may be able to identify it through a process ofelimination.

In some ways this graph is characteristic of an itemfrom a speed test. However, a correct response does notdepend on mental speed (response time) alone. Moreover,this test was not designed to be a speed test: Examineescomplete every item. Items with response surfaces like theone depicted are scattered throughout the test. Graphs ofthe remaining information literacy test items indicate thatthe role of response time varies somewhat from item toitem. There were a few groups of items in which responsetime had a similar effect, but finding a parametric modelthat would fit all items would be difficult.

The preceding statements assume that cognition un-derlies item response time. This assumption may or maynot be supported. Cognition, motivation, and/or other ex-aminee characteristics may contribute to response timedifferences. More work is needed to understand fully therole of response time in the information literacy test andtesting in general. Indeed, the interpretation of responsetime will vary from test to test and from item to item. Visualexploration of response time is one step in using time totap into important examinee characteristics that are notfully captured by ability.

The graph was produced using a multivariate extensionof Ramsay’s (1991) nonparametric item response theorymodel (Meyer, 2005). The response time deviation is arobust standardization of the item response time. Timesaround zero are close to the median response time. Neg-ative numbers indicate quick (shorter than median) re-sponse times while positive numbers indicate slow (longerthan median) response times. In this framework, medianresponse times are on tempo, while other times are off

tempo. Ability was estimated using a normal score trans-formation of the raw test score without the studied item.A benefit of this model is that it allows the effect of re-sponse time to be examined without imposing any distri-butional assumptions on the data. Response time does notneed to fit a particular distribution such as the Weibullor log-normal, unlike parametric response time models.This aspect is particularly important given that responsetime distributions tend to be mixture densities (Schnipke& Scrams, 1999).

In ClosingDuring this year’s annual meeting I saw presentations ofmany excellent papers that could be tailored quite suitablyfor EMIP. I hope that many of you will take a look at theEMIP editorial mission statement (see http://www.ncme.org/pubs/emip_policy.ace), give some thought to EMIP’s targetaudiences, and consider revising your NCME and AERA pa-pers and submitting them for consideration for EMIP.

Several of you commented to me in Montreal about thecover visuals. I hope that you’ll send a graph, plot, table, orother visual for me to consider for a future cover.

Steve FerraraEditor

References

Meyer, J. P. (2005). A nonparametric response time model. Manu-script in preparation.

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametricitem characteristic curve estimation. Psychometrika, 56, 611–630.

Schnipke, D. L., & Scrams, D. J. (1999). Modeling item response timeswith a two-state mixture model: A new approach to measuringspeededness (Computerized Testing Report 96-02). Newton, PA:Law School Admission Council.

Wise, S. L., & DeMars, C. E. (in press). An application of item responsetime: The effort moderated IRT model. Journal of EducationalMeasurement.

Wise, S. L., & Kong, X. (2005). Response time effort: A new measureof examinee motivation in computer-based tests. Applied Measure-ment in Education, 18, 163–183.

2 Educational Measurement: Issues and Practice