editorial

3
Editorial A long with two exciting empirical studies and some NCME news, this issue of EMIP contains an interesting commen- tary and rejoinder and a clarification of a technical matter addressed in a previous issue of EMIP. Mark Reckase presents a refined and expanded version of a theoretically framed empirical evaluation of standard set- ting methods that he first reported in an NCME symposium in 2005. Reckase proposes a theory of the standard setting judgmental task. As he puts it, the standard setter’s judg- mental task is to follow prescribed procedures to find his or her “intended cut score.” Reckase makes clear that he is not proposing that a “true” standard exists theoretically in the way that an examinee’s true score exists theoretically. He uses the intended cut score concept to enable the analyses and evaluative comparisons he makes of results from simu- lations of applying modified Angoff and Bookmark standard settings and recommendations for improving the implemen- tation of these methods. I will not tell you here what he concludes and recommends—you will have to read for your- self. This paper already has generated controversy, just as Reckase’s NCME 2005 presentation did. In fact, you can ex- pect to see a commentary on the paper and a rejoinder in an upcoming issue of EMIP. The Reckase paper is a nice companion to other papers on standard setting that have appeared in EMIP recently. In the winter 2004 issue, Gregory Cizek, Michael Bunch, and Heather Koons presented an ITEMS module on contem- porary standard setting methods. In the spring 2006 issue, Ana Karantonis and Stephen Sireci provided a review of the literature on the Bookmark standard setting method. Stimu- lated by a discussion of response probability (RP) criteria in that paper, Huynh Huynh clarifies in this issue of EMIP the conceptual and psychometric basis for RP67. I hope that the combination of EMIP papers and other studies on standard setting published in the Journal of Educational Measurement and Applied Measurement in Education recently (see Buckendahl, 2005; Wainer, Wang, Skorupski, & Bradlow, 2005) and elsewhere will stimulate even more debate and additional research and development work on standard setting conceptions and methods. In fact, and at the risk of sounding highfalutin, I wonder whether we are approaching where the nuclear physicists are: search- ing for a unifying theory. A sort of unifying theory for stan- dard setting would address social interaction processes (e.g., Fitzpatrick, 1989) and social psychology findings on human judgment and decision making (e.g., Plous, 1993), among other considerations. The potential for deleterious influences of high stakes testing situations on examinee performance and professional ethical behavior receive plenty of attention in professional writing and in the media. Low stakes testing situations re- ceive less public attention. Anyone who has worked on NAEP and other low stakes assessments is familiar with concerns about examinee effort and the effects of low motivation on examinee proficiency estimates, item and test statistics, and interpretation of test scores. Steven Wise and his col- leagues Dennison Bhola and Sheng-Ta Yang report results from their recent study of the effects of an “effort-monitoring computer-based test” on student motivation—as indicated by their response-time effort measure—and on student per- formance. This is another interesting and useful study from the research program of the faculty and students at the In- stitute for Computer-Based Assessment at James Madison University. In the spring 2005 issue of EMIP, Ning Wang, Deborah Schnipke, and Elizabeth Witt illustrated the impact of using knowledge, skill, and ability (KSA) statements in developing validity evidence for licensure and certification tests. They advocated building a linkage between job analysis results and test specifications—evidence that is crucial to support the use of scores from such tests—through KSA statements that are connected to the results from the job analysis. Tony LaDuca provides four counterarguments to using KSA state- ments as advocated in the earlier paper. Wang and her associates provide a rejoinder. Even though licensure and certification testing is outside my experience, I find the subtleties in these discussions interesting and illuminating. Clearly, these subtle arguments and clarifications are im- portant: The discussion is about development and validation procedures for tests that license health care providers and a range of other professionals—who provide services to all of us—and that provide assurance that these professionals are competent to provide those services. NCME News This issue contains photographs from the 2006 annual meeting breakfast. They remind me of the professionalism, productivity, and geniality of the people in the educational measurement community, our colleagues, and our profes- sional community. Having briefly savored that memory, you can see that the call for proposals for the 2007 meeting also appears in this issue. So get to thinking. Proposals are due by August 11! Cover Visuals This issue features two graphics provided by Dorry Kenyon of the Center for Applied Linguistics. He designed these graphics to communicate results from complex psychomet- ric analyses for non-technical members of the World-class Instructional Design and Assessment (WIDA) Consortium of states (see http://www.wida.us/). The WIDA Consortium and its partners developed an English language proficiency as- sessment, ACCESS for ELLs TM . Kenyon describes this assess- ment and the background and interpretation of the graphics below. Background on the Graphics These graphics were developed to summarize a large quan- tity of data analyses to address a concern of educators in the field of English language learning in the U.S. K–12 arena. Summer 2006 1

Upload: steve-ferrara

Post on 23-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Editorial

A long with two exciting empirical studies and some NCMEnews, this issue of EMIP contains an interesting commen-

tary and rejoinder and a clarification of a technical matteraddressed in a previous issue of EMIP.

Mark Reckase presents a refined and expanded version ofa theoretically framed empirical evaluation of standard set-ting methods that he first reported in an NCME symposiumin 2005. Reckase proposes a theory of the standard settingjudgmental task. As he puts it, the standard setter’s judg-mental task is to follow prescribed procedures to find his orher “intended cut score.” Reckase makes clear that he is notproposing that a “true” standard exists theoretically in theway that an examinee’s true score exists theoretically. Heuses the intended cut score concept to enable the analysesand evaluative comparisons he makes of results from simu-lations of applying modified Angoff and Bookmark standardsettings and recommendations for improving the implemen-tation of these methods. I will not tell you here what heconcludes and recommends—you will have to read for your-self. This paper already has generated controversy, just asReckase’s NCME 2005 presentation did. In fact, you can ex-pect to see a commentary on the paper and a rejoinder in anupcoming issue of EMIP.

The Reckase paper is a nice companion to other paperson standard setting that have appeared in EMIP recently.In the winter 2004 issue, Gregory Cizek, Michael Bunch,and Heather Koons presented an ITEMS module on contem-porary standard setting methods. In the spring 2006 issue,Ana Karantonis and Stephen Sireci provided a review of theliterature on the Bookmark standard setting method. Stimu-lated by a discussion of response probability (RP) criteria inthat paper, Huynh Huynh clarifies in this issue of EMIP theconceptual and psychometric basis for RP67.

I hope that the combination of EMIP papers and otherstudies on standard setting published in the Journal ofEducational Measurement and Applied Measurement inEducation recently (see Buckendahl, 2005; Wainer, Wang,Skorupski, & Bradlow, 2005) and elsewhere will stimulateeven more debate and additional research and developmentwork on standard setting conceptions and methods. In fact,and at the risk of sounding highfalutin, I wonder whether weare approaching where the nuclear physicists are: search-ing for a unifying theory. A sort of unifying theory for stan-dard setting would address social interaction processes (e.g.,Fitzpatrick, 1989) and social psychology findings on humanjudgment and decision making (e.g., Plous, 1993), amongother considerations.

The potential for deleterious influences of high stakestesting situations on examinee performance and professionalethical behavior receive plenty of attention in professionalwriting and in the media. Low stakes testing situations re-ceive less public attention. Anyone who has worked on NAEPand other low stakes assessments is familiar with concernsabout examinee effort and the effects of low motivationon examinee proficiency estimates, item and test statistics,and interpretation of test scores. Steven Wise and his col-

leagues Dennison Bhola and Sheng-Ta Yang report resultsfrom their recent study of the effects of an “effort-monitoringcomputer-based test” on student motivation—as indicatedby their response-time effort measure—and on student per-formance. This is another interesting and useful study fromthe research program of the faculty and students at the In-stitute for Computer-Based Assessment at James MadisonUniversity.

In the spring 2005 issue of EMIP, Ning Wang, DeborahSchnipke, and Elizabeth Witt illustrated the impact of usingknowledge, skill, and ability (KSA) statements in developingvalidity evidence for licensure and certification tests. Theyadvocated building a linkage between job analysis resultsand test specifications—evidence that is crucial to supportthe use of scores from such tests—through KSA statementsthat are connected to the results from the job analysis. TonyLaDuca provides four counterarguments to using KSA state-ments as advocated in the earlier paper. Wang and herassociates provide a rejoinder. Even though licensure andcertification testing is outside my experience, I find thesubtleties in these discussions interesting and illuminating.Clearly, these subtle arguments and clarifications are im-portant: The discussion is about development and validationprocedures for tests that license health care providers and arange of other professionals—who provide services to all ofus—and that provide assurance that these professionals arecompetent to provide those services.

NCME NewsThis issue contains photographs from the 2006 annualmeeting breakfast. They remind me of the professionalism,productivity, and geniality of the people in the educationalmeasurement community, our colleagues, and our profes-sional community. Having briefly savored that memory, youcan see that the call for proposals for the 2007 meeting alsoappears in this issue. So get to thinking. Proposals are dueby August 11!

Cover VisualsThis issue features two graphics provided by Dorry Kenyonof the Center for Applied Linguistics. He designed thesegraphics to communicate results from complex psychomet-ric analyses for non-technical members of the World-classInstructional Design and Assessment (WIDA) Consortium ofstates (see http://www.wida.us/). The WIDA Consortium andits partners developed an English language proficiency as-sessment, ACCESS for ELLsTM. Kenyon describes this assess-ment and the background and interpretation of the graphicsbelow.

Background on the Graphics

These graphics were developed to summarize a large quan-tity of data analyses to address a concern of educators in thefield of English language learning in the U.S. K–12 arena.

Summer 2006 1

The concern arises as an outcome of the No Child Left Be-hind (NCLB) legislation and has received little attention todate. Specifically, NCLB requires English language learners(ELLs) to be assessed annually in four language domains(listening, speaking, reading, writing) and to show that theyare making progress in acquiring English. What may be calledan older generation of English language proficiency tests hadbeen used for many years, primarily to identify English lan-guage learners needing English language support services.However, new assessments were needed to address the newcontext: assessments that can meet the technical rigor re-quired by NCLB and that are based on standards that addressthe acquisition of the English language.

One of the new tests is the WIDA Consortium’s AssessingComprehension and Communication in English State toState for English Language Learners (or ACCESS forELLsTM). The WIDA Consortium is a group of (currently)12 states that began with initial funding from a federal En-hanced Assessment Grant to Wisconsin. Under the leadershipof Tim Boals as Director and Margo Gottlieb as Lead Devel-oper of the WIDA English Language Proficiency Standards forEnglish Language Learners in Kindergarten through Grade12, the consortium developed ACCESS for ELLsTM in part-nership with the Center for Applied Linguistics. ACCESS forELLsTM was first used operationally by three of the consor-tium states in 2005, with all states using it operationally in2006.

One of the activities of the WIDA Consortium was to con-duct a bridge study in 2005 that examined the relationshipof performances on “older generation” tests that some con-sortium states had been using with performances on thenew ACCESS. Across the consortium states, four main testshad been used (often at the discretion of the local district):the IDEA Proficiency Test (IPT), the Language AssessmentScales (LAS), the Language Proficiency Test Series (LPTS),and the Revised Maculaitis II (MAC II). In the bridge study,4,985 students enrolled in grades K through 12 from se-lected districts in Illinois and Rhode Island were admin-istered the IPT, LAS, LPTS, or MAC II prior to an opera-tional administration of ACCESS for ELLsTM. Data collectedin the bridge study were used to develop predicted scores ta-bles; that is, tables that show student scores on ACCESS forELLsTM predicted from their scores on an older generationtest.

Many educators in the field of English language learn-ing had long felt that the older generation tests had a ten-dency to exit students from English as a second language(ESL) services too early, before students had mastered En-glish needed to succeed not only with the everyday sociallanguage of American schooling but also with the languageneeded for academic success. WIDA Consortium states wereeager to answer the question, How are performances on theolder tests and ACCESS for ELLsTM related, particularly inregard to recommendations on when to exit ELLs from ESLservices?

ACCESS for ELLsTM operationalizes the WIDA Consor-tium’s English Language Proficiency Standards. Its ultimategoal is to place English learners appropriately and accu-rately in the proficiency levels defined by the Standards. TheStandards define five proficiency levels: Entering, Beginning,Developing, Expanding, and Bridging. These are hierarchi-cal levels, shown at the top of the cover graphic throughsteps 1–5. Although each state may choose different points

at which students may be exited as a policy decision, a sixthcategory is inherent in the Standards: students who may beconsidered “formerly ELLs.” This stage, known in the WIDAConsortium as Reaching, is not a defined proficiency levelper se and thus in the graphic is represented more as a “fin-ish line.” The graphic showing these steps is commonly usedacross the Consortium member states when illustrating thehierarchical levels of the WIDA Standards.

Interpreting the Graphics

We used results from the bridge study to answer the ques-tion, How are performances on the older tests and ACCESSfor ELLsTM related, particularly in regard to recommenda-tions on when to exit ELLs from ESL services? To answerthe question graphically, we examined all conversion tablesin the bridge study report, noted the top proficiency levelaccording to the score metric on the older generation test,and determined the predicted score on ACCESS for ELLsTM

and the corresponding WIDA proficiency level. (ACCESS forELLsTM proficiency level scores are reported in tenths, suchas 4.6. The highest possible proficiency level score is 6.0.)Both on the older test and ACCESS for ELLsTM, these pro-ficiency levels vary by grade level. For each test on eachdomain, an average WIDA proficiency level score was calcu-lated across all grade levels.

The arrow on the bottom of the cover graphic shows theresults for an older generation test in the writing domain,referred to here as “ELP Test A.” Among all older gener-ation tests, this test exited students from ESL services atthe lowest proficiency level vis-a-vis the WIDA proficiencylevels. The average predicted ACCESS for ELLsTM profi-ciency level score across grade levels was 2.9. The secondgraphic (see Figure 2) shows the outcome for an oldergeneration test in the reading domain, referred to hereas “ELP Test B.” Among all older generation tests, thistest exited students from ESL services at the highest profi-ciency level vis-a-vis the WIDA levels. The average predictedACCESS for ELLsTM proficiency level score across gradelevels was 5.3.

The result for writing on ELP Test A is not surprising be-cause the older generation tests tended to focus on oral lan-guage skills as an indicator of English language proficiency,whereas the WIDA Standards account for the important rolethat writing plays in academic success. The result for readingalso is not surprising for ELP Test B, which was developedmore recently and, like the WIDA Standards, includes aca-demic language in its definition of English proficiency forELL students.

In ClosingThe last two issues of 2006 will be special topic issues. TheFall issue will contain a number of papers and discussion froma 2006 NCME session. Judith Koenig organized and moder-ated a symposium titled “Following the Standards for Educa-tional and Psychological Testing: The Challenges of EnsuringSound Measurement Practice.” The presenters, discussants,and audience addressed the theme of the session and turnedto additional topics of interest: ensuring broader use of thestandards, recognizing measurement issues that have be-come salient since the publication of the standards in 1999(e.g., assessing students with disabilities and English lan-guage learners), and determining whether supplementary

2 Educational Measurement: Issues and Practice

FIGURE 2. Comparison of Exit Levels for ACCESS for ELLsTM and Other EnglishLanguage Proficiency Tests: Results of the WIDA Bridge Study (Highest Exit).

materials would enhance the usability and broader use of thestandards. The winter issue will contain papers and discus-sion addressing the theme “Toward a Theory of EducationalAchievement Testing: Practical Guidelines and Principles,”which is the issue’s working title.

To those of you with 9-month contracts, I hope that youare enjoying a relaxing and productive summer. To those ofus with year-round jobs, let’s remember to take some timeoff to enjoy the warm weather. (I wrote this last sentencefrom a little trattoria in Tuscany. . .)

Steve FerraraEditor

References

Buckendahl, C. W. (Ed.). (2005). Special Issue: Qualitative inquiries ofparticipants’ experiences with standard setting. Applied Measure-ment in Education, 18(3).

Fitzpatrick, A. R. (1989). Social influences in standard setting: Theeffects of social interaction on group judgments. Review of Educa-tional Research, 59(3), 315–328.

Plous, S. (1993). The psychology of judgment and decision making.New York: McGraw-Hill.

Wainer, H., Wang, X. A., Skorupski, W. P., & Bradlow, E. T.(2005). A Bayesian method for evaluating passing scores: ThePPoP curve. Journal of Educational Measurement, 42(3), 271–281.

Summer 2006 3