editorial

3
Editorial I n this special issue of EMIP, Judith Koenig of the Board on Testing and Assessment of the National Research Council has organized several articles around the topic, “Is it time to update the Standards for Educational and Psychological Testing?” The current Standards were published only seven years ago. Earlier versions of the Standards were published in the 1960s, in the 1970s, and in 1985. The amount of time that has elapsed since the last revision is not the impor- tant consideration here, of course. The important questions are whether the current Standards document addresses cur- rent issues using current conceptions, criteria, and consid- erations and whether it is actually used in testing practice. Koenig and the authors—Wayne Camara, Naomi Chudowsky, Stuart Elliott, William G. Harris, Daniel Koretz, Suzanne Lane, Robert Linn, Lorraine McDonnell, Polly Parker, Bar- bara Plake, Stephen Sireci, and Lauress Wise—consider these issues. Koenig provides background on the 2006 NCME symposium that is the basis for these articles and synthesizes and extends what the authors and commentators have to say. I’d like to raise two issues for consideration as the dis- cussion moves forward on what to do about the Standards. First, the educational testing sector of our profession and industry has evolved rapidly, startlingly so, over the last four years because of the requirements of No Child Left Behind (NCLB). For example, NCLB and related legislation require us to provide alternate assessments for students with signif- icant cognitive disabilities and English language proficiency assessments for English language learners. In addition, the use of technology has expanded rapidly in recent years: a range of selection tests and even some state content area assessments are delivered on computer and online, innova- tive item formats and assessment tasks capitalize on new computing power, and computer programs generate items on the fly and score open-ended responses automatically. All of that is pretty dazzling. The 1999 Standards is not silent on these recent advances. The question is whether the current Standards addresses them adequately. Further, our colleagues in other professional education areas are now using the Standards to guide their work. You can see in the 2005 and 2006 programs of NCME and the National Con- ference on Large Scale Assessment that special educators with little or no training in large-scale assessment, who are involved in statewide alternate assessment programs, are re- lying on the Standards to understand the types of evidence they are required to include in technical reports. I had an illuminating experience in an AERA Division D session last spring on English language proficiency assessments. I was gratified when a well-known linguist and opinion leader on assessing English language learners referred explicitly to the Standards in his talk. That is, I was gratified until he said he was using the 1985 edition. So, not only should we be con- cerned about the currency of the Standards; we also need to consider whether they are comprehensible and practically useful for a wide range of other professionals and, in fact, educational policymakers and even legislators. Second point: The U.S. Department of Education now reg- ulates our work in state assessment through the NCLB peer review process. Here is what the Standards and Assessments Peer Review Guidance: Information and Examples for Meet- ing Requirements of the No Child Left Behind Act of 2001 (dated April 28, 2004) has to say about the role of peer review: The No Child Left Behind Act of 2001 (NCLB) reformed Fed- eral educational programs to support State efforts to establish challenging standards, to develop aligned assessments, and to build accountability systems for districts and schools that are based on educational results. In particular, NCLB includes explicit requirements to ensure that students served by Title I are given the same opportunity to achieve high standards and are held to the same high expectations as all other students in each State. [p. 1; retrieved September 12, 2006, from http://www.ed.gov/admins/lead/account/saa.html#peerreview] The peer review guidance for submitting review materials specifies seven requirements for which states must provide supporting evidence for their assessment and accountabil- ity programs. For example, Section 4 of the guidance man- dates “a system of assessments with high technical quality” and provides several critical elements (e.g., element 4.1 for evidence: “For each assessment, including alternate assess- ment(s), has the State documented the issue of validity ... as described in the Standards for Educational and Psycho- logical Testing?”) Critical elements are illustrated with ex- amples of acceptable and incomplete evidence. States that fail to provide sufficient and convincing evidence in these seven areas risk public embarrassment and loss of Title I funds. The Standards plays a prominent role in guiding states in providing supporting evidence. In fact, NCLB peer review guidance defines the rules of evidence for evaluat- ing the rigor and appropriateness of state assessments and, in large part, does so by following the Standards. Attor- neys follow rules of evidence required for federal and state courts (e.g., Rule 401, Definition of “Relevant Evidence”; see http://www.law.cornell.edu/rules/fre/). Our rules of evidence appear in the Standards, but they are also sprinkled through a range of publications, including the just-published fourth edition of Educational Measurement (Brennan, 2006). Just to emphasize the regulatory function of NCLB peer review, here is a definition of government regulation from Wikipedia: A regulation is a legal restriction promulgated by government administrative agencies through rulemaking supported by a threat of sanction or a fine ... Regulation mandated by the gov- ernment or state attempts to produce outcomes which might not otherwise occur, produce or prevent outcomes in differ- ent places to what might otherwise occur, or produce or pre- vent outcomes in different timescales than would otherwise occur ...Common examples of regulation include attempts to control market entries, prices, wages, pollution effects, em- ployment for certain people in certain industries, standards of production for certain goods and services. [Retrieved Septem- ber 12, 2006, from http://en.wikipedia.org/wiki/Regulation] Fall 2006 1

Upload: steve-ferrara

Post on 23-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Editorial

In this special issue of EMIP, Judith Koenig of the Board onTesting and Assessment of the National Research Council

has organized several articles around the topic, “Is it timeto update the Standards for Educational and PsychologicalTesting?” The current Standards were published only sevenyears ago. Earlier versions of the Standards were publishedin the 1960s, in the 1970s, and in 1985. The amount of timethat has elapsed since the last revision is not the impor-tant consideration here, of course. The important questionsare whether the current Standards document addresses cur-rent issues using current conceptions, criteria, and consid-erations and whether it is actually used in testing practice.Koenig and the authors—Wayne Camara, Naomi Chudowsky,Stuart Elliott, William G. Harris, Daniel Koretz, SuzanneLane, Robert Linn, Lorraine McDonnell, Polly Parker, Bar-bara Plake, Stephen Sireci, and Lauress Wise—considerthese issues. Koenig provides background on the 2006 NCMEsymposium that is the basis for these articles and synthesizesand extends what the authors and commentators have tosay.

I’d like to raise two issues for consideration as the dis-cussion moves forward on what to do about the Standards.First, the educational testing sector of our profession andindustry has evolved rapidly, startlingly so, over the last fouryears because of the requirements of No Child Left Behind(NCLB). For example, NCLB and related legislation requireus to provide alternate assessments for students with signif-icant cognitive disabilities and English language proficiencyassessments for English language learners. In addition, theuse of technology has expanded rapidly in recent years: arange of selection tests and even some state content areaassessments are delivered on computer and online, innova-tive item formats and assessment tasks capitalize on newcomputing power, and computer programs generate itemson the fly and score open-ended responses automatically.All of that is pretty dazzling. The 1999 Standards is notsilent on these recent advances. The question is whetherthe current Standards addresses them adequately. Further,our colleagues in other professional education areas are nowusing the Standards to guide their work. You can see in the2005 and 2006 programs of NCME and the National Con-ference on Large Scale Assessment that special educatorswith little or no training in large-scale assessment, who areinvolved in statewide alternate assessment programs, are re-lying on the Standards to understand the types of evidencethey are required to include in technical reports. I had anilluminating experience in an AERA Division D session lastspring on English language proficiency assessments. I wasgratified when a well-known linguist and opinion leader onassessing English language learners referred explicitly to theStandards in his talk. That is, I was gratified until he said hewas using the 1985 edition. So, not only should we be con-cerned about the currency of the Standards; we also need toconsider whether they are comprehensible and practicallyuseful for a wide range of other professionals and, in fact,educational policymakers and even legislators.

Second point: The U.S. Department of Education now reg-ulates our work in state assessment through the NCLB peerreview process. Here is what the Standards and AssessmentsPeer Review Guidance: Information and Examples for Meet-ing Requirements of the No Child Left Behind Act of 2001(dated April 28, 2004) has to say about the role of peerreview:

The No Child Left Behind Act of 2001 (NCLB) reformed Fed-eral educational programs to support State efforts to establishchallenging standards, to develop aligned assessments, and tobuild accountability systems for districts and schools that arebased on educational results. In particular, NCLB includesexplicit requirements to ensure that students served by Title Iare given the same opportunity to achieve high standards andare held to the same high expectations as all other studentsin each State. [p. 1; retrieved September 12, 2006, fromhttp://www.ed.gov/admins/lead/account/saa.html#peerreview]

The peer review guidance for submitting review materialsspecifies seven requirements for which states must providesupporting evidence for their assessment and accountabil-ity programs. For example, Section 4 of the guidance man-dates “a system of assessments with high technical quality”and provides several critical elements (e.g., element 4.1 forevidence: “For each assessment, including alternate assess-ment(s), has the State documented the issue of validity . . . asdescribed in the Standards for Educational and Psycho-logical Testing?”) Critical elements are illustrated with ex-amples of acceptable and incomplete evidence. States thatfail to provide sufficient and convincing evidence in theseseven areas risk public embarrassment and loss of TitleI funds. The Standards plays a prominent role in guidingstates in providing supporting evidence. In fact, NCLB peerreview guidance defines the rules of evidence for evaluat-ing the rigor and appropriateness of state assessments and,in large part, does so by following the Standards. Attor-neys follow rules of evidence required for federal and statecourts (e.g., Rule 401, Definition of “Relevant Evidence”; seehttp://www.law.cornell.edu/rules/fre/). Our rules of evidenceappear in the Standards, but they are also sprinkled througha range of publications, including the just-published fourthedition of Educational Measurement (Brennan, 2006).

Just to emphasize the regulatory function of NCLB peerreview, here is a definition of government regulation fromWikipedia:

A regulation is a legal restriction promulgated by governmentadministrative agencies through rulemaking supported by athreat of sanction or a fine . . .Regulation mandated by the gov-ernment or state attempts to produce outcomes which mightnot otherwise occur, produce or prevent outcomes in differ-ent places to what might otherwise occur, or produce or pre-vent outcomes in different timescales than would otherwiseoccur . . .Common examples of regulation include attempts tocontrol market entries, prices, wages, pollution effects, em-ployment for certain people in certain industries, standards ofproduction for certain goods and services. [Retrieved Septem-ber 12, 2006, from http://en.wikipedia.org/wiki/Regulation]

Fall 2006 1

I do not believe that our work in educational testing hasever before been regulated so comprehensively and in suchdetail (cf. Phillips & Camara, 2006, pp. 733–755, on federallaw and regulations on assessing students with disabilitiesand English language learners and on testing litigation).Certainly, it is great news that the peer review guidelines relyon the Standards. Is the Standards document adequatelycurrent and comprehensive for this important new role? Thearticles in this issue identify areas in which it probably is not.

It is crucial that we consider that question now, giventhe role of the Standards in NCLB peer reviews. And it iscrucial, whatever the educational measurement communitydecides, that the process is completed in the next year or so.1Otherwise, peer reviewers will regulate state assessmentswithout adequate guidance from the Standards in a numberof challenging areas.

As I discussed in the summer issue of EMIP, a lively de-bate about standard setting methods and incisive researchhas been evolving in several measurement journals since2004. Matthew Schulz and Mark Reckase extend the de-bate and the research in this issue. Reckase’s article in thatsummer issue of EM:IP, “A Conceptual Framework for a Psy-chometric Theory for Standard Setting with Examples of ItsUse for Evaluating the Functioning of Two Standard SettingMethods” (Reckase, 2006), is the starting point for Schulz’scommentary and Reckase’s rejoinder in this issue. They con-ceive of things somewhat differently but appear to agreeon a couple of points: (a) The standard setting method—specifically, the panelist’s judgmental task—must be speci-fied in detail and in a comprehensible way so that panelistsmake judgments as intended and avoid judgmental biases.(The literature on scoring writing assessments demonstratesthat training judges to avoid biases is an achievable goal.)(b) We can learn a great deal from modeling the standardsetting judgmental process statistically and translating whatwe learn into practical improvements to the methods we use.

The writing and thinking in these articles are lively. You’lllearn a lot and enjoy the debate if you choose to read them.And I will repeat my summer exhortation, one that bothauthors also have voiced: standard setting is a particularlycrucial area, one that is crying out for conceptual and em-pirical work. So let’s sustain the momentum and extend theexcellent work of the last three years.

NCME NewsThis issue contains the slate of candidates for vice president,Board of Directors at Large, and Board of Directors, repre-sentative from a testing organization. Be sure to watch themail for your ballot and to submit your votes.

This issue also contains the calls for NCME’s six annualawards. These awards span the arc of our careers in edu-cational measurement: from dissertation to early career tocareer contributions. Why not consider nominating a de-serving colleague or even yourself? And please—when theawards committee chairs ask you to join a committee—sayyes, no matter how busy you may be.

Cover VisualFor this issue I have selected a bar chart from a report onNCLB from Policy Analysis for California Education (PACE).The report, Is the No Child Left Behind Act Working? The

Reliability of How States Track Achievement, and bar chartson reading and mathematics are available on the PACE web-site (see http://pace.berkeley.edu/pace_publications.html).

This visual illustrates a debate that began at least 10 yearsago about performance standards on state assessments. Theinitial question was whether state performance standardswere demanding enough. Comparisons of students at/aboveProficient on state assessments and state NAEP suggested tosome that many states’ standards were woefully undemand-ing. Others suggested that NAEP Achievement Levels mightbe too high. That debate has evolved. Current questions focuson whether gains on state assessments represent real gains inachievement or inflation. One way to consider that question isto look at changes in achievement on state assessments andNAEP. The cover visual addresses that question, and specif-ically with regard to the influence of NCLB requirements onachievement gains.

These paired bar graphs show average annual changes inpercentages of students at/above Proficient in Grade 4 read-ing on a state’s customized reading assessment and the stateNAEP assessment in reading. They make plain that averageachievement changes on state assessments in these 12 se-lected states over the 4 years since the inception of NCLB donot follow a pattern similar to the changes in NAEP perfor-mance over the same period. Three states (Arkansas, Cali-fornia, and Washington) show average annual achievementgains of 3% to 4% on state assessments and 0% to 1% gains onNAEP. Five states (Illinois, Iowa, New Jersey, North Carolina,and Oklahoma) show achievement gains on the state assess-ment and small declines on state NAEP. Such comparisons in-volve important assumptions: (a) NAEP is a reasonable stan-dard for comparison because the NAEP content frameworkscontain academic content and skills that appear in statecontent standards; (b) NAEP is a relatively unpolluted mea-sure (at least at Grades 4 and 8) of achievement because nostakes are attached to NAEP performance for schools, teach-ers, and students; and (c) state assessments are particularlysensitive to changes in achievement because they are tightlyaligned with state content standards, and local curriculumemphasizes state standards, or they are particularly suscep-tible to “teaching to the test” rather than “good instruction”because the stakes are so high for states, school systems,schools, and teachers. I should emphasize that the differ-ences between performance on the state assessments andNAEP are small but important because they compound overyears.

How do the authors of the PACE report answer their two-part question about NCLB’s success and the trustworthinessof state assessment performance gains?

State results consistently exaggerate the percentage of fourth-graders deemed proficient or above in reading and math—forany given year and for reported rates of annual progress,compared with NAEP results. For reading, this gulf betweenthe two testing systems has actually grown wider over time.Any analysis conducted over the 1992–2005 period based solelyon state results will exaggerate the true amount of progressmade by fourth-graders. (p. 19)

And further, “In short, the pattern of inflated state re-sults has generally persisted during the three years sinceenactment of NCLB” (p. 14).

That’s awfully discouraging. The authors acknowledge theview that state assessments may be more sensitive to state

2 Educational Measurement: Issues and Practice

curriculum emphases than is state NAEP, and they dismissthat view as “unlikely” (p. 18). Beyond the very real worriesof test score inflation is the reality of achieving real gainsin student achievement. We will need more than the 4 yearssince NCLB—and even the nearly 25 years of school improve-ment efforts since the report A Nation at Risk—to overcomethe effects of decades of poverty, low expectations, and insome cases inadequate teaching.

In ClosingThe final issue of 2006 will follow soon after you receive thisissue. It is another special issue, this time on the theme“Toward a Theory of Educational Achievement Testing: Prin-ciples and Practical Guidelines.”

Steve FerraraEditor

Note1This sounds naı̈ve, of course. The process for revising the 1999 Stan-dards required six years.

References

Brennan, R. L. (Ed.). (2006). Educational Measurement (4th ed.).Westport, CT: American Council on Education and Praeger.

Fuller, B., Gesicki, K., Kang, E., & Wright, J. (2006). Is the No Child LeftBehind Act Working? The Reliability of How States Track Achieve-ment (PACE Working Paper 06-01.) Retrieved August 23, 2006, fromhttp://pace.berkeley.edu/pace_publications.html.

Phillips, S. E., & Camara, W. J. (2006). Legal and ethical issues. In R.L. Brennan (Ed.), Educational Measurement (4th ed.). Westport,CT: American Council on Education and Praeger.

Reckase, M. D. (2006). A conceptual framework for a psychometrictheory for standard setting with examples of its use for evaluat-ing the functioning of two standard setting methods. EducationalMeasurement: Issues and Practice, 25(2), 4–18.

Fall 2006 3