educational testing policy: the changing federal role

21
CHAPTER 3 Educational Testing Policy: The Changing Federal Role

Upload: others

Post on 14-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

CHAPTER 3

Educational Testing Policy:The Changing Federal Role

Contents

Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Chapter 1, Elementary and Secondary Education Act: A Lever on Testing . . . . . . . . . . . . . 82

History of Chapter 1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Specific Requirements for Evaluating Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Standardized Tests and Mandated Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Other Uses of Tests in Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 84Competing Tensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85Effects of Chapter 1 on Local Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Ripple Effects of Chapter 1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Chapter 1 Testing in Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

National Assessment of Educational Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Safeguards and Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93The 1988 Amendments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 94NAEP in Transition . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

National Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Conclusions . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Table

3-1. Achievement Percentiles for Chapter 1 Students Tested on an Annual Cycle,1979-80 to 1987-88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

CHAPTER 3

Educational Testing Policy: The Changing Federal Role

Highlights●

As the Federal financial commitment to education expanded during the 1960s and 1970s, new demandsfor test-based accountability emerged. Federal policymakers now rely on standardized tests to assessthe effectiveness of several Federal programs.Evaluation requirements in the Federal Chapter 1 program for disadvantaged children, which result inmore than 1.5 million children being tested every year, have helped escalate the amount of testing inAmerican schools. Questions arise about whether results of Chapter 1 testing produce an accuratepicture of the program’s effectiveness, about the burden that the testing creates for schools, teachers,and children, and about the usefulness of the information provided by the test results.The National Assessment of Educational Progress (NAEP) is a unique Federal effort begun in the1960s to provide long-term and continuous data on the achievement of American school children inmany different subjects. NAEP has become a well-respected instrument to help gauge the Nation’seducational health. Recent proposals to change NAEP to allow for comparisons in performancebetween States, to establish proficiency standards, or to use NAEP items as a basis for a system ofnational examinations raise questions about how much NAEP can be changed without compromisingits original purposes.National testing is a critical issue before Congress today. Many questions remain about the objectives,content, format, cost, and administration of any national test.

The role of the Federal Government in educa-tional testing policy has been limited but influential.Given the decentralized structure of Americanschooling, few decisions supported with test infor-mation are made at the Federal level. States and localschool districts make most of the decisions aboutwhich tests to give, when to give them, and how touse the information. The Federal Governmentweighs in primarily by requiring test-based meas-ures of effectiveness for some of the educationprograms it funds, operating its own testing programthrough the National Assessment of EducationalProgress (NAEP), and affording some limited pro-tections and rights to test takers and their parents(see ch. 2).

This circumscribed Federal role has neverthelessinfluenced the quantity and character of testing inAmerican schools. As Federal funding has expandedover the past 25 years, so has the Federal appetite fortest-based evidence aimed at ensuring accountabil-ity for those funds. This growth in Federal influencehas evolved with no specific and deliberate Federalpolicy on testing. Most Federal decisions abouttesting have been made in the context of largerprogram reauthorization bills, with evaluation ques-

tions treated as program issues rather than testingpolicy issues. As discussed in the preceding chapter,Congress did consider several bills in the 1970s and1980s related to test disclosure and the rights of testtakers; only the Family Education Rights andPrivacy Act of 1974 became law.

This picture is changing. Congress now facesseveral critical choices that could redefine theFederal role in educational testing. In three policyareas, Congress has already played an importantrole, and its decisions in the near term could havesignificant consequences for the quantity and qualityof educational testing. Accountability for federallyfunded programs is the frost area. The tradition ofachievement testing as a way to hold State- ordistrict-level education authorities accountable is asold as public schooling itself. Continued spendingon compensatory education has become increas-ingly dependent on evidence that these programs areworking. Thus, for several decades now the singlelargest Federal education program--Chapter 1 (Com-pensatory Education)--has struggled with the needfor evaluation data from States and districts thatreceive Federal monies. Increasing reliance onstandardized norm-referenced achievement tests to

-81-

82 ● Testing in American Schools: Asking the Right Questions

monitor Chapter 1 programs indicates an increasingFederal influence on the nature and quantity oftesting. Congress has revised its accountabilityrequirements on several occasions, and in today’satmosphere of test reform, the $6 billion FederalChapter 1 program can hardly be ignored. The basicpolicy question is whether the Federal Governmentis well served by the information derived from thetests used today and whether modifications couldprovide improved information.

Second, Federal support for collection of educa-tional data, traditionally intended to keep the Nationinformed about overall educational progress, is nowviewed by some as a lever to influence teaching andlearning . Thus, the 20-year-old NAEP, widely ac-claimed as an invaluable instrument to gauge theNation’s educational health, has, in the past fewyears, attracted the attention of some policymakersinterested in using its tests to change the structureand content of schooling.

A third and related issue is national testing, Inaddition to various suggested changes to NAEP, anumber of proposals have emerged recently– -fromthe White House, various agencies of the executivebranch, and blue ribbon commissions-to imple-ment nationwide tests. Although the purposes ofthese tests vary, it is clear they are intended to bringabout improved achievement, not simply to estimatecurrent levels of learning. The idea of nationaltesting seems to have gained greater public accepta-bility. Proponents argue that “national” does notequal ‘‘ Federal,” and that national education stand-ards do not require Federal determination o f c u r r i c -ula and design of tests. Others fear that nationaltesting will lead inevitably to Federal control ofeducation.

OTA analyzed the development and effects of thecurrent Federal role in testing and examined pendingproposals to change that role. This chapter discussesOTA’s findings vis-á-vis Chapter 1, NAEP, andnational testing.

Chapter 1, Elementary and SecondaryEducation Act: A Lever on Testing

The passage of the 1965 Federal Elementary andSecondary Education Act (ESEA) heralded a newera of broad-scale Federal involvement in educationand established the principle that with Federaleducation funding comes Federal strings. The cor-nerstone of ESEA was Title I (renamed Chapter 1 in1981), which is still the largest program of Federalaid to elementary and secondary schools.1 Thepurpose of Title I/Chapter 1, both then and now, isto provide supplementary educational services, pri-marily in reading and mathematics, to low-achievingchildren living in poor neighborhoods. With anappropriation of $6.2 billion for fiscal year 1991,2

Chapter 1 channels funds to almost every schooldistrict in the country. Some 51,000 schools, includ-ing over 75 percent of the Nation’s elementaryschools, receive Chapter 1 dollars, which are used tofund services to about 5 million children in pre-school through grade 12. Given its 25-year historyand broad reach, the effect of Chapter 1 on Federaltesting policy is profound.

History of Chapter 1 Evaluation

From the beginning, the Title I/Chapter 1 lawrequired participating school districts to periodicallyevaluate the effectiveness of the program in meetingthe special educational needs of educationally disad-vantaged children, using ‘‘. . . appropriate objectivemeasures of educational achievement’ ’3-interpretedto mean norm-referenced standardized tests. Con-gress has revised the evaluation requirements manytimes to reflect changing Federal priorities andaddress new State and local concerns.

During the 1960s and 1970s, the Title I evaluationprovisions generally became more prescriptive anddetailed. In 1981, a dramatic loosening of Federalrequirements occurred: while evaluations were stillrequired, Federal standards governing the format,frequency, and content of evaluations were deleted.In the absence of Federal guidance, confusion aboutjust what was required ensued at the State and local

l’rhe remainder of this section is from Nancy Kober, ‘‘The Role and Impact of Chapter 1 Evaluation and Assessment Requirements, ’ OTA contractorreport, May 1991.

~f this $6.2 billiou approximately $5.5 billion is distributed by formula to local school districts. The remainder is used for three State-administeredprograms for migrant students, students with disabilities, neglected and delinquent children, and for other specialized programs and activities, such asState administration and technical assistance.

3Public Law 89-10.

Chapter 3—Educational Testing Policy: The Changing Federal Role . 83

Photo reedit: UPI, Bettmann

President Johnson signing the Elementary and Secondary Education Act of 1965 at a school in Johnson City, Texas.The enactment of this law was a milestone in Federal education policy.

levels. Congress responded by gradually retighten-ing the evaluation requirements. The most recent setof amendments, the 1988 reauthorization, madeChapter 1 assessment more consequential and con-troversial than ever before by requiring Chapter 1schools to modify their programs if they could notdemonstrate achievement gains among participatingchildren-the so-called ‘program improvement pro-visions.

Through all these revisions, the purposes of TitleI/Chapter 1 evaluation have remained much thesame: to determine the effectiveness of the programin improving the education of disadvantaged chil-dren; to instill local accountability for Federal funds;and to provide information that State and localdecisionmakers can use to assess and alter programs.

Specific Requirements forEvaluating Programs

Title I/Chapter 1 is a partnership between Federal,State, and local governments, and the evaluation

provisions reflect this division of responsibility.Evaluation of the effects of Chapter 1 on studentachievement begins at the project level—usually theschool. Test scores of participating children arecollected from schools, analyzed, and summarizedby the local education agency (LEA). Each LEAreports its findings to the State education agency(SEA), which aggregates the results in a report to theU.S. Department of Education. (States can, if theywish, institute additional requirements regarding theformat, content, and frequency of Chapter 1 evalua-tions.) Congress, by statute, and the Department ofEducation, through regulations and other writtenguidance (particularly the guidance in the Depart-ment’s Chapter 1 Policy Manua14), set standards forSEAS and LEAs to follow in evaluating andmeasuring progress of Chapter 1 students. TheDepartment also compiles the State data and sendsCongress a report summarizing the national achieve-ment results, along with demographic data forChapter 1 participants.

4u.s. De~~ent of Education, Chapfer ~ poflcy Ma~uaZ (Wmhington, DC: Apfil 1990).

84 ● Testing in American Schools: Asking the Right Questions

Standardized Tests andMandated Evaluations

Since the creation of the Title I/chapter 1Evaluation and Reporting System (TIERS) in themid-1970s, the Department has relied on norm-referenced standardized test scores as an available,straightforward, and economical way of depictingChapter 1 effectiveness. The law, for its part, givesan imprimatur to standardized tests, through numer-ous references to “testing,” “scores,” “objectivemeasures, ‘‘ ‘‘measuring instruments,’ and ‘ ‘aggre-gate performance. ” Chapter 1 evaluation has be-came nearly synonymous with norm-referencedstandardized testing.

The purpose of TIERS has changed little since itbecame operative in 1979: to establish standards thatwill result in nationally aggregated data showingchanges in Chapter 1 students’ achievement inreading, mathematics, and language arts. To con-form with TIERS, States and local districts mustreport gains and losses in student achievement interms of Normal Curve Equivalents (NCEs), astatistic developed specifically for Title I. NCEsresemble percentile scores, but can be used tocompute group statistics, combine data from differ-ent norm-referenced tests (NRTs), and evaluategains over time. (Gains in scores, which can rangefrom 1 to 99, with a mean of 50, reflect animprovement in position relative to other students.5)To produce NCE scores, local districts must use anNRT or another test whose scores can be equatedwith national norms and aggregated. Thus, althoughthe Chapter 1 statute does not explicitly state thatLEAs must use NRTs to measure Chapter 1 effec-tiveness, the law and regulations together have theeffect of requiring NRTs because of their insistenceon aggregatable data and their reliance on the NCEstandard.

The 1988 law, as interpreted by the Department ofEducation, changed the basic evaluation provisionsin ways that increased the frequency and signifi-cance of standardized testing in Chapter 1. Specifi-cally, the law:

through the new “program improvement”provisions, put teeth into the longstanding TitleI/Chapter 1 requirement that LEAs use evalua-tion results to determine whether and how localprograms should be modified. Schools withstagnant or declining aggregate Chapter 1 testscores must develop improvement plans, frostin conjunction with the district and then withthe State, until test scores go up.gave the Department the authority to reinstatenational guidelines for Chapter 1 evaluation(which had been eliminated in 1981) andrequired SEAS and LEAs to conform to thesestandards.focused greater attention on (and, throughregulation, required measurement of) studentachievement in higher order analytical, reason-ing, and problem-solving skills.directed LEAs to develop ‘desired outcomes,’or measurable goals, for their local Chapter 1programs, which could include achievementoutcomes to be assessed with standardizedtests.expanded the option for high-poverty schoolsto operate schoolwide projects,6 as long as theycan demonstrate achievement gains (i.e., highertest scores) among Chapter l-eligible children.as interpreted by the Department, requiredLEAs to conduct a formal evaluation that metTIERS standards every year, rather than every3 years. (In actual practice, most States requiredannual evaluations.)

Other Uses of Tests in Chapter 1

Producing data for national evaluations is onlyone of several uses of standardized tests in Chapter1. Under the current law and regulations, LEAs arerequired, encouraged, or permitted to use tests for allthe following decisions:

identifying which children are eligible forChapter 1 services and establishing a “cutoffscore” to determine which children will actu-ally be served;assessing the broad educational needs of Chap-ter 1 children in the school;

%&y Kemedy, B e a t r i c e F, B - and Randy E. Demaline, (Washington DC: U.S. Department ofEducatioa 1986), p. E-2.

6Under the ~~hoolwide Project Optiom schools wi~ 75 prcent or more poor children may use heir Cbpter 1 tids fOr prOgr~S tO upgde ?.hC

educational program for all children, without regard to Chapter 1 eligibiMy; in exchange for this greater flexibility, these schools must agree to increasedaccountability.

determining the base level of achievement ofindividual Chapter 1 children before receivingservices (the “pretest’ ‘);assessing the level of achievement of Chapter1 children after receiving services (the “post-test’ ‘), in order to calculate the change datarequired for national evaluations;deciding whether schools with high proportionsof low-achieving children should be selectedfor projects over schools with high poverty;7

allocating funds to individual schools;establishing goals for schoolwide projects;determining whether schoolwide projects canbe continued beyond their initial 3-year projectperiod;annually reviewing the effectiveness of Chap-ter 1 programs at the school level for purposesof program improvement;deciding which schools must mod@ theirprograms under the “program improvement”requirements;determining when a school no longer needsprogram improvement;identifying which individual students havebeen in the program for more than 2 yearswithout making sufficient progress; andassessing the individual program needs ofstudents that have participated for more than 2years.

In addition, Congress and the Department ofEducation use standardized test data accumulatedfrom State and local evaluations for a variety ofpurposes:

justifying continued appropriations and author-izations;weighing major policy changes in the program;targeting States and districts for Federal moni-toring and audits; andcontributing to congressionally mandated stud-ies of the program.

Competing Tensions

Chapter 1 is a good example of how Congressmust weigh competing tensions when making deci-sions about Federal accountability and testing. Forexample, in Chapter 1, as in other educationprograms, the need for Federal accountability must

be weighed against the need for State and localflexibility in program decisions. The Federal appe-tite for statistics must be viewed in light of theundesirable consequences of too much Federalburden and paperwork-lost instructional time anddeclining political support for Federal programs, toname a few. The Federal desire for succinct,‘‘objective, and aggregatable data must be judgedagainst the reality that test scores alone cannotprovide a full and accurate picture of Chapter 1’sother goals and accomplishments (e.g., redistribut-ing resources to poor areas, mitigating the socialeffects of child poverty, building children’s selfesteem, and keeping students in school). Finally, theFederal need for summary evaluations on which toformulate national funding and policy decisionsmust be weighed against the local need for meaning-ful, child-centered information on which to baseday-to-day decisions about instructional methodsand student selection.

The number of times Congress has amended theChapter 1 evaluation requirements suggests howdifficult it is to balance these competing tensions.

Effects of Chapter 1 on Local Testing

Chapter 1 has helped create an enormous systemof local testing. Almost every Chapter 1 child istested every year, and in some cases twice a year, tomeet national evaluation requirements. In schoolyear 1987-88, over 1.6 million Chapter 1 partici-pants were tested in reading and just under 1 millionin mathematics. Sometimes this testing is combinedwith testing that fulfills State and local needs; othertimes Chapter 1 has caused districts to administertests more frequently, or with different instruments,than they would in the absence of a Federalrequirement.

Because SEAS and LEAs often use the same testinstruments to fulfill both their own needs andChapter 1 requirements, and because States anddistricts expanded their testing programs duringroughly the period when Chapter 1 appropriationswere growing, it is difficult, perhaps impossible, tosort out which entity is responsible for what degreeof the total testing burden. Although States anddistricts often coordinate their Chapter 1 testing withother testing needs, many LEAs report that without

7A ~ropo~~ t. amend Tide I so that all funding would be distributed on the basis of achievement teSt scores was put fofi ~ tie ~te 197@ bythen-Congressman Albert Quie (R-MN). The proposal was not accepted, but a compromise provision was adoptcxi, which remains in the law today,permitting school districts to allocate funds to schools based on test scores in certain limited situations.

86 ● Testing in American Schools: Asking the Right Questions

Photo medif: Julie Miller, Education VWek

Classrooms like this in Jefferson Parish, Louisiana, benefit from the extra assistance for disadvantaged students provided byChapter 1 of the Elementary and Secondary Education Act. Testing has always been a big part of Chapter 1 activity.

Chapter 1, they would do less testing. A districtadministrator from Detroit, for example, estimatedthat her school system conducts twice as muchtesting because of Chapter 1.8 The research andevaluation staff of the Portland (Oregon) PublicSchools noted that in the absence of a Chapter 1requirement to test second graders, their districtwould begin standardized testing later, perhaps inthe third or fourth grade.9 (In school year 1987-88,about 22 percent of Chapter 1 public and privateschool participants were in grades pre-K through oneand were already exempted from testing. Another 26percent of the national Chapter 1 population were ingrades two and three; these children must be testedunder current requirements.) One State Chapter 1coordinator said that without Chapter 1, his Statewould require only its State criterion-referencedinstrument, and not NRTs. At the school level,

principals and teachers express frustration with theamount of time spent on testing and tracking testdata in Chapter 1 and the degree of disruption itcauses in the academic schedule.

National studies of Chapter 1 and case studies ofits impact in particular districts have uncoveredsome significant concerns about the appropriatenessof using standardized tests to assess the program’soverall effectiveness, make program improvementdecisions, and determine the success of schoolwideprojects. Over the years, Chapter 1 researchers andpractitioners have raised a number of technicalquestions about the quality of Chapter 1 evaluationdata and have expressed caveats about its limitationsin assessing the full impact and long-term conse-quences of Chapter 1 participation. With the newrequirements that raised the stakes of evaluation,debate over the data’s validity and limitations has

ss~on Jo~on-~wis, director, office of N- g, Research and Evaluatio% Detroit Public Schools, remarks at OTA Advisory Panel meeting,June 28, 1991.

~s and the other observations about the impact of Chapter 1 on testing practices are taken from Kober, op. cit., footnote 1. Case studies of thePhiladelphia, PA, and Portland, OR+ public schools helped inform OTA’S analysis and are cited throughout this chapter.

Chapter 3—Educational Testing Policy: The Changing Federal Role ● 87

become more heated. For example, there is evidencefrom Philadelphia, Portland, and other districts thatbecause of measurement phenomena, test results donot always target for program improvement theschools with the lowest achievement or the weakestprograms. Similarly, schools with schoolwide proj-ects have argued that a 3-year snapshot based on testscores does not always provide adequate time or anaccurate picture of the project’s success comparedwith more traditional Chapter 1 programs.

State and local administrators have also expressedconcerns about the effect of Chapter 1 testing oninstruction. While administrators and teachers areloathe to admit to any practices that could beinterpreted as ‘‘teaching to the test, ” there is someevidence from case studies and national evaluationsthat teachers make a point to emphasize skills thatare likely to be tested. In districts such as Philadel-phia and Portland, where a citywide test tied to localcurriculum is also the instrument for Chapter 1evaluation, teachers can readily justify this practice.Discomfort arises, however, when local administra-tors and teachers feel they are being pressed byFederal requirements to spend too much timedrilling students in the type of “lower order” skillsfrequently included on commercially publishedNRTs, or when teachers hesitate to try newerinstructional approaches, such as cooperative learn-ing and active learning, for fear their efforts will nottranslate into measurable gains.

Of more general concern is the broad feeling thatfor the amount of burden it entails, Chapter 1 testdata is not very useful for informing local programdecisions. According to case studies and otheranalyses, teachers and administrators use federallymandated evaluation results far less often than othermore immediate and more student-centered evalua-tion methods--e.g., criterion-referenced tests (CRTs),book tests, teacher observations, and various formsof assessment—to determine students’ real progressand make decisions about instructional practices.Frequently the mandated evaluations are viewed asa compliance exercise—a ‘‘hoop’ that States andlocal districts must jump through to obtain Federalfunding.

Although Chapter 1 teachers, regular classroomteachers, and administrators do occasionally employ

other types of assessment to make decisions aboutChapter 1 students and projects, these alternativeforms are not entrenched in the program in the sameway that NRTs are, and are seldom considered partof the formal Chapter 1 evaluation process. Whilethe Chapter 1 law contains some nods in thedirection of alternative assessment—particularly formeasuring progress toward desired outcomes andevaluating the effects of participation on children inpreschool, kindergarten, and frost grade-the gen-eral requirements for evaluation cause local practi-tioners to feel that NCE scores are the only resultsthat really matter. They believe that alternativeassessment will not become a meaningful compo-nent of chapter 1 evaluation without explicit en-couragement from Congress and the Department.

One bottom line question remains: what does thelarge volume of testing data generated by Chapter 1evaluation tell Congress and other data users aboutthe achievement of Chapter 1 children? To answerthis question, it is useful to consider the data from a10-year summary of Chapter 1 information, asshown in table 3-1.10 The first thing that is apparentfrom the summary data is how the millions ofindividual test scores required for Chapter 1 evalua-tion are aggregated into a single number for eachgrade for each year. Average annual percentile gainsin achievement--comparing average student pretestscores and average post-test scores—have hoveredin the range of 2 to 6 percentiles in reading, and 2 to11 percentiles in mathematics. For some gradelevels, in some years, there have been greaterimprovements, but in general the gains have beenmodest and the post-test scores have remained low.For example, in 1987-88 the average post-test scorefor Chapter 1 fourth graders was the 27th percentilein reading and the 33rd percentile in mathematics. Inanalyzing these data it is important to understandthat Chapter 1 children, by definition, are the lowestachieving students in their schools, and that once achild’s test scores exceed the cutoff score for thedistrict that child is no longer eligible for Chapter 1services. There has been some upward trend, morepronounced in mathematics than in reading, butoverall closing of the gap has been slow. In addition,because there is no control group for Chapter 1evaluation, it is difficult to assess what thesepost-test scores really mean, i.e., how well Chapter

of ~~ refen~ to in his diseussioq see us. Dep~ent of ~ucatio~ A SUnWUIry o~~rate chapter ~

(Washington DC: 1990).

297-933 0 - 92 - 7 QL 3

=

88 ● Testing in American Schools: Asking the Right Questions

Table 3-l—Achievement Percentiles for Chapter 1 Students Tested on anAnnual Cycle, 1979-80 to 1987-88

Changes in percentile ranks for readingGrade 1979-80 1980-81 1981-82 1982-83 1983-84 1984-85 1985-86 1986-87 1987-88

2 . . . . . . . 23 . . . . . . . 44 . . . . . . . 35 . . . . . . . 36 . . . . . . . 47 . . . . . . . 28 . . . . . . . 39 2

10 ::::::: -111 . . . . . . . -312 . . . . . . . 2

25456344230

Grade 1979-80 1980-81

23455454112

244563442

-10

24455443101

34565642220

24555443230

45645332232

45545443220

Changes in percentile ranks for mathematics1981-82 1982-83 1983-84 1984-85 1985-86 1986-87 1987-88

2 . . . . . . . 23 . . . . . . . 14 . . . . . . . 35 . . . . . . . 46 . . . . . . . 67 . . . . . . . 48 . . . . . . . 49 1

10 : : : : : : : -211 . . . . . . . 112 . . . . . . . 2

53648351120

55566552011

35488763210

66577551123

64676652232

96697642442

107877545334

11787745443

-1

SOURCE: Beth Sinclair and Babette Gutmann, A Summary of sate Chapter 1 Participation and AMievementInforn?ation for 1987-88 (Washington, DC: U.S. Department of Education, 1990), pp. 49-50.

1 children would achieve in the absence of anyintervention. l1

For purposes of this analysis, the real question iswhether the information from these test scores isnecessary or sufficient to answer the accountabilityquestions of interest to Congress. For the disadvan-taged population targeted by the program, theachievement score gains are evidence of improve-ment. Thus, when taken together with other evalua-tive evidence about the program’s impact, the testscores support continued funding. But whether thetest scores reveal anything significant about whatand how Chapter 1 children are learning remainsambiguous. And in the light of unanticipated effectsof the extensive testing, it is not clear that theinformation gleaned from the tests warrants thecontinuation of an enormous and quite costlyevaluation system in its present form.

Ripple Effects of Chapter 1 Requirements

Title I/Chapter 1 established a precedent for achieve-ment-based accountability requirements adopted inmany subsequent Federal education programs. In themigrant education program added in 1966, thebilingual education program added in 1967, theHead Start program enacted in the Economic Oppor-tunity Amendments of 1967, and programs thatfollowed, Congress required recipients of Federalfunds to evaluate the effectiveness of the programsfunded. 12 As a result of Federal requirements, Stateand local agencies administer a whole range oftests—to place students, assess the level of partici-pants’ needs, and determine progress. Even whenNRTs are not explicitly required, they are often thepreferred mode of measurement for Federal account-ability because they can be applied consistently, arerelatively inexpensive, and leave a clearly under-

11One of tie more “efig ~v~wtion probl~s ~ &n @ infer ‘ ‘~~~ent effec~’ horn Smdies with no control group. For discussion and analysisof methods designed to correct for ‘regression to the mean’ and other statistical constraints, see Anand Desai, ‘lkcbnicd Issues in Measuring ScholasticImprovement Due to Compensatory Education Programs,’ vol. pp. 143-153.

lzFor discussion of outcome-based p~ormance measures in vocational education and job training programs see, e.g., U.S. Congress, Office of‘lkchnology Assessmen4 “Performance Standards for Secondary School Vocational Educatio~” background paper of the Science, Education andTransportation Program, April 1989.

Chapter 3—Educational Testing Policy: The Changing Federal Role . 89

stood and justifiable trail for Federal monitors andauditors.

The 1965 ESEA had another, less widely recog-nized impact on State testing practices. Title V of theoriginal legislation provided Federal money tostrengthen State departments of education, so thatthey could assume all the administrative functionsbestowed on them by the new Federal educationprograms. This program helped usher in an era ofincreased State involvement in education and wouldhave a significant impact down the road as Statesassumed functions and responsibilities far beyondthose required by Federal programs or envisioned byCongress in 1965.

Chapter 1 Testing in Transition

OTA finds that because of its size, breadth, andinfluence on State and local testing practices,Chapter 1 of ESEA provides a powerful lever bywhich the Federal Government can affect testingpolicies, innovative test development, and test usethroughout the Nation.

OTA’s analysis brings to light several reasonswhy Congress ought to reexamine and considersignificant changes to the Federal requirements forChapter 1 evaluation and assessment.

National policymakers and State and localprogram administrators have different dataneeds, not all of which are well served byNRTs.The implementation of the 1988 programimprovement and schoolwide project require-ments has underscored some of the inadequa-cies and limitations of using NRTs for localprogram decisions, while simultaneously in-creasing the consequences attached to thesetests.While the uses and importance of evaluationdata have changed substantially as a result ofthe 1988 amendments, the methods and instru-ments for collecting this data have remainedessentially the same since the late 1970s. Abetter match is needed between the new goalsof the law, particularly the goal to improve thequality of local projects, and the tools used tomeasure progress toward those goals.

As Congress approaches Chapter 1 reauthori-zation, it should examine how all the pieces thataffect testing under the umbrella of Chapter 1 fit

Photo uedit: The Jenks Stvdio of Photography

Research has shown that early intervention is important,and many schools like this one in Danville, Vermont,use Chapter 1 funds for preschool and kindergarten

programs.

together. Many pieces are interrelated, but they donot always work harmoniously. For example, thetiming and evaluation cycles for Federal, State, andlocal testing in existing law are not well coordinated.As part of this review, Congress should pay particu-lar attention to the need to revise language thatinadvertently endorses norm-referenced testing insituations where that type of testing may be inappro-priate. Options such as data sampling may meetcongressional needs. Clearer legislative languagecould help maintain and improve accountability,because States and local districts would know betterwhat was expected.

The following questions can guide congressionaldeliberations regarding changes in Chapter 1:

What information does Congress need to makepolicy and funding decisions about Chapter 1?Is Congress getting that information, and is ittimely and useful?What information does the Department ofEducation need to administer the program?How do the data needs of State and localagencies differ from those of the FederalGovernment and each other?Is it realistic to serve national, State, and localneeds with the same information system basedon the same measurement tool?How well do NRTs measure what Chapter 1children know and can do?Is the nationally aggregated evaluation data thatis currently generated accomplishing what

90 ● Testing in American Schools: Asking the Right Questions

Congress intended? Specifically, do aggregatesof aggregates of averages of NCE gains andnegative gains present a meaningful and validnational picture of how well Chapter 1 childrenare achieving?To what extent is the value of cumulative datasymbolic rather than substantive? For example,

is being able to point to a rising line on a chartas important as having accurate, meaningfuldata about what Chapter 1 children know andcan do? Can symbolic or oversight needs befulfilled with less burdensome types of testing?What other types of data, beyond test scores,might meet Federal policy makers’ criteria forobjectivity?

I n summary, OTA finds that Congress shouldrevisit the Chapter 1 assessment and evaluationrequirements in the attempt to lessen reliance onNRTs, reduce the testing burden, and stimulate thedevelopment of new methods of assessment moresuited to the students and the program goals ofChapter 1. A careful reworking of the requirementscould have widespread salutary effects on the use ofeducational tests nationwide. Congressional optionsfor achieving these ends are identified in chapterof this report.

National Assessment of EducationalProgress

1

By the late 1960s, Title I/chapter 1 and otherFederal programs had produced a substantial amountof data concerning the achievement of disadvan-taged children and other special groups of students.State and local testing told SEAS and LEAs howtheir students stacked up against national norms onspecific test instruments. What was missing, how-ever, was a context—a nationally representativedatabase about the educational achievement ofelementary and secondary school children as agroup, against which to confirm or challenge infer-ences drawn from State, local, or other nationwidetesting programs.

Although policymakers and the public could drawfrom a wide variety of statistics to make informeddecisions on such issues as health and labor, theywere operating in a vacuum when it came toeducation. The Department of Education produced arange of quantitative statistics on school facilities,teachers, students, and resources, but had nevercollected sound and adequate data on what Ameri-can students knew and could do in key subject areas.

Francis Keppel, U.S. Commissioner of Educationfrom 1962 to 1965, became troubled by this dearthof information and initiated a series of conferencesto explore the issue.13 In 1964, as a result of thesediscussions, the Carnegie Corp. of New York, aprivate foundation, appointed an exploratory com-mittee and charged it with examining the feasibilityof conducting a national assessment of educationalattainments. By 1966, the committee had concludedthat a new battery of tests-carefully constructedaccording to the highest psychometric standards andwith the consensus of those who would use it—would have to be developed.14

The vision became a reality in 1969, when theU.S. Office of Education began to conduct periodicnational surveys of the educational attainments ofyoung Americans. The resulting effort, NAEP,sometimes called ‘‘the Nation’s report card,” hasthe primary goal of obtaining reliable data on thestatus of student achievement and on changes inachievement in order to help educators, legislators,and others improve education in the United States.

Purpose

Today, NAEP remains the only regularly con-ducted national survey of educational achievementat the elementary, middle, and high school levels.15

To date it has assessed the achievement of some 1.7million young Americans. Although not everysubject is tested during every administration of theprogram, the core subjects of reading, writing,mathematics, science, civics, geography, and U.S.

lq~ 1%3, Kep@ is reported to have lamented the fact that: ‘‘Congress is continually asking me about how bad or how good the schools are and wehave no dependable information. They give different tests at schools for different purposes, but we have no idea generally about the subjects that educatomvalue. . . .’ OTA interview with Ralph W. ‘&ler, Apr. 5, 1991.

14~s ~ly histow of the Natio~ Assessment of ~ucatio~ pro~ss (NAEp) is ~en from tie National Assessment Of ~UCatiOIld fiOgrCSS,

(Washingto% DC: National Center for Education Statistics, 1974); and George Madaus and Dan Stufflebeam (eds.), (Boston, MA: Kluwer Academic Publishers, 1989). Conversations with Frank Womer, Edward

Roeber, and Ralph ‘Ijder, all involved in different capacities in the original design and implementation of NAEP, enriched the material found in publishedsources.

15Natio~ Ass~sment of ~ucationa.1 FYogress, (Princeton, NJ: Educational ‘Iksting Service, 1986).

Chapter 3—Educational Testing Policy: The Changing Federal Role . 91

I

I

‘0w, ,

(.,

Photo credit: Office of Technology Assessment, 1992

Known as the Nation’s Report Card, the National Assessment of Educational Progress issues summary reports forassessments conducted in” a number of academic subject areas. These reports also analyze trends in achievement

levels over the past 20 years.

have been assessed more than once to or a vehicle for student selection or funds allocationhistory

determine trends over time. Occasional assessmentshave also examined student achievement in citizen-ship, literature, art, music, computer competence,and career development.

Safeguards and Strengths

The designers of the NAEP project took extremecare and built in many safeguards to ensure that anational assessment would not, in the worst fears ofits critics, become any of the following: a steppingstone to a national individual testing program, a toolfor Federal control of curriculum, a weapon to‘‘blast’ the schools, a deterrent to curricular change,

decisions. 16 An understanding of NAEP's designsafeguards is crucial in order to comprehend whatNAEP was and was not intended to do and why it isunique in the American ecology of student assess-ment. NAEP has seven distinguishing characteris-tics.

NAEP reports group data only, not individualscores. NAEP results cannot be used to infer howparticular students, teachers, schools, or districts areachieving or to diagnose individual strengths andweaknesses. Prevention of these levels of scorereporting was a prerequisite to gaining approval for

lc~ler, op. cit., footnote 13.

92 ● Testing in American Schools: Asking the Right Questions

the original development and implementation ofNAEP.17

NAEP is essentially a battery of criterion-referenced tests in various subject areas (although itsdevelopers prefer the term “objective-referenced,”since NAEP tests are not tied to any specificcurriculum but measure the educational attainmentof young Americans relative to broadly definedbodies of knowledge). Unlike many commerciallypublished NRTs, NAEP scores cannot be used torank an individual’s performance relative to otherstudents. This emphasis on criterion-referencedtesting represents an important shift toward outlin-ing how children are doing on broad educationalgoals rather than how they are doing relative to otherstudents. NAEP is the only test to provide this kindof information on a national scale.

NAEP has pioneered a survey methodologyknown as “matrix sampling.” This approach grewout of item-response theory, and has been hailed asan important contribution to the philosophy andpractice of student testing.18 Under this method, asample of students across the country is tested, ratherthan testing all students (which would be considereda‘ ‘census’ design). Furthermore, the students in thematrix sample do not take a “whole’ test, or eventhe same subject area tests, nor are they all given thesame test items. Rather, each student takes a l-hourtest that includes a mix of easy, medium, anddifficult questions. Thus, NAEP uses a method ofsampling, not only of the students, but also of thecontent that appears on the test. Any student takinga NAEP test only takes one-seventh of the test in al-hour testing session. Because of matrix sampling,a much wider range of content and goals can becovered by the test than most other tests can allow.This broad coverage of content is the essentialfoundation of a nationally relevant test, as well as atest that is relatively well protected against thenegative side effects that can occur with teaching toa narrow test. It is probable that these importantstrengths of NAEP, which make it a robust andnationally credible test, would be difficult to incor-

porate into a test designed to be administered toindividuals (unless it were a prohibitively long test).In addition, because no individual students can beassigned scores, the matrix sampling approachimposes an important technological barrier againstthe use of NAEP results for making student, school,district, or State comparisons, or for sorting orselecting students.

NAEP provides comparisons over time, bytesting nationally representative samples of 4th, 8th,and 12th graders on a biennial cycle. (Prior to 1980,NAEP tested on an annual cycle.) This form ofsampling deters the kinds of interpretation problemsthat can arise when different populations of testtakers are compared.19 Due to cost constraints, theout-of-school population of students that had beensampled in early NAEP administrations was elimi-nated.

NAEP strives for consensus about educationalgoals. NAEP’s governing board employs a consensus-building process for establishing content frame-works and educational objectives that are broadlyaccepted, relevant, and forward looking. Panels ofteachers, professors, parents, community leaders,and experts in the various disciplines meet indifferent locales and work toward agreement on acommon set of objectives for each subject area.These objectives are then given to item writers, whocome up with the test questions. Before the items areadministered to students, they undergo carefulscrutiny by specialists in measurement and thesubject matter being tested and are closely reviewedin the effort to eliminate racial, ethnic, gender, andother biases or insensitivities.20

Recognizing that changing educational objectivesover time can complicate its mandate to plot trendsin achievement, NAEP has developed a valuableprocess for updating test instruments. Using thisprocess, NAEP revises test instruments to reflectnew developments in curricular objectives, at thesame time maintaining links between current andpast levels of achievement of certain freed objec-

IYSCC, e.g., James Hazlet~ University of ~w, ‘‘A History of the National Assessment of Educational Progress, 1%3-1973,” unpublished doctoraldissertatio~ December 1973.

18~e p~clplti of ~~ ~Wpl~g ~ now us~ ~ mmy State ~sessm~t pmgrws, M well as in other countries. See ChS. 6 ad 7 for additionaldiscussion.

l~or ~xmple, ~s Wm a ~jor problem in using tie d~~c in schol~tic Apti~de Test scores as a basis for the inference bit Overti aCh.&CInenI

had fallen. See Robert Linn and Stephen Dunbar, “The Nation’s Report Card Goes Home: Good News and Bad About Trends in Achievement” October 1990, pp. 127-133.

~ational Assessment of Educational Progress, op. cit., footnote 15.

Chapter 3—Educational Testing Policy: The Changing Federal Role . 93

SOUTHEAST

@

Photo credit: National Assessment of Educational Progress

In addition to information about the Nation as a whole, theNational Assessment of Educational Progress (NAEP)

reports for four regions of the country as well as by sex,race/ethnicity, and size and type of community. NAEP does

not report results for individual students, but generatesinformation by sampling techniques.

tives. In mathematics and reading, for example,representative samples of students are assessedusing methods that have remained stable over thepast 20 years, while additional samples of studentsare tested using instruments that reflect newermethods or changed definitions of learning objec-tives. Thus, the 1990 mathematics assessment al-lowed some students to use calculators, a decisiongenerally praised by the mathematics teachingcommunity. The NAEP authors took care to note,however, that the results of these samples were notcommensurate with the mathematics achievementresults from prior years.

Although NAEP is predominantly a paper-and-pencil test relying heavily on multiple-choice items,certain assessments include open-ended questionsor nontraditional formats. For example: the writ-ing assessment requires students to produce writingsamples of many different kinds, such as a persua-sive piece or an imaginative piece; the 1990assessment also included a national ‘writing portfo-lio’ of works produced in classrooms; the scienceassessment combines multiple-choice questions withessays and graphs on which students fill in aresponse; and the 1990 mathematics assessmentincluded several questions assessing complex problem-

solving and estimation skills, as recommended bythe mathematics teaching profession.

During its early years, NAEP experimented witheven more varied test formats and technologies,conducting performance assessments in music andart that were administered by trained school person-nel and scored by trained teachers and graduatestudents. Although many of its more innovativeapproaches were suspended due to Federal fundingconstraints, 21 many State testing programs continueto use the performance assessment technologiespioneered by NAEP. Moreover, NAEP continues tobe a pioneer in developing open-ended test itemsthat can be used for large scale testing; this ispossible largely due to matrix sampling.

Accomplishments

All of these strengths have lent NAEP a degree ofrespect that is exceptional among federally spon-sored evaluation and data collection efforts. NAEPhas produced 20 years of unparalleled data and isconsidered an exemplar of careful and innovativetest design. NAEP reports are eagerly awaited beforepublication and widely quoted afterward. In addi-tion, NAEP collects background data about stu-dents’ family attributes, school characteristics, andstudent attitudes and preferences that can be ana-lyzed to help understand achievement trends, such asthe relationship between television and readingachievement.

Because of NAEP, the Nation now knows, amongother trends, that Black students have been narrow-ing the achievement gap during the past decade,9-year-olds in general read better now than they did10 years ago, able 13-year-olds do less well onhigher order mathematics skills than they did 5 yearsago, and children who do homework read better thanthose who do not.

Caveats

A relatively recent issue has emerged with poten-tial consequences for NAEP administration and forinterpretation of NAEP results. Researchers havebegun to question whether NAEP scores tend tounderestimate knowledge and skills of Americanstudents, precisely because NAEP is perceived as alow-stakes test. The question is whether studentsperform at less than their full ability in the absence

ZIF~r dis~ssion of tie 1974 fi&ng crisis, see Ha.zlett, op. cit., footnote 17, pp. 297-299.

94 ● Testing in American Schools: Asking the Right Questions

of extrinsic motivation to do well. It is not purely anacademic question: much of today’s debate over thefuture of American education and educational test-ing turns on public perceptions of the state ofAmerican schooling, perceptions based at least inpart on NAEP.

Some empirical research on the general questionof motivation and test performance has alreadydemonstrated that the issue may be more importantthan originally believed. For example, one studyfound that students who received “. . . special in-structions to do as well as possible for the sake ofthemselves, their parents, and their teachers. . .“ didsignificantly better on the Iowa Tests of Basic Skillsthan students in the control group who receivedordinary instructions.

22 This result supports thegeneral findings in research discussed in the preced-ing chapter;23 and another analyst’s observation that" . . . when a serious incentive is present (high schoolgraduation), scores are usually higher.’ ’24

Prompted by these and other findings, severalresearchers are conducting empirical studies todetermine the specific motivational explanations ofperformance-on NAEP. One study involves experi-mental manipulation of instructions to NAEP testtakers; the other involves embedding NAEP items inan otherwise high-stakes State accountability test.25

Data are to be collected in spring 1992. The resultsof these studies will shed light on an importantaspect of how NAEP scores should be interpreted.26

The 1988 Amendments

The original vision of NAEP has been diminishedby years of budget cuts and financial constraints.Some of what NAEP once had to offer the Nation hasbeen lost as a result. Concomitantly, over the pastfew years, new pressures have arisen in the attemptto adapt NAEP to serve purposes for which it wasnever intended. Some of this pressure has come frompolicymakers illustrated with the lack of effect of

NAEP results in shaping educational policy and therelatively “low profile” of the test and the results.Responding in part to this pressure, Congress tooksome cautious steps in 1988 to amend NAEP toprovide new types of information.

One dilemma that surfaced during NAEP’s firsttwo decades was that its results did not appear tohave much impact on education policy decisions,especially at the State and local levels. Whiletheoretically NAEP could provide benchmarks againstwhich State and local education authorities couldmeasure their own progress, many educators arguedthat the information was too general to be of muchhelp when they made decisions about resourceallocations. Others observed that since NAEP car-ried no explicit or implicit system for rewards orsanctions, there was simply no incentive for Statesand localities to pay much attention to its results.

Had NAEP not been so highly respected, criti-cisms about its negligible influence on policy mighthave been considered minor, but given NAEP’sreputation, its lack of clout was viewed as a majorlost opportunity. Pressure mounted to change NAEPto make State and local education authorities takegreater heed of its message. These voices for changewere quickly met by experts who reissued warningsfrom the past: that any attempts to use NAEP forpurposes other than analyzing aggregate nationaltrends would compromise the value of its informa-tion and ultimately the integrity of the entire NAEPprogram.

27 The principal concerns were:

1. that turning NAEP into a high-stakes testwould lead to the kinds of score ‘inflation’ or‘‘pollution’ that have undermined the credi-bility of other standardized tests as indicatorsof achievement (see ch. 2); and

2. that using NAEP to compare student attain-ment across States would induce States tochange their curricula or instruction for the

~steven M. Brown and Herbert J. Walberg, University of Illinois at Chiago, “Motivational Effects on ‘l&t Scores of Elementmy School Students:An Experimental Study,” monograph 1991.

ZS= D~el Kore~, Row L~ stephen Dunbar, ~d ~rne sllep~~ ‘ ‘me Effects of H@ Sties ‘l&ting on AChieVelIleIlt: prel- FindingsAbout Generalization Across lksts,” paper presented at the annual meeting of the American Educational Research Association, Chicago, IL, April 1991.

~sm pad Burke, “YOU Can Lead Adolescents to a lkst But You Can’t Make Them Try,’ OTA contractor report, Aug. 14, 1991, p. 4.

MRobeII L~ Universiw of Colorado at Boulder, personal communication, November 1991.~For tiwussion of gen~ issues reg~dfig the pub~c’s ~derS~@ of Natio~ Assessment of E&cationd ~ogress Smres, see Robert Forsy@

“Do NAEP Scales Yield Wid Criterion-Referenced Interpretations?” 10, No. 3, fall 1991, pp.3-9; and Burke, op. cit., footnote 24.

zv~e strongest early w arnings about NAEP were found in Harold Hand, “National Assessment Viewed as the Camel’s Nose, ” 1, 1965, pp. 8-12; and Harold Hand, “Recipe for Control by the Few, ” vol. 30, No. 3, 1966, pp. 263-272,

Chapter 3—Educational Testing Policy: The Changing Federal Role ● 95

sake of showing up better on the next test,rather than as a result of careful deliberationsover what should be taught to which studentsand under what teaching methods.

When NAEP came up for congressional reauthoriza-tion in 1988, it was amid a climate of growing publicdemands for accountability at all levels of education(fueled in part, ironically, by NAEP’s own reports ofmediocre student achievement in critical subjects).Almost a decade of serious education reform effortshad made little visible impact on American students’test scores, especially relative to those of interna-tional competitors.

Trial State Assessment

Congress responded by authorizing, for the firsttime, State-level assessments, to be conducted on avoluntary, trial basis. Beginning with the 1990eighth grade mathematics assessment and the 1992fourth grade mathematics and reading assessments,NAEP results were to be published on a State-by-State basis for those States that chose to participate.Congress considered this amendment a trial, to befollowed up with careful evaluation, before theestablishment of a full-scale, State-level NAEPprogram could be considered.

While proponents believed that the experimentwould yield useful information for SEAS, criticsworried that a State-by-State assessment wouldinvite fruitless comparisons among States that didnot take into account other factors influencingachievement; would put pressure on States to teachto the test or find other ways to artificially inflatescores; or would lead to general ‘‘education bash-ing.” Most importantly, critics cautioned that withthe State assessment Congress would eventuallysuccumb to pressure to allow assessments andcomparative reporting by district, by school, or evenby student-a travesty of NAEP’s original purposeand design.

Thirty-seven States, the District of Columbia,Guam, and the Virgin Islands participated in the firsttrial State assessment of mathematics, conducted in1990. Results were released in June 1991.28 Asexpected, some media reports focused on the inevi-table question of: ‘‘Where does your State rank?’ In

general, however, the consequences of the trial willnot be apparent for some time. In addition toanalyzing the effects of the trial on the quality andvalidity of NAEP data and on State and local policydecisions, observers are likely to focus on whetherthe information will be worth the high cost ofadministering the State assessments, and whetherthe cost of the State programs will crowd out othernecessary expenditures or improvements in the basicNAEP program.

Standard Setting

The 1988 reauthorization made another funda-mental revision in the original concept of NAEP.From its inception, NAEP had reported results interms of proficiency scales, pegged to everydaydescriptions of what children at that performancelevel could do. For example, a 200 score in readingmeant that students . . have learned basic compre-hension skills and strategies and can locate andidentify facts from simple informational paragraphs,stories, and news articles.”29 NAEP has beencommended for its accuracy in describing howthings are. In the late 1980s, however, it came undercriticism because it was silent on how things ought “to be. Those who saw NAEP as a potential tool forreforming schools or measuring progress toward thePresident’s and the Governors’ National Goals forthe year 2000 thought that NAEP should setproficiency standards-benchmarks of what stu-dents should be able to do. As with the statewideassessment proposal, the recommendation for profi-ciency standards raised the hackles of many educa-tors, researchers, and policy makers. Opponents ofthe proposal said it would undermine local control ofeducation; increase student labeling, tracking, andsorting; and compromise NAEP’s original purposeand validity.

The 1988 amendments created a new governingbody, the National Assessment Governing Board(NAGB), and charged it with identifying “. . . ap-propriate achievement goals for each age and gradein each subject area." NAGB has completed thestandard-setting process for mathematics in 4th, 8th,and 12th grades, and in doing so, generated consider-able controversy. Many observers felt that the

Z8sm ~ vs. Mulfis, Jo~ A. Dossey, Eugene H. owe~ and (’Jq w. p~~ps, ~ucatio~ Testing Semice, The Stare o~Marhemutics Achievement,

prepared for the National Center for Education Statistics (Washington DC: U.S. Department of Educatio~ Education Information Branch June 1991).ZgFor ~~ysls of N~ti~@ A~~eS~ment of ~ucatio~ ~c)~ss’ deffitions of lit~acy we Joti B. C~O~, ‘‘The National Assessments in Reading:

Are We Misreading the Findings?” vol. 68, No. 6, February 1987, pp. 424-430.

96 ● Testing in American Schools: Asking the Right Questions

mathematics standards were hammered out tooquickly, before true consensus was achieved.

Adding the trial State assessment and standard-setting activity increased NAEP funding from about$9.3 million in fiscal year 1989 to over $17 millionin fiscal year 1990 (nominal dollars).

NAEP in Transition

When authorization for the trial State assessmentsand standard-setting processes expires, Congresswill face the issue of whether to continue and expandthese efforts. As of now, Congress has authorizedplanning for the 1994 trial, but has not appropriatedfunds for the implementation of the trial itself. TheAdministration’s “America 2000 Excellence inEducation Act’ recommends authorization of State-by-State comparisons in five core subject areas(mathematics, science, English, history, and geogra-phy) beginning in 1994 as a means of monitoring(and stimulating) progress toward the NationalGoals. The Administration’s bill also suggests thattests used in NAEP be made available to States thatwish to use them for testing at school or districtlevels at their own expense.

In conclusion, the basic question facing Congressis whether to make NAEP even more effective atwhat it was originally intended to do, or to exploreways that NAEP could serve new purposes. OTAfinds that any major changes in NAEP should becarefully evaluated with respect to potential effectson NAEP’s capacity to serve its original purpose.

National Testing

Overview

Perhaps the proposals with the most far-reachingimplications for the Federal role in testing are thosecalling for the creation and implementation of anational testing program. Although the objectives ofthe various national testing proposals are somewhatunclear, they appear to rest on two basic assump-tions: first, that the skills and knowledge of mostAmerican schoolchildren do not meet the needs of achanging global economy; and second, that newtests can create incentives for the teaching andlearning of the appropriate knowledge and skills.Momentum for these efforts has built rapidly, fueledby numerous governmental and commission reportson the state of the economy and of the educationalsystem; by the National Goals initiative of the

Photo credit: National Assessment of Etitional Progress

The National Assessment of Educational Progress hasdeveloped and pilot tested a variety of hands-on scienceand mathematics tasks. In this example, students watch

an administrator’s demonstration of centrifugal forceand then respond to written questions about what

occurred in the demonstration.

President and Governors; by casual references to thesuperiority of examination systems in other coun-tries; and most recently by the President’s ‘America2000” plan.

Taken together, the questions of purpose andbalance between local control and national interestframe the debate regarding the desirability ofnational testing. This debate must reflect both theneeds of the Nation and the well being of individualstudents.

Congress provides the best forum for review ofthis question. Commitment to such a test representsa major change in education policy and should notbe undertaken lightly. A number of issues must beconsidered in weighing the concept.

Will testing create incentives that motivatestudents to work harder? What are the effects oftests on the motivation of students? Tests shouldreward classroom effort, rather than undermine it.Tests built on comparing students to one another, forexample, may reinforce the notion that effort doesnot matter, since the bell curve design of normreferencing always places some students at the top,some at the bottom, and most in the middle.

Chapter 3—Educational Testing Policy: The Changing Federal Role ● 97

Furthermore, if the test is of no consequence to thestudents, they may not be motivated to try hard or tostudy to prepare for it. The motivation of those whodo poorly on tests must be carefully considered.Those students who repeatedly experience failure ontests (starting in the earliest years of schooling),without any assistance or guidance to help themmaster test content, are unlikely to be motivated bya high-stakes test. Positive motivational effects arelikely only if students perceive they have a goodchance of achieving the rewards attached to strongtest performance.

How broad will the content and skills coveredbe? Can just one test be offered to all students at aparticular grade level, or will there need to be a rangeof tests at various levels and disciplines? This affectsthe testing burden on any one student and the rangeof levels at which testing can be focused. In someEuropean countries, for example, students takesubject-specific examinations at a choice of levels.Some examinations take many hours or are adminis-tered over several days, with combinations of testingitems and formats that call on a range of performanceby the student.

Would the test be voluntary or mandatory?Voluntary tests sound appealing. However, if a testbecomes very widely used or needed for access toimportant resources, it will no longer be trulyvoluntary. Choosing not to take a test may not be aneutral option; negative consequences may result forthose who choose not to be tested. This is especiallytrue if a test is used for selection or credentialing;without a test result in hand, what chance does thestudent have? Furthermore, voluntary tests do notprovide an accurate picture if the goal is schoolaccountability. If only those students, schools,districts, or States that feel they can do well on a testparticipate in it, the results give an inaccurate pictureof achievement. The claim that an important test canbe voluntary should be taken with a grain of salt.

What happens to those who fail? Are thereresources provided to help them? If consequencesfor failure are high and a student has no recourseonce the examination has been taken, the wisestchoice for a student who is having difficulty inschool is to skip the examination altogether. Thenegative effects of examinations on students who donot do well have been a matter of serious concern inmany European countries. Some countries havebeen dismayed to find that some students leave

school before required high-stakes examinations areoffered, rather than face the indignity and stigma thataccompanies failure. This has also occurred withhigh school graduation examinations in some partsof this country. Rather than punishing those who donot succeed at standards that seem unattainable, testscan be designed to make standards more explicit andthe path to their acquisition more clear. However, ifit is certain that low scores do not mean failure butthat additional or refocused resources will be pro-vided to the student, testing can have positiveoutcomes.

Who will design the tests and set performancestandards? In the decentralized U.S. educationalsystem, national testing proposals raise questions ofState and local responsibility for determining whatis taught and how it is taught. Can any test contentbe valid for the entire Nation? Who shall be chargedwith determiningg test content? It is important torecall that achievement tests by definition mustassess material taught in the classroom. As thecontent of a test edges away from the specifics ofwhat is delivered in classrooms, based on State-defined curricular goals, and searches instead forcommon elements, it can become either a test of“basic skills” or of more general skills and under-standings. In the latter case, however, the test risksbecoming more a measure of aptitude than one ofachievement. (See also, ch. 6, box 6-A.) Similarly,setting performance standards on a national basisassumes the feasibility of consensus not only onwhat is taught and measured, but also on whatconstitutes acceptable performance, and on proce-dures to distinguish among levels of performance.

Will the content and grading standards bevisible or invisible? Will the examinations besecret or disclosed? Experience from the classroomand other countries suggests that students are moremotivated and will learn better when they under-stand what is expected of them and when they knowwhat competent performance looks like. It is impor-tant to note that in Europe the impact of examina-tions on teaching and learning—what is taught andlearned and how it is taught and learned-ismediated through the availability of past examina-tion papers. The tradition in this country is just theopposite. Most high-stakes examinations are keptsecret, in part because of high development costs.For a national examination to have salutary effectson learning, the additional costs of item disclosure

98 ● Testing in American Schools: Asking the Right Questions

should be weighed against the larger impact of theexamination on teaching and learning.

Would the examination be administered at asingle setting or several times, perhaps whenstudents feel ready? This question affects students’control over the opportunity to study and prepare fora n examination. If students can schedule a test whenthey feel they have mastered the material, they aremore likely to be motivated by a realistic expectationof success. Conversely, accountability examinationsare more likely to require single-sitting administra-tion if they measure achievement within a commontimeframe.

Do students have a chance to retake an exami-nation to do better? Allowing retakes suggests amastery model in which effort is rewarded andstudents can try again if they do not master thematerial the frost time. It reinforces the idea thatstudents can learn what they need to know.

Would the tests be administered to samples ofstudents or individuals? If a test is intended toincrease student motivation, then it will have to bean individual test. However, tests administered toindividuals need safeguards to meet high technicalstandards if they will affect the future opportunitiesof individuals.

At what age are students to be tested? Americanelementary schoolchildren are tested far more oftenthan their European counterparts, especially withstandardized examinations. Much of the rationalefor this testing is related to the selection of childrenfor Chapter 1 services and for identification ofprogress within those programs. This testing has hada spill-over effect greatly influencing overall ele-mentary school testing practice. However, the use ofmultiple-choice, standardized norm-referenced test-ing of elementary school children in general, andyoung (prior to grade three) children in particular, isunder attack by those who see the negative conse-quences of early labeling. Thus, the suggestion of anew national examination a t t h i s a g e s t a n d s i ncontrast with efforts in many States to reduce earlychildhood standardized testing and to use insteadteacher assessments, checklists, portfolios, and otherforms of performance-based assessments.

What legal challenges might be raised? Legalchallenges based on fairness have become a part ofthe American landscape. Public policy in thiscountry is based on assurances of equal protection

under the law; furthermore, cultural and racialdiversity make equity issues far more significant inthis country than in most others. Tests must meetthese challenges by careful design that assures thatthe administration and scoring procedures are fair,the content measures what all participants have beentaught, and the scores are used for the purposesunderstood and agreed to by the participants.

What test formats will be used? Tests sendimportant signals to students about the kinds ofskills and knowledge they need to learn. Tests thatrely on a single format, such as multiple choice, arelikely to send a limited message about necessaryskills. As noted earlier, the United States and Japanare the only countries to rely almost exclusively onmultiple-choice paper-and-pencil examinations f o rtesting. Current proposals for national tests rangefrom the use of multiple-choice norm-referencedstandardized tests to the use of “state-of-the-art”assessment practices. Test format and procedures forscoring go hand in hand. Because performanceassessments generally involve scoring by teachers orother experts, they are more expensive than machine-scorable tests. A diversity of formats in tasks anditems may be the best means of balancing tradeoffsbetween the kinds of skills and understandings thatany one test can measure and the costs of testing.

Conclusions

The answers to these questions will shed light onthe larger questions of whether or not nationaltesting is desirable. Goals must be clearly set todetermine the kind of tests, content, costs, andpotential linkages to curriculum. For example, ifCongress sets as its goal increasing student effort forhigher achievement by testing in specific subjects,one would expect mandatory tests, administered toall individuals, with the content made explicitthrough a common syllabus covering a broad scopeof material, with past test items made public sostudents can study and practice for them. If othercountries are to be a guide, this kind of examinationis not used for testing children under the age of 16 oreven 18. Some States are already using tests of thissort (e.g., New York Regents, California GoldenState Examinations) for students as high school-leaving examinations. Congress should considerhow the participation of these States would beaffected, or how these tests could serve as models foruse, or be calibrated to match some nationalstandard.

Chapter 3—Educational Testing Policy: The Changing Federal Role . 99

Furthermore, if the goal is to encourage perform-ance that includes direct measures of complex tasks,then written essays, portfolios of work over time, ororal presentations may be called for. These testswould be considerably more costly to develop,administer, and score than machine-scored norm-referenced examinations. Tests of this type are not ascarefully researched and may be challenged if usedprematurely for high-stakes outcomes like selectionor certification.

At present, there is controversy over the use ofmany test results. The development and use of testsis complicated, both in terms of science and politics.If a test is placed into service at the national levelbefore these important questions are answered,

OTA finds that the test could easily become abarrier to many of the educational reforms thathave been set into motion and become the nextobject of concern and frustration within theAmerican educational system.

Congress should consider the questions of testdesirability and use first, and then consider policydirections that emerge from these conclusions. Thisdeliberation cannot be separated from a comprehen-sive look at the other issues discussed in this section,specifically, the role of NAEP in the national testingmosaic, the ways testing is used for Chapter 1purposes, and how students’ interests are to beprotected. The policy implications of these choicesare considered collectively in chapter 1.