2. 'in praise of educational research' = formative assessment

7/27/2019 2. 'in Praise of Educational Research' = Formative Assessment

http://slidepdf.com/reader/full/2-in-praise-of-educational-research-formative-assessment 1/16

BERA

'In Praise of Educational Research': Formative AssessmentAuthor(s): Paul Black and Dylan WiliamSource: British Educational Research Journal, Vol. 29, No. 5, 'In Praise of EducationalResearch' (Oct., 2003), pp. 623-637Published by: Taylor & Francis, Ltd. on behalf of BERA

Stable URL: http://www.jstor.org/stable/1502114 .Accessed: 19/10/2013 05:12

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

of scholarship. For more information about JSTOR, please contact [email protected].

.

Taylor & Francis, Ltd. and BERA are collaborating with JSTOR to digitize, preserve and extend access to

British Educational Research Journal.

http://www.jstor.org

This content downloaded from 202.43.95.117 on Sat, 19 Oct 2013 05:12:24 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=taylorfrancis

http://www.jstor.org/action/showPublisher?publisherCode=bera

http://www.jstor.org/stable/1502114?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp





http://www.jstor.org/stable/1502114?origin=JSTOR-pdf

http://www.jstor.org/action/showPublisher?publisherCode=bera

http://www.jstor.org/action/showPublisher?publisherCode=taylorfrancis



British EducationalResearchJournal

Vol. 29, No. 5, October2003

CarfaxPublishingTaylor Francis roup

'In Praise of Educational Research':

formative assessment

PAUL BLACK & DYLAN WILIAM, King's College London

ABSTRACT The authors trace the development of the King's Formative Assessment

Programmefrom its origins in diagnostic testing in the 1970s, through the gradedassessment movement in the 1980s, to the present day. In doing so, they discuss the

practicalissues involvedin

reviewingresearch and outline the

strategiesthat were used

to try to communicate the findings to as wide an audience as possible (including

policy-makersand practitionersas well as academics). Theydescribe how they worked

with teachers to develop formativepractice in classrooms,and discuss the impactthat

this work has had on practice and policy. Finally, they speculate about some of the

reasons for this impact, and make suggestions for how the impact of educational

research on policy and practice might be improved.

Introduction

It has long been recognisedthat assessment can support earningas well as measure t.

Whilst it appears that Michael Scriven first used the term 'formative evaluation' in

connection with the curriculumandteaching(Scriven, 1967), it was Bloom et al. (1971)who first used the term in its generally accepted current meaning. They defined

summativeevaluation tests as those tests given at the end of episodes of teaching(units,courses, etc.) for the purpose of gradingor certifying students, or for evaluatingthe

effectiveness of a curriculum(p. 117). They contrasted these with 'anothertype of

evaluation which all who are involved-student, teacher, curriculummaker-would

welcome because they find it so useful in helping them improvewhat they wish to do'(p. 117), which they termed 'formativeevaluation'.

From theirearliest use it was clear thatthe terms 'formative'and summative'appliednot to the assessments themselves, but to the functions they served. However, themethods and questions of traditionalsummativetests might not be very useful for the

purpose of the day-to-day guidance of learning. So the development of formativeassessmentdependedon the developmentof new tools. To make optimumuse of theseteacherswould also have to change their classroompractices. There would also be aneed to align formative and summative work in new overall systems, so that teachers'

ISSN 0141-1926 (print)/ISSN 1469-3518 (online)/03/050623-15 ?2003 British Educational Research Association

DOI: 10.1080/0141192032000133721







624 P. Black & D. Wiliam

formative work would not be underminedby summativepressures,and indeed, so that

summative requirementsmight be better served by taking full advantageof improve-ments in teachers' assessment work.

The Rise and Fall of Formative Assessment

In the 1970s and 1980s, the developmentof new tools was advancedby a series of

researchprojectsat ChelseaCollege (whichmergedwith King's College in 1985) which

explored ways in which assessmentsmight support earning.The Conceptsin SecondaryMathematics and Science (CSMS) project investigated mathematicaland scientific

reasoningin studentsthrough he use of tests thatwere intendedto illuminateaspectsof

students' thinking,ratherthanjust measure achievement.

In this venture, the science 'wing' of the project followed a broadly Piagetianapproach, seeking to understandstudents' scientific reasoning in terms of stages of

development (Shayer & Adey, 1981). This approachdid not lead as directlyto results

applicable in normalteaching as did the more empirical approachof the mathematics

team, which focused on the diagnosis of errors in the concepts formed by secondaryschool students, and looked for ways to address them (Hart, 1981). Their subsequent

projects sought to understandbetter the relationshipbetween what was taughtand what

was learned(Johnson, 1989).Such interest in the use of assessment to support learningwas given added impetus

by the recommendationof the Committee of Inquiry nto the Teachingof Mathematicsin Schools (1982) that a system of 'gradedtests' be developed for pupils in secondaryschools whose level of achievement was below that certificated n the currentschool-

leaving examinations. Similar systems had been used to improve motivation and

achievement in modernforeign languagesfor many years (Harrison,1982).The groupat Chelsea College chose rather o aim at a system for all pupils, and with

supportfrom both the Nuffield Foundationand the InnerLondonEducationAuthority(ILEA), their Graded Assessment in Mathematics(GAIM) Project was established in

1983. It was one of five gradedassessmentschemes supportedby the ILEA.

This developmentwas more ambitious in attemptingto establish a new system. Inmathematics,English, andcraft,design andtechnology,the schemes set out to integratethe summative unction with the formative.Information ollected for formativepurposes

by teachers as part of the day-to-dayclassroom work would be aggregatedat specific

points in order to recognise the achievement of studentsformally.On leaving a school,the achievements o datewould be 'cashedin' to secure a school-leavingcertificate.This

was allowed before 1988, but then the new criteria for the General Certificateof

Secondary Education (GCSE) specified that, in mathematics, the assessments must

include a writtenend-of-courseexaminationwhich had to count for at least 50% of the

available marks (Departmentof Education and Science & Welsh Office, 1985). Theoriginal developments n othersubjectsmademoreuse of frequent ormaltests, but were

similarlyconstrainedby the GCSE rules.

These rules were just one of three factors which undermined he gradedassessment

developments.The second was the introductionof a national curriculum or all schools

in Englandand Wales in 1988. The National CurriculumTaskGroupon Assessmentand

Testing (TGAT) (1988a, 1988b) adopted the model of age-independent levels of

achievement that had been used by the graded assessment schemes, but requireda

system of 10 levels to cover the age range 5-16, arrangedso that the average studentcould be

expectedto achieve one level

everytwo

years.This was too

coarse-graineda







FormativeAssessment 625

system to be directly useful in classroom teaching: the graded assessment schemes

needed 20 levels for the same age range. To add to the problem of recalibrating he

levels to the new National Curriculum evels, a third problem became increasingly

serious. Assuring comparabilityof awards, both between schools and with othertraditionallybased awards,requiredcostly administration.The combinedeffect of these

three factors led, by 1995, to the abandonmentof all the schemes.

Here began the decline of the developmentin formative assessmentachieved in the

1970s and 1980s. The gradedassessmentschemes had achieved remarkable uccess. In

mathematics in particularthey had given a clear indication of how formative and

summative functions might be integrated n a large-scaleassessmentsystem. They had

also providedmodels of how reportingstructures or the end-of-key-stageassessments

might foster and support progression in learning, and also some ideas about how

moderation of teachers' awards might operate in practice. Ironically, whilst theseachievementshada strong nfluence on the TGAT's formulationof its recommendations,the new system that developed aftertheir reportserved to undermine hem.

The originalNational Curriculumproposalshad envisaged that much of the assess-

ment at 7, 11 and 14 would:

be done by teachers as an integralpartof normal classroom work. But at the

heartof the assessmentprocess there will be nationallyprescribedtests done

by all pupils to supplement he individualteachers' assessments.Teacherswill

markand administer hese, but theirmarking-and their assessmentsoverall-will be externallymoderated. Departmentof Educationand Science & Welsh

Office, 1987, p. 11)

Manyof these suggestionsweretakenup in the TGAT'srecommendations.n particular,the Task Group'sfirstreportpointedout that while fine-scale assessments which served

a formativefunction could be aggregated o serve a summative unction,the reverse was

not true:assessmentsdesignedto serve a summative unction could not be disaggregatedto identify learningneeds. The Task Groupthereforerecommended hat the formative

function should be paramountor Key Stages 1, 2 and 3. Manyin the teachingprofessionwelcomed the TGAT report,but the initial receptionof the researchcommunitywas

quite hostile, as was that of the Prime Minister(Thatcher,1993, pp. 594-595).Whilst the governmentacceptedthe recommendations f the TaskGroup(in a written

parliamentary nswer dated 7 June 1988), over the next 10 years the key elements were

removed,one by one, so that all that now remains n place is the idea of age-independentlevels of achievement(Black, 1997; Wiliam, 2001). In particular,he idea thatNational

Curriculumassessment should supportthe formativepurposehas been all but ignored.

Daugherty(1995, p. 78) points out that the words 'teacherassessment'appearedon the

agenda of the School Examinationsand Assessment Council (the body responsibleforthe implementationof National Curriculumassessment)only twice between 1988 and

1993.

Throughoutthe 1990s, the debate continued about how evidence from the external

tests and teachers' judgements could be combined. In the end, it was resolved by

requiringschools to publish data derived from teachers'judgements and those derived

from the external tests, side by side. Whilst the governmentcould claim that it was

giving equal priorityto the two components,in practice schools were not requiredto

determine the final teacherjudgementsuntil after the test resultshad been received by

the school. This created an incentivefor teachersto match theirown assessments to the








results of the tests, ratherthan face criticismfor having standards hat were either too

low or too strict.

So by 1995 nothing was left of the advances made in the previous decades.

Government was lukewarm or uninterestedin formative assessment: the systems tointegrate t with the summativehadgone, andthe furtherdevelopmentof tools was only

weakly supported.Therewere some minorattempts n the 1990s by the governmentand

its agencies to support ormative assessmentwithin the NationalCurriculum ssessment

system, includinga projectto promotethe formative use of data from summativetests,but these efforts were little more than window dressing.

The debateover the relativeweights to be appliedto the resultsof external tests and

teachers'judgementsobscuredthe fact thatboth of these were summativeassessments.

While differentteachersdevelopeddifferentmethodsof arrivingatjudgementsaboutthe

levels achieved by their students (Gipps et al., 1995), it is clear that most of therecord-keepingoccasioned by the introductionof the National Curriculumplaced far

more emphasison supporting he summative functionof assessment. At the same time,the work of Tunstall and Gipps (1996) showed that while some teachers did use

assessmentin supportof learning, n manyclassroomsit was clear thatmuch classroom

assessment did not support earning,and was often used more to socialise childrenthan

to improveachievement(Torrance& Pryor,1998). Such studies werebeginningto direct

attention to the classroomprocesses, i.e. to the fine detail of the ways in which the

day-to-day actions of teachers put formative principles into practices focused on

learning.Thus, one sign of change was that within the educationalresearchcommunity,there

was increasingconcern that the potentialof assessment to support learningwas being

ignored and there were continuing calls for the formative function of assessment to

receive greater emphasis.The lead here was takenby the British EducationalResearch

Association's Policy Task Groupon Assessment.Theirarticle(Harlenet al., 1992) and

that of Torrance(1993) reiterated he importanceof the formativefunction of assess-

ment, although there was disagreementabout whether the formative and summative

functions should be separatedor combined (Black, 1993a).In

1997,as

partof this effort to reassertthe

importanceof formative

assessment,the

British EducationalResearch Association Policy Task Groupon Assessment (with the

support of the Nuffield Foundation)commissioned us to undertake a review of the

researchon formativeassessment.They thereby nitiateda new stage in the developmentof formativeassessments.

Defining the Field and Reviewing the Relevant Research

Two substantialreview articles,one by Natriello(1987) andthe otherby Crooks(1988),had addressedthe field in which we were

interested,and so our review of the research

literatureconcentratedon articlespublishedafter 1987, which we assumed had been the

cut-off point for these earlier reviews. Natriello's review covered the full range of

assessment purposes (which he classified as certification, selection, direction and

motivation)while Crooks's review coveredonly formativeassessment(whichhe termed

'classroom evaluation'). An indication of the difficulty of defining the field, and of

searching the literature, s given by the fact that while the reviews by Natriello and

Crooks cited 91 and 241 items respectively, only nine references were cited in both.

Natriello'sreview used a model of the assessmentcycle, beginningwith purposes,and

moving on to the setting of tasks, criteria and standards,evaluating performanceand








providing feedback and then discussed the impact of these processes on students.He

stressedthat the majorityof the researchhe cited was largelyirrelevantbecause of weak

theorisation,which resulted n key distinctions(e.g. the qualityandquantityof feedback)

being conflated.Crooks'sarticle had a narrowerocus-the impactof evaluationpracticeson students.

He concludedthatthe summative unction of assessmenthas been too dominantandthat

more emphasis should be given to the potential of classroom assessments to assist

learning. Most importantly,assessments must emphasise the skills, knowledge and

attitudesregardedas most important,not just those that are easy to assess.

Ourreview also built on four otherkey reviews of researchpublishedsince those byNatrielloand Crooks-reviews by Bangert-Drowns t al. and Kuliket al. into the effects

of classroom testing (Kulik et al., 1990; Bangert-Drownset al., 1991a, 1991b) and a

review by one of us of researchon summative and formative assessment in scienceeducation(Black, 1993b).

To identifyrelevant iteraturewe first searchedthe ERIC database.However,the lack

of consensus on the appropriatekeywords in this area meant that this process consist-

ently failed to yield studies that we knew from our own readingto be relevant. Citation

index searcheson the existingreviews added additional tudies,but still failed to identifyothers. Study of articles cited in the studies we had identified yielded some further

studies. However, it was clear overall that we needed a differentapproach f we were

to identify all of the many relevantstudies. So we resortedto a manual searchthrough

every issue, from 1987 to 1998, of 76 of the most likely journals.As we read throughthese journals manually, rather than relying on keywords, our notion of what was

relevantwas continually expanding.This process generated681 publications hatappeared,at firstsight, to be relevant.By

reading abstracts,and in some cases full articles, this number was reduced to 250

publicationsthat were read in full, and coded with our own keywords. Keywordswere

thengroupedto form sections of the review. In synthesisingthe studies allocatedto each

section, we rejectedthe use of meta-analysis.In the firstplace, many of the studies did

not providesufficient details to calculate standardised ffect sizes, which is the standard

procedurefor combining results from different studies in meta-analysis(Glass et al.,1981). Second, the use of standardised ffect sizes in publishedstudies overestimates he

size of actual effects. This is because the low statisticalpower of most experiments n

the social sciences (Cohen, 1988) means thatmany experimentsgenerateresults that fail

to reach the thresholdfor statisticalsignificance,and go unpublished,so that those that

arepublishedarenot a representative ampleof all resultsactuallyfound (Harlowet al.,

1997). Third,and most importantly n our view, given the relativelyweak theorisation

of the field, we felt it was not appropriate imply to specify inclusion criteriaand then

include only studies that satisfied them. Instead, we assembled the review througha

process of 'best evidence synthesis' (Slavin, 1990). Such an approachis inevitablysomewhatsubjective,and is in no sense replicable-other authors,even given the same

250 focal studies, would have produceda different review. Instead, our aim was to

producean account of the field thatwas authentic, aithfuland convincing.However, it

is worthnoting that none of the six shortcommentariesby experts in this field, which

were published in the same journal issue as our review, disagreedwith our findings.We believe that this story of the developmentof our review makes several important

points about educational research. In the first place, reviewing research is much more

difficult in any social science than it is in (say) the physical sciences because the

complexity of the field precludes any simple universal system of keywords. Any








automatedresearchprocess is bound to result in a systematic underrepresentationf the

body of research because it cannothope to identify all the relevant studies.

Second, synthesisingresearchcannot be an objectiveprocess.While review protocols

such as those being used by the EPPI-Centre or the evaluation of individualresearchstudies are helpful in identifying strengths and weaknesses in those studies, the

significanceattachedto each study, and the way thatgeneralconclusions are drawnby

weighing sometimes conflicting findings, will inevitablyremainsubjective.

Third, it seems to us important o point out that reviewing research is not merely a

derivativeform of scholarship.Reviews such as those by Natriello and Crooks can serve

to reconceptualise, to organise, and to focus research. Because our definition of

'relevance' expanded as we went along, we too had to find ways of organising a

widening field of research,and were forced to make new conceptuallinksjust in order

to be able to relate the various researchfindings into as coherent a pictureas possible.This was one reasonwhy our review generateda momentum or work in this field that

would be difficult to create in any other way.One feature of our review was that most of it was concernedwith such issues as

students'perceptions,peer- and self-assessment,and the role of feedbackin a pedagogyfocused on learning.Thus, it helpedto takethe emphasisin formativeassessment studies

away from systems, with its emphasis on the formative-summativeinterface, and

relocate it on classroomprocesses.

Disseminating the Findings

Experience in producing a review of mathematics education research for teachers

(Askew & Wiliam, 1995) hadtaughtone of us thatit is impossibleto satisfy, in the same

document, the demands of the academic community for rigour and the demands for

accessibilityby practitioners.Discussions with primary eachersshowed that even when

that draft review had been written in the most straightforward anguage we could

manage,they still found it unsuitable.Extensiveredrafting ventuallymade it accessible,but this work took as long the originalcollection andwriting,and it thenturnedout that

the academiccommunityfound it of little interest.Such experienceconvinced us that we had to producedifferentoutputsfor different

audiences. For other academics,we produceda 30,000-wordjournal article (Black &

Wiliam, 1998a), which, togetherwith shortresponsesfrom invited commentators rom

around the world, formed the whole of a special issue of the journal Assessment in

Education.As well as detailingour findings,we tried to lay out as clearly as possiblehow we had constructed the review so that, while we would not necessarily expectdifferent authors to reach identical conclusions, we hoped that the process which we

followed was verifiable and could be repeated.

To make the findings accessible to practitionersand policy-makers,we produceda21-page booklet in A5 format entitledInside the Black Box (Black & Wiliam, 1998b).

We also produced a slightly revised version for the internationalaudience (Black &

Wiliam, 1998c).In Inside the Black Box, we very briefly outlined the approachwe had taken in our

review and summarised ts conclusions. However, we also felt it essential to explore

implicationsfor policy and practice. In so doing we inevitably, at some points, went

beyond the evidence, relying on our experienceof many years' work in the field. If we

had restrictedourselves to only those policy implicationsthat followed logically and

inevitablyfrom the researchevidence, we would have been able to say verylittle.Whilst








the policy implicationswe drew were not determinedby the researchbase, they were,

we felt, the most reasonableinterpretation f the findings (for more on reasonableness

as a criterion,see later).The title itself was significant,for it pointedto our mainpolicy

plea-that teachers'work in the

classroomwas the

keyto

raisingstandardsand that

systems of externaltesting shouldbe restructuredo ensure that this work was supportedrather than underminedby them. The process now claimed priorityin the system.

In order to secure the widest possible impactof our work, we plannedthe booklet's

launch with membersof the King's College London's External RelationsDepartment.We held a launch conference on 5 February1998, hosted by the Nuffield Foundation,

who had supported he writingof the review, and on the previousday had held a series

of briefingsforjournalistsof the nationaldaily newspapersand the specialisteducational

press. The result was that almost every single nationaldaily newspapercarried some

reference to the work. While some of the reports were either inaccurate or highlypoliticised, much of the coverage was broadlyaccurate, f somewhatselective.

In the five years since its launch,Inside the Black Box has sold over 30,000 copies,and data from the Authors' Licensing and Collecting Society (ALCS) suggests that at

least another25,000 copies have been made in schools.

Inside the BlackBox did not try to lay out whatformativeassessment would look like

in practice. Indeed, it was clear that this was neither advisablenor possible:

Thus the improvementof formative assessment cannot be a simple matter.

There is no'quick ix'

that can be added toexistingpractice

withpromise of

rapidreward.On the contrary, f the substantial ewardsof whichthe evidence

holds out promiseare to be secured,this will only come about if each teacher

finds his or her own ways of incorporatinghe lessons and ideas that are set

out above into her or his own patternsof classroom work. This can only

happen relatively slowly, and throughsustainedprogrammesof professional

developmentand support. (p. 15, emphasis in original)

Putting it into PracticeThe three research reviews offered strong evidence that improving the quality of

formative assessment would raise standardsof achievement n each countryin which it

had been studied.Furthermore,he consistency of the effects across ages, subjectsand

countries meant that even though most of the studies reviewed had been conducted

abroad,these findingscould be expected to generaliseto the UK. For us, the questionwas not, therefore, 'Does it work?' but 'How do we get it to happen?'

In mid-1998 the Nuffield Foundationagreedto supporta two-year projectto involve

24 teachersin six schools in two local education authorities Oxfordshireand Medway)in exploring how formative assessment might be put into practice. So began the

King's-Medway-OxfordshireFormativeAssessment Project(KMOFAP).One of the key assumptions of the project was that if the promise of formative

assessment was to be realised,traditional esearchdesigns-in which teachers are 'told'

what to do by researchers-would not be appropriate.We argued that a process of

supporteddevelopmentwas an essential next step:

In such a process, the teachers in their classrooms will be working out the

answers to many of the practical questions that the evidence presentedhere

cannotanswer,andreformulatinghe issues, perhaps n relationto fundamental








insights, and certainlyin terms that can make sense to theirpeers in ordinaryclassrooms. (Black & Wiliam, 1998b, p. 16)

We had chosen to work with Oxfordshire and Medway because we knew that their

officers were interested n formative assessment and would be able to supportour work

locally. We asked each authority o select three secondaryschools and, after discussion

with membersof the research eam,the six so chosen agreedto be involved. At this stagewe were not concerned to find 'typical' schools, but schools that could provide'existenceproofs' of good practice,and would producethe 'living examples' alluded to

earlierfor use in furtherdissemination.

We decidedto start with mathematicsand science because these were subjectswhere

we felt there were clearmessages from the researchand also where we hadexpertisein

the subject-specificdetails thatwe thoughtessentialin practicaldevelopment.The choice

of teachers, two in mathematics and two in science in each school, was left to theschools: a variety of methods were used, so that there was a considerablerange of

expertise and experience amongst the 24 teachers selected. In the following year we

augmented the project with one additional mathematics and one additional science

teacher,and also began working with two English teachersfrom each school.

Our 'intervention'with these teachers had two main components:

* a series of nine one-day in-service education and training(INSET) sessions over a

period of 18 months, during which teachers were introduced to our view of the

principlesunderlying ormativeassessment,and were given the opportunity o developtheir own plans;

* visits to the schools, duringwhich the teacherswould be observedteachingby projectstaff, andhave an opportunity o discuss their ideas and theirpractice;feedback from

the visits helped us to attune the INSET sessions to the developing thinking and

practiceof the teachers.

The key feature of the INSET sessions was the developmentof actionplans. Since we

were aware from other studies that effective implementationof formativeassessment

requiresteachersto renegotiatethe 'learningcontract'that they had evolved with theirstudents(Brousseau, 1984; Perrenoud,1991), we decided that implementingformative

assessmentwould best be done at the beginningof a new school year. For the first six

months of the project(January1999 to July 1999), therefore,we encouraged he teachers

to experimentwith some of the strategiesandtechniquessuggested by the research,such

as rich questioning,comment-onlymarking,sharingcriteriawith learners,and student

peer- and self-assessment.Each teacherwas then asked to drawup an actionplan of the

practicesthey wished to develop and to identify a single focal class with whom these

strategieswould be introduced n September1999. Details of these plans can be found

in Black et al. (2003).Our interventiondid not impose a model of 'good formative assessment' on teachers,

but, rather,supportedthem in developing their own professionalpractice. Since eachteacher was free to decide which class to experiment with, we could not impose a

standardexperimentaldesign-we could not standardisethe outcome measures, nor

could we rely on having the same 'input' measures for each class. In order to secure

quantitative vidence, we thereforeused an approach o the analysisthatwe have termed'local design', making use of whatever data were available within the school in thenormal course of events. In most cases, these were the results on the NationalCurriculum ests or GCSE but in some cases we also made use of

scoresfrom

school








assessments. Each teacher consultedwith us to identify a focal variable(i.e. dependentvariableor 'output')and in most cases we also had reference variables(i.e. independentvariablesor 'inputs'). We then set up, for each experimentalclass, the best possible

control class in the school. In some cases, this was a parallelclass taught by the sameteacher(eitherin the same or previousyears);in others,it was a parallelclass taughtbya different teacher.Failing that, we used a non-parallelclass taught by the same or a

different teacher. We also made use of national norms where these were available.In

most cases, we were able to condition the focal variable on measures of priorachievementor generalability.By dividingthe differencesbetween the mean scores of

control groupand experimentalgroups by the pooled standarddeviation,we were able

to derive a standardised ffect size (Glass et al., 1981) for each class. The medianeffect

size was 0.27 standarddeviations,and a jackknifeprocedure Mosteller& Tukey, 1977)

yielded a point estimate of the meaneffect size as 0.32, with a 95% confidenceintervalof [0.16, 0.48]. Of course, we cannot be sure that it was the increased emphasis on

formativeassessmentthatwas responsiblefor this improvement n students' scores, but

this does seem the most reasonableinterpretationsee discussion of 'reasonableness'

below). For furtherdetails of the experimentalresults, see Wiliam et al. (2004).

Outcomes

The quantitative vidence that formativeassessmentdoes raise standards f achievement

on National Curriculum ests and GCSE examinations is importantin showing thatinnovations thatworkedin researchstudies in other countriescould also be effective in

typicalUK classrooms.Partof the reasonthat formativeassessmentworksappears o bean increasein students' 'mindfulness'(Bangert-Drowns t al., 1991a),andwhile this hasbeen shown to increaselong-termretention Nuthall& Alton-Lee, 1995), it also dependsto a certainextenton the kind of knowledgethat is being assessed. More will be gainedfrom formative feedbackwhere a test calls for the mindfulnessthat it helps to develop.Thus, it is significanthere that almost all the high-stakesassessments in this countryrequireconstructedresponses(as opposedto multiplechoice) and often assess higher-or-

der skills.However,other outcomesfrom theprojectareat least as important.Throughour work

with teachers, we have come to understandmore clearly how the task of applyingresearch into practiceis much more than a simple process of 'translating' he findingsof researchersnto the classroom. The teachersin ourprojectwere engagedin a processof knowledgecreation,albeit of a distinctkind,andpossibly relevantonly in the settingsin which they work (see Hargreaves,1999).

As the teachersexploredthe relevance of formativeassessment for their own practice,they transformed deas fromotherteachers nto new ideas, strategiesandtechniques,and

these were in turncommunicated o otherteachers,creatinga 'snowball'effect. Also, aswe have introducedmore and more teachers to these ideas, we have become betterat

communicating he key ideas. A case in point is that we have each been asked severaltimes by teachers, 'Whatmakes for good feedback?'-a questionto which, at first,wehad no good answer.Over the course of two or threeyears, we have evolved a simpleanswer-good feedback causes thinking.

Ever since the publicationof Inside the Black Box we have been producinga seriesof articles in journals aimed at practitioners hat showed how the ideas of formative

assessment could be used in practice. We are also publishing researchpapers arising

from the project.These have includedaccounts of our collaborativeworkwith teachers,








of the processes of teacherchange, of the quantitativeevidence of learning gains, and

of a theoretical ramework or classroom formativeassessment(see the project'swebsite

at: < http://www.kcl.ac.uk/depsta/education/KAL/ASSESSMENT.htmlfor details of

the programme'spublications).In addition,since the launch of Inside the Black Box in February1998, members of

the research eam at King's have spokento groupsof teachers andpolicy-makerson over

400 occasions, suggestingthatwe have addressedwell over 20,000 people directlyabout

our work.

Following the success of Inside the BlackBox, we decided to use the same formatfor

communicatingthe results of the KMOFAPprojectto teachers. The resultingbooklet,

Working nside the Black Box, sold 15,000 copies in the six months afterits launchin

July 2003.

Of course, working as intensively as we did with 24 teachers could not possiblyinfluence more than a small fractionof teachers, so we have also given considerable

thought to how the work could be 'scaled up'. We are working with other local

educationauthorities notably Hampshire) o develop local expertise, in both formative

assessment and in strategiesfor dissemination.In Scotland, formative assessment has

become an importantcomponentof the Scottish Executive's strategyfor schools and

membersof the projectteam, includingthe projectteachers,have providedsupportand

advice.

This work has also led to furtherresearchwork.We arepartners,with the Universities

of Cambridge, Reading and the Open University, in the Learning How to Learn: inclassrooms, schools and networksproject, which is part of the Economic and Social

ResearchCouncil's Teaching and LearningResearchProgramme.This projectaims to

promote and understandchange in classrooms, but also is investigating how these

changes are supportedor inhibitedby factors at the level of the whole school, and of

networksof schools. We are also workingwith StanfordUniversity in California on a

similarproject,which againcombinesdetailedwork with small numbersof teacherswith

a focus on how small-scalechangescan be scaled up. Assessment for learninghas also

become one of the two key foci (along with thinking skills) of the government'sKey

Stage 3 Strategyfor the foundationsubjects.

Reflections

Critiques of educational research typically claim that it is poorly focused, fails to

generatereliable and generalisablefindings, is inaccessible to a non-academicaudience

and lacks interpretationor policy-makersandpractitioners Hillage et al., 1998). While

some of these criticismsareundoubtedlyapplicableto some studies,we believe that this

characterisation f the field as a whole is unfair.We do not believe that all educationalresearchshould be useful, for two reasons.The

first is that,just as most research n the humanities s not conductedbecause it is useful,

we believe that there should be scope for some researchin educationto be absolutelyuninterested n considerationsof use. The second reason is that it is impossibleto state,with any certainty,which researchwill be useful in the future.

Havingsaid this, we believe stronglythatthe majorityof research n educationshould

be undertakenwith a view to improvingeducationalprovision-research in what Stokes

(1997) calls 'Pasteur'squadrant'.And althoughwe do not yet know everythingabout

'what works' in teaching,we believe that there is a substantialconsensus on the kinds








of classrooms thatpromotethe best learning.Whatwe know much less aboutis how to

get this to happen.

Policy-makersappear o want large-scaleresearchconducted to the highest standards

of analyticrationality,whose findings are also relevantto policy. However, it appearsthat these two goals are, in fact, incompatible. Researching how teachers take on

research,adapt it, and make it their own is much more difficult than researchingthe

effects of differentcurricula,of class sizes, or of the contribution f classroomassistants.

While we do not know as much as we would like to know about effective professional

development,if we adopt 'the balance of probabilities'ratherthan 'beyond reasonable

doubt' as our burdenof proof, then educationalresearchhas muchto say. When policywithout evidence meets developmentwith some evidence, developmentshould prevail.

In terms of our own work, perhapsthe most puzzling issue arisingfrom this story is

why our work has had the impact that it has. Our review did add to the weight ofevidence in supportof the utility of formativeassessment,but did not substantiallyalter

the conclusionsreachedby Crooks andNatriello 10 yearsearlier.It could be, of course,that the current nterestin formativeassessment,and its policy impact, is nothingto do

with ourwork,but thatit takes 10 yearsor so for suchfindingsto filterthrough o policy,or that the additionalstudies identifiedby us in some way tipped the balance to make

the findingsmore credible. All of this seems unlikely.It thereforeappears hatsomethingthatwe did, not just in undertakinghe review, but also in the way it was disseminated,has had a profoundimpact on how this research has fed into policy and practice.Of

course, identifyingwhich elements have been most important n this impact is imposs-ible, but it does seem appropriateo speculateon the factorsthat have contributed o it.

The first factor is thatalthoughmost of the studies cited in our review emanated rom

overseas, the review was originatedand published in this country. This provided a

degree of local 'ownership'that is likely to have attractedattentionto the research.

The second is that althoughwe triedto adhereclosely to the traditional tandardsof

scholarshipin the social sciences when conductingand writing our review, we did not

do so when exploring the policy implications in Inside the Black Box. While the

standardsof evidence we adoptedin conductingthe review might be characterisedas

those of 'academicrationality', he standardwithin Inside the Black Box is much closerto that of 'reasonableness'advocatedby StephenToulmin for social enquiry(Toulmin,

2001). In some respects,Inside the Black Box representsour opinionsandprejudicesasmuch as anything else, although we would like to think that these are supportedbyevidence, and are consistentwith the 50 yearsof experienceof workingin this field that

we have between us. It is also important o note in this regard hat the success of Inside

the Black Box has been as much due to its rhetorical force as to the evidence that

underpins t. This will certainlymakemanyacademicsuneasy-for it appears o blurthe

line between fact and value, but as Flyvbjerg(2001) argues, social enquiryhas failed

precisely because it has focused on analyticrationalityrather hanvalue-rationalityseealso Wiliam, 2003).

The thirdfactorwe believe has been significant s the stepswe have takento publiciseour work. As noted earlier, this has included writing different kinds of articles for

differentaudiences,and has also involved workingwithmedia-relations xpertsto securemaximumpress coverage.Again, these are not activities thataretraditionallyassociatedwith academic research,but they are, we believe, crucial if research is to impact on

policy.A fourthfactorthat appears o have been important s the credibilitythatwe brought

as researchers o the process. In theirprojectdiaries,several of the teacherscommented








that it was our espousal of these ideas, as much as the ideas themselves, thatpersuadedthem to engage with the project:where educationalresearch s concerned,the facts do

not necessarily speak for themselves.

A fifth factor is that the ideas had an intrinsicacceptability

to the teachers. We were

talkingaboutimproving earning n the classroom,which was centralto the professionalidentities of the teachers, as opposed to bureaucraticmeasures such as target-setting.

Linked to this factor is our choice to concentrateon the classroomprocesses and to

live with the external constraintsoperatingat the formative-summative nterface:the

failed attemptsto change the system in the 1980s and 1990s were set aside. Whilst it

might have been merely prudent to not try again to tilt at windmills, the more

fundamental trengthwas thatit was at the level chosen, thatof the core of learning, hat

formative work stakes its claim for attention.Furthermore, iven thatany changehas to

work out in teachers'practicalaction,

this is where reformshouldalways

have started.

The evidence of learning gains, from the literature eview andfrom ourproject,restates

and reinforcesthe claim for priorityof formative work that the TGAT tried in vain to

establish. The debate about how policy should secure optimum synergy between

teachers'formative, eachers'summative,andexternalassessments s unresolved,but the

terms and the balance of the argumentshave been shifted.

The final,andperhapsmost important,act in all this is thatin ourdevelopmentmodel

we attendedto both the content and the process of teacherdevelopment(Reeves et al.,

2001). We attended o the processof professionaldevelopment hroughan acknowledge-ment that teachers need

time, freedom,and

supportfrom

colleaguesin order to reflect

criticallyupon and to develop their practice(Lee, 2000), whilst offering also practical

strategiesandtechniquesabout how to begin the process. By themselves, however,these

arenot enough.Teachers also need concreteideas about the directions n which they can

productivelytake their practice,and thus there is a need for work on the professional

development of teachers to pay specific attention to subject-specific dimensions of

teacherlearning(Wilson & Berne, 1999).One might ask of our work whether it drew, or could have drawn,strength rom the

psychology of learning.For the practicaldemandfor good tools, e.g. a good questionto

exploreideas about the

conceptof

momentum,the

helpcould

onlycome from studies

in mathematicsand science education.We have shown earlierhow the historicaloriginsof ourworkcontributed o such studies. The fact thatthe King's teamhad strongsubject

backgroundsand could draw on earlier work on such tools was essential. However, it

remainsthe case that for most school subjectsrich sources for the appropriateools are

lacking.At a more general level, however, it could be seen that the practical activities

developed did implementprinciplesof learningthat are prominentin the psychologyliterature.Examplesare the constructivistprinciplethat learningaction must startfrom

the learner'sexisting knowledge,

the need for active andresponsible

nvolvementof the

learner, he need to establishin the classroom a communityof subjectdiscourse,and the

value of developing meta-cognition see Black et al., 2003). Ourorientation n develop-

ing the activities in line with such principleswas in part intuitive,but became explicitwhen the teachers asked us to give a seminaron learningtheoryat one of the group'sINSET sessions. We judge now that to have startedwith such principleswhen we had

little idea of how to implement them in practice might have done little to motivateteachers.

One feature thatemergesfrom consideringthe historyof this work is thatits success

seems far from inevitable,but is, rather,the result of a series of contingencies.What








would have happenedif our review article had not been commissioned, or if we had

finished work when the review was complete (which is all we had been commissioned

to do)? What if we had failed to secure funding for our project,or had failed to find

teacherswilling

to work withus,

or if we workedin adepartment

orced to make cuts

in staffingas a result of a researchassessmentexercise? The continuityof staffingseems

to us to be particularly mportant.We do not think that it is a coincidence that the work

herebuildson a 25-yeartraditionof workon diagnosticand formativeassessmentwithin

a single institution,andone in whichthe turnoverof staff has beenextremely ow. While

some of the expertisegatheredover this time can be made explicit, much of it cannot,and we believe that our collective implicit knowledge has been at least as important o

our work as publishedresearchfindings.This history paints a picture of educational researchnot as a steady building up of

knowledge towardssome objective truthabout teaching and learning,but, rather,as atrajectorybuffeted by combinations of factors. Calling this 'the mangle of practice',

Pickering(1995) shows thatneitherthe view of scientificknowledgeas a series of truths

waiting to be discovered,nor the view that scientific knowledge is whateverscientists

do, are adequatedescriptions.Instead,he suggests, scientists build ideas abouthow the

world is, which are then 'mangled' in their contact with the objective physical world.

When questions become too difficult, scientists change the questions to ones that are

tractable. n social sciences, the mangleis even morecomplex, in thatwhat it is possibleto research,and what it is good to researchkeeps changing.

From this perspective, all our activities-the developmentof theoreticalresources,work with teachers,rhetoricalcampaignsto convince policy-makersand teachers of the

utility of formative assessment, even co-opting the print and broadcast media to our

purpose-makes a kind of sense.

Educationalresearch can and does make a difference,but it will succeed only if we

recogniseits messy, contingent,fragilenature.Some policy-makersbelieve thatsupport-

ing educationalresearch s crazy,but surelythe real madnessis to carryon what we have

been doing, and yet to expect differentoutcomes.

Acknowledgements

We would like to acknowledgethe initiative of the AssessmentPolicy TaskGroupof the

British Educational Research Association (now known as the Assessment Reform

Group)who gave the initialimpetusandsupport or our researchreview. We aregratefulto the Nuffield Foundation,who funded the originalreview, and the firstphase of our

project.We are also gratefulto ProfessorMyron Atkin and his colleagues in Stanford

University,who securedfundingfrom the US National Science Foundation NSF Grant

REC-9909370), for the last phase. Finally, we are indebted to the Medway andOxfordshire local education authorities,their six schools and above all to their 36

teachers who took on the central and risky task of turning our ideas into practical

working knowledge.

Correspondence:Paul Black, King's College London, School of Social Science and

Public Policy, Departmentof Education and Professional Studies, Franklin-Wilkins

Building, WaterlooRoad, London SE1 9NN, UK.








REFERENCES

ASKEW,M. & WILIAM, . (1995) Recent Research in Mathematics Education 5-16 (London, HMSO).

BANGERT-DROWNS,. L., KULIK -L. C., KULIK,. A. & MORGAN,M. T. (1991a) The instructionaleffect

of feedback n test-likeevents,

Reviewof

EducationalResearch,61, pp.

213-238.

BANGERT-DROWNS,. L., KULIK,. A. & KULIK,C-L. C. (1991b) Effects of frequent classroom testing,Journal of Educational Research, 85, pp. 89-99.

BLACK, . (1997) Whatever happened to TGAT?, in: C. CULLINGFORDEd.) Assessment vs. Evaluation,

pp.24-50 (London,Cassell).BLACK,. J. (1993a)Assessmentpolicy andpublicconfidence: ommentson the BERAPolicy Task

Group'sarticle Assessment ndtheimprovementf education',TheCurriculumournal,4, pp.421-

427.

BLACK, . J. (1993b) Formative and summative assessment by teachers, Studies in Science Education, 21,

pp.49-97.BLACK,P. J. & WILIAM, D. (1998a) Assessmentandclassroom earning,Assessment in Education, 5,

pp.7-73.BLACK,P. J. & WILIAM, D. (1998b) Inside the Black Box: raising standards through classroom

assessment London,King'sCollegeLondonSchool of Education).BLACK,.J.&WILIAM,D. (1998c)Inside he blackbox:raising tandardshrough lassroom ssessment,

Phi Delta Kappan, 80, pp. 139-148.

BLACK, ., HARRISON,., LEE,C., MARSHALL,. & WILIAM,D. (2003) Assessmentfor Learning: puttingit intopractice(Buckingham,OpenUniversityPress.)

BLOOM,. S.;HASTINGS,. T. & MADAUS,G. F. (Eds) (1971) Handbook on the Formative and Summative

Evaluation of Student Learning (New York, McGraw-Hill).

BROUSSEAU,. (1984) The crucialrole of the didacticalcontract n the analysisand construction fsituations n teachingand learningmathematics,n: H.-G. STEINEREd.) Theoryof Mathematics

Education: ICME 5 topic area and miniconference, pp. 110-119 (Bielefeld, Institut fuirDidaktik der

Mathematik erUniversittitBielefeld).COHEN, . (1988) Statistical Power Analysis for the Behavioural Sciences (Hillsdale, NJ, Lawrence

ErlbaumAssociates).COMMITTEEOF INQUIRY NTO THE TEACHINGOF MATHEMATICSN SCHOOLS(1982) Report: mathematics

counts(London,HMSO).CROOKS,. J. (1988)Theimpactof classroom valuationpracticeson students,Reviewof Educational

Research, 58, pp. 438-481.

DAUGHERTY,. (1995) National CurriculumAssessment: a review of policy 1987-1994 (London, Falmer

Press).

DEPARTMENTOF EDUCATIONAND SCIENCE& WELSH OFFICE (1985) General Certificate of SecondaryEducation: the national criteria (London, HMSO).

DEPARTMENTOF EDUCATIONAND SCIENCE& WELSH OFFICE (1987) The National Curriculum 5-16: a

consultationdocumentLondon,Departmentf Education ndScience).FLYVBJERG,B. (2001) Making Social Science Matter: why social inquiryfails and how it can succeed

again (Cambridge,CambridgeUniversityPress).

GIPPS,C. V., BROWN,M. L., MCCALLUM,. & MCALISTER,. (1995) Intuition or Evidence? Teachers and

the national assessment of seven year olds (Buckingham, Open University Press).

GLASS, G. V.;McGAw, B. & SMITH, M. (1981) Meta-analysis in Social Research (BeverlyHills, CA,

Sage).HARGREAVES,D.H. (1999) Theknowledgecreating chool,British Journal of Educational Studies, 47,

pp. 122-144.HARLEN,W., GIPPS,C., BROADFOOT,. & NUTTALL, . (1992) Assessment and the improvement of

education, The CurriculumJournal, 3, pp. 215-230.

HARLOW,. L., MULAIK,. A. & STEIGER,. H. (Eds) (1997) WhatIf There Were No Significance Tests?

(Mahwah,NJ, LawrenceErlbaumAssociates).HARRISON, A. (1982) Review of Graded Tests (London,Methuen).HART, K. M. (Ed.) (1981) Children's Understanding of Mathematics: 11-16 (London, John Murray).

HILLAGE,J., PEARSON,R., ANDERSON,A. & TAMKIN, P. (1998) Excellence in Research on Schools

(London,Departmentor Education ndEmployment).JOHNSON,D. C. (Ed.) (1989) Children's Mathematical Frameworks 8-13: a study of classroom teaching

(Windsor,NFER-Nelson).








KULIK,-L. C., KULIK,. A. & BANGERT-DROWNS,. L. (1990) Effectivenessof mastery learning

programs: a meta-analysis, Review of Educational Research, 60, pp. 265-299.

LEE,C. (2000) The King's MedwayOxfordFormativeAssessmentProject: tudyingchangesin the

practiceof two teachers.Paper presentedat symposiumentitled 'Gettinginside the black box:

formative assessment in practice' at the British Educational Research Association 26th AnnualConference,CardiffUniversity.

MOSTELLER,. W. & TUKEY, . W. (1977) Data Analysis and Regression: a second course in statistics

(Reading,MA, Addison-Wesley).NATIONALURRICULUMASKGROUPONASSESSMENTNDTESTING1988a) A Report (London, Depart-

mentof Education ndScience).NATIONALURRICULUMASKGROUP NASSESSMENTNDTESTING1988b) Three Supplementary Reports

(London,Departmentf Education ndScience).NATRIELLO,. (1987) The impact of evaluation processes on students, Educational Psychologist, 22,

pp. 155-175.

NUTHALL,. & ALTON-LEE,. (1995) Assessing classroom learning: how students use their knowledge

andexperienceo answer lassroom chievementestquestionsn scienceandsocialstudies,AmericanEducational Research Journal, 32, pp. 185-223.

PERRENOUD,. (1991) Towardsa pragmaticapproacho formativeevaluation,n: P. WESTONEd.)Assessment of Pupil Achievement, pp. 79-101 (Amsterdam, Swets & Zeitlinger).

PICKERING,. (1995) The Mangle of Practice: time, agency, and science (Chicago, IL, University of

ChicagoPress).REEVES,., MCCALL,. & MACGILCHRIST,. (2001) Change leadership: planning, conceptualization and

perception, in: J. MACBEATH P. MORTIMOREEds) Improving School Effectiveness, pp. 122-137

(Buckingham,Open UniversityPress).SCRIVEN,M. (1967) The Methodology of Evaluation (Washington,DC, AmericanEducationalResearch

Association).SHAYER,M. & ADEY, P. S. (1981) Towards a Science of Science Teaching: cognitive development and

curriculum demand (London, Heinemann).

SLAVIN, R. E. (1990) Achievement ffects of abilitygrouping n secondary chools:a best evidence

synthesis, Review of Educational Research, 60, pp. 471-499.

STOKES, . E. (1997) Pasteur's Quadrant:basic science and technological innovation (Washington, DC,

Brookings nstitutionPress).THATCHER,. (1993) The Downing Street Years (London, Harper Collins).

TORRANCE,. (1993) Formativeassessment: some theoreticalproblemsand empiricalquestions,Cambridge Journal of Education, 23, pp. 333-343.

TORRANCE,. & PRYOR, . (1998) Investigating Formative Assessment (Buckingham, Open University

Press).TOULMIN,. (2001) Return to Reason (Cambridge, MA, HarvardUniversity Press).

TUNSTALL,. & GIPPS,C. (1996) Teacherfeedbackto young childrenin formativeassessment:a

typology, British Educational Research Journal, 22, pp. 389-404.

WILIAM, D. (2001) Level Best? Levels of Attainment in National Curriculum Assessment (London,Associationof TeachersandLecturers).

WILIAM, D. (2003) Theimpactof educationalesearch n mathematicsducation,n: A. BISHOP,. A.

CLEMENTS,. KEITEL,. KILPATRICK F. K. S. LEUNG Eds) Second International Handbook ofMathematics Education, pp. 469-488 (Dordrecht, Kluwer).

WILIAM, ., LEE,C., HARRISON,. & BLACK,P. (2004) Teachers developing assessment for learning:

impact on student achievement. To appear in Assessment in Education: Principles, Policy and

Practice, 11(2).

WILSON,. M. & BERNE,. (1999) Teacher earningand the acquisition f professionalknowledge:anexaminationof researchon contemporary rofessionaldevelopment,n: A. IRAN-NEJADP. D.PEARSONEds) Review of Research in Education, pp. 173-209 (Washington, DC, American Educa-

tional ResearchAssociation).

2. 'in praise of educational research' = formative assessment

Documents