dylan wiliam's presentation at ofqual's chief regulator's report event

16

Upload: ofqual-slideshare

Post on 14-Dec-2014

769 views

Category:

Education


3 download

DESCRIPTION

Dylan Wiliam's presentation on 'The reliability of educational assessments' at Ofqual's 2009 Chief Regulator's Report event

TRANSCRIPT

Page 1: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event
Page 2: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

The reliability of educational assessments

Dylan Wiliam

www.dylanwiliam.net

Ofqual Annual Lecture, Coventry: 7 May 2009

Page 3: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

The argumentThe public understanding of the reliability of assessments is weak

Contributory factors are The need of humans for certainty (and beliefs that its absence is chaos) The inherent unreliability of all measurements, educational and otherwise The use in education of tools derived from individual differences psychology An emphasis on scores, rather than how they are used Political assumptions about the educability of the public, combined with A desire to use assessment outcomes as drivers of reform

Those who produce—and those who mandate the use of—assessments must take responsibility for informed use of assessment outcomes

Page 4: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Dealing with uncertainty in societyPeople like certainty… Hilbert (1900): “In mathematics, there is no ignoramibus” He was wrong And it was unsettling (Klein, 1980)

…and to attribute blame… Deaths of children in care (e.g., “Baby P.”)

…although there are some cases where uncertainty is accepted “It is better and more satisfactory to acquit a thousand guilty persons than to

put a single innocent one to death” (Maimonides) “It is better that ten guilty persons escape than that one innocent suffer”

(Blackstone)

Page 5: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

The very first high-stakes assessment…“Then Jephthah gathered together all the men of Gilead, and fought with Ephraim: and the men of Gilead smote Ephraim, because they said, Ye Gileadites are fugitives of Ephraim among the Ephraimites, and among the Manassites.

And the Gileadites took the passages of Jordan before the Ephraimites: and it was so, that when those Ephraimites which were escaped said, Let me go over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay;

Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan: and there fell at that time of the Ephraimites forty and two thousand. (Judges 12, 4-6, King James version)

Page 6: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

ReliabilityHansen (1993) distinguishes between literal and representational assessments There are no literal assessments All assessments are representational All assessments involve generalization

Reliability is a measure of the stability of assessment outcomes under changes in—or the ability to generalize across—things that (we think) shouldn’t make a difference, such as: marker/rater occasion* item selection*

* UK excepted

Page 7: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Uncertainty in assessing EnglishStarch & Elliott (1912)

Page 8: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Uncertainty in assessing mathematicsStarch & Elliott (1913)

Page 9: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Measures of reliabilityIn classical test theory, reliability is defined as a kind of “signal-to-noise” ratio (in fact a signal to signal-plus-noise ratio) Reliability is increased

by decreasing the noise, or, easier, by increasing the signal

Hence the need for discrimination The legacy of individual differences psychology

A focus on discrimination between individuals In education, more appropriate ways of estimating reliability exist

Discriminating between those who have and have not been taught Discriminating between those who have and have not been taught well

Page 10: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Test length and reliability

0.70 0.75 0.80 0.85 0.90 0.95

0.70 1.0

0.75 1.3 1.0

0.80 1.7 1.3 1.0

0.85 2.4 1.9 1.4 1.0

0.90 3.9 3.0 2.3 1.6 1.0

0.95 8.1 6.3 4.8 3.4 2.1 1.0

From

To

Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).

Page 11: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Reliability is not what we really wantTake a test which is known to have a reliability of around 0.90 for a

particular group of students.Administer the test to the group of students and score itGive each student a random script rather than their ownRecord the scores assigned to each student

What is the reliability of the scores assigned in this way?A. 0.10B. 0.30C. 0.50D. 0.70E. 0.90

Page 12: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Reliability v consistencyClassical measures of reliability are meaningful only for groups are designed for continuous measures

Marks versus grades Scores suffer from spurious accuracy Grades suffer from spurious precision

Classification consistency A more technically appropriate measure of the reliability of assessment Closer to the intuitive meaning of reliability

Page 13: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Uncertainty in assessment at A-levelClassification consistency at A-level Please, D. N. (1971) “Estimation of the proportion of examination candidates

who are wrongly graded”. Br. J. math. statist. Psychol. 24: 230-238.

Page 14: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

Here’s the table that got me into trouble…

Classification consistency of National Curriculum Assessment in England

reliability levels 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

KS1 3 73% 75% 77% 79% 81% 83% 86% 90%

KS2 5 56% 58% 60% 64% 68% 73% 77% 84%

KS3 8 45% 47% 50% 54% 57% 62% 68% 76%

Page 15: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

AERA, APA, NCME Standards (4 e/d,1999)Standard 2.1 For each total score, subscore or combination of scores that is to be interpreted,

estimates of relevant reliabilites and standard errors of measurement or test information functions should be reported (p. 31)

Standard 2.2 The standard error of measurement, both overall and conditional (if relevant) should

be reported both in raw score or original scale units and in units of each derived score recommended for use in test interpretation (p. 31, my emphasis).

Standard 2.3

When test interpretation emphasizes differences between two observed scores of an individual, or two averages of a group, reliability data, including standard errors, should be provided for such differences (p. 32)

Page 16: Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event

ConclusionIt is simply unethical to produce or to mandate the use of assessments without taking steps to ensure informed use of the outcomes of the assessments by those likely to do so. Error bounds should be routinely estimated, and reported in terms of the

units used for reporting (e.g., grades and aggregate measures) Government and its agencies should actively promote public understanding

of the limitations of assessments, both in terms of reliability and other

aspects of validity appropriate interpretations of assessment outcomes, for individuals and

groups