reliability of essay type questions — effect of structuring

24
This article was downloaded by: [University of Leeds] On: 18 October 2014, At: 15:05 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Assessment in Education: Principles, Policy & Practice Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/caie20 Reliability of Essay Type Questions — effect of structuring Manorama Verma a , Jugesh Chhatwal a & Tejinder Singh a a Department of Paediatrics , Christian Medical College , Ludhiana141008, India Published online: 28 Jul 2006. To cite this article: Manorama Verma , Jugesh Chhatwal & Tejinder Singh (1997) Reliability of Essay Type Questions — effect of structuring, Assessment in Education: Principles, Policy & Practice, 4:2, 265-270, DOI: 10.1080/0969594970040204 To link to this article: http://dx.doi.org/10.1080/0969594970040204 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/ page/terms-and-conditions

Upload: tejinder

Post on 24-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reliability of Essay Type Questions — effect of structuring

This article was downloaded by: [University of Leeds]On: 18 October 2014, At: 15:05Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Assessment in Education: Principles,Policy & PracticePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/caie20

Reliability of Essay Type Questions— effect of structuringManorama Verma a , Jugesh Chhatwal a & Tejinder Singh aa Department of Paediatrics , Christian Medical College ,Ludhiana‐141008, IndiaPublished online: 28 Jul 2006.

To cite this article: Manorama Verma , Jugesh Chhatwal & Tejinder Singh (1997) Reliabilityof Essay Type Questions — effect of structuring, Assessment in Education: Principles, Policy &Practice, 4:2, 265-270, DOI: 10.1080/0969594970040204

To link to this article: http://dx.doi.org/10.1080/0969594970040204

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information(the “Content”) contained in the publications on our platform. However, Taylor& Francis, our agents, and our licensors make no representations or warrantieswhatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions andviews of the authors, and are not the views of or endorsed by Taylor & Francis. Theaccuracy of the Content should not be relied upon and should be independentlyverified with primary sources of information. Taylor and Francis shall not be liablefor any losses, actions, claims, proceedings, demands, costs, expenses, damages,and other liabilities whatsoever or howsoever caused arising directly or indirectly inconnection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden.Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Reliability of Essay Type Questions — effect of structuring

Assessment in Education, Vol. 4, No. 2, 1997 265

Reliability of Essay TypeQuestions -- effect of structuringManorama Verma, Jugesh C h h a t w a l & Tejinder SinghDepartment of Paediatrics, Christian Medical College, Ludhiana-141008, India

ABSTRACT The present study was designed to test the reliability of traditional essay typequestions and to see the effect of 'structuring' on the reliability of those questions. Sixty-twofinal MBBS students were divided into two groups of 31 each. Group A was administereda 2-hour test containing five traditional essay questions picked up from previous universitypapers, while Group B was administered the same questions in a structured format. Theanswer sheets were xeroxed and evaluated independently by seven examiners. The dispersionof marks was significantly greater in Group A, as was the variance between marks awardedby the seven examiners to each student. Correlation of individual marks awarded was alsopoor for Group A scripts. The internal consistency for Group A was 0.31 (p > 0.05) whilethat for Group B was 0.69 (p < 0.05). Inter-examiner agreement on ranks awarded wasbetter for Group B. These findings suggest that reliability of traditional essay questions canbe improved by structuring them.

Knowledge is an important component of and prerequisite for any educationalprogramme. Yet, its evaluation poses an enigma -- because it cannot be directlyobserved; it can only be inferred. Thus devising a truly valid education instrumentbecomes difficult and there is often a trade off between choosing a highly reliable(and less valid) instrument such as multiple choice questions (MCQs) or more valid(and less reliable) instrument such as an essay type question.

In spite of problems with reliability, traditional essay question (TEQs) are thoughtto evaluate higher domains of learning and remain the mainstay of evaluation ofknowledge in most of the Indian universities. Lately, a trend has been observed ofreplacing TEQs by what is called 'Short notes', which limit the available time perquestion, without limiting the content (e.g. write short notes on: hyperthyroidism,digoxin, portal hypertension, etc.). However, this does not serve the intendedpurpose (unpublished observation).

It is generally accepted that one should go in for a more valid instrument ratherthan a more reliable one (Harper & Harper, 1990). To that extent, essay typequestions still remain an obvious choice for evaluation of knowledge. However, theirrelatively low reliability is a cause for concern. Hence, the present study wasconducted with the following twin objectives:

(1) to estimate the reliability of TEQs; and

0969-594X/97/020265-06 ©1997 Carfax Publishing Ltd

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 3: Reliability of Essay Type Questions — effect of structuring

266 M. Verma et al.

(2) to study the effect of structural modifications on the reliability of TEQs.

Material and Methods

The study was conducted on the answer sheets of 62 final-year medical students ofa traditional medical school in north India at a pre-university examination. Thestudents were divided into two equal groups of odd and even numbers. Group A,consisting of odd-numbered students, was given a test paper consisting of five TEQsin the subject of paediatrics. To maintain objectivity, the questions, were picked upfrom previous years' university examination papers. The examination was of 2 hoursduration and all questions carried equal weight. Maximum possible mark was 100.

To the even-numbered students in Group B, the same questions were put in astructured format. For structuring, the components of the question were arrived atby discussion among teachers not involved with the study. These components wereframed into questions requiring the student to provide a definite answer withminimal ambiguity. Differential weighting was given to questions depending on thelength of the expected answer. These were called structured essay questions (SEQs).Examples of each type of questions are provided in the Appendix.

To test the marker reliability, seven teachers of paediatrics (five from one univer-sity and two from another) were asked to evaluate the xeroxed answer sheets, firstfor Group A and later for Group B. No model answers, check-lists or otherguidelines regarding evaluation were provided.

The mean of all seven marks awarded to each candidate was considered as his/her'fair mark'. Students in both the groups were ranked in order based on their 'fairmarks'. The following statistical analysis was carried out on the data thus generated.

(1) The difference of marks awarded by the seven examiners from the fair markswas calculated for each student and later for the whole group.

(2) The co-efficient of variance (ANOVA) between examiners was calculated, bothfor the marks and the rank. Rank analysis was confined to the top 10 studentspicked on the basis of fair marks.

(3) The co-efficient of correlation using the Pearson product moment formula wascalculated between marks awarded by individual examiners and fair marks; thiswas also done for the rank awarded by individual examiners, with ranksobtained on the basis of fair marks.

All the values were checked for statistical significance from standard statisticaltables.

Results

The mean marks obtained by the two groups were 30 ± 8.4 and 36 ± 8.4 respectively(x2 2.766, p<0.001) while the mean dispersion from fair marks was 12.9 ± 5.4 and9.8 ± 4.4, respectively (x2 2.437, p< 0.05). The variance between marks awarded byexaminers to group A (f= 2.99, p < 0.05) was significant. The correlations between

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 4: Reliability of Essay Type Questions — effect of structuring

Reliability of Essay Type Questions 267

TABLE I. Correlation of individual markswith 'fair marks'

Examiner

1234567

Group A

0.84*0.220.600.69*0.080.79*0.54

Group B

0.90*0.92*0.85*0.97*0.91*0.560.76*

*p<0.05.

TABLE II. Rank correlation with 'fair marks'

Examiner

1234567

Group A

0.74*0.140.130.74*0.560.080.39

Group B

0.78*0.79*0.69*0.90*0.68*0.59*0.44

*p<0.05.

individual marks and fair marks and between individual ranks and fair rank has beenshown in Tables I and II.

The internal consistency of TEQs calculated by the split-half method was 0.37(p>0.05, NS), while that of SEQs was 0.69 (p<0.05) The estimated internalconsistency of a paper consisting of twice the number of TEQ was 0.54, while thatof four times the number was 0.70 (both p> 0.05, NS).

Discussion

A classical study (Harper & Misra, 1988) was published in India to assess markerreliability of essay type questions in history from a large-scale public examination.The study clearly brought out the extremely low reliability of traditional essayquestions. Unfortunately, no such study has been attempted in the field of medicaleducation, although TEQs have been used for evaluation of knowledge in over 150Indian medical schools for a long time. The present study aimed at providing anobjective answer to this problem and highlighted some interesting aspects.

The group of students answering TEQs scored lower mean marks as comparedwith the group answering SEQs (30.8 vs 36 ± 8.4). The difference in marks from fairmarks was also significantly more for TEQs. This suggests that by using TEQs,

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 5: Reliability of Essay Type Questions — effect of structuring

268 M. Verma et al.

123456789

10

Group A

A B C D E F G HStudents

CO

123456789

10

Group B

K L M N O P Q R S TStudents

FIG. 1. Rank analysis: ranking pattern of Groups A and B.

the students are put at a disadvantage -- not only of obtaining lower marks but alsoof being subjected to more inter-marker variation. The probability of being assignedto a 'dove' or a 'hawk' examiner depends solely on the 'luck' of the student!(Wakeford & Roberts, 1984). Thus, whether a student passes or fails or getshonours, depends on extraneous factors rather than on his/her ability.

Further objective evidence of this fact is provided by analysis of the variance ofmean marks awarded by the seven examiners to the entire group. The variationamong examiners for the TEQ group was significant, not only for marks but also forranks awarded. However, the variance in marks and rank was not significant for theSEQ. It was interesting to note that the student ranked first in TEQs by fourexaminers was ranked third, fourth and eighth by the other three. Similarly, thestudent ranked second by one was ranked eighth, ninth and tenth by three of theother examiners. For the SEQs, the ranking was more consistent and correlatedbetter with fair ranks. In spite of variations in the marks awarded, the studentranking first by fair marks was ranked first by seven examiners. Figure 1 provides theranking pattern for the two groups. There is total scatter in the ranks awarded byexaminers to the TEQ group.

For the TEQ group, only three out of seven examiners had a significant corre-lation with fair marks; while for the SEQ, this was true for six examiners. Similarly,rank correlation was significant for six examiners for SEQs as against only two forTEQs. In other words, at least four examiners improved their correlation coefficientwhen marking structured essay questions. Even for the ones where there was asignificant correlation in marking TEQs, it improved further for SEQs. Thoughdifficult to quantify, this 'proximity-movement' cannot be overlooked and suggestsan encouraging process.

The internal consistency of the TEQ paper was poor and did not reach astatistically significant level even for a hypothetical paper of 20 questions (althoughit would be impossible in actual practice to have that long a paper). For the SEQpaper, however, the internal consistency was significant.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 6: Reliability of Essay Type Questions — effect of structuring

Reliability of Essay Type Questions 269

With the advent of what can be called an 'assessment-movement' (Association ofIndian Universities, 1982) high emphasis was placed on MCQs because of theirobjectivity and reliability. However, soon it became apparent that this was aninflated dependence. MCQs, at least in the most commonly used format, fail to testthe higher domains of learning (McCloskey & Holland, 1976). It is very easy toconstruct a poor-quality MCQ and the results obtained from such MCQs tend togain respectability under the garb of reliability! Because of these reasons, manyinstitutions have discontinued use of MCQs in final certifying examinations (Wake-ford & Roberts, 1984).

Some of the earlier studies on TEQs have suggested that reliability can beimproved by reducing the length of the expected answer (Bull, 1956; Wakeford &Roberts, 1979). In another study, however, we had demonstrated that merelyreducing the length of the answer does not help, as students tend to be evaluated ondifferent abilities (unpublished observation). Other suggestions, such as providing amodel answer or check lists to the examiners, are not feasible, at least in ourcircumstances, because the person setting the paper is usually different from theperson evaluating it and thus the chances of agreeing to a common check list arelower. Furthermore, with most medical schools having upwards of 150 students percohort, evaluating as per check list is a time-consuming process and operationallynot feasible. For similar reasons, having TEQs evaluated by more than one examineris also not possible.

'Structuring' of an essay question provides a tool for improving marker reliability.The present study has confirmed the efficacy of such an intervention. It may,however, be noted that TEQ and SEQ, although tapping the same knowledge, arenot doing so in a similar way. TEQs may be testing analysis and synthesis ofknowledge, while SEQs are more likely to test recall of facts. However, it would bewrong to tell ourselves that all examiners evaluate TEQs with that objective in mind.

It would be worthwhile to mention some of the limitations of the present study.The 'examiners' may have been conscious of the fact that their evaluations would besubjected to scrutiny and this may have introduced some element of artificiality.Further, the common tendency of applying criterion-referenced evaluations couldhave influenced the results. However, these criticisms apply to both the groups andthus may have counterbalanced each other.

To conclude, it can be stated that structuring an essay question helps in improvingmarker reliability as well as the internal consistency of the test, without anyadditional input. It would be worthwhile to devise SEQs for different levels oflearning (cf. Bloom's taxonomy; Bloom, 1956) and replicate the study so that aclearer picture can be obtained.

Acknowledgements

Acknowledgements are due to the seven examiners who helped in the evaluation ofscripts and to Dr Shavinder Singh, Reader in Community Health, Christian MedicalCollege, Ludhiana for help in the statistical analysis.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 7: Reliability of Essay Type Questions — effect of structuring

270 M. Verma et al.

ReferencesASSOCIATION OF INDIAN UNIVERSITIES (1982) Monograph on Student Assessment (New Delhi,

Association of Indian Universities).BLOOM, B.S. (1956) Taxonomy of Educational Objectives: the classification of educational goals (New

York, David McKay).BULL, G.M. (1956) An examination of the final examination in medicine, Lancet, ii, pp. 368-370.HARPER, A.E. & HARPER, E.S. (1990) Preparing Objective Examinations (New Delhi, Prentice Hall

of India).HARPER, E. JR & MISRA, V.S. (1988) Ninety-marking ten, in: S.K. PANDA & K. MURUGAN (Eds)

Readings in Distance Education II (New Delhi, Indira Gandhi National Open University).MCCLOSKEY, D.L. & HOLLAND, R.A.B. (1976) A comparison of student performance in answer-

ing essay type and multiple choice questions, Medical Education, 10, pp. 382-384.WAKEFORD, R.E. & ROBERTS, S. (1979). A pilot experiment on the inter-rater reliability of short

essay questions, Medical Education 13, pp. 342-344.WAKEFORD, R.E. & ROBERTS, S. (1984) Short answer questions in undergraduate qualifying

examinations: a study of examiner reliability, Medical Education, 18, pp. 168-169.

Appendix

Essay Question

How will you approach a 3-year-old child presenting with convulsions? (20)

Structured Essay Question

(a) Enumerate five common causes of convulsions in a 3-year-old child (5)(b) What emergency care will you provide to a convulsing child? (5)(c) How will you arrive at an aetiological diagnosis? (5)(d) How will you treat him if the diagnosis is established as Primary Generalised (5)

Epilepsy?

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 8: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 277

Nevertheless, there were concerns about particular groups and individuals. In1994, 13 of the 21 teachers interviewed after the test who felt that pupils hadgenerally coped well, still referred by name to individual pupils whom they felt hadunderperformed compared to what they should have achieved, and nine of the 21mentioned problems of specific groups of pupils. This was in spite of the fact thatthe lowest attaining pupils who were more than 2 years behind the mean inattainment should not have been attempting the tests, but should instead have beenassessed using the practical tasks.

In order to reduce these perceived disadvantages, in the official tests in 1995, 17of the 30 teachers had obtained permission from the head or local educationauthority in accordance with SCAA regulations (Ref. KS2/95/137) to make specialarrangements for some pupils. In most cases (11 teachers) this was to allowquestions to be read to poor readers and/or bilingual pupils; in some cases observed,this required teachers to read questions one by one to pupils in a separateroom, which also had the effect of allowing pupils significantly more time. Threeother teachers had simply asked for extra time for pupils with special needs.Some teachers volunteered the view that withdrawal of a few had had the additionalbenefit of enabling both withdrawn and remaining groups of pupils to concentratebetter.

At least 20 of the teachers also attempted before the tests to assist the anxious,unconfident and slow-working pupils by practising beforehand in a test format withtime limits. Several teachers had discussed examination techniques including theimportance of pacing.

Perhaps after all these efforts to minimise disadvantage, it is not surprising thatwhen in 1995 all teachers were asked in the questionnaire whether they thought thetests as a whole had disadvantaged any children in any way, 22 of the 30 teachersthought that they had not. Two of these teachers volunteered that this was due tothe special arrangements made.

Nevertheless, in relation to specific tests the proportion of teachers who expressedconcern about fairness was sometimes substantially greater than this. The mostcommonly cited sources of 'unfairness' are discussed under separate headingsbelow.

Time Limits

Teachers felt very concerned about the effect of time restraints on individuals, andin some cases the whole class:

It was too long—they needed more than the specified time;

It's not the children or us who are at fault, it was all down to the timing;

They were frustrated that they haven't completed this one.

In particular, all 30 teachers responding to the questionnaire reported that the 1995mathematics tests were unfair because they did not allow most pupils to finish whatthey could do, and the majority were concerned in both years about the reading

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 9: Reliability of Essay Type Questions — effect of structuring

278 M. Brown et al.

comprehension tests. In the case of writing, some teachers in both years (seven outof 17 in 1994) felt that the time limit might have disadvantaged specific childrenwho 'take a while to get going' or who can, with a little more time, show 'creativity',as well as whole classes who were used to working on a draft version and takingmuch longer to produce a polished story.

Generally, time limits were thought to penalise particularly slow but thoroughworkers and anxious children inclined to panic at time pressure:

There were one or two who we think were panicked by the timing. I feltthey didn't really have an opportunity to show what they could do;

V showed that working to a time worried her and she rushed through all ofthese tests, no matter how much we talked to her and re-assured her it'smade no difference, and she certainly won't be showing her full potential.

Although concern was eased a little in 1995 by ensuring that most marks at the endof papers were for work at higher levels, it still seems unnecessary to cause additionalanxiety by setting time limits which do not enable most, perhaps all, pupils tocomplete everything they can do and allow them time to review their work.

Carroll suggests that teachers are right to be concerned at a lack of validity fromthis source; he summarises research which suggests that speed and level of responseare distinguishable traits, and that:

A test given in unlimited time is more likely to be a pure measure of a levelability. (Carroll, 1993, p. 507)

This, together with the fact that the science 1995 tests were not perceived as leadingto invalid results as they enabled pupils to finish comfortably in the time allowed,suggested that the mathematics and reading comprehension tests should have eitherlonger times or reduced content. Concern about time limits has now also beenshown in the official evaluation of the tests (SCAA, 1995, p. 6) to be the issue mostfrequently raised. In response to this the SCAA has now agreed to lengthen themathematics tests times from 35 to 45 minutes starting in 1996, and to remove theLevel-6 assessment for reading to a separate extension test. In the case of the writingtest, there seems to be no good reason why pupils should not have at least a wholemorning if necessary to complete their work, although the SCAA has elected tomake no changes.

Insufficient Differentiation

Teachers felt that the wide range of attainment tested (from Levels 3 to 5 for scienceand mathematics and Levels 3 to 6 for English) not only made the tests longer, theyalso made them especially forbidding for weak or unconfident Level-3 pupils, whowere 'fidgety' and generally found it difficult to sit still for long periods, 'got bored'or 'got bogged down' or just became discouraged because they 'found it difficult'.It should be noted that since Level 3 is notionally defined to be the attainment ofan average 9-year-old and Level 5 that of an average 13-year-old, the weak Level-3

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 10: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 279

pupil at age 11 might well only be expected to be competent to do all the questionson a Level 3-5 paper 5 years later, at the age of 16.

The worst problems occurred with the mathematics tests, with more than half ofthe teachers interviewed in 1994 mentioning low attainers and an almost universalperception that the tests were 'very difficult', even 'formidable' in 1995:

Knock[ed] them back completely;

The children who are borderline 3s, they had a thoroughly horrible 35minutes;

There wasn't anything in the paper to give them [the Level 3s] anyconfidence;

Set too high, at the top end of Level 3—showed what children couldn't do.

Mathematics A was the only test in which a team member observed a child cryingat the end, among a general 'stunned silence'. Teachers commented that somepupils had 'looked at it aghast' and needed reassurance to encourage them tocontinue.

The reading comprehension test in both years was felt to be in an attractive andmotivating format, but suffered some of the same problems as the 1995 maths tests.The need to recognise achievement across Levels 3-6 meant that many teachers feltit made 'weak Level 3s' 'frustrated' and 'despondent about the amount and densityof it'.

Ten out of 19 teachers made such comments in 1994. Similarly, in 1995 althoughteachers were pleased that there were easy questions at the beginning, many thoughtit unfair to present weaker and less mature pupils with both a story in which themain point was too subtle for them to understand, and a difficult poem:

Too many catches really for other than the brightest children. So I foundthe brightest children coped but those I teacher-assessed at 4 found itdifficult.

Vocabulary aimed in some cases at Level 6 was also felt to present unfair obstacles.Science was not thought to share these problems, although both teachers and

observers commented that by the time of the Science B, the last of the mainstreamtests in 1995, children were test-weary, 'tired' and 'lacking in concentration', andhence may have performed below capability.

The fact that Level-3 pupils were faced with tests much of which they did notunderstand, while Level-5 pupils experienced tests within their grasp and with manyeasy items, was correctly identified as a major source of inequity. Several teachersproposed tests with a more limited range (for example, separating the tests into two,covering Levels 3 and 4 and Levels 4 and 5). One teacher suggested a graded set oftests aimed at each level in turn, with pupils joining in where appropriate anddropping out if they failed to reach a criterion mark for a level. Both of these wouldalso enable less content to be covered and, hence, each test to be more manageablein the given time-limits. It would also seem to be worthwhile exploring the suitabilityof the system at Key Stage 3 in which each pupil sits a test with questions at their

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 11: Reliability of Essay Type Questions — effect of structuring

280 M. Brown et al.

estimated level and from one level below and one above this. This latter system doesnot suffer from the problems of equity which have already been identified, andalthough it may introduce a possibility of bias in relying on teacher assessment oflevel, this seems likely to be a minor factor in comparison to the current unfairnessto low attaining pupils entered for mainstream tests.

The only concession made has been the decision to limit the mainstream 1996reading comprehension test to Levels 3-5 (SCAA, 1995, p. 8).

Language Problems

The feature which was the second most commonly cited as a source of unfairness inthe mathematics tests, cited by 27 of the 30 teachers in 1995, was the difficulty ofthe reading level. Several teachers mentioned this as a problem for validity, in spiteof the fact that 11 had made special arrangements for some pupils to be read thequestions:

Too much literature I think in the maths test, because I do have childrenwho are pretty able at maths, but they are not so able in language.

Nevertheless, overall, teachers in 1995 were less concerned than in 1994 about theproblems for poor readers and second-language speakers, partly because of theflexibility they had had in making special arrangements, either by withdrawinggroups to whom the tests were read, or by being able in the classroom to readspecific phrases to individuals when they had problems. Although this is likely to bea continuing source of unfairness, it would seem that this solution is an appropriateone and goes as far as possible to provide equity.

Bias

In 1994, eight out of the 17 teachers felt the choice of writing task was discrimina-tory, favouring either 'middle-class children' (because of the style of the starterphrases), 'boys' (because of the 'Time Travel' title), or children who liked fantasywriting and/or writing in the first person. Presumably to counter the latter criticism,the 1995 selection included a non-narrative option of writing a letter of complaint.However, since we observed the choice between narrative and non-narrative gener-ally to have been made by the teacher on behalf of the whole class, the ability ofpupils to select according to their preferred style seems to have been limited.Nevertheless, teachers were certainly happier with the equity aspect of the selectionin 1995.

Representativeness of Content and Embodiment of Good Practice

When teachers were asked to comment on the content of each of the tests, theircomments concerned a number of issues which are dealt with under the two broadheadings of coverage and balance, and reflection of good practice.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 12: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 281

Coverage, and Balance of Content and Difficulty

Few teachers, when asked about the quality of the test material, raised the problemof validity caused by basing a high-status assessment on a small sample of evidence,although those who did felt it to be an important factor:

Seventy minutes in total is not a lot to test a child's mathematicalknowledge;

There is no way you can cover four years' [science] work ... in 70 minutesof tests;

It only tests a little of what they know.

Rather more referred to the limited sample in other contexts, for example whenexplaining why they would not want to report test results to parents independentlyof teacher assessment.

There was a general feeling that, within the limitations of testing and the statedaims of the tests, a reasonable balance across different content topics had beenachieved, especially in 1995 (but not across content and process, which will becovered in the next section). However, as has already been noted in the section on'fairness', there was some comment in 1994 and much criticism in 1995, by morethan half the teachers sampled in each case, that the distributions of questions in themathematics and reading comprehension tests were pitched too high and did notinclude sufficient questions which were appropriate to below average (Level-3 andweaker Level-4) pupils.

Reflection of Good Practice

Many teachers were seriously concerned about aspects of the tests which failed toreflect their classroom practice. In science and mathematics these problems relatedto the decision that neither of the national tests in these subjects should test theprocess-oriented Attainment Target 1, 'Using and Applying Mathematics' and'Experimental and Investigative Science', respectively. In each case in the statutorycurriculum Order, this Attainment Target is not seen as independent of the contenttargets, but is required to permeate the teaching of the remaining targets. Thus it isinevitable that the tests and the curriculum Orders give different messages aboutclassroom practice.

There was most conflict in science between the concentration on knowledge in thetests and many teachers' classroom approach which was more process-oriented,emphasising the experimentation which is a focus of Attainment Target 1. Thismismatch was mentioned explicitly by 5 of the 15 teachers who commented aboutscience in 1994 and by 19 out of the 29 in 1995:

Too much emphasis on pure memorising of facts;

We try to get people to investigate and experiment and ask 'why?'—thetests don't do that;

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 13: Reliability of Essay Type Questions — effect of structuring

282 M. Brown et al.

You feel if you'd done a year's talk and chalk, you could have preparedthem for the test—there's a dilemma between the way you teach scienceand the way it's tested;

The tests that we have just done do not bear any resemblance to the workthat the children have been doing ... although the work we've done is veryclosely related to the programme of study, the approach is different;

It doesn't really test their ability to get the right equipment, to organise it,to do an experiment, put their results in a chart, there was none of that ...in getting inferences from what they'd done, coming to a conclusion ... itwas just purely testing knowledge;

The importance of science is the understanding, showing you know how toset up an investigation—the only way that can be tested is through adiscussion-based thing.

There was more satisfaction with the mathematics tests, with 13 of the 29 teachersin 1995 welcoming the style of items, especially the emphasis on 'getting thechildren to think', and agreeing that it was 'quite like classroom maths'. Never-theless, many of the other 16 felt that the testing did not reflect the full range ofclassroom practice. Some criticised the lack of 'depth' and of any investigatoryapproach in maths as is emphasised in Attainment Target 1:

More evidence necessary of longer investigations.

In contrast, others considered the 1995 tests put too much emphasis on children'sability to reason rather than on their knowledge and skills:

Too much deduction; not every child could cope with that;

If it was basic maths they might get a totally different result.

There was also some concern as to whether the tests were presented in a style whichmatched certain published mathematics, schemes more closely than others.

The responses from both interviews and questionnaires only serve to highlight thevariety, and sometimes the direct conflict, of views among different teachers on whatare the important aspects of school mathematics. This corresponds to researchfindings on the divergence in beliefs of teachers as to whether mathematics ispredominantly an active process or a body of knowledge and techniques (forexample, Lerman, 1983; Thompson, 1984), and on teachers' differing interpret-ation of the meaning of the process target in the mathematics Order (Askew, 1996).It would clearly be impossible to satisfy all teachers, but nevertheless a testthat leaves a majority of the 29 teachers in a random sample dissatisfied aboutcurriculum mismatch is unlikely to earn solid support among the profession.

In the official evaluation, the number of teachers who felt that some test questionsdid not match their interpretation of the mathematics Order was rather smaller, at29% (SCAA, 1995, p. 17). This, however, does not contradict our findings, since itwas in response to a different question; each question may individually be acceptable

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 14: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 283

without the whole test being judged an accurate reflection of the intendedcurriculum.

Two teachers did refer to the logistical problems of incorporating science andmathematics investigations as part of national tests:

There is no way it can be individual, because you would then have to havea teacher supervise 30 individual investigations.

This problem had already occurred at Key Stage 1 (age 7) in 1991. Then teacherswere very supportive of the principle of including investigations in science andmathematics, but found the management of them difficult and hence such taskswere removed for 1992 and later years.

Although it was agreed that the missing process-oriented attainment targets areassessed within teacher assessment, teachers saw this as irrelevant as they recognisedthat the test results would be the major basis for comparison in league tables.Whatever the limits of the test domain noted in the small print, test results wouldgenerally be considered by parents and the public to be a valid reflection ofattainment across the subject as a whole.

In English, similar concerns were expressed, although not this time about Attain-ment Target 1 (speaking and listening) which was not assessed, and about whichsimilar arguments could be made to those above. The most common mismatch withclassroom practice in English appeared to arise from the assessment of writing in alimited time in a single genre with uninspiring titles to choose from:

There was no time for a proof, then best;

It tested ability to jump through hoops quickly—creativity a secondaryissue;

To produce something with just 15 minutes' preparation time was quitechallenging ... they are not used to such a structured approach—they areused to a great deal of discussion and the writing of ideas before we go intoanything.

There is clearly a real problem, especially in relation to science, with the fact that the'high-stakes' tests do not sample from all the material in the National CurriculumOrders. By omitting the process targets, testing methods do not reinforce teachingmethods used by teachers to satisfy legitimate and significant requirements of thenational curriculum.

The only politically practicable way of solving this problem is by aggregatingassessments of the missing components with the national test results to producemore valid overall subject results. One method of doing this would be to limitsummative teacher assessment to Attainment Target 1 in each subject, and inEnglish additionally to include writing. The teacher-assessed results would have tobe moderated, probably by agreement meetings within and between local clusters ofschools.

Alternatively, teacher-assessed results would remain broad and unmoderated, butthe process target results to be aggregated with the national test results would be

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 15: Reliability of Essay Type Questions — effect of structuring

284 M. Brown et al.

determined by a separate externally set but teacher-marked coursework component.This could be achieved by issuing, nationally, a set of tasks and marking schemes outof which, say, two or three varied tasks for each subject would be carried out andassessed by teachers during the year. There would be some limited moderation,either by local agreement groups or by postal moderation or moderator visits onlimited samples. This is the same type of solution as is adopted in the GeneralCertificate of Secondary Education (GCSE) public examinations at age 16; it isdifficult to see why it should not be acceptable at 11, although the additionalworkload for teachers might prove difficult to negotiate.

Although both these solutions, and particularly the first, will reduce the 'objec-tivity' and hence reliability, the enhancement of teachers' assessment skills bymoderation procedures, and the support from teachers associated with the improvedvalidity, should make this price worth paying.

More radical solutions are possible but unlikely to be politically acceptable; forexample, extended investigations could be used as the only national tests providingthe basis of content assessment, as in the pilot Key Stage 3 mathematics tasks(Brown, 1992a).

Teachers' Views of Accuracy of Test Results

Usefulness of Results

One criterion for whether the teachers felt that the results of the national tests werevalid was the extent to which they used them in practice. Both in 1994 and in 1995,more than two-thirds of schools said that they did not intend to make any use of thetest results, other than obeying the legal requirement of reporting them in writtenform to parents, and in most cases reporting also to secondary schools.

In 1995 we also asked teachers, via the questionnaires, whether they would usethe returned test papers to look at children's errors, and whether they would discussresults with colleagues and/or parents. Twenty-two out of 30 said they wouldexamine errors, and 25 said they would discuss the overall results with colleagues,but only five referred to any evaluative purpose:

Identify gaps and discuss what is expected;

Help us with forward planning.

Only ten teachers out of 29 said they intended to discuss results with parents,although all but two schools would communicate them to parents. These twoschools in fact failed to satisfy a legal requirement by not supplying national testresults to parents, choosing to give only teacher-assessment results.

Thus there was a general reluctance to attach significance to the results of thetests. Although most teachers were intending to study and discuss with colleagueshow pupils had performed, this seemed to be more for the purpose of gaining betterscores the following year than for gaining a valid evaluation of their curriculum andteaching.

One test for the validity of inferences from national test scores was whether the

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 16: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 285

secondary schools found the results useful. They appear in general not to have done,as is reported in a parallel paper on the use and impact of results (Brown et al.,1996).

Correspondence with Teacher-assessment Results

Clearly, teachers will be more willing to accept test results as valid if they feel theresults match closely their expectations (concurrent validity). In national assessmentit is possible, unusually, to assess this aspect of validity as the teacher-assessmentlevel provides an alternative measure on the same scale. Although there are differ-ences in the domains due to the rulings that national tests should not attempt tomeasure the Attainment Target 1 in each subject, these are not widely appreciated.Teachers' judgements are made in relation to more-or-less well-defined criteria ineach attainment target using ongoing work in the classroom, although the methodsof arriving at the final levels are quite different in different schools (McCallum et ah,1993; Gipps et al, 1996).

In 1994 we asked teachers how the outcomes of the tests, as marked by theteachers themselves in accordance with national mark schemes, compared with theirown teacher assessments. Of the 25 schools which had sufficient test results togeneralise, nine said there were few discrepancies ('no surprises'), seven reportedthat some children came out higher in the tests than in teacher assessment, and theremaining eight reported the opposite—they would have expected more Level 4sand 5s in the test than they had received. Thus two-thirds of the teachers noted thatthere were discrepancies, considerably more than was the case at Key Stage 1(although at the younger age only three levels were in play). The results of the 1994national pilot tests (SCAA/Universities of Bath & Wales, 1994) suggest that therewere more Level-3 results and fewer at Level 5 than predicted by the nationalmodel, especially in science.

Nevertheless, when we asked the teachers in the 1994 questionnaire whether theyhad confidence in the overall test results, 14 out of the 17 who had sufficient resultsto comment upon said that they did, on the basis that they generally correspondedto their teacher-assessment results, although nine of the 14 thought the tests hadbeen unfair to some groups.

In 1995 we enquired in rather more detail in all 31 schools, by telephone, shortlyafter the scripts were returned from the external markers at the end of June. Werepeated the question in the autumn term to only those teachers who had continuedteaching a Year-6 class in the same school, and the comments were consistent withthose of the whole sample made in June.

In the case of mathematics and English, only about a third of the schools (ten of31 in mathematics, 11 out of 30 in English) reported that test results matchedreasonably closely their teacher-assessment results. Some of these teachers confessedto having inspected the question papers before finally deciding on their teacher-assessment levels and thus had 'erred on the side of caution'.

In both subjects there were many complaints of fewer Level 5s than expected,and lower scores generally. Some blamed the children ('lack of checking ... not

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 17: Reliability of Essay Type Questions — effect of structuring

286 M. Brown et al.

understanding the importance of the test'). Most blamed the difficulty of the testmaterials and/or the time limits; in English 'the reading comprehension was hard','the tests were challenging', 'time factor—too short', while in mathematics 'the testswere flawed', 'the papers were dreadful', 'set too high', 'all down to the timing','children went to pieces'. To give an example of the extent of discrepancy, oneschool which had assessed 15% on Level 5 and 50% on Level 4 in teacherassessment for mathematics received test results to show only 3% at Level 5 and25% at Level 4.

Nevertheless, at least one teacher expressed pleasant surprise:

The results actually I thought would be horrendous because of the type ofpaper; I thought a lot of kids would have been actually cheesed off with it.But it's surprising because more children got to [level] 5 than I expected.

A factor which only affected English was that the markers and the marking criteriawere criticised for being over-harsh. The official SCAA report (1995) shows thatresulting levels in English were changed as a consequence of correcting clerical ormarking errors on about 1800 returned scripts; although less than 1%, this isprobably a significant underestimate of scripts with errors, since few teachers withanomalies bothered to return scripts. It is interesting to compare this with Radnor(1995, 1996), who in an evaluation of 1995 Key Stage 3 tests, also confirms thatdissatisfactions with the English marking were both widespread and justified.

In science, even fewer schools (six out of 31) reported a close match between testresults and teacher-assessed levels. However, in contrast to mathematics and En-glish, 22 of the remaining 25 schools reported that the test results were higher thanexpected. Seven of these 22 schools, although pleased, spontaneously questioned thevalidity of the tests. It was suggested that some discrepancies may have occurredbecause in the tests Attainment Target 1 on science processes was omitted, and thatthis received a weighting of 3 in teacher assessment levels. Although discrepancieswere also explained by a few teachers by additional revision, overwhelmingly theywere attributed to the lack of challenge in the test:

Basically science was too easy—half the kids were at Level 5 and that'sridiculous.

The SCAA report (1995, p. 6) confirms the directions of the dissatisfaction in thedifferent subjects, although the numbers of teachers who acknowledged in thequestionnaire that there were significant differences between the test and teacher-assessment results was rather lower in their questionnaire sample, ranging from 38%(English) to 50% (science).

Implications and Recommendations

Thus many teachers were clearly reluctant to accept the results of the national testsas valid. In the majority of cases there were indeed significant discrepancies betweenteacher-assessment and test results. At other points in the project, the majority ofteachers expressed to us the need for some means of moderation of teacherassessment, both to give themselves confidence and to ensure comparability between

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 18: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 287

standards of teacher assessment in different schools. However, they were reluctantto accept the results of the test as performing a moderating function.

Teachers gave as* their reason for this refusal the already-identified shortcomingsof the tests. It seems likely that at least the most criticised of these will have to beaddressed before teachers will be prepared to put any trust in the results.

Although some aspects, for example the time limits, have been improved in 1996and seem likely to have led to a reduction of the overall dissatisfaction, one aspectof particular concern has not been addressed. This is the feature cited by someteachers that the tests and the teacher assessment are 'assessing different things':

• The tests were merely a 'snapshot' using one particular and rather atypical contextand mode, whereas teacher assessment takes place over several years in manydifferent contexts and different modes.

• The teacher assessment reflects nationally accepted criteria; while some of themarking schemes reflected these to some extent, in other cases there was a moretenuous relation with the subject Orders. Teachers felt concerned that they couldnot tell from pupils' scripts which level they would be awarded; the nature of thetranslation from marks to levels appeared to them (correctly) to be going to bedetermined pragmatically after mark distributions were available, suggesting thatthey would be norm-referenced rather than criterion-referenced.

• While teachers' own assessment had to reflect the full set of attainment targets,there were, as already noted, important targets unassessed in the test result. Thus whatwas stated to be an overall test level for a subject was not comparable to an overalllevel determined by teacher assessment.

These findings suggest that many teachers have a serious lack of confidence inthe test results, which they may well convey to parents and which is at risk ofundermining the system.

It might have been expected that teachers would show some dissatisfaction at thetests, given the circumstances of the imposition of national curriculum, a nationaltesting regime and league tables on a primary school system in which teachers hadpreviously had an unusual, perhaps even unreasonable, degree of autonomy. Thecomplaint that tests did not match the teaching might have been an indication thatthe teachers, rather than the tests, were out of step with the curriculum.

While the results reported in the next section suggest that there was indeed a smallgroup of schools and teachers who resented the interference and would not acceptthe legitimacy of any national testing, no teacher or headteacher in the sample wasopposed in principle to a national curriculum, and indeed the majority accepted thelegitimacy of national testing. Our visits to schools, and much other evidencecollected by national agencies such as SCAA and Ofsted in their various reports,suggests that primary teachers were largely satisfied with the detailed content of thecore subjects in the national curriculum and were conscientiously trying to im-plement it.

This, and our own evaluation of the tests, suggests that the complaints weremostly legitimate, and that the problem lay in the match between the tests and thecurriculum to which teachers had made great efforts to adapt, rather than in the

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 19: Reliability of Essay Type Questions — effect of structuring

288 M. Brown et al.

match between the teaching and the curriculum. In fact, in some cases, for examplethat of mathematics, the tests were closer to much of the teaching than to thecurriculum as stated. While other tests sometimes used voluntarily by teacherswould probably be perceived to suffer from some of the same problems of balanceand bias, this was of little consequence because of their generally 'low-stakes' natureand because no such curriculum match was claimed. National tests on the otherhand were both 'high stakes', providing the basis for league tables, and expected bydefinition to provide a valid assessment of the legally enforced curriculum.

The complaints certainly seemed to be justified in regard to the untested processelements which had been emphasised in the documentation and teacher training. Itthus appeared that there was an element of official deception, in that while thecurriculum documents legally endorsed a broad range of skills and high-levelprocesses, the tests demonstrated that it was a much narrower range of knowledgeand traditional skills that were really valued. On the other hand, the governmentcould reasonably claim that it had tried to incorporate process testing at Key Stage1 (age 7) and that it was to meet the demands of teachers for less onerous tests thatit had been removed.

In relation to the criterion-referencing aspect, partly to appease teacher unionsthere had been a deliberate move away from the initial criterion-referenced formu-lation to the more traditional notion of a test based more loosely on a programmeof study. This had made the discrepancies between teacher assessment and nationaltest results inevitable since they were targeting different aspects and were not tiedinto specific statements of attainment. Thus although the complaint was genuine, itwas not generally realised that this was one of the consequences of a nationaltrade-off.

Clearly, there are problems in achieving a compromise between manageability andvalidity. Nevertheless any methods which can be found of ensuring a closer fitbetween test levels and teacher-assessed levels would be worthwhile. Not only willthe tests need to be improved in ways already described so that teachers perceivethem to be a valid reflection of the curriculum, but the mark schemes will need tobe more transparently related to teachers' criteria for level decisions, and morethorough trialling will be needed to ensure that the results of tests and teacherassessments coincide in a majority of cases for most teachers.

These problems did not occur to anything like the same extent at Key Stage 1 (age7), although detailed comparisons are difficult since the results related to 3 yearsearlier. It is not clear whether the much greater degree of agreement was becausethere were fewer levels available, or whether, as already noted, it was a consequenceof the greater satisfaction with the tests. There was also a much closer correspon-dence at Key Stage 1 between statements of attainment then used for teacherassessment and the test items. Alternatively, it may reflect the fact that teachers atKey Stage 1 marked the tests themselves, and hence had some opportunity toincorporate, perhaps unconsciously, their own long-term judgements in the testresults. Certainly, there had been less concern about validity at Key Stage 1,although there was still considerable doubt about the worthwhileness of the testingprocess (Gipps et ah, 1995).

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 20: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 289

Teachers' General Views on the Balance between Formative andSummative in National Assessment

It is important to discuss whether the considerable doubts teachers have about thevalidity of the national test results arise from opposition to the current model ofnational assessment, or whether the objection is more to the details of its implemen-tation.

In a questionnaire in 1995, 24 of the 30 teachers and 25 of the 28 heads saw someadvantages in national tests. The perceived benefits are listed in Table II.

Two years earlier, in 1993, seven of the 32 teachers (including those from the sixschools which refused to carry out any tests in 1994) had been very much againstnational tests of any sort, on the basis that they 'disturbed class teaching', 'putpressure on children and teachers' and were 'unfair' to many children who would be'sitting down in front of something they've got no chance of engaging with'. One ofthis group was, in fact, also against any form of systematic teacher assessment sinceit was 'fruitless—spending all this time doing it', while the other six saw teacherassessment as the only valid form of assessment.

However, perhaps surprisingly in view of the boycott, the remaining 25 of the 32teachers in 1993 had replied that they found the national assessment modelacceptable ('a good idea', 'probably the best you can get', 'not a bad thing').Nineteen of the 25 expressed the positive view that both teacher assessment and testcomponents were needed as they had to go 'hand in hand' and that 'either on theirown would be wrong'. Of these 19, 13 nevertheless said that they thought teacherassessment should have the major influence as it was 'far more important', and

TABLE II. Possible benefits of national tests as perceived by teachers and headteachers in July 1995

Perceived benefits of testing

Raise standardsImprove children's

knowledgeImprove children's

understandingImprove children's problem-

solvingBroaden the curriculumImprove teacher assessmentImprove teacher's knowledge

of individualsImprove basic skills teachingImprove teachers' planningImprove design of tasksRaise teachers' expectationsEncourage teachers' own

evaluationNone of the above

Number of teachers agreeing(» = 30)

7

11

5

57

19

1213101015

186

Number of heads agreeing(« = 28)

10

10

7

51

15

1212171613

173

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 21: Reliability of Essay Type Questions — effect of structuring

290 M. Brown et al.

'more accurate'. Tests were only a 'snapshot', 'somebody else's inventions' andthere were fears that if they were given prominence with league tables there wouldbe pressure to teach to the test and narrow the curriculum.

Similarly, in 1995, as the final line in Table II demonstrates, six teachers and threeheads believed that national tests would do nothing to improve the quality ofchildren's education, and were therefore a 'waste of time and money'. There werealso nine other heads who, in spite of seeing some advantages were still on balanceagainst national tests, mainly because of their distorting influence of the curriculum,and because testing and league tables were:

Not the best way to encourage schools to become effective—too much stickand not enough carrot.

When Year-6 teachers were asked in 1995 what information they thought wouldbe of most use to parents, four said they would have preferred to give teacher-assessment levels only, as they believed that the tests by their nature could not everprovide sufficiently valid information:

A test is a picture of a short time;

A test can't be used to create a picture of a child's ability.

Ten other teachers supported the official position of reporting teacher assessmentand test results separately, both because they felt the test results were insufficientlyvalid on their own, for similar reasons to those given above, and because theyrepresented:

Completely different aspects of assessment, assessing different things.

Radnor (1995, pp. 70-71) termed such teachers 'differentialists' in her evaluation oftesting at Key Stage 3 (age 14).

However, the majority of Year-6 teachers (15 out of 29) responded in a similarway to those termed 'levellers' by Radnor in the age 14 study, feeling that theteacher assessment (TA) and test results should test the same domain and should becombined in some way to give a single result to present to parents, partly to reducethe visibility of the test results with their low validity, but also because:

Parents like an official measure—some will take no notice of TA.

A single result would certainly remove concern about discrepancies, as well asimproving validity. One method of achieving a single result which could commandpublic confidence would be to employ the proposal made in an earlier section tocombine moderated teacher assessment of process targets, and Writing in English,with test results for content targets. This would not prevent encouragement forteachers to use formatively assessment against level descriptions for content as wellas process areas.

In the longer term, the arrangement might develop towards a model such as thatin the Report of the Task Group for Assessment and Testing (TGAT) (DES/WelshOffice (WO), 1987), with results from nationally prescribed standardised tasks or

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 22: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 291

tests used mainly for moderation purposes, in combination with local agreementmeetings.

Rather than using summative tests, an item bank of standardised tasks could beavailable for use at any time, as in Scotland or Northern Ireland. This system hasalso been used successfully at secondary level in England in Graded Assessmentprojects (for example, Brown, 1992b), and in the period when this was permissableit appeared to give valid results in the GCSE at age 16 without any final test. Itwould certainly be more possible with authentic classroom tasks to avoid some of theobjections to tests such as the unfairness resulting from time pressure, unfamiliarand artificial formats, and mismatches with the style of classroom activity. Somesuch classroom tasks have been offered to teachers in England and Wales in subjectswhich do not have statutory tests, but without any requirement on their use. Indeed,in an earlier feasibility study for the DES on a national system of attainment targetsand assessment in primary schools (Denvir et ah, 1986), a combined system ofagreement group moderation and a national item bank with some requirement forminimum level of use was recommended, but rejected by politicians.

The final outcome of such a system would be, as the majority of teachersrequested, a single result, based on teacher assessment but moderated by nationaltest results:

I do think that if you could standardise a worthwhile [teacher] assessment,it would be brilliant.

Conclusion

These findings concerning teachers' views of assessment at the end of the primaryphase are important not only in England, but also for other countries which arecontemplating introducing a system of national testing. The recent English andearlier Scottish boycotts should demonstrate the pitfalls of introducing a systemwithout the support of the teaching profession. It seems that in England the currentmodel of blanket external tests and separate teacher assessment is now perceived byteachers as being broadly acceptable for the core subjects of mathematics, Englishand science, provided that the tests are improved in the various ways suggested soas to give more valid results. Most teachers however, would prefer, rather than twoseparate results, some combination of teacher assessment with national test resultsto give a single more valid result for each subject. In the longer term, moderatedteacher assessment without formal national tests seems likely to receive the strongestsupport from teachers; this approach seems worth aiming towards as a futurenational policy.

ReferencesAMERICAN EDUCATIONAL RESEARCH ASSOCIATION (AERA), AMERICAN PSYCHOLOGICAL

ASSOCIATION & NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION (1985) Standards

for Educational and Psychological Testing (Washington, DC, American PsychologicalAssociation).

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 23: Reliability of Essay Type Questions — effect of structuring

292 M. Brown et al.

ASKEW, A. (1996) Using and applying mathematics in schools: reading the texts, in: D.C.JOHNSON & A. MILLETT (Eds) Implementing the Mathematics National Curriculum: policy,politics and practice, pp. 99-112 (London, Paul Chapman).

BLACK, P. (1993) The shifting scenery of the national curriculum, in: C. CHITTY & B. SIMON

(Eds) Education Answers Back—critical responses to government policy, pp. 45-60 (London,Lawrence & Wishart).

BROWN, M. (1992a) Elaborate nonsense? The muddled tale of standard assessment tasks inmathematics at Key Stage 3, in: C. GIPPS (Ed.) Developing Assessment for the NationalCurriculum, pp. 6-19 (London, Kogan Page).

BROWN, M. (Ed.) (1992b) Graded Assessment in Mathematics (Basingstoke, Thomas Nelson).BROWN, M., TAGGART, B., MCCALLUM, B. & GIPPS, C. (1996) The impact of Key Stage 2 tests,

Education 3 to 13, 24(3), pp. 3-7.CARROLL, J.B. (1993) Human Cognitive Abilities: a survey of factor-analytic studies (New York,

Cambridge University Press).CRONBACH, L. (1988) Five perspectives on validity argument, in: H. WEINER & H. BRAUN (Eds)

Test Validity (Hillsdale, NJ, Lawrence Erlbaum).DAUGHERTY, R. (1995) National Curriculum Assessment: a review of policy 1987-1994 (London,

Falmer Press).DENVIR, B., BROWN, M. & EVE, P. (1986) Attainment Targets and Assessment in the Primary Phase:

Mathematics Feasibility Study (London, DES).DEPARTMENT OF EDUCATION AND SCIENCE/WELSH OFFICE, TASK GROUP ON ASSESSMENT AND

TESTING (1987) A Report (London, DES/WO).GIPPS, C. (Ed.) (1992) Developing Assessment for the National Curriculum (London, Kogan Page).GIPPS, C. (1994) Beyond Testing: towards a theory of educational assessment (London, Falmer Press).GIPPS, C , BROWN, M., MCCALLUM, B. & MCALISTER, S. (1995) Intuition or Evidence? Teachers and

assessment of seven-year-olds (Buckingham, Open University Press).GIPPS, C , MCCALLUM, B. & BROWN, M. (1996) Models of teacher assessment among primary

school teachers in England.LERMAN, S. (1983) Problem-solving or knowledge-centred: the influence of philosophy

on mathematics teaching, International Journal of Mathematics Education in Science andTechnology, 14, pp. 59-66.

MCCALLUM, B., MCALISTER, S., BROWN, M. & GIPPS, C. (1993) Teacher assessment at Key StageOne, Research Papers in Education: policy and practice, 8, pp. 305-327.

MESSICK, S. (1989) Validity, in: R. LINN (Ed.) Educational Measurement (New York, Macmillan).MESSICK, S. (1992) The Interplay of Evidence and Consequences in the Validation of Performance

Assessments (Princeton, NJ, Educational Testing Service).Moss, P.A. (1992) Shifting conceptions of validity in educational measurement: implications for

performance assessment, Review of Educational Research, 62, pp. 229-258.NUTTALL, D. (1987) The validity of assessments, European Journal of Psychology of Education, 11,

pp. 109-118.POPHAM, J. (1996) Consequential validity: right concern—wrong concept, paper presented at the

Annual Meeting of the AERA, New York, April.RADNOR, H. (1995) The Evaluation of Key Stage 3 Assessment in 1995 (Exeter, University of

Exeter).RADNOR, H. (1996) Evaluation of the Quality of External Marking of the 1995 Key Stage 3 Tests in

English (Exeter, University of Exeter).SCAA/UNiVERsrnEs OF BATH & WALES (1994) Report on the 1994 Key Stage 2 Pilot (London,

SCAA).SCAA (1995) Report on the 1995 Key Stage 2 Tests and Tasks in English, Mathematics and Science

(London, SCAA).SMITH, M.L. (1991) Put to the test: the effects of external testing on teachers, Educational

Researcher, 20, pp. 8-11.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014

Page 24: Reliability of Essay Type Questions — effect of structuring

Validity of National Testing at Age 11 293

THOMPSON, A. (1984) The relationship of teachers' conceptions of mathematics and mathematicsteaching to instructional practice, Educational Studies in Mathematics, 15, pp. 105-127.

WHJAM, D. (1993) Validity, dependability and reliability in national curriculum assessment, TheCurriculum Journal, 4, pp. 335-350.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

15:

05 1

8 O

ctob

er 2

014