standards seem like a good idea but how do we validate them? · standards seem like a ......

'Standards seem like a good idea but how do we validate them?' Gordon Stanley

Presented at 'The Blind Assessor: Are we constraining or enriching student learning?' symposium. 22/11/2010 1

Standards seem like a good idea but how do we

validate them?

Gordon Stanley [email protected]

Feature presentation at ‘The Blind Assessor: Are we constraining or enriching student learning’ symposium. 22 November 2010



Media Lobbying by Head Masters

Media Lobbying by Head Masters



Background to contemporary standards-based reporting

Most education systems around the world now report their student performance with reference to standards.

This move has been hastened by the impact of international testing and the public policy focus on education in the development of human and social capital.

What has this meant for assessment practice?

  Assessment and grading do not take place in a vacuum. Professional judgments about the quality of student work together with interpretations of such judgments are always made against some background framework or information.

  Modern assessment practice tries to make the framework explicit.



Aggregate Approach

  Pre-‐dates modern outcome/standards approach.

  Scores on different assessment tasks are added together and converted to a 100 point (percentage) scale.

  Component scores may be weighted before being added.

  While not norm-‐based it is not standards-‐referenced unless components are.

Traditional Approaches

Grading Using Aggregate Scores of 100

A: 90-‐100 B: 80-‐89 C: 65-‐79 D: 50-‐64 (Passing grade) E: 45-‐49 (Conditional pass) F: < 45



Norm-based Grading

Norm-‐based or grading on a curve pre-‐sets distribution of grades achievable. Typical e.g.:

A: 4-‐6% B: 8-‐12% C: 20-‐30% D: 45-‐55% E: 5-‐15%

Move to Standards-referenced Assessment

Aggregating of scores by itself does not enable one to be conXident that grades can be interpreted in consistent ways unless the tasks that are being scored have been designed in such a way that higher scores can only be achieved through demonstrating higher levels of the knowledge and skills required.

Norm-‐based grading only prevents ‘grade inXlation’ but does not allow demonstration of improved student cohort performance when it occurs.



International Trends With the growth of international education and the global human capital market countries are looking beyond their borders for comparability in student outcomes/results.

How to meet this challenge of comparable assessment is an important issue for education systems.

In Europe and in many Commonwealth countries the Xirst approach to comparison has been to develop qualiXications frameworks to classify levels of qualiXications and to de;ine their common characteristics as outcome standards.

Definition of Standards Across Systems

Standards describe what it is that students should be able to know and do.

While this deXinition is basically the same across countries and systems; what varies is the “breadth” of the variable (i.e. the descriptions); “the name(or type/purpose)”.

Standards written for quali;ication frameworks are generic and need to be interpreted at a given subject/discipline level.



Level of Specification of Outcome Standard

Specifying standards in terms of observable student outcomes is not easy.

Grading against speciXic objectives often leads to Xiner and Xiner level of speciXication (‘check-‐list’ approaches).

Danger is that there can be too many to consider and each element may become operationally isolated from each other.

Design Requirements

  Assessment tasks need to be designed to elicit student performance and enable it to be judged with respect to the intended outcomes.

  The syllabus and teaching program needs to be explicit about the developmental continuum from novice to expert in the Xield of study.



Design Requirements continued

  The assessment program has to allow that higher scores or higher grades are based on evidence about progress on the developmental continuum from novice to expert.

  Generic standards descriptors can help to frame subject speciXic criteria for the reporting of marks and grades.

Generic Grade Descriptions

A: Clear mastery of all course objectives with intellectual initiative demonstrated at high level.

B: Substantial mastery of most course objectives with relevant analytical skills.

C: Sound mastery of major course objectives with understanding of most of basic course.

D: Some mastery of a range of objectives with basic understanding.

E: Few course objectives met.



Accountability   Performance management and the effects of the accountability agenda mean that the results of testing and public examinations are high stakes for many of the stakeholders in education.

  Students: when results determine entry to selective higher education program.

  Teachers: when results are used for accountability.

  School systems: where funding is dependent on meeting government targets.

  End-‐users: want credible results.

Effects of Accountability   Student performance on external examinations and tests are seen as performance indicators for education systems.

  In this context everyone has an interest in improved performance.

  With strong incentives for improvement how do we know whether or not results are more inXluenced by this need rather than by valid evidence of improvement?

  ‘Is grade inXlation occurring?’ common media debate.



l

Typical Popular Comment

How should we respond?

  This letter raises the issue commonly appearing in the media where employers complain that the skills and knowledge expected by the level of student results are not present.

  What is the evidence for this view and how can school and assessment authorities address the issue?



Grade Inflation?

  Media response to publication of results: are standards rising or falling?

  Grades represent the achievement standard so should not rise or fall.

  Numbers achieving the ‘standards’ implicit in the grade can rise or fall.

  Standards once established should remain relatively constant.

t/



Source: www.gradeinflation.com. Constructed by Rojstracze

Grade trend in US universities and colleges 1920 to 2006

Standards Versus Norms

  Media problem is caused by the move away from normative equating procedures for reporting results.

  When normative scaling is applied it is relatively easy to have a Xixed proportion of candidates achieving the highest reported grade each year.

  With standards-‐referenced systems outcomes are determined by a judgment process as to whether or not the standard has been reached.

  How stable is the judgment process?



One of the main differences between norm-‐ and standards-‐referencing is that with the latter there is no inherent limit to the percentage of students achieving a particular standard. In theory it is possible for all students to achieve any performance standard.

This means the percentage achieving particular performance bands or levels can vary from year to year. The question is how do we know that the percentage we have nominated as achieving the bands is comparable from year to year?

Is it good enough to just attest that due process has been carried out or is there a need for more substantive information regarding the percentages to be produced?

Validating Standards

Interpreting Results Over Time

  How should we interpret variation in the numbers achieving the top grade over time?

  Time series data often show incremental creep with more students achieving the top levels of performance each year.

  This result then leads to debate about whether or not standards are falling or whether the education system itself is delivering some consistent improvement (Wikstrom, 2005).



The Need to Validate

  Clearly a fundamental question which arises in education systems is how do we validate standards-‐referenced results?

  Public examination authorities are expected to maintain standards.

  There are a number of procedures and tools to assist the process in systems where results are referenced to standards.

  They all ultimately rely on professional judgment to some degree.

A Scenario   An education system introduces a new professional development programme aimed at increasing student outcomes.

  The ‘high stakes’ exit examination is reported with respect to standards. Much effort is put into ensuring that standards are being maintained.

  Student performance on the exit examination appears to be much improved.

  Is it an ‘easier’ examination, therefore cut-‐scores to determine grades need to be raised? or

  Is it evidence that the professional development programme is bearing fruit?

  How would we know?



The Validation Problem Professional judgement-‐based standard setting exercise has been conducted and the percentages achieving particular grade levels determined.

The examinations are high stakes and previous years’ examinations have been used by teachers and students to prepare for the current examination.

The performance standards that are used to assign grades are available to school systems so teachers and students have the opportunity to “internalise” the performance standards.

What are some “other” ways that we can get alternate multiple sources of information about the current distribution (relative to the previous distributions) that will enable us to judge (validate) the outcomes of the results of the standard setting exercise?

The emphasis in these questions is whether the alignment of the cut-‐score to the “borderline student” from one year is equivalent to the new cut-‐score of the same borderline student in subsequent years.

Create alternate multiple sources of information that indicate the relative stability of the results of the standard setting exercise. If these different sources give similar information then we can be more conXident that the results are comparable and any change is genuinely a change in the distribution of performance from one year to the next (Convergent Validation).

Validating Standards



Comparability   For example:

  Using statistical moderation to moderate school-‐based assessments (common students and common items)

  Using common items (moderating test) to compare performance over time in the same subjects in Hong Kong

  Using common judges to compare performance over time in subjects in New South Wales and the GCSE

  Using common students to equate the distributions of different subjects in the same year.

Professional Judge Approaches Panel of independent judges interrogates data and process to

make their own independent, professional judgement about the relative differences between the distributions from the different years. This could involve interviewing the examiners, markers, judges and asking them such questions as “Is this year’s paper more difXicult then last year’s”; “Is there a difference in the ability of this year’s cohort relative to the previous year?”; etc.

Complete independent standard setting exercise using equivalent judges or use judges from a different educational system who are familiar with the content

/



Ask the examiners to

Have teachers in the system estimate the cut-‐scores on the examination after the examination has started and before the examination is complete so that the students have not contaminated the judgement by providing their views on the relative difXiculty of the paper to the teachers estimate the cut-‐scores when they set the examination and compare the two sets of cut-‐scores

/

Professional Judge Approaches

Use pair-‐wise comparison method to equate the scales from two adjacent years:

Choose items from different examinations (including the one currently being completed by the students) ask a number of teachers (100 or more ~ online) to take each pair of items in turn and indicate which item is the harder of the two. The results from these judgements can then be used to produce a common scale which can then be used to compare the cut-‐scores from the two, or more, years.

/

Professional Judge Approaches



Professional Judgement Method

Advantages Disadvantages

1.  Involves teachers in applying the standards; helps internalise the standards across the system

1.  Transparency in cut-scores ~ what happens if there is a large differential.

2.  It gives the system level authorities feedback as to how well the standard is effectively embedded

2.  Not getting student comparison only getting teacher estimates i.e. teacher effect

3.  Not statistical; relies on professional judgement

3.  Needs to be done online or by phone

4.  Relatively cheap and non-intrusive 4.  Could lack authenticity within the community because the teachers themselves are making the judgements

5.  It is similar in that it validates professional judgement with professional judgement

Common Item Method for Validating Standards

The Common Items Method involves moderating tests: A generic moderating test (Core Skills or General Achievement

Test) can be developed and used on a sample of students in a sample of subjects each year . It needs to be kept secure.

The distributions of results from different years can then be mapped onto the scale of the moderating test and comparisons can then be made to make sure that the cut-‐scores do align (within reason).

Calibrated item banks can be used to develop the moderating tests so that the security of the moderating tests is not a major issue



Advantages and Disadvantages - Moderating Test Method -

Advantages Disadvantages

1.  It is perceived to be an alternate to professional judgement

1.  Relatively costly and quite intrusive

2.  It is well known and accepted as a method to equate and compare distributions

2.  May be difficult to motivate students ~ this could lead to a diminution of validity

3.  One single test can be used to accommodate most subjects and sub-tests of the test can be used to equate the different subjects

3.  Generic tests are only loosely linked to the actual content in the examinations

4.  Actual student performance is used to compare the subjects

4.  Adds to the examination load of students

5.  Statistical in nature and would be relatively difficult for teachers and the community to understand

6.  Security is an issue

Common Student Method

This method involves students sitting Item from examinations from different years

Students from a similar, but different, system could be asked to complete a shortened composite paper that comprises items (that assess material that is known to the students in the chosen system) from the years that need to be equated or compared. The results can then be used to place the distributions onto a common scale so that the cut-‐scores can then be compared.



Alternative Scenarios - School Based Assessment -

  Professional judgement based standard setting exercise at the subject level by different teachers in different schools.

  No subject based examinations; but generic skills test available.

  Consensus or social moderation methods are used to achieve comparability across schools (reliability assumed at the school-‐level)

  How can we get an alternate estimate of the percentage achieving the performance standards?

Clearly validation is not a straight forward exercise. Most procedures are expensive in time and professional resource and cannot provide unequivocal answers.

Validation does not occur in a neutral environment when accountability agendas place such high stakes on yearly improvement.

standards seem like a good idea but how do we validate them? · standards seem like a ......

Documents