reliability and dependability by neil jones

21
Reliability and Dependability by Neil Jones The Routledge Handbook of Language Testing by Glenn Fulcher and Fred Davidson Prepared By: Amirhamid Foroughameri [email protected] November 2015

Upload: ahfameri

Post on 13-Jan-2017

327 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Reliability and dependability by neil jones

Reliability and Dependabilityby Neil Jones

The Routledge Handbook of Language Testingby

Glenn Fulcher and Fred Davidson

Prepared By: Amirhamid [email protected]

November 2015

Page 2: Reliability and dependability by neil jones

Reliability as an aspect of test quality

• Reliability and validity are classically cited as the two most important properties of a test. • Bachman (1990) identified four key qualities – validity, reliability,

impact and practicality.• He proposed that in any testing situation validity and reliability

should be maximised to produce the most useful results for test users, within practical constraints that always exist. • Here, reliability will be presented rather as an integral component of

validity, and approaches to estimating reliability as potential sources of evidence for the construct validity of a test.

Page 3: Reliability and dependability by neil jones

Measurement

• The idea that quantification is the way to understanding was memorably expressed by Kelvin in 1883:• … when you can measure what you are speaking about, and express

it in numbers you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.• (Kelvin, quoted by Stellman, 1998: 1973)

Page 4: Reliability and dependability by neil jones

Does this apply to the case of language proficiency?The answer could be No for two reasons:• First, it suggests that language proficiency is an enduring real property

that resides in a person’s head and can be quantified, like their height or weight. • Second the metaphor implies that language proficiency, like

temperature, has a single unique meaning, and can be precisely quantified.

We cannot take a one-size-fits-all approach to language assessment.

Page 5: Reliability and dependability by neil jones

The concept of reliability• Reliability equals consistency• Reliability in assessment means something rather different to its everyday use as a

synonym of ‘trustworthy’ or ‘accurate’. • However, in testing reliability has the narrower meaning of ‘consistent’.• A reliable test is consistent in that it produces the same or similar result on repeated use;

that is, it would rank-order a group of test takers in nearly the same way. • But the result need not be a correct or accurate measure of what the test claims to measure. • Just as a train service can run consistently late, a test may provide an incorrect result in a

consistent manner. • High reliability does not necessarily imply that a test is good, i.e., valid. • Nonetheless, a valid test must have acceptable reliability, because without it the results can

never be meaningful. • Thus a degree of reliability is a necessary but not sufficient condition of validity .

Page 6: Reliability and dependability by neil jones

• Reliability and error• When a group of learners takes a test their scores will differ, reflecting their

relative ability.• Reliability is defined as the proportion o f variation in scores caused by the

ability measured, and not by other factors. • This proportion is typically described as a correlation (or correlation-like) coefficient. • Depending on the type of reliability being analysed, what is correlated with what

will change.• A perfectly reliable test would have a reliability coefficient (r) of 1.• The variability caused by other factors is called error.

Page 7: Reliability and dependability by neil jones
Page 8: Reliability and dependability by neil jones

Replications and generalizability‘A person with one watch knows what time it is; a person with two

watches is never quite sure.’Thus Brennan (2001: 295) introduces a presentation of reliability

from the perspective of replications.Information from only one observation may easily deceive, because

unverifiable, while to get direct information about consistency (i.e., reliability) at least two instances are required.

Replications in some form are necessary to estimate reliability.

Page 9: Reliability and dependability by neil jones

Even more importantly, Brennan argues, ‘they are required for an unambiguous conceptualization of the very notion of reliability.’

Specifying exactly what would constitute a replication of a measurement procedure is necessary to provide any meaningful statement about its reliability.

The individual variation in test-takers from one day to another is difficult to measure, because the test is taken only once.

Thus its impact is very likely ignored, leading to an overestimate of reliability, unless we can do specific experiments to replicate the testing event in a way that will provide evidence.

Page 10: Reliability and dependability by neil jones

• Reliability and dependability• Dependability is a term sometimes used (in preference to reliability) to refer to the

consistency of a classification – that is, of a test-taker receiving the same grade or score interpretation on repeated testing.

• The way the term is used relates to the distinction made between norm-referenced and criterion referenced approaches to testing.

• Taken literally, norm-referencing means interpreting a learner’s performance relative to other

learners, i.e., as better or worse, while criterion-referencing interprets performance relative to some fixed external criterion, such as a specified level of a proficiency framework like the CEFR. • The term dependability is used in a criterion-referencing context where the aim is to classify

learners, for example as masters or non-masters of a domain of knowledge.

Page 11: Reliability and dependability by neil jones

• But if dependability relates to a particular criterion-referenced approach to interpretation we should not conclude that reliability relates only to norm-referenced interpretations.

• It is true that reliability is defined in terms of the consistency with which individuals are ranked relative to each other, but in many testing applications it is no less concerned with consistency of classification relative to cut-off points that have well-defined criterion interpretations.

Item response theory has the particular advantage that it models a learner’s ability in terms of probable performance on specific tasks. Henning (1987: 111) argues that IRT reconciles norm- and criterion-referencing.

Page 12: Reliability and dependability by neil jones

• The standard error of measurement• The standard error of measurement (SEM) is a transformation of reliability in

terms of test scores, which is useful in considering consistency of classification. • While reliability refers to a group of test-takers, the SEM shows the impact of

reliability on the likely score of an individual: it indicates how close a test-taker’s score is likely to be to their ‘true score’.

One difference often cited between CTT and IRT is that CTT SEM is a single value applied to all possible scores in a test, while the IRT SEM is conditional on each possible score, and is probably of greater technical value.

However, as Haertel (2006: 82) points out, CTT also has techniques for estimating SEM conditional on score.

Page 13: Reliability and dependability by neil jones

Internal consistency as the definition of a trait

• It is important to note that internal consistency is conceptually quite unrelated to the definition of reliability.• Think of a short test consisting of items on, say: your shoe size, visual

acuity, the number of children you have, and the distance from your house to work. Assume that with appropriate procedures each of these can be found without error, for a group of candidates. The reliability of this error-free test will be a perfect 1.• But these items are completely unrelated to each other, and so an internal

consistency estimate of their reliability would be about zero. For this reason too, it is impossible to put a name to this test, that is, to say what it is actually a test of.

Page 14: Reliability and dependability by neil jones

Internal consistency as the definition of a trait

• Now suppose the test contained, say, items on shoe size, height, gender. This time it is likely that on administering the test the internal consistency estimate of reliability would be found to be

considerably higher than zero. • The difference is that this time the items are related to each other.• Study them and you could probably name what it measures:

something like ‘physical build’.• So the trait which a test actually measures is whatever explains its

internal consistency.

Page 15: Reliability and dependability by neil jones

Reliability and validity

• Validity nowadays tends to be judged in terms of whether the usesmade of test results are justified (Messick, 1989). This implies a complex set of arguments that go well beyond the older and purely psychometric issue of whether the test measures what it is believed to measure.

Page 16: Reliability and dependability by neil jones

Reliability and validity• Coherent measurement and construct definition• In the trait-based, unidimensional approaches, conceptions of validity and

reliability emerge as rather closely linked. They both relate to the same notion of– of focusing in on ‘one thing’ at a time, coherent measurement.

• Typically this means identifying skills such as Reading, Writing, Listening and Speaking as distinct traits, and testing them separately.

• Each of these traits requires definition: what do we understand by ‘Reading’ or ‘Listening’ ability, and how is it to be tested?

• Such construct definition provides the basis of a validity argument for how test results can be interpreted.

• Defining constructs encourages test developers to identify explicit models of language competence, enables useful profiling of an individual learner’s strengths or weaknesses, and helps to interpret test performance in meaningful terms.

Page 17: Reliability and dependability by neil jones

• Focusing on specific contexts• The conclusion is thus that the trait-based measurement

models presented here enable approaches to language proficiency testing which can work well, achieving a useful blend of reliability, validity and practicality. • However, there is a condition: each testing context must be

treated on its own terms, and tests designed for one context may not be readily comparable with tests designed for another context.

Page 18: Reliability and dependability by neil jones

• Mislevy (1992: 22) identifies four possible levels at which tests can be compared:• Equating – the strongest level: refers to testing the same thing in the same way, e.g. two

tests constructed from the same test specification to the same blueprint. Equating such tests allows them to be used interchangeably.

• Calibration – refers to testing the same thing in a different way, e.g. two tests constructed from the same specification but to a different blueprint, which thus have different measurement characteristics.

• Projection – refers to testing a different thing in a different way, e.g. where constructs are differently specified. It predicts learners’ scores on one test from another, with accuracy dependent on the degree of similarity. It is relevant where both tests target the same basic population of learners.

• Moderation – the weakest level: can be applied where performance on one test does not predict performance on the other for an individual learner, e.g. tests of French and German.

Page 19: Reliability and dependability by neil jones

Issues with reliabilityIn practice language testing seeks to achieve both reliability and validity within the

practical constraints which limit every testing context. The aim should be to optimise both, rather than prioritise one over the other. If reliability is prioritised, then indeed it may conflict with validity.Internal consistency estimates of reliability make it possible to drive up the reliability

of tests over time, simply by weeding out items which correlate less highly with the others.

This, as Ennis (1999) points out, is potentially a serious threat to the validity of a test, as it leads to a progressive narrowing of what is tested, without explicit consideration of how the content of the test is being modified.

A classic way of narrowing the testing focus is to restrict the range of task types used and select items primarily on psychometric quality – the discrete item multiple-choice test format which Spolsky questioned.

Page 20: Reliability and dependability by neil jones

Trait-based measures versus cognitive models

The trait-based measurement approach is most useful in summative assessment, where at the end of a course of study the learner’s achievements can be summarised as a simple grade or proficiency level.

Formative assessment, which aims to feed forward into future learning, needs to provide more information, not simply about how much a learner knows, but about the nature of that knowledge.

As Mislevy (1992: 15) states: ‘Contemporary conceptions of learning do not describe developing competence in terms of increasing trait values, but in terms of alternative constructs.’

Page 21: Reliability and dependability by neil jones

Thank You