when enough is enough: a conceptual basis for fair and defensible practice performance assessment

When enough is enough: a conceptual basis for fairand defensible practice performance assessment

L W T Schuwirth,1 L Southgate,2 G G Page,3 N S Paget,4 J M J Lescop,5 S R Lew,6 W B Wade7 &M Baron-Maldonado8

Introduction An essential element of practice perform-

ance assessment involves combining the results of var-

ious procedures in order to see the whole picture. This

must be derived from both objective and subjective

assessment, as well as a combination of quantitative

and qualitative assessment procedures. Because of the

severe consequences an assessment of practice per-

formance may have, it is essential that the procedure is

both defensible to the stakeholders and fair in that it

distinguishes well between good performers and und-

erperformers.

Lessons from competence assessment Large samples of

behaviour are always necessary because of the domain

specificity of competence and performance. The test

content is considerably more important in determining

which competency is being measured than the test

format, and it is important to recognise that the process

of problem-solving process is more idiosyncratic than

its outcome. It is advisable to add some structure to the

assessment but to refrain from over-structuring, as this

tends to trivialise the measurement.

Implications for practice performance assessment A practice

performance assessment should use multiple instru-

ments. The reproducibility of subjective parts should

not be increased by over-structuring, but by sampling

through sources of bias. As many sources of bias may

exist, sampling through all of them may not prove

feasible. Therefore, a more project-orientated approach

is suggested using a range of instruments. At various

timepoints during any assessment with a particular

instrument, questions should be raised as to whether

the sampling is sufficient with respect to the quantity

and quality of the observations, and whether the totality

of assessments across instruments is sufficient to see

�the whole picture�. This policy is embedded within a

larger organisational and health care context.

Keywords clinical competence ⁄ *standards; physicians,

family ⁄ *standards; education, medical ⁄ *standards;

quality of health care ⁄ standards.

Medical Education 2002;36:925–930

Introduction

The area of practice performance assessment is relat-

ively new in the field of medical assessment. However, it

has been very high on the agenda in the last decade,

because of the possibilities that assessment offers for

improving quality of patient care and because of the

demonstrated limitations of competence assessment

procedures. Moreover, it has received additional focus

as a result of societal concerns about the quality of

practising doctors. Many major medical boards have

defined standards for good medical practice and are now

seeking useful instruments to assess whether practising

doctors meet these standards.1–4 While these instru-

ments may be used as screening tools to detect areas of

strength and weakness in order to guide remediation, in

some cases the purpose is to decide whether the assessee

is still fit for practice. Clearly the consequences in the

latter case may have enormous impact both on society

and on the assessee; therefore rigour in the assessment

procedures and the determination of outcomes is

1Department of Educational Development and Research, University

of Maastricht, The Netherlands, 2Centre for Health Informatics and

Multiprofessional Education, University College London, UK,3Department of Medicine, Division of Educational Support and

Development, University of British Columbia, Canada, 4Royal

Australasian College of Physicians, Sydney, Australia, 5College des

Medecins de Quebec, Montreal, Canada, 6Royal Australian College of

General Practitioners, Melbourne, Australia, 7Royal College of

Physicians, London, UK, 8Department of Physiology, University of

Alcala, Madrid, Spain

Correspondence: L W T Schuwirth, MD PhD, Maastricht University,

Department of Educational Development and Research, PO Box 616,

6200 Maastricht, The Netherlands. Tel.: 00 31 43 388 1129; Fax: 00

31 43 388 4140; E-mail: [email protected]

Papers from the 10th Cambridge Conference

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:925–930 925

essential. Not only should the procedures used in

practice performance assessment be defensible, but they

must also inspire confidence and be congruent with best

evidence-based medical educational standards.

Although existing assessment methods, both tradi-

tional and new, are being suggested and trialed for

performance assessment, the value of many of them for

this purpose has not been demonstrated. The findings

from competence assessment make it highly unlikely

that one superior assessment instrument will be devel-

oped for practice performance assessment, given that

the goal in such assessments is to see the whole picture

of a practitioner’s performance. Instead, it is more likely

that a varied palette of methods will be necessary to

achieve this goal.5,6 In this paper we will suggest an

approach to selecting, using and combining methods to

set up a practice performance assessment that will

illuminate the entirety of a practitioner’s performance,

presenting a picture which is accurate and defensible.

Before doing so, however, it would be useful to

distinguish between competence and performance.

Many different definitions of competence and perform-

ance have been proposed, but they mainly converge on

the notion that competence indicates what people will

do under optimal conditions, knowing that they are

challenged to demonstrate that they have the particular

knowledge, skills and attitudes required for a task.

Performance indicates how people will behave when

unobserved, in real life, on a day-to-day basis.7 Further

agreement seems to exist that competence is a neces-

sary but not sufficient requirement for performance.8,9

In other words, performance can be seen as the result of

competence combined with the conditions which both

enable and impose boundaries on the practitioner.

Because competence and performance are strongly

related, some of the lessons learnt from competence

assessment can serve as initial guides to assessing

performance.

Lessons from competence assessment

Domain specificity requires large samples.

The domain specificity of many competencies consti-

tutes a large threat to the reproducibility of assessment

results.10 Although intuitively it is often assumed that

competencies are stable, generic traits that, once

mastered, can be used in any given situation, the

opposite has been proven.10,11 This has particularly

been studied in the field of medical problem solving,

where the result obtained on one case proved to be a

poor predictor for the result on any other given case.

For this reason, large samples of cases are needed to

achieve sufficient reliability of the assessment.12

The test content, rather than the format, decides which

competency is being measured and it is not possible for any

one format to assess all aspects of medical competence.

In terms of validity or what the assessment really

measures, the format appears to be relatively unimpor-

tant.13–16 If the same content is assessed, it is not how

things are asked but what things are asked that decides

which competency is being measured. It is logical that

certain formats are better for certain content, but there

is certainly no single method that can do it all. A

complete assessment package for competence assess-

ment must therefore consist of a variety of methods,

each chosen on the basis of its effectiveness in assessing

a particular aspect of competence.17,18

The problem-solving process is more idiosyncratic than the

outcome.

When presented with the same problem, different

experts will suggest different strategies to solve it,

although they may come to the same solution.19,20

Each individual strategy is determined by the expert’s

individual experiences and organisation of his or her

knowledge, and idiosyncrasy tends to increase with

increasing expertise. Therefore, an outcome-based ap-

proach will be more useful for high stakes assessments

of experts than a process-based approach. In other

words, the quality of a doctor’s clinical decisions is often

a better measure of his or her medical competence than

the reasoning process leading to the decisions.

Some structure in assessment adds a lot, but too much

structure loses ground.

Structuring an assessment lightly can make enormous

improvements in reproducibility.21 Adding too much

Key learning points

A combination of assessment instruments (both

objective measurements and subjective

judgements) is necessary to achieve fair and

defensible practice performance assessment.

In order to be able to view the whole picture of a

candidate’s performance, adequate sampling

through possible bias and error sources is more

effective than the sole use of objective instruments.

A project management approach can contribute

towards making a broad sampling feasible and

resource-effective.

Careful planning and production of a written

project plan is essential in fair and defensible

practice performance assessment.

Making practice performance assessment fair and defensible • L W T Schuwirth et al.926

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:925–930

structure, however, trivialises the measurement. Some-

times relevant facets of an assessment, such as rapport

with the patient, cannot be structured or made objective

and thus assessment designers may resolve the issue by

ignoring them. It is important to recognise that some

elements of clinical behaviour are more subjective than

others and cannot be assessed objectively in an assess-

ment method. This is not an argument against precision,

but a challenge to assessment developers not to apply

objectivity as the sole approach for reproducibility.

Performance assessment as a method of judging the

whole picture

Given that the purpose of practice performance assess-

ment is to see the whole picture, pointillist painting

serves as a useful metaphor. In such painting, the

quality of each of the dots, as well as their quantity,

must be sufficient for their purpose. Moreover, the

relationship between the dots is essential to facilitating

our view of the whole picture. We suggest a conceptual

approach based on this metaphor for making practice

performance assessment procedures both fair and

defensible. Fairness is used here as the level of accuracy

to which decisions about candidates (e.g. �good per-

formance�, �in need of remediation�, �poor perform-

ance�) can be made. This implies an approach in which

assessment is an integrated view of the quality of the

dots (analogous to assessment methods), the quantity

of the dots (analogous to the number of items ⁄obser-

vations per method) and the relationship between the

dots (analogous to the combination of the different

assessment methods).

Implications for the concept of performance assessment

In setting up a fair and defensible performance assess-

ment procedure, especially when it is being used for

high stakes decisions, it is advisable to incorporate these

lessons. Procedures for performance assessment should

consist of:

1 obtaining sufficiently large samples of practice;

2 with sufficiently large variety of methods;

3 with a main focus on outcomes, and

4 with a judicious blend of structure ⁄objectivity and

subjective methods.

A popular misconception about subjectivity exists in

assessment. It is often thought that subjectivity is

synonymous with unreliability, and that objectivity is

synonymous with reliability. As a consequence, we

might surmise that the only way to improve reliability is

to add structure to the measurement and to make the

assessment more objective. However, this risks trivial-

ising the assessment rather than improving it. Often it is

more effective to stick to subjective judgements, but to

sample across error-sources. If, for example, the judge’s

bias negatively influences reproducibility, it is better to

collect independent judgements from many different

judges than to produce overly detailed checklists.

Returning to the metaphor of pointillist painting, the

artistic quality of a pointillist painting is determined not

only by the number of dots and the quality of each of

the individual dots (all of which may be different from

one another), but, more importantly, by the combina-

tion of the dots. In judging the artistic value of a

painting, application of rating scales for the quality of

the dots and their quantity would not be sufficient; a

judgement of the whole picture is necessary. A more

rational approach when trying to assess the artistry of a

painter would be to have 10 different experts look

independently at 10 paintings from the artist. This

would lead to a matrix of 100 judgements, each of

which might be subjective or qualitative. The average

judgement however, would be generalisable, even

though no detailed yes-no criteria were developed for

artistry. The same argument applies to judgements of

practice performance. If judgements and assessments

are collected which sample through all possible sources

of error, the end result may be highly reproducible

despite the subjectivity of some of the individual

observations. As noted earlier, some structure in the

form of general criteria, and the fact that the judgement

is based on concrete observations is important in this

respect.

Implications for the practice of performance assessment

Several sources of bias may exist in judgements and

assessment of practice performance. The most obvious

is the personal bias of the judge. The judge may be

harsh or lenient, or may even have a personal liking or

dislike for the assessee. But there are other sources of

error. The specific time frame of the assessment, the

assessment methods used, the specific selection of

patients, the specific selection of domains or elements

of performance, the selection of tasks, the specific

occasions – these are all examples of context. Ideally

a performance assessment procedure would sample

across all these sources of error or bias. However

conceptually appealing as this may be, it is not feasible.

In order to be able to see the whole picture a seven-

dimensional blueprint would be needed, defining the

sample content and sample size for each of the biases or

error sources mentioned above. This is not possible in

practice, as each individual assessment would simply be

too extensive.

Making practice performance assessment fair and defensible • L W T Schuwirth et al. 927


A more efficient approach would involve including

decision points in the practice performance assessment

procedure and determining at each point when suffi-

cient information has been collected to see the whole

picture. At these decision points, four questions should

be considered.

Is the quality of the individual sample sufficient?

For each element of performance, certain assessment

methods can be used. It is important to make a rational

decision in choosing the method for the specific

element. If decision-making is considered a perform-

ance element of interest, a chart review may be more

useful than under-cover simulated patients. If patient

education is the aim of the sampling, assessing the

information retained by the patient may be more

informative than videotaping the patient education

process in the consultation room.

Is the quantity of the individual sample sufficient?

There are two important aspects to this question.

Firstly, the quantity of the sample should be large

enough to judge the purported element of performance

(not to provide a general judgement about the

assessee’s performance). Secondly, the decision to

collect more evidence should be based on the results

of the previous judgements. If, for example, the

judgements about communication behaviour in the

previous 10 observations are excellent, it is highly

unlikely that the next will be far below standard. In

determining whether or not to go on collecting

observations or asking items, a binomial approach

is more helpful than a standard generalisability ap-

proach. After each observation or item, the chance for

the next item to give contradictory information can be

calculated. If, on the other hand, the 10 judgements

are inconclusive, more evidence is needed. Therefore,

Figure 1 A decision schematic for effi-

cient sampling in practice performance

assessment.



the size of the sample can vary from situation to

situation.

Is the quality of the samples sufficient?

The instruments used to collect evidence for good or

bad performance should be sufficiently diverse. Meth-

ods should be selected on their merits. Avoiding the use

of overlapping methods will avoid redundancy. On the

contrary, the combination of methods that will cover a

good range of the elements of the whole picture should

be selected.

Is the quantity of the samples sufficient?

Good coverage of the whole picture can only be

obtained when sufficient methods are used. The use

of previous results to determine whether or not more

methods should be used is also relevant to this decision.

Figure 1 shows a schematic for the procedure of a


When is enough enough?

The approach described above demonstrates the need

for a detailed and structured plan for the set-up of an

assessment procedure. We would like to support the

idea of incorporating the approach into a larger plan of

attack. This plan should be described in a formal

document, in much the same way as project planning.

This document would then serve as the basis for

defending the approach chosen for practice perform-

ance assessment. The main considerations to describe

in this document would be:

1 the purposes of the assessment procedure and how

the process is tailored to meet the purposes as closely

as possible;

2 the regulatory structure of the assessment, dealing

with consequences of decisions, appeal possibilities;

3 quality control measures for methods, domains,

judges, tasks, time frame;

4 cost-effectiveness or, better still, investment-benefit

analysis;

5 rationale and ⁄or scientific underpinning of the

choices made;

6 relationships between evidence (defined as the col-

lection of judgements and outcomes), criteria (de-

fined as what ideal judgements and outcomes should

be) and standards (defined as qualities the assessee

should have).

In a larger framework, practice performance assess-

ment would entail a process that starts with the writing

of a plan containing the above-mentioned elements of

defensibility and honesty. The implementation of the

plan should entail a careful evaluation of the outcomes,

which in turn may have consequences for the assessee

or for the assessment plan itself. Figure 2 presents this

proposal schematically.

Conclusion

In this paper we have suggested and defended a

systematic approach towards the set-up of a perform-

ance assessment process. We suggest that defensibility

of such a programme would be better served by careful

selection of a variety of quantitative and qualitative

instruments and careful monitoring of their values,

rather than by trying to design individual instruments

that are made as objective as possible. In performance

assessment, it is more important to view the whole

picture than to examine individual �dots�. It is essential

that some of the lessons learned from competence

assessment and best evidence medical education be

applied in this area.

Acknowledgements

Grateful acknowledgement is made to the sponsors of

the 10th Cambridge Conference: the Medical Council

of Canada, the Smith & Nephew Foundation, the

American Board of Internal Medicine, the National

Board of Medical Examiners and the Royal College of

Physicians.

plan

purposes

elements of defensibility

quality control measures

pathwaysimplementation

judgement

consequences

assessee

Figure 2 Organisational schematic for the implementation of a


Making practice performance assessment fair and defensible • L W T Schuwirth et al. 929


References

1 Jolly B, McAvoy P, Southgate L. GMC’s proposals for

revalidation. Effective revalidation system looks at how

doctors practise and quality of patients’ experience. BMJ

2001;322:358–9.

2 Southgate L, Dauphinee D. Maintaining standards in British

and Canadian medicine: the developing role of the regulatory

body. BMJ 1998;319:697–700.

3 Southgate L, Pringle M. Revalidation in the United Kingdom:

general principles based on experience in general practice.

BMJ 1999;319:1180–3.

4 Southgate L, Hays R, Norcini J, Mulholland H, Ayers B,

Woolliscroft J et al. Setting performance standards for medical

practice: a theoretical framework. Med Educ 2001;35:474–81.

5 Southgate L, Cox J, David T, Hatch D, Howes A, Johnson N

et al. The assessment of poorly performing doctors: the

development of the assessment programmes for the General

Medical Council’s Performance Procedures. Med Educ

2001;35:2–8.

6 Southgate L, Cox J, David T, Hatch D, Howes A, Johnson N

et al. The General Medical Council’s performance procedures:

peer review of performance in the workplace. Med Educ

2001;35:9–19.

7 Rethans J, Sturmans F, Drop M, Van der Vleuten C.

Assessment of performance in actual practice of general

practitioners by use of standardized patients. Br J General Prac

1991;41:97–9.

8 Miller GE. The assessment of clinical skills ⁄ competence ⁄performance. Acad Med 1990;65:S63–7.

9 Southgate L, Campbell M, Cox J, Foulkes J, Jolly B,

McCrorie P et al. The General Medical Council’s perform-

ance procedures: the development and implementation of

tests of competence with examples from general practice. Med

Educ 2001;35:20–8.

10 Elstein AS, Shulmann LS, Sprafka SA. Medical Problem-

Solving: an Analysis of Clinical Reasoning. Cambridge,

Massachusetts: Harvard University Press; 1978.

11 Chi MTH, Glaser R, Rees E. Expertise in problem solving.

In: Sternberg RJ, ed. Advances in the Psychology of Human

Intelligence. Hillsdale, New Jersey: Lawrence Erlbaum;

1982:7–76.

12 Swanson DB. A measurement framework for performance-

based tests. In: Hart I, Harden R, eds. Further Developments in

Assessing Clinical Competence. Montreal: Can-Heal Publica-

tions; 1987:13–45.

13 An evaluation of the construct validity of four alternative theories of

clinical competence. Proceedings of the 25th Annual RIME Con-

ference. Chicago: AAMC; 1986.

14 Norman G, Tugwell P, Feightner J, Muzzin L, Jacoby L.

Knowledge and clinical problem-solving. Med Educ

1985;19:344–56.

15 Norman GR, Smith EKM, Powles AC, Rooney PJ, Henry

NL, Dodd PE. Factors underlying performance on written

tests of knowledge. Med Educ 1987;21:297–304.

16 Norman GR. Reliability and construct validity of some cog-

nitive measures of clinical reasoning. Teaching Learning Med

1989;1:194–9.

17 Ram P. Comprehensive Assessment of General Practitioners.

Maastricht: University of Maastricht; 1998.

18 Van der Vleuten CPM. The assessment of professional com-

petence: developments, research and practical implications.

Adv Health Sci Education 1996;1:41–67.

19 Polsen P, Jeffries R. Expertise in problem solving. In: Stern-

berg RJ, ed. Advances in the Psychology of Human Intelli-

gence. Hillsdale, New Jersey: Lawrence Erlbaum; 1982:367–

411.

20 Swanson DB, Norcini JJ, Grosso LJ. Assessment of clinical

competence: written and computer-based simulations.

Assessment Evaluation Higher Education 1987;12:220–46.

21 Frijns P. Scoringsmodellen Voor Open-Vraag Vormen [Scoring

Models for Open-Ended Question Formats]. Maastricht: Uni-

versity of Maastricht; 1992.

Received 21 March 2002; editorial comments to authors 13 June 2002;

accepted for publication 17 June 2002



when enough is enough: a conceptual basis for fair and defensible practice performance assessment

Documents