yecf[j[dy[ - college of physicians and surgeons...

99KKKK==KKKKEE==FFLL99KKKK==KKKKEE==FFLLee\\ ee\\

YYeeccff[[jj[[ddYY[[YYeeccff[[jj[[ddYY[[

C o n t e n t sQuality assurance in examinations: pre and post exam analysis 85

Assembling an examination 89

Standard setting 90

Statistical analysis of examination performance 92

Difficulty and discrimination indices 98

Reliability analysis scale 99

Common methods of evaluating psychomotor skills 100

The objective structured clinical examination OSCE 101

Psychometric analysis of OSCE 103

Holistic rating scale: long and short cases 106

Preparing TOACS 107

Organizing TOACS 109

TOACS sheets 111

ASSESSMENT OF COMPETENCE

S C H E D U L E

DAY THREEQuality assurance in examinations: pre and post exam analysis

Assembling an examination

Standard setting

Statistical analysis of examination performance

Difficulty and discrimination indices

Reliability analysis scale

Common methods of evaluating psychomotor skills

The objective structured clinical examination

Psychometric analysis of OSCE

Holistic rating scale: long and short cases

Preparing TOACS

Organizing TOACS

TOACS sheets

QUALITY ASSURANCE IN EXAMINATIONSPRE AND POST EXAM ANALYSIS

INTRODUCTIONThis document outlines a procedure for selecting examination questions,setting the pass mark, and measuring the performance of each item, aswell as the examination as a whole. It has been adapted from standard,published methods and is set out in a way that gives a logical progressionthrough the whole process from selecting items, setting the pass mark. Itis not intended as a detailed description of the procedure and a list of textsis appended for those who are interested in reading about preparation andquality assurance in more detail.

Quality assurance procedures are an essential and integral part ofCPSP examinations. These procedures are done routinely after everypart I and II examination. Judgement about the quality of questions usedin an examination is made based on the scientific principles and evidenceobtained thereof. The making of tables of specifications by all the facultiesis an important aspect of pre- exam quality assurance procedures.

SELECTING QUESTIONS AND EXAM MAPPINGMost MCQ exams consist of a mixture of new questions and itemsselected from the question bank. Usually between 10% and 20% of thequestions will be new. The Examination Board must decide how to deal withthe new questions from among 3 options:

1. to treat all the questions (new and ones from the question bank) thesame and have them all counting towards the candidates’ finalmarks;

2. to exclude all the new questions from calculating the candidates’final marks, so that marks depend only on the questions taken fromthe bank;

3. to count new questions that are shown to perform well on thestatistical analysis described later, but exclude from the final marksthose that do not perform well.

== Once the examination is well established, either option 2 or 3 shouldbe used. Option 1 is not recommended (even though it is the methodcurrently used in the MRCP Part 1 examination in the UK) because itdoes not allow dysfunctional items to be excluded and, therefore, islikely to impair the validity and reliability of the exam.

85Assessment of Competence

DAY

THREE

= The most obvious is option 3 – in other words, to perform statisticalanalysis on every item in the examination and base the marks only onthose questions that perform well.

Step 1 – collecting questionsThe Examiners need to collect a substantial number of questions –considerably more than will be needed in the examination itself.

Step 2 – exam contentHaving collected the questions, the examiners will match them with thetable of specifications.

Topics in the table of specifications. should be assigned to one of 3categories, depending on the importance of the material they are testing:

l Essentiall Importantl Supplementary

If this has not already been done, Examiners should do it at this stage.

Questions testing trivial material, peculiarities or rarities should bediscarded – they will only waste valuable testing time.

A good way to match questions to topics in the table of specifications. isby placing the questions into groups or piles on a very large table or on thefloor. When this has been done, the examiners look at the distribution ofquestions, with 2 things in mind:

1. Are there any essential or important areas of the syllabus that arenot covered by questions?

2. Are there too many questions on some topics (especially topics thatare not ‘essential’ or ‘important’)?

If some essential or important areas are not covered, Examiners mightdecide to commission new questions to fill the gaps.

If some areas are over-represented, Examiners will look at all thequestions in these groups and remove some of them. (They might chooseto retain what look to be the best questions, or remove questions thatduplicate, or closely match, others).

At this point, Examiners should still be left with more questions than theywill eventually need in the exam because more will be removed later


DAY

THREE

Step 3 – item difficultyEither at the item-writing stage, or when compiling an examination once theitems have been collected, it is important to judge how difficult each itemis. This is because the most effective items in an examination are thoseof moderate difficulty and high discriminant function. Examiners often thinkthat there should be a predominance of difficult items, forgetting that mostcandidates will not be able to answer many of them.

Each item should be judged as:

l Difficultl Moderatel Easy

== It is recommended that about 75% of the items in the examinationshould be of moderate difficulty.

Step 4 - producing the draft examination paperOne convenient way of assembling the draft examination paper is to entereach question on a grid according to its importance and difficulty:

So, for example, if we have 3 questions with the following characteristics:

1. this question is difficult and essential2. this question is moderate and important3. this question is easy and supplementary

They would be entered onto the grid as followsWhen all the questions are entered on the grid, the examiners make anumber of checks and, if necessary, amendments:

= All items in the ‘QUESTIONABLE’ column are reviewed. If they are nottesting valid material, they must be rejected. If they are testing validmaterial they should be re-allocated into one of the other columns – theymight require some re-writing before doing this. By the end of this


DAY

THREE

stage, there should be no items left in the ‘QUESTIONABLE’ column.a) There might now be items in all the remaining 9 cells in the grid and

the examiners need to check and, if necessary, adjust the distribution (by replacing or re-writing some items) according to the following guidelines:l The majority of questions (about 75%) should be in just 2 cells – the

‘ESSENTIAL’ and ‘IMPORTANT’ columns and the ‘moderate’ row.l The remaining questions should be distributed fairly evenly in

most of the other 7 cells. However, the ‘difficult’ ‘supplementary’cell can be left empty or with just one or two questions in it

This method ensures that the majority of items test the most importantcontent of the syllabus; that most are of moderate difficulty (which tend tobe the items that discriminate best between good and poor candidates –after all, that is the main aim of this examination); and that there are someitems that the best candidates might pick up additional marks for.

The next stage of the process is to set the pass mark.


DAY

THREE

ASSEMBLING AN EXAMINATIONSTEPS

Step 1 - Collect questions

Step 2 - Allocate question numbers in one of the three groups

Essential

Important

Supplementary

Step 3 - Ascertain item difficulty

Step 4 - Allocate in one of the three categories

Difficult

Moderate

Easy

Step 5 - Produce the draft examination paper

Step 6 - Review the questions and prepare the final paper

Have a final look for spelling errors etc. before sending for printing

Ensure secrecy


DAY

THREE

ESSENTIAL IMPORTANT SUPPLEMENTARY QUESTIONABLE

Difficult

Moderate

Easy

STANDARD SETTINGBy this stage, every item should be classified according to 2 criteria –

importance and difficulty. Both these indicators are essential for the

recommended method of standard setting, which is described below.

There are 3 main methods of standard setting (ie setting the pass mark)

in exams, though there are several potential variations of each method.

The simplest in Angoff’s method. Ebel’s method is slightly more

complicated, yet leads to a better examination design. The Hofstee method

is more complex and best used with large cohorts of examinees.

The method described here is a combination of both Ebel’s and Angoff’s

method, utilizing the advantages of both.

However, during the development stage with the new one-best answer

MCQs, Hofstee’s technique might also be a useful research method, so it

should not be entirely ruled out at this stage. It would probably not be

necessary to use it, though, if the suggested method produces good results.

Suggested method of standard setting

In setting the passing standard by this method, the examiners take

account of both the importance and difficulty of each item. It is likely to be

particularly useful in the development of new examinations or new items

within an existing examination.

Having selected a sufficient number of questions, ensured that they make

a good coverage of the syllabus, and plot out well on the grid above, the

draft examination is scrutinised by a small panel of examiners who will set

the pass mark. This pass mark might be modified in the light of post hoc

analysis of the examination, but it is extremely helpful to have a provisional

pass mark in mind before the examination is sat.

== It is recommended that each panel of examiners consists of at least

3 people, 5 – 8 members would be ideal. This should be sufficient to

minimize hawk/dove effects while still giving a depth of experience in a

group of manageable size.


DAY

THREE

l The examiners are asked orientate themselves by first briefly discussing the characteristics of a ‘borderline’ candidate – one whose knowledge would be just adequate for them to pass the examination.

l Next, with the agreed view of a ‘borderline’ candidate in mind, eachexaminer is asked to make their own, personal estimation of thepercentage of ‘borderline’ candidates who might answer eachquestion correctly. This is done without further discussion among theexaminers – each needs to make their own personal estimate at thisstage.

== For a large examination, such as the College Fellowships with manyquestions in each MCQ paper, it is probably more feasible divide thetask and allocate, say, 25 questions to each small group of examiners,rather than to ask one team to judge all the questions. If this is done,then the questions for each group should be drawn from severaldifferent cells in the grid. This will minimize the errors induced byvariations in judgement between the different groups of examiners.

l When each examiner has judged each of the allocated questions,the group’s leader collects the estimated percentage ‘pass rates’ foreach item in turn. If there is close agreement between theexaminers, it is usually a simple matter to set the ‘pass mark’ forthat item. If the examiners’ estimates for a particular question arespread out, the leader adopts the following strategy:

q The leader asks the examiner who gave the highest estimate tobriefly explain why

q Next, the examiner giving the lowest estimate is asked to explaintheir reasons

q If necessary, the group as a whole discusses the matterq Finally, agreement is reached either through compromise,

averaging the estimated marks, or by majority vote.

l When estimates have been agreed for all the questions, theprovisional ‘pass mark’ for the whole examination is calculated byadding together the agreed estimate for each question and dividingby the number of questions.


DAY

THREE

STATISTICAL ANALYSIS OF EXAMINATIONPERFORMANCE

It is strongly recommended that statistical analysis is performed usingSPSS software running under the Microsoft Windows operating system.At the time of writing (December 2000), the latest version of SPSS isversion 10.

There are a few statistical measurements on examination items that

are extremely useful in examination development and quality assurance.

There are:

= Reliability analysis

= Inter-item and item-total correlations

= Difficulty indices

= Discrimination indices

RELIABILITY ANALYSISThis is extremely important because, as the name implies, it reveals how

reliable the examination is and is the basis for a further calculation – the

Standard error of Measurement – that gives the confidence intervals for

the marks. This is explained below.

There are many aspects of reliability, all of which can be measured. The

standard measurement, though, is the internal consistency of the

examination. The basis for this measurement is that items in a test such

as a College Fellowship MCQ paper should be assessing different aspects

of the same domain (ie specialist medical knowledge) and should therefore

correlate well with each other and with the total score. For simplicity,

correlations themselves are covered in the next section, but correlations

and internal consistency can be measured in the same set of calculations

on SPSS.

There are two well-known (and related) methods of calculating internal

consistency. The KR-20 formula (Kuder-Richardson formula 20) goes back

to the 1930s and is suitable for items with dichotomous answers (eg

true/false). KR-20 would therefore be suitable for use with the new one-

best answer MCQs (where the correct option for each question would be

regarded as ‘true’ and all the other options as ‘false’).


DAY

THREE

However, it is not suitable for negatively-marked multiple true/false MCQs

because there have a third option – ‘don’t know’.

More common nowadays is the Chronbach’s a (alpha) formula, which was

developed from the KR-20. For dichotomous items (such as one-best

answer MCQs) both K-R 20 and Chronbach’s a would give the same value,

but a can also be used for questions that have more than 2 possible

answers.

= It is recommended that the measurement of reliability (internalconsistency) is made using Chronbach’s a for three reasons:

1. it is today’s standard measure used by test developers

2. SPSS software is programmed to perform it and it is available on the

drop-down menu

3. it can be used with other examination items, apart from MCQs, so

reports of all examinations can be prepared using the same method

of reliability measurement.

= The minimum value for alpha that we are looking for in an examinationis 0.8 and for an MCQ exam we would expect a figure of about 0.9.

Immediately above the reliability coefficients for the whole examination is

a table giving various characteristics for each item analyzed. The column

of particular interest to us here is that on the extreme right hand side –

labeled ‘alpha if item deleted’. (Though there is another important column

in this table that we shall consider in the next section). In the ‘alpha if item

deleted’ column we are looking for 2 things. Bearing in mind the

standardized item alpha score, we need to look at the results in this

column to identify two things:

1. items where the scores INCREASE when the item is deleted

2. items where the scores DECREASE SIGNIFICANTLY when the item

is deleted.

Test theory tells us that when an item is removed from an exam, so the

exam gets shorter, the reliability will be reduced. Clearly in an exam with

many items, this reduction will be insignificantly small when only one item

is removed. Consequently, if the alpha for the exam actually increases

when the item is removed, that item must be IMPAIRING the reliability, not

helping it. Therefore, that item is likely to be faulty.


DAY

THREE

Conversely, if the reliability is reduced by more than a tiny amount, the item

must be contributing more than its fair share towards reliability and it is

therefore probably a particularly good item. A note should be made of

items of both types.

DIFFICULTY INDICESThis measure is of limited value, yet seems to have a great attraction for

many examiners. However, it does have a role to play in measuring new

exam items, particularly MCQs, because they can confirm (or disprove) the

examiners’ estimation of whether an item is hard, moderate or easy.

More correctly known as the frequency of endorsement (p-value), difficulty

indices are most usefully calculated and interpreted alongside

discrimination indices, although for simplicity the two will be dealt with

separately here.

Essentially, in an MCQ, the difficulty of a question is the proportion of

people who answered it correctly. It can be calculated by counting the

number of correct answers for each question in turn and converting it to a

percentage of the total number of candidates. This will produce a figure

between zero (where nobody got the answer correct) and 100 (where

everybody got it right). However, p-values can also be reported as a figure

between zero and one, simply by dividing by 100.

It is obviously pointless having an exam item that nobody gets right,

although it might occasionally be worth including an item that everybody

can be expected to get right. However, this must be because of its

fundamental importance, not because it is too simple. (The reason for this

is that it will send out a signal to candidates, along the lines of ‘there is

always a question about xyz’ and the candidates will, therefore, learn the

material).

A second factor is present in one-best answer MCQs – the chances of a

candidate getting the right answer for the wrong reason. This is usually

because they guess. (Guessing is a permanent problem with MCQs and

there is nothing we can do to stop it – the best we can do is contain the

damage). Consequently, with one-best answer MCQs having 5 options for

each question, around 20% of candidates might get the right answer by

chance alone. Therefore, we should be looking for questions where the

correct branch has a p-value higher than 20% (or 0.20).


DAY

THREE

In practice, with this type of question the target p value for all but the

occasional, exceptional item would be between 25% (0.25) and 80%

(0.80). However, it is important not to judge this information in isolation.

It is much better to look at the pattern of responses rather than the p-value

by itself. This is because it is quite possible to have a question with

serious faults, yet for it to have a p value that looks satisfactory. For

example, if say 40% of candidates get the correct answer but a substantial

proportion of the remainder all go for the same incorrect answer.

DISCRIMINATION INDICESTaken together with the pattern of responses, the discrimination index is avery useful measure. The principle here is that the main purpose of theCollege Fellowship examination is to discriminate between the strongcandidates, who reach the required standard, and the weaker ones who donot. Therefore, the examination must contain a substantial proportion ofitems that discriminate well between these tow kinds of candidate.

There are various ways of calculating discrimination indices. The mostcommon are biserials and point biserials. The simplest formula is:

n above – n belowtotal

where: n above = the number of candidates above the median who answered the item correctly

n below = the number of candidates below the median who answered the item correctly

total = the total number of candidates

The best discriminators will have the highest discrimination index. Ingeneral, discrimination indices greater than 0.25 would probably beregarded as OK. Any item with a negative discrimination index is obviouslyvery suspicious and must be scrutinised.

Further reading

Case S M and Swanson D B (1996) Constructing Written Test Questionsfor the Basic and Clinical Sciences. Philadelphia, PA; National Board ofMedical Examiners.

Holsgrove G. (1997) Chapters 27 to 30 in Teaching Medicine in GeneralPractice, (Editors Whitehouse, C; Roland, M; and Campion, P). OxfordUniversity Press.

Streiner DL and Norman GR (1995) Health Measurement Scales (2ndedition) Oxford University Press.


DAY

THREE

KEYS TO DIFFICULTY AND DISCRIMINATIONINDICES

(Based on: Statistical Analyses: Classical and Rasch. American Society of Clinical Pathologists).


DAY

THREE

Discrimination Index

Negative value

0.00 to 0.19

0.20 to 0.29

0.30 to 1.00

Interpretation

Inverse discrimination

Discrimination is (at best)questionable

Acceptable

Good

Difficulty Index

0.00 to 0.29

0.30 to 0.49

0.50 to 0.69

0.70 to 0.89

0.90 to 1.00

Interpretation

Very difficulty – may beinappropriate

Difficult

Moderate (the majority ofquestions should be in thiscategory)

Easy

Very easy


DAY

THREE

OO PP TT II OO NN AA NN AA LL YY SS II SSBB AA SS EE DD OO NN DD II FF FF II CC UU LL TT YY

II NN DD II CC EE SSQ. No. Key A% B% C% D% E% Blank Total

1. B 19.74 30.26 46.05 2.63 1.32 0.00 176

2. C 0.00 0.00 92.32 2.63 5.05 0.00 176

3. A 88.16 6.58 0.00 2.63 2.63 0.00 176

4. D 31.58 13.16 3.95 39.47 11.84 0.00 176

5. C 1.32 1.32 34.21 32.89 30.26 0.00 176

6. E 9.21 19.74 11.84 22.37 36.84 0.00 176

7. C 7.89 17.11 46.05 2.63 25.00 1.32 176

8. D 13.70 4.11 16.44 61.64 4.11 0.00 176

9. D 2.63 19.74 36.84 30.26 10.53 0.00 176

10. D 0.00 38.00 00.00 24.00 38.00 0.00 176

11.. E 1.32 14.47 5.26 6.58 72.37 0.00 176

12. C 6.58 14.47 50.00 22.37 6.58 0.00 176

13. B 5.26 81.58 5.26 2.63 5.26 0.00 176

14. A 45.33 14.67 14.67 22.67 2.67 0.00 176

15. B 19.74 36.84 26.32 7.89 9.21 0.00 176

16. D 3.95 66.58 20.26 0.00 9.21 0.00 176

17. E 9.33 12.00 5.33 1.33 72.00 0.00 176

18. C 9.21 15.79 73.68 0.00 1.32 0.00 176

19. C 2.63 39.47 44.74 2.63 10.53 0.00 176

20. A 100.00 00.00 100.00 0.00 0.00 0.00 176

21. C 17.11 25.00 40.79 6.58 10.53 0.00 176

22. D 15.79 31.58 9.21 26.32 15.79 1.32 176

23. A 69.74 2.63 13.16 3.95 10.53 0.00 176

24. E 15.79 47.37 6.58 21.05 9.21 0.00 176

25. B 14.47 21.05 42.11 5.26 17.11 0.00 176

26. B 36.28 40.03 19.74 0.00 3.95 0.00 176

27. B 9.21 78.95 5.26 3.95 2.63 0.00 176

28. C 9.21 7.89 36.84 23.68 22.37 0.00 176

29. E 9.72 0.00 0.00 19.28 71.00 0.00 176

30. E 0.00 13.88 10.23 0.00 75.89 0.00 176


DAY

THREE

Q. No. Difficulty Index Discrimination Index

1. 0.30 0.08

2. 0.92 -0.23

3. 0.88 0.03

4. 0.39 0.05

5. 0.34 0.21

6. 0.37 0.16

7. 0.46 0.56

8. 0.59 0.34

9. 0.30 0.08

10. 0.24 -0.09

11. 0.72 0.13

12. 0.50 0.36

13. 0.82 0.16

14. 0.45 0.05

15. 0.37 0.26

16. 0.00 0.00

17. 0.71 0.11

18. 0.74 0.05

19. 0.45 0.11

20. 1.00 0.00

21. 0.41 0.12

22. 0.26 -0.03

23. 0.70 0.24

24. 0.09 0.47

25. 0.21 0.11

26. 0.40 -0.10

27. 0.79 0.11

28. 0.37 0.37

29. 0.71 0.16

30. 0.76 0.32

DD II FF FF II CC UU LL TT YY AA NN DD DD II SS CC RR II MM II NN AA TT II OO NN II NN DD II CC EE SS

RELIABILITY ANALYSIS - SCALE(ALPHA)

Alpha if them deleted calculations are done to see what will happen to theoverall reliability value of the exam if a particular MCQ if deleted from thereliability calculations. So, if by deleting an MCQ, the reliability of the examrises by more than 0.01 then it is a good idea to carefully look at that MCQand re-write it or delete it. If, on the other hand, by deleting an MCQ, thealpha value drops by 0.01 or more, then that MCQ is particularly wellmade and MUST by banked.

Alpha if item Deleted

Alpha = .6656 Standardized item alpha = .6851


DAY

THREE

Q 1 .6624Q 2 .7321Q 3 .6600Q 4 .6683Q 5 .6744Q 6 .6755Q 7 .6843Q 8 .6723Q 9 .6510Q 10 .7406Q 11 .6622Q 12 .6630Q 13 .6630Q 14 .6678Q 15 .6640Q 16 .6589Q 17 .6655Q 18 .6634Q 19 .6598Q 20 .6571Q 21 .6693Q 22 .7522Q 23 .6688Q 24 .6613Q 25 .6635Q 26 .7381Q 27 .6671Q 28 .6892Q 29 .6837Q 30 .6646


DAY

THREE

METHOD

I (a) Procedure lists

log books

I (b) Procedure lists

with log books, with

prescribed minimum

number

II (a) Direct observation

with patients.

Without criteria.

II (b) Direct observation

with patients,

using criteria

checklist.

RE

LIA

BIL

ITY

FAC

E

VAL

IDIT

Y

CO

NT

EN

TVA

LID

ITY

CO

NS

EQ

UE

N

CE

VAL

IDIT

Y

CO

NS

TR

UC

T

VAL

IDIT

Y

PR

ED

ICT

IVE

VAL

IDIT

Y

FE

SIB

ILIT

Y

+++

+++

+++

+++

+++

+++

+++

COMMON METHODS OF EVALUATINGPSYCHOMOTOR SKILLS

Key:

+++ most effective method

++ adequately effective

+ effectiveness is questionable

+

+

+

++

+

+

++

+

+

+

+

+

+ +

+ +

+++

+ + +++

THE OBJECTIVE STRUCTUREDCLINICAL EXAMINATION (OSCE)

HISTORICAL BACKGROUND

One of the major aspects of CPSP postgraduate training programs is

assuring that residents who complete the training programs are clinically

competent before being allowed to practice as consultants. Although many

types of examinations are currently used to evaluate clinical competence

by various specialty boards, very few, if any, have been demonstrated to be

valid and reliable measures of clinical competence. An ideal examination

should be valid, reliable, and feasible.

The Objective Structured Clinical Examination (OSCE) was developed in the

mid-1970s at the University of Dundee in Scotland in an attempt to meet

these criteria. It has become widely used in medical schools throughout

the world both at the undergraduate and postgraduate level and has been

added to "the examiner's toolbox."

The features of an OSCE include:

l It tests individual components of clinical competence.

l It tests process as well as product.

l Checklists are used for assessing skills.

l Clinical material and simulations can be used.

l It assesses a broad spectrum of clinical skills.

In an OSCE, the candidates rotate through a series of stations, which may

be observed or unobserved. Usually an expert, who has a checklist with

him/her, observes them. This checklist has the necessary steps of the

procedure. These stations assess basic clinical skills, including

procedural, problem solving and counseling skills. Examples of stations

include: 1) taking a history or doing a physical on a simulated patient; 2)

interpreting x-rays, microscopic slides or ECGs; 3) analyzing diagnostic or

management data.


DAY

THREE

After performing the required procedure/task, the candidate either gives a

structured viva or answers written questions of true false type or restricted

type. The questions are aimed at testing application of knowledge and

interpretation of given data. A limited time is allotted at the stations. All

of the candidates are thus tested on the same or very similar material

avoiding the variation in the difficulty of the clinical material presented to

the candidate.

It should be noted that the OSCE is not an examination method. It is an

examination framework or format. Many types of examination methods

can be incorporated into this format. If the ground rules are followed

(multiple stations, time limits, checklists which have been standardized

and agreed upon, and the use of objective questions where possible), the

OSCE can incorporate:

a) stations of varying lengths

b) stations which test basic clinical skills, procedural skills,

problem-solving skills, attitudinal skills, counseling skills, etc.

The test methods used in measuring these skills can include written, oral,

clinical, simulation and in fact the whole wide range of examination

methods available to us. Thus, the OSCE can be seen as a framework to

which can be attached various test methods designed to measure a variety

of components of clinical competence.

FEEDBACK

The OSCE, in addition to being an evaluating tool, can be a powerful

learning tool because it can be used to provide feedback. Relatively

immediate feedback can be given not only by providing marks but also by

giving candidates their own completed checklist for each station.

The coordinating committee along with the program directors and

examiners should evaluate each station in the examination in the light of

the results obtained. Deficiencies in the candidates as a whole should be

fed-back to the teaching staff.


DAY

THREE

PSYCHOMETRIC ANALYSIS OF OSCEThe OSCE is a very useful addition to "the examiner's toolbox.' Subjectivity, inter-examiner variability, variability in patient-case material,differences in settings and contexts each contribute to the poormeasurement characteristics of the traditional assessment approaches.The Objective Structured Clinical Examination (OSCE), introduced byHarden to assess basic clinical skills, was intended to avoid themeasurement weakness inherent in the traditional assessmentapproaches.

I) RELIABILITY:A test's reliability is a measure of its precision and is a function both ofthe test and the candidates being tested. For example, a test designed forinterns may be reliable when administered to interns and less reliablewhen administered to third year MBBS student.

a) Inter- rater reliability:Inter-rater reliability reflects the degree of inter-examiner agreement.It is calculated by correlating two examiners' ratings of the samecandidates and may vary from 0 (no agreement) to 1 (perfect agreement).

b) Internal consistency:Another perspective of test reliability is offered by a test's internalconsistency, that is, the extent to which the test items serve as aseries of repeated measures of the same factor. If "basic clinicalskill" is a trait that should manifest itself in the variety of OSCE tasks,then the internal consistency definition of reliability can be applied inthe examination of the OSCE.

Overall, it would seem that the reasons for OSCE' is acceptable reliabilityare at least three-fold:

1. Examiner variability is minimized2. Patient variability is minimized3. The task sheets contain specific performance criteria that should

contribute to the achievement of a more careful and objectiveassessment.


DAY

THREE

II) VALIDITY:

A test's validity is defined with reference to evidence that attests to the

test's ability to measure what it is intended to measure. Three approaches

are commonly used.

a) Content Validity:

Statements from panels of experts usually provide evidence of a

test’s content validity or by descriptions of a test's construction that

support the notion of adequate content sampling.

b) Criterion Validity:

Evidence of a test's criterion validity is derived by correlating the

test's results with the results of the administration of another test

that tests the same skills. This second test has to have high validity

and is administered to the same candidates. With reference to the

assessment of clinical skills, a true "gold standard" does not exist.

There is evidence that OSCE has greater criterion validity for the

OSCE than for the oral examination results.

c) Construct Validity:

In one study, the construct of interest was "the ability to perform

basic clinical procedures ". In the context of validating the OSCE, a

useful hypothesis was that the ability to perform basic clinical skills

should improve as length of training increases (improves with clinical

experience).

The results provided evidence of the construct validity of the OSCE.

Evidence for construct validity of tests can rest on the calculation of

correlation coefficients. In this process, one looks for relationships that

support hypotheses relating to construct-postulated attributes of people,

assumed to be reflected in test performance. Validation of such

hypotheses provides evidence of the construct validity of a test or

measure.


DAY

THREE

III) FEASIBILITY:

The feasibility of OSCE has been reported to vary according to the

sophistication of the stations and the interest of the faculty. An institution

of standard does not have to spend a lot of money for its development

since it has most of the material indigenously ( like room, furniture,

patients, instruments etc.). Expenses rise when the institution has to train

people for simulated patients. Another investment that the OSCE requires

(apart from the financial investment) is that of time of the experts. They

have to sit and prepare stations, checklists and gather material for the

stations. Hence, the concerned faculty has to have a high motivation level

not only to start an OSCE but also to maintain it.

CONCLUSION:

OSCE is a useful approach for the assessment of basic clinical skills of

medical subjects. OSCE is fairly feasible and acceptable to both examiners

and examinees. It is obvious that the quality, reliability and validity of

OSCE are acceptable and better than oral examination.

Hence, OSCE is useful for the assessment of basic clinical skills of general

medicine residents.


DAY

THREE

HOLISTIC RATING SCALE LONG ANDSHORT CASES

ExcellentThe candidates addressed this point with particular expertise, clarity andaccuracy. Descriptions and explanations were clear and concise.Supplementary information (if required) was relevant and up-to-date.Virtually all of the important information was given. There were nosignificant errors, omissions or misinterpretations. A particularly goodperformance.

GoodClearly of a high standard, showing a good command of the material. Goodexplanations and, where appropriate, supplementary information, wasgiven. Most of the important information was given, without significanterrors, omissions or misinterpretations. A good, solid performance – notexceptional, but clearly of an acceptable standard.

AdequateA safe and competent performance, reaching an acceptable standardwithout significantly exceeding it. No major errors, omissions, ormisinterpretations, though perhaps a little hesitant at times. A bare pass.

InadequateNot quite reaching the required standard, indicating the need forimprovements to be made in order to reach the required level ofcompetency, safety and patient management. The candidates might havebeen noticeably hesitant and undecided, with some important defects intheir performance and perhaps a few major errors or omissions. Thecandidates who show signs that they might achieve the required standardwith further work, but are not clearly demonstrating it yet.

PoorWell below the required standard. Major omissions or errors, particularlyregarding safety and competence. Explanations may have beennoncomprehensible, confused or incorrect. The candidates possibly out ofdepth, well below the required standard, who have a great deal to dobefore they might attain it.


DAY

THREE

PREPARING AND RUNNINGTOACS STATIONS

PLANNING TOACS STATIONS

Each TOACS station has to be prepared in a similar manner with its

description, instructions and procedures given in detail. These details will

enable each station to be clearly identified, to be set up and run in the

manner intended, and to allow for scoring of candidates to be done in a

systematic and standardized way. The information that is required for each

station is indicated in the sections below.

1. Cover sheet

This sheet contains information about the test station and allows it to be

clearly identifiable as part of an TOACS station bank. Its contents are:

= Station title

= Objectives to be assessed

= Area/ topic covered

= Type of station (active / static)

= Resource requirements

= Names of Examiners developing the station

= Date that station was developed

= Dates on which station was used in the examination

2. Detailed outline of the station

In this section there needs to be a very detailed description of the station,

including the clinical task or skills that will be assessed. Of considerable

importance are the key features of the task that make it clear what

aspects of a candidate’s performance in the station will need to be

demonstrated if the task is to be completed satisfactorily.

= Station title

= Objectives to be assessed

= Area/ topic covered

= Skills/clinical task to be assessed

= Key features of clinical task


DAY

THREE

3. Instructions for ExaminersThe procedure instructions are to be given, again in detail, so that the

station can be implemented in a standardized manner on repeated

occasions. All instructions to examiners on how to introduce the task to

candidates and how to manage the time allocated to each component.

These instructions must be written down in a way that will ensure they will

be presented in a similar fashion to each candidate, and will include:

= Station title

= Introduction of the station and task to the candidate

= Instructions to candidates on how to proceed with the clinical task

= Indicative time allocation within the station

= Prompt questions to be used

= Ways in which examiners will be best able to observe or interact withcandidates

= Any other pertinent information necessary to ensure that the stationruns smoothly

Note: Standardized patient instructions (if real or simulated patients are used)

If standardized patients are being used the instructions to them should be

written for training the person in giving the required answers and in

responding to physical examination in a consistent and standardized

manner. The patient has to be advised on how to react to expected

situations, questions and statements, but also know the bounds within

which they may respond to unanticipated situations.

4. Scoring instructionsThis is for the examiner to score each candidate in a standardized and

systematic manner through interaction and/or observation. The type of

scoring of the candidate’s performance is known as global scoring, where

the impressions of the performance in sections of the clinical task are

assigned a score according to their completeness and relevance according

to the defined key features. These scores are entered on the structured

scoring scale developed in the station plan and are made available to the

Chief Examiner at the end of the examination.

= Station title

= Structured scoring scale including key features

= Weightings of station components

= Indication of acceptable score for candidates reaching a minimallyacceptable standard of performance.


DAY

THREE

ORGANIZING TOACSPREPARATION BEFORE THE TOACS

l Form a group comprising of subject experts and a medical

educationist.

l Keep the objectives of training and the examination and a table of

specifications in front of you

l Develop TOACS stations according to the required framework

l Review the stations in the light of the objectives and the table of

specifications to ensure their consistency and validity with the

competencies required

l Make a list of staff needed for a smoothly running TOACS

l Prepare guidelines for examiners and candidates

l Ensure availability of examiners, patients and other required

examination material

l Draw a plan showing the layout of the TOACS stations

l Ensure the availability of the venue for the day of the examination

l Ensure that the stations are all set a day before the TOACS starts

(tables, chairs, couches, screens, bell, stopwatches etc.)

ON THE DAY OF THE TOACSl Arrive well before time

l Re-check the stations

l Designate a timekeeper at a central place

l Distribute plans to invigilators and examiners

l Meet with the examiners and brief them about the plan of the day

l Explain the whole procedure in detail to the candidates

l Ensure that the examination runs smoothly and on time

l Attend to any irregularities that may arise

l Collect all candidate score sheets from all stations

l Note and record any situations that might have influenced candidate

performance

l Check with each station examiner on the standard required for a

minimal pass

l Ensure score sheets are conveyed to the Examinations Department

with appropriate security


DAY

THREE

POST EXAMINATION ANALYSISAfter collecting the response sheets from the stations (active or static) thescores will be collated and analyzed by the Examinations Department. Thisanalysis will include detailed statistical analysis of each station’s and eachcandidate’s performance which will be taken into account in thedetermination of the passing score. Other post hoc analysis includingreliability and inter-station correlations will also be carried out.

GLOBAL SCORINGResearch on the assessment of clinical competence has now clearly andconsistently demonstrated that global scoring is more valid and reliablethan detailed checklist scoring, particularly at postgraduate examinationlevel. In practice this means that although some structure and cleardefinition of what is required if a candidate is to perform a clinical task atan acceptable level, this procedure must be able to take into account thequality and appropriateness of that task. Many contextual variables willinfluence clinical performance, including the type and severity of apresenting problem and the background of the patient. Communicationand interpersonal skills are usually important in reaching and appropriatediagnosis and management, as are the professional and ethicalconsiderations of the candidate.

Taking all issues into consideration in a complex clinical task that requiresa number of skills to be applied successfully can best be judged by seniorclinicians with a wide ranging experience and expertise. Although they canbe guided by what it is necessary to perform to complete a taskacceptably, the quality and clinical reasoning behind the actions is often ofgreater importance and needs to observed or discussed. The ultimate testof clinical competence is not how much or how quickly a task is performed,but the manner in which it is performed and the outcomes of thatperformance. It is important, for example, not to penalize a competentcandidate who reaches a satisfactory outcome in shorter time and withfewer questions or investigations because he or she is extremely familiarwith the clinical presentation. Similarly it is important not to score poorperformance highly when most steps are taken in a task but theperformance is disorganized, unsystematic and carried out without dueregard for the patient’s rights or condition.

Examiners of clinical competence require some flexibility to apply theirprofessional judgement to a range of candidate performances within astandardized setting and task definition. Global scoring allows thisto occur.


DAY

THREE


DAY

THREE

COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN

TOACS INSTRUCTIONS FOR EXAMINERS

STATION NO. _____________________________________________________________

TOPIC: __________________________________________________________________

COMPETENCE TO BE ASSESSED: __________________________________________

TYPE OF STATION: INTERACTIVE/STATIC: __________________________________

INSTRUCTIONS AS GIVEN TO CANDIDATE IN HIS / HER INSTRUCTION SHEET:

(Questions to be asked from the candidate are on the scoring sheet (if applicable)


DAY

THREE


PROCEDURE INSTRUCTIONS FOR CANDIDATES

STATION NO: __________________________________________________________

TOPIC: _______________________________________________________________

TIME ALLOWED: _______________________________________________________

DESCRIPTION OF THE TASK TO THE CANDIDATES(Information and instructions)


DAY

THREE


TOACS COVER SHEET

SERIAL NO: _______________________________________________________________

TOPIC: ___________________________________________________________________

COMPETENCE TO BE ASSESSED: ___________________________________________

TYPE OF STATION: INTERACTIVE/ STATIC: ___________________________________

RESOURCES REQUIRED: (Please write number of item required wherenecessary and put a tick mark before it)

Table chairs couch illuminator paperPencil/s eraser screen/s

Any other, please write the item and the number required below:

NAMES OF EXAMINERS DEVELOPING THE STATION:

DATE THAT STATION WAS DEVELOPED:(TO BE FILLED IN BY THE EXAMINATION DEPARTMENT)

Date when previously used Difficulty index Discrimination index

CO

LLE

GE

OF

PH

YSI

CIA

NS

AN

D S

UR

GE

ON

S PA

KIS

TAN

TOAC

S SC

OR

ING

SH

EET

STA

TIO

N N

O._

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

_

TOP

IC:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

EX

AM

CEN

TRE:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

CO

MP

ON

EN

T

KE

YA

GR

EE

DC

OM

PO

NE

NT

PAS

SIN

GA

WA

RD

ED

FE

AT

UR

ES

AN

SW

ER

SS

CO

RE

SC

OR

ES

CO

RE

EX

AM

INE

RS

’NA

ME

____

____

____

____

____

___

____

____

____

____

____

_

EX

AM

INE

RS

’SIG

NA

TU

RE

S__

____

____

____

____

____

___

____

____

____

____

___

PR

OM

PT

QU

ES

TIO

NS

: (if

any

)

Tota

l

CO

LLE

GE

OF

PH

YSI

CIA

NS

AN

D S

UR

GE

ON

S PA

KIS

TAN

TOAC

S SC

ORIN

G SH

EET

STA

TIO

N N

O.

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

__

TOP

IC:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

EX

AM

CEN

TRE:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

CO

MP

ON

EN

T

KE

YA

GR

EE

DC

OM

PO

NE

NT

FE

AT

UR

ES

AN

SW

ER

SS

CO

RE

EX

AM

INE

RS

’NA

ME

____

____

____

____

____

___

____

____

____

____

____

_

EX

AM

INE

RS

’SIG

NA

TU

RE

S__

____

____

____

____

____

___

____

____

____

____

___

RA

TIN

G S

CA

LE

PR

OM

PT

QU

ES

TIO

NS

: (i

f an

y)

Tota

l

INADEQUATE

ADEQUATE

GOOD

EXCELLENT

POOR

CO

LLE

GE

OF

PH

YSI

CIA

NS

AN

D S

UR

GE

ON

S PA

KIS

TAN

TOAC

S SC

OR

ING

SH

EET

STA

TIO

N N

O._

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

_

TOP

IC:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

EX

AM

CEN

TRE:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

CO

MP

ON

EN

T

KE

YA

GR

EE

DC

OM

PO

NE

NT

PAS

SIN

GA

WA

RD

ED

FE

AT

UR

ES

AN

SW

ER

SS

CO

RE

SC

OR

ES

CO

RE

EX

AM

INE

RS

’NA

ME

____

____

____

____

____

___

____

____

____

____

____

_

EX

AM

INE

RS

’SIG

NA

TU

RE

S__

____

____

____

____

____

___

____

____

____

____

___

PR

OM

PT

QU

ES

TIO

NS

: (if

any

)

Tota

l


DAY

THREE



STATION NO. _____________________________________________________________

TOPIC: __________________________________________________________________






DAY

THREE


PROCEDURE INSTRUCTIONS FOR CANDIDATES

STATION NO: __________________________________________________________

TOPIC: _______________________________________________________________

TIME ALLOWED: _______________________________________________________

DESCRIPTION OF THE TASK TO THE CANDIDATES(Information and instructions)


DAY

THREE


TOACS COVER SHEET

SERIAL NO: _______________________________________________________________

TOPIC: ___________________________________________________________________










DAY

THREE



STATION NO. _____________________________________________________________

TOPIC: __________________________________________________________________






DAY

THREE


TOACS COVER SHEET

SERIAL NO: _______________________________________________________________

TOPIC: ___________________________________________________________________









CO

LLE

GE

OF

PH

YSI

CIA

NS

AN

D S

UR

GE

ON

S PA

KIS

TAN

TOAC

S SC

ORIN

G SH

EET

STA

TIO

N N

O.

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

__

TOP

IC:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

EX

AM

CEN

TRE:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

CO

MP

ON

EN

T

KE

YA

GR

EE

DC

OM

PO

NE

NT

FE

AT

UR

ES

AN

SW

ER

SS

CO

RE

EX

AM

INE

RS

’NA

ME

____

____

____

____

____

___

____

____

____

____

____

_

EX

AM

INE

RS

’SIG

NA

TU

RE

S__

____

____

____

____

____

___

____

____

____

____

___

RA

TIN

G S

CA

LE

PR

OM

PT

QU

ES

TIO

NS

: (i

f an

y)

Tota

l

INADEQUATE

ADEQUATE

GOOD

EXCELLENT

POOR

CO

LLE

GE

OF

PH

YSI

CIA

NS

AN

D S

UR

GE

ON

S PA

KIS

TAN

TOAC

S SC

OR

ING

SH

EET

STA

TIO

N N

O._

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

_

TOP

IC:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

EX

AM

CEN

TRE:

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

____

___

CO

MP

ON

EN

T

KE

YA

GR

EE

DC

OM

PO

NE

NT

PAS

SIN

GA

WA

RD

ED

FE

AT

UR

ES

AN

SW

ER

SS

CO

RE

SC

OR

ES

CO

RE

EX

AM

INE

RS

’NA

ME

____

____

____

____

____

___

____

____

____

____

____

_

EX

AM

INE

RS

’SIG

NA

TU

RE

S__

____

____

____

____

____

___

____

____

____

____

___

PR

OM

PT

QU

ES

TIO

NS

: (if

any

)

Tota

l

yecf[j[dy[ - college of physicians and surgeons...

Documents