yecf[j[dy[ - college of physicians and surgeons...
TRANSCRIPT
C o n t e n t sQuality assurance in examinations: pre and post exam analysis 85
Assembling an examination 89
Standard setting 90
Statistical analysis of examination performance 92
Difficulty and discrimination indices 98
Reliability analysis scale 99
Common methods of evaluating psychomotor skills 100
The objective structured clinical examination OSCE 101
Psychometric analysis of OSCE 103
Holistic rating scale: long and short cases 106
Preparing TOACS 107
Organizing TOACS 109
TOACS sheets 111
ASSESSMENT OF COMPETENCE
S C H E D U L E
DAY THREEQuality assurance in examinations: pre and post exam analysis
Assembling an examination
Standard setting
Statistical analysis of examination performance
Difficulty and discrimination indices
Reliability analysis scale
Common methods of evaluating psychomotor skills
The objective structured clinical examination
Psychometric analysis of OSCE
Holistic rating scale: long and short cases
Preparing TOACS
Organizing TOACS
TOACS sheets
QUALITY ASSURANCE IN EXAMINATIONSPRE AND POST EXAM ANALYSIS
INTRODUCTIONThis document outlines a procedure for selecting examination questions,setting the pass mark, and measuring the performance of each item, aswell as the examination as a whole. It has been adapted from standard,published methods and is set out in a way that gives a logical progressionthrough the whole process from selecting items, setting the pass mark. Itis not intended as a detailed description of the procedure and a list of textsis appended for those who are interested in reading about preparation andquality assurance in more detail.
Quality assurance procedures are an essential and integral part ofCPSP examinations. These procedures are done routinely after everypart I and II examination. Judgement about the quality of questions usedin an examination is made based on the scientific principles and evidenceobtained thereof. The making of tables of specifications by all the facultiesis an important aspect of pre- exam quality assurance procedures.
SELECTING QUESTIONS AND EXAM MAPPINGMost MCQ exams consist of a mixture of new questions and itemsselected from the question bank. Usually between 10% and 20% of thequestions will be new. The Examination Board must decide how to deal withthe new questions from among 3 options:
1. to treat all the questions (new and ones from the question bank) thesame and have them all counting towards the candidates’ finalmarks;
2. to exclude all the new questions from calculating the candidates’final marks, so that marks depend only on the questions taken fromthe bank;
3. to count new questions that are shown to perform well on thestatistical analysis described later, but exclude from the final marksthose that do not perform well.
== Once the examination is well established, either option 2 or 3 shouldbe used. Option 1 is not recommended (even though it is the methodcurrently used in the MRCP Part 1 examination in the UK) because itdoes not allow dysfunctional items to be excluded and, therefore, islikely to impair the validity and reliability of the exam.
85Assessment of Competence
DAY
THREE
= The most obvious is option 3 – in other words, to perform statisticalanalysis on every item in the examination and base the marks only onthose questions that perform well.
Step 1 – collecting questionsThe Examiners need to collect a substantial number of questions –considerably more than will be needed in the examination itself.
Step 2 – exam contentHaving collected the questions, the examiners will match them with thetable of specifications.
Topics in the table of specifications. should be assigned to one of 3categories, depending on the importance of the material they are testing:
l Essentiall Importantl Supplementary
If this has not already been done, Examiners should do it at this stage.
Questions testing trivial material, peculiarities or rarities should bediscarded – they will only waste valuable testing time.
A good way to match questions to topics in the table of specifications. isby placing the questions into groups or piles on a very large table or on thefloor. When this has been done, the examiners look at the distribution ofquestions, with 2 things in mind:
1. Are there any essential or important areas of the syllabus that arenot covered by questions?
2. Are there too many questions on some topics (especially topics thatare not ‘essential’ or ‘important’)?
If some essential or important areas are not covered, Examiners mightdecide to commission new questions to fill the gaps.
If some areas are over-represented, Examiners will look at all thequestions in these groups and remove some of them. (They might chooseto retain what look to be the best questions, or remove questions thatduplicate, or closely match, others).
At this point, Examiners should still be left with more questions than theywill eventually need in the exam because more will be removed later
86Assessment of Competence
DAY
THREE
Step 3 – item difficultyEither at the item-writing stage, or when compiling an examination once theitems have been collected, it is important to judge how difficult each itemis. This is because the most effective items in an examination are thoseof moderate difficulty and high discriminant function. Examiners often thinkthat there should be a predominance of difficult items, forgetting that mostcandidates will not be able to answer many of them.
Each item should be judged as:
l Difficultl Moderatel Easy
== It is recommended that about 75% of the items in the examinationshould be of moderate difficulty.
Step 4 - producing the draft examination paperOne convenient way of assembling the draft examination paper is to entereach question on a grid according to its importance and difficulty:
So, for example, if we have 3 questions with the following characteristics:
1. this question is difficult and essential2. this question is moderate and important3. this question is easy and supplementary
They would be entered onto the grid as followsWhen all the questions are entered on the grid, the examiners make anumber of checks and, if necessary, amendments:
= All items in the ‘QUESTIONABLE’ column are reviewed. If they are nottesting valid material, they must be rejected. If they are testing validmaterial they should be re-allocated into one of the other columns – theymight require some re-writing before doing this. By the end of this
87Assessment of Competence
DAY
THREE
stage, there should be no items left in the ‘QUESTIONABLE’ column.a) There might now be items in all the remaining 9 cells in the grid and
the examiners need to check and, if necessary, adjust the distribution (by replacing or re-writing some items) according to the following guidelines:l The majority of questions (about 75%) should be in just 2 cells – the
‘ESSENTIAL’ and ‘IMPORTANT’ columns and the ‘moderate’ row.l The remaining questions should be distributed fairly evenly in
most of the other 7 cells. However, the ‘difficult’ ‘supplementary’cell can be left empty or with just one or two questions in it
This method ensures that the majority of items test the most importantcontent of the syllabus; that most are of moderate difficulty (which tend tobe the items that discriminate best between good and poor candidates –after all, that is the main aim of this examination); and that there are someitems that the best candidates might pick up additional marks for.
The next stage of the process is to set the pass mark.
88Assessment of Competence
DAY
THREE
ASSEMBLING AN EXAMINATIONSTEPS
Step 1 - Collect questions
Step 2 - Allocate question numbers in one of the three groups
Essential
Important
Supplementary
Step 3 - Ascertain item difficulty
Step 4 - Allocate in one of the three categories
Difficult
Moderate
Easy
Step 5 - Produce the draft examination paper
Step 6 - Review the questions and prepare the final paper
Have a final look for spelling errors etc. before sending for printing
Ensure secrecy
89Assessment of Competence
DAY
THREE
ESSENTIAL IMPORTANT SUPPLEMENTARY QUESTIONABLE
Difficult
Moderate
Easy
STANDARD SETTINGBy this stage, every item should be classified according to 2 criteria –
importance and difficulty. Both these indicators are essential for the
recommended method of standard setting, which is described below.
There are 3 main methods of standard setting (ie setting the pass mark)
in exams, though there are several potential variations of each method.
The simplest in Angoff’s method. Ebel’s method is slightly more
complicated, yet leads to a better examination design. The Hofstee method
is more complex and best used with large cohorts of examinees.
The method described here is a combination of both Ebel’s and Angoff’s
method, utilizing the advantages of both.
However, during the development stage with the new one-best answer
MCQs, Hofstee’s technique might also be a useful research method, so it
should not be entirely ruled out at this stage. It would probably not be
necessary to use it, though, if the suggested method produces good results.
Suggested method of standard setting
In setting the passing standard by this method, the examiners take
account of both the importance and difficulty of each item. It is likely to be
particularly useful in the development of new examinations or new items
within an existing examination.
Having selected a sufficient number of questions, ensured that they make
a good coverage of the syllabus, and plot out well on the grid above, the
draft examination is scrutinised by a small panel of examiners who will set
the pass mark. This pass mark might be modified in the light of post hoc
analysis of the examination, but it is extremely helpful to have a provisional
pass mark in mind before the examination is sat.
== It is recommended that each panel of examiners consists of at least
3 people, 5 – 8 members would be ideal. This should be sufficient to
minimize hawk/dove effects while still giving a depth of experience in a
group of manageable size.
90Assessment of Competence
DAY
THREE
l The examiners are asked orientate themselves by first briefly discussing the characteristics of a ‘borderline’ candidate – one whose knowledge would be just adequate for them to pass the examination.
l Next, with the agreed view of a ‘borderline’ candidate in mind, eachexaminer is asked to make their own, personal estimation of thepercentage of ‘borderline’ candidates who might answer eachquestion correctly. This is done without further discussion among theexaminers – each needs to make their own personal estimate at thisstage.
== For a large examination, such as the College Fellowships with manyquestions in each MCQ paper, it is probably more feasible divide thetask and allocate, say, 25 questions to each small group of examiners,rather than to ask one team to judge all the questions. If this is done,then the questions for each group should be drawn from severaldifferent cells in the grid. This will minimize the errors induced byvariations in judgement between the different groups of examiners.
l When each examiner has judged each of the allocated questions,the group’s leader collects the estimated percentage ‘pass rates’ foreach item in turn. If there is close agreement between theexaminers, it is usually a simple matter to set the ‘pass mark’ forthat item. If the examiners’ estimates for a particular question arespread out, the leader adopts the following strategy:
q The leader asks the examiner who gave the highest estimate tobriefly explain why
q Next, the examiner giving the lowest estimate is asked to explaintheir reasons
q If necessary, the group as a whole discusses the matterq Finally, agreement is reached either through compromise,
averaging the estimated marks, or by majority vote.
l When estimates have been agreed for all the questions, theprovisional ‘pass mark’ for the whole examination is calculated byadding together the agreed estimate for each question and dividingby the number of questions.
91Assessment of Competence
DAY
THREE
STATISTICAL ANALYSIS OF EXAMINATIONPERFORMANCE
It is strongly recommended that statistical analysis is performed usingSPSS software running under the Microsoft Windows operating system.At the time of writing (December 2000), the latest version of SPSS isversion 10.
There are a few statistical measurements on examination items that
are extremely useful in examination development and quality assurance.
There are:
= Reliability analysis
= Inter-item and item-total correlations
= Difficulty indices
= Discrimination indices
RELIABILITY ANALYSISThis is extremely important because, as the name implies, it reveals how
reliable the examination is and is the basis for a further calculation – the
Standard error of Measurement – that gives the confidence intervals for
the marks. This is explained below.
There are many aspects of reliability, all of which can be measured. The
standard measurement, though, is the internal consistency of the
examination. The basis for this measurement is that items in a test such
as a College Fellowship MCQ paper should be assessing different aspects
of the same domain (ie specialist medical knowledge) and should therefore
correlate well with each other and with the total score. For simplicity,
correlations themselves are covered in the next section, but correlations
and internal consistency can be measured in the same set of calculations
on SPSS.
There are two well-known (and related) methods of calculating internal
consistency. The KR-20 formula (Kuder-Richardson formula 20) goes back
to the 1930s and is suitable for items with dichotomous answers (eg
true/false). KR-20 would therefore be suitable for use with the new one-
best answer MCQs (where the correct option for each question would be
regarded as ‘true’ and all the other options as ‘false’).
92Assessment of Competence
DAY
THREE
However, it is not suitable for negatively-marked multiple true/false MCQs
because there have a third option – ‘don’t know’.
More common nowadays is the Chronbach’s a (alpha) formula, which was
developed from the KR-20. For dichotomous items (such as one-best
answer MCQs) both K-R 20 and Chronbach’s a would give the same value,
but a can also be used for questions that have more than 2 possible
answers.
= It is recommended that the measurement of reliability (internalconsistency) is made using Chronbach’s a for three reasons:
1. it is today’s standard measure used by test developers
2. SPSS software is programmed to perform it and it is available on the
drop-down menu
3. it can be used with other examination items, apart from MCQs, so
reports of all examinations can be prepared using the same method
of reliability measurement.
= The minimum value for alpha that we are looking for in an examinationis 0.8 and for an MCQ exam we would expect a figure of about 0.9.
Immediately above the reliability coefficients for the whole examination is
a table giving various characteristics for each item analyzed. The column
of particular interest to us here is that on the extreme right hand side –
labeled ‘alpha if item deleted’. (Though there is another important column
in this table that we shall consider in the next section). In the ‘alpha if item
deleted’ column we are looking for 2 things. Bearing in mind the
standardized item alpha score, we need to look at the results in this
column to identify two things:
1. items where the scores INCREASE when the item is deleted
2. items where the scores DECREASE SIGNIFICANTLY when the item
is deleted.
Test theory tells us that when an item is removed from an exam, so the
exam gets shorter, the reliability will be reduced. Clearly in an exam with
many items, this reduction will be insignificantly small when only one item
is removed. Consequently, if the alpha for the exam actually increases
when the item is removed, that item must be IMPAIRING the reliability, not
helping it. Therefore, that item is likely to be faulty.
93Assessment of Competence
DAY
THREE
Conversely, if the reliability is reduced by more than a tiny amount, the item
must be contributing more than its fair share towards reliability and it is
therefore probably a particularly good item. A note should be made of
items of both types.
DIFFICULTY INDICESThis measure is of limited value, yet seems to have a great attraction for
many examiners. However, it does have a role to play in measuring new
exam items, particularly MCQs, because they can confirm (or disprove) the
examiners’ estimation of whether an item is hard, moderate or easy.
More correctly known as the frequency of endorsement (p-value), difficulty
indices are most usefully calculated and interpreted alongside
discrimination indices, although for simplicity the two will be dealt with
separately here.
Essentially, in an MCQ, the difficulty of a question is the proportion of
people who answered it correctly. It can be calculated by counting the
number of correct answers for each question in turn and converting it to a
percentage of the total number of candidates. This will produce a figure
between zero (where nobody got the answer correct) and 100 (where
everybody got it right). However, p-values can also be reported as a figure
between zero and one, simply by dividing by 100.
It is obviously pointless having an exam item that nobody gets right,
although it might occasionally be worth including an item that everybody
can be expected to get right. However, this must be because of its
fundamental importance, not because it is too simple. (The reason for this
is that it will send out a signal to candidates, along the lines of ‘there is
always a question about xyz’ and the candidates will, therefore, learn the
material).
A second factor is present in one-best answer MCQs – the chances of a
candidate getting the right answer for the wrong reason. This is usually
because they guess. (Guessing is a permanent problem with MCQs and
there is nothing we can do to stop it – the best we can do is contain the
damage). Consequently, with one-best answer MCQs having 5 options for
each question, around 20% of candidates might get the right answer by
chance alone. Therefore, we should be looking for questions where the
correct branch has a p-value higher than 20% (or 0.20).
94Assessment of Competence
DAY
THREE
In practice, with this type of question the target p value for all but the
occasional, exceptional item would be between 25% (0.25) and 80%
(0.80). However, it is important not to judge this information in isolation.
It is much better to look at the pattern of responses rather than the p-value
by itself. This is because it is quite possible to have a question with
serious faults, yet for it to have a p value that looks satisfactory. For
example, if say 40% of candidates get the correct answer but a substantial
proportion of the remainder all go for the same incorrect answer.
DISCRIMINATION INDICESTaken together with the pattern of responses, the discrimination index is avery useful measure. The principle here is that the main purpose of theCollege Fellowship examination is to discriminate between the strongcandidates, who reach the required standard, and the weaker ones who donot. Therefore, the examination must contain a substantial proportion ofitems that discriminate well between these tow kinds of candidate.
There are various ways of calculating discrimination indices. The mostcommon are biserials and point biserials. The simplest formula is:
n above – n belowtotal
where: n above = the number of candidates above the median who answered the item correctly
n below = the number of candidates below the median who answered the item correctly
total = the total number of candidates
The best discriminators will have the highest discrimination index. Ingeneral, discrimination indices greater than 0.25 would probably beregarded as OK. Any item with a negative discrimination index is obviouslyvery suspicious and must be scrutinised.
Further reading
Case S M and Swanson D B (1996) Constructing Written Test Questionsfor the Basic and Clinical Sciences. Philadelphia, PA; National Board ofMedical Examiners.
Holsgrove G. (1997) Chapters 27 to 30 in Teaching Medicine in GeneralPractice, (Editors Whitehouse, C; Roland, M; and Campion, P). OxfordUniversity Press.
Streiner DL and Norman GR (1995) Health Measurement Scales (2ndedition) Oxford University Press.
95Assessment of Competence
DAY
THREE
KEYS TO DIFFICULTY AND DISCRIMINATIONINDICES
(Based on: Statistical Analyses: Classical and Rasch. American Society of Clinical Pathologists).
96Assessment of Competence
DAY
THREE
Discrimination Index
Negative value
0.00 to 0.19
0.20 to 0.29
0.30 to 1.00
Interpretation
Inverse discrimination
Discrimination is (at best)questionable
Acceptable
Good
Difficulty Index
0.00 to 0.29
0.30 to 0.49
0.50 to 0.69
0.70 to 0.89
0.90 to 1.00
Interpretation
Very difficulty – may beinappropriate
Difficult
Moderate (the majority ofquestions should be in thiscategory)
Easy
Very easy
97Assessment of Competence
DAY
THREE
OO PP TT II OO NN AA NN AA LL YY SS II SSBB AA SS EE DD OO NN DD II FF FF II CC UU LL TT YY
II NN DD II CC EE SSQ. No. Key A% B% C% D% E% Blank Total
1. B 19.74 30.26 46.05 2.63 1.32 0.00 176
2. C 0.00 0.00 92.32 2.63 5.05 0.00 176
3. A 88.16 6.58 0.00 2.63 2.63 0.00 176
4. D 31.58 13.16 3.95 39.47 11.84 0.00 176
5. C 1.32 1.32 34.21 32.89 30.26 0.00 176
6. E 9.21 19.74 11.84 22.37 36.84 0.00 176
7. C 7.89 17.11 46.05 2.63 25.00 1.32 176
8. D 13.70 4.11 16.44 61.64 4.11 0.00 176
9. D 2.63 19.74 36.84 30.26 10.53 0.00 176
10. D 0.00 38.00 00.00 24.00 38.00 0.00 176
11.. E 1.32 14.47 5.26 6.58 72.37 0.00 176
12. C 6.58 14.47 50.00 22.37 6.58 0.00 176
13. B 5.26 81.58 5.26 2.63 5.26 0.00 176
14. A 45.33 14.67 14.67 22.67 2.67 0.00 176
15. B 19.74 36.84 26.32 7.89 9.21 0.00 176
16. D 3.95 66.58 20.26 0.00 9.21 0.00 176
17. E 9.33 12.00 5.33 1.33 72.00 0.00 176
18. C 9.21 15.79 73.68 0.00 1.32 0.00 176
19. C 2.63 39.47 44.74 2.63 10.53 0.00 176
20. A 100.00 00.00 100.00 0.00 0.00 0.00 176
21. C 17.11 25.00 40.79 6.58 10.53 0.00 176
22. D 15.79 31.58 9.21 26.32 15.79 1.32 176
23. A 69.74 2.63 13.16 3.95 10.53 0.00 176
24. E 15.79 47.37 6.58 21.05 9.21 0.00 176
25. B 14.47 21.05 42.11 5.26 17.11 0.00 176
26. B 36.28 40.03 19.74 0.00 3.95 0.00 176
27. B 9.21 78.95 5.26 3.95 2.63 0.00 176
28. C 9.21 7.89 36.84 23.68 22.37 0.00 176
29. E 9.72 0.00 0.00 19.28 71.00 0.00 176
30. E 0.00 13.88 10.23 0.00 75.89 0.00 176
98Assessment of Competence
DAY
THREE
Q. No. Difficulty Index Discrimination Index
1. 0.30 0.08
2. 0.92 -0.23
3. 0.88 0.03
4. 0.39 0.05
5. 0.34 0.21
6. 0.37 0.16
7. 0.46 0.56
8. 0.59 0.34
9. 0.30 0.08
10. 0.24 -0.09
11. 0.72 0.13
12. 0.50 0.36
13. 0.82 0.16
14. 0.45 0.05
15. 0.37 0.26
16. 0.00 0.00
17. 0.71 0.11
18. 0.74 0.05
19. 0.45 0.11
20. 1.00 0.00
21. 0.41 0.12
22. 0.26 -0.03
23. 0.70 0.24
24. 0.09 0.47
25. 0.21 0.11
26. 0.40 -0.10
27. 0.79 0.11
28. 0.37 0.37
29. 0.71 0.16
30. 0.76 0.32
DD II FF FF II CC UU LL TT YY AA NN DD DD II SS CC RR II MM II NN AA TT II OO NN II NN DD II CC EE SS
RELIABILITY ANALYSIS - SCALE(ALPHA)
Alpha if them deleted calculations are done to see what will happen to theoverall reliability value of the exam if a particular MCQ if deleted from thereliability calculations. So, if by deleting an MCQ, the reliability of the examrises by more than 0.01 then it is a good idea to carefully look at that MCQand re-write it or delete it. If, on the other hand, by deleting an MCQ, thealpha value drops by 0.01 or more, then that MCQ is particularly wellmade and MUST by banked.
Alpha if item Deleted
Alpha = .6656 Standardized item alpha = .6851
99Assessment of Competence
DAY
THREE
Q 1 .6624Q 2 .7321Q 3 .6600Q 4 .6683Q 5 .6744Q 6 .6755Q 7 .6843Q 8 .6723Q 9 .6510Q 10 .7406Q 11 .6622Q 12 .6630Q 13 .6630Q 14 .6678Q 15 .6640Q 16 .6589Q 17 .6655Q 18 .6634Q 19 .6598Q 20 .6571Q 21 .6693Q 22 .7522Q 23 .6688Q 24 .6613Q 25 .6635Q 26 .7381Q 27 .6671Q 28 .6892Q 29 .6837Q 30 .6646
100Assessment of Competence
DAY
THREE
METHOD
I (a) Procedure lists
log books
I (b) Procedure lists
with log books, with
prescribed minimum
number
II (a) Direct observation
with patients.
Without criteria.
II (b) Direct observation
with patients,
using criteria
checklist.
RE
LIA
BIL
ITY
FAC
E
VAL
IDIT
Y
CO
NT
EN
TVA
LID
ITY
CO
NS
EQ
UE
N
CE
VAL
IDIT
Y
CO
NS
TR
UC
T
VAL
IDIT
Y
PR
ED
ICT
IVE
VAL
IDIT
Y
FE
SIB
ILIT
Y
+++
+++
+++
+++
+++
+++
+++
COMMON METHODS OF EVALUATINGPSYCHOMOTOR SKILLS
Key:
+++ most effective method
++ adequately effective
+ effectiveness is questionable
+
+
+
++
+
+
++
+
+
+
+
+
+ +
+ +
+++
+ + +++
THE OBJECTIVE STRUCTUREDCLINICAL EXAMINATION (OSCE)
HISTORICAL BACKGROUND
One of the major aspects of CPSP postgraduate training programs is
assuring that residents who complete the training programs are clinically
competent before being allowed to practice as consultants. Although many
types of examinations are currently used to evaluate clinical competence
by various specialty boards, very few, if any, have been demonstrated to be
valid and reliable measures of clinical competence. An ideal examination
should be valid, reliable, and feasible.
The Objective Structured Clinical Examination (OSCE) was developed in the
mid-1970s at the University of Dundee in Scotland in an attempt to meet
these criteria. It has become widely used in medical schools throughout
the world both at the undergraduate and postgraduate level and has been
added to "the examiner's toolbox."
The features of an OSCE include:
l It tests individual components of clinical competence.
l It tests process as well as product.
l Checklists are used for assessing skills.
l Clinical material and simulations can be used.
l It assesses a broad spectrum of clinical skills.
In an OSCE, the candidates rotate through a series of stations, which may
be observed or unobserved. Usually an expert, who has a checklist with
him/her, observes them. This checklist has the necessary steps of the
procedure. These stations assess basic clinical skills, including
procedural, problem solving and counseling skills. Examples of stations
include: 1) taking a history or doing a physical on a simulated patient; 2)
interpreting x-rays, microscopic slides or ECGs; 3) analyzing diagnostic or
management data.
101Assessment of Competence
DAY
THREE
After performing the required procedure/task, the candidate either gives a
structured viva or answers written questions of true false type or restricted
type. The questions are aimed at testing application of knowledge and
interpretation of given data. A limited time is allotted at the stations. All
of the candidates are thus tested on the same or very similar material
avoiding the variation in the difficulty of the clinical material presented to
the candidate.
It should be noted that the OSCE is not an examination method. It is an
examination framework or format. Many types of examination methods
can be incorporated into this format. If the ground rules are followed
(multiple stations, time limits, checklists which have been standardized
and agreed upon, and the use of objective questions where possible), the
OSCE can incorporate:
a) stations of varying lengths
b) stations which test basic clinical skills, procedural skills,
problem-solving skills, attitudinal skills, counseling skills, etc.
The test methods used in measuring these skills can include written, oral,
clinical, simulation and in fact the whole wide range of examination
methods available to us. Thus, the OSCE can be seen as a framework to
which can be attached various test methods designed to measure a variety
of components of clinical competence.
FEEDBACK
The OSCE, in addition to being an evaluating tool, can be a powerful
learning tool because it can be used to provide feedback. Relatively
immediate feedback can be given not only by providing marks but also by
giving candidates their own completed checklist for each station.
The coordinating committee along with the program directors and
examiners should evaluate each station in the examination in the light of
the results obtained. Deficiencies in the candidates as a whole should be
fed-back to the teaching staff.
102Assessment of Competence
DAY
THREE
PSYCHOMETRIC ANALYSIS OF OSCEThe OSCE is a very useful addition to "the examiner's toolbox.' Subjectivity, inter-examiner variability, variability in patient-case material,differences in settings and contexts each contribute to the poormeasurement characteristics of the traditional assessment approaches.The Objective Structured Clinical Examination (OSCE), introduced byHarden to assess basic clinical skills, was intended to avoid themeasurement weakness inherent in the traditional assessmentapproaches.
I) RELIABILITY:A test's reliability is a measure of its precision and is a function both ofthe test and the candidates being tested. For example, a test designed forinterns may be reliable when administered to interns and less reliablewhen administered to third year MBBS student.
a) Inter- rater reliability:Inter-rater reliability reflects the degree of inter-examiner agreement.It is calculated by correlating two examiners' ratings of the samecandidates and may vary from 0 (no agreement) to 1 (perfect agreement).
b) Internal consistency:Another perspective of test reliability is offered by a test's internalconsistency, that is, the extent to which the test items serve as aseries of repeated measures of the same factor. If "basic clinicalskill" is a trait that should manifest itself in the variety of OSCE tasks,then the internal consistency definition of reliability can be applied inthe examination of the OSCE.
Overall, it would seem that the reasons for OSCE' is acceptable reliabilityare at least three-fold:
1. Examiner variability is minimized2. Patient variability is minimized3. The task sheets contain specific performance criteria that should
contribute to the achievement of a more careful and objectiveassessment.
103Assessment of Competence
DAY
THREE
II) VALIDITY:
A test's validity is defined with reference to evidence that attests to the
test's ability to measure what it is intended to measure. Three approaches
are commonly used.
a) Content Validity:
Statements from panels of experts usually provide evidence of a
test’s content validity or by descriptions of a test's construction that
support the notion of adequate content sampling.
b) Criterion Validity:
Evidence of a test's criterion validity is derived by correlating the
test's results with the results of the administration of another test
that tests the same skills. This second test has to have high validity
and is administered to the same candidates. With reference to the
assessment of clinical skills, a true "gold standard" does not exist.
There is evidence that OSCE has greater criterion validity for the
OSCE than for the oral examination results.
c) Construct Validity:
In one study, the construct of interest was "the ability to perform
basic clinical procedures ". In the context of validating the OSCE, a
useful hypothesis was that the ability to perform basic clinical skills
should improve as length of training increases (improves with clinical
experience).
The results provided evidence of the construct validity of the OSCE.
Evidence for construct validity of tests can rest on the calculation of
correlation coefficients. In this process, one looks for relationships that
support hypotheses relating to construct-postulated attributes of people,
assumed to be reflected in test performance. Validation of such
hypotheses provides evidence of the construct validity of a test or
measure.
104Assessment of Competence
DAY
THREE
III) FEASIBILITY:
The feasibility of OSCE has been reported to vary according to the
sophistication of the stations and the interest of the faculty. An institution
of standard does not have to spend a lot of money for its development
since it has most of the material indigenously ( like room, furniture,
patients, instruments etc.). Expenses rise when the institution has to train
people for simulated patients. Another investment that the OSCE requires
(apart from the financial investment) is that of time of the experts. They
have to sit and prepare stations, checklists and gather material for the
stations. Hence, the concerned faculty has to have a high motivation level
not only to start an OSCE but also to maintain it.
CONCLUSION:
OSCE is a useful approach for the assessment of basic clinical skills of
medical subjects. OSCE is fairly feasible and acceptable to both examiners
and examinees. It is obvious that the quality, reliability and validity of
OSCE are acceptable and better than oral examination.
Hence, OSCE is useful for the assessment of basic clinical skills of general
medicine residents.
105Assessment of Competence
DAY
THREE
HOLISTIC RATING SCALE LONG ANDSHORT CASES
ExcellentThe candidates addressed this point with particular expertise, clarity andaccuracy. Descriptions and explanations were clear and concise.Supplementary information (if required) was relevant and up-to-date.Virtually all of the important information was given. There were nosignificant errors, omissions or misinterpretations. A particularly goodperformance.
GoodClearly of a high standard, showing a good command of the material. Goodexplanations and, where appropriate, supplementary information, wasgiven. Most of the important information was given, without significanterrors, omissions or misinterpretations. A good, solid performance – notexceptional, but clearly of an acceptable standard.
AdequateA safe and competent performance, reaching an acceptable standardwithout significantly exceeding it. No major errors, omissions, ormisinterpretations, though perhaps a little hesitant at times. A bare pass.
InadequateNot quite reaching the required standard, indicating the need forimprovements to be made in order to reach the required level ofcompetency, safety and patient management. The candidates might havebeen noticeably hesitant and undecided, with some important defects intheir performance and perhaps a few major errors or omissions. Thecandidates who show signs that they might achieve the required standardwith further work, but are not clearly demonstrating it yet.
PoorWell below the required standard. Major omissions or errors, particularlyregarding safety and competence. Explanations may have beennoncomprehensible, confused or incorrect. The candidates possibly out ofdepth, well below the required standard, who have a great deal to dobefore they might attain it.
106Assessment of Competence
DAY
THREE
PREPARING AND RUNNINGTOACS STATIONS
PLANNING TOACS STATIONS
Each TOACS station has to be prepared in a similar manner with its
description, instructions and procedures given in detail. These details will
enable each station to be clearly identified, to be set up and run in the
manner intended, and to allow for scoring of candidates to be done in a
systematic and standardized way. The information that is required for each
station is indicated in the sections below.
1. Cover sheet
This sheet contains information about the test station and allows it to be
clearly identifiable as part of an TOACS station bank. Its contents are:
= Station title
= Objectives to be assessed
= Area/ topic covered
= Type of station (active / static)
= Resource requirements
= Names of Examiners developing the station
= Date that station was developed
= Dates on which station was used in the examination
2. Detailed outline of the station
In this section there needs to be a very detailed description of the station,
including the clinical task or skills that will be assessed. Of considerable
importance are the key features of the task that make it clear what
aspects of a candidate’s performance in the station will need to be
demonstrated if the task is to be completed satisfactorily.
= Station title
= Objectives to be assessed
= Area/ topic covered
= Skills/clinical task to be assessed
= Key features of clinical task
107Assessment of Competence
DAY
THREE
3. Instructions for ExaminersThe procedure instructions are to be given, again in detail, so that the
station can be implemented in a standardized manner on repeated
occasions. All instructions to examiners on how to introduce the task to
candidates and how to manage the time allocated to each component.
These instructions must be written down in a way that will ensure they will
be presented in a similar fashion to each candidate, and will include:
= Station title
= Introduction of the station and task to the candidate
= Instructions to candidates on how to proceed with the clinical task
= Indicative time allocation within the station
= Prompt questions to be used
= Ways in which examiners will be best able to observe or interact withcandidates
= Any other pertinent information necessary to ensure that the stationruns smoothly
Note: Standardized patient instructions (if real or simulated patients are used)
If standardized patients are being used the instructions to them should be
written for training the person in giving the required answers and in
responding to physical examination in a consistent and standardized
manner. The patient has to be advised on how to react to expected
situations, questions and statements, but also know the bounds within
which they may respond to unanticipated situations.
4. Scoring instructionsThis is for the examiner to score each candidate in a standardized and
systematic manner through interaction and/or observation. The type of
scoring of the candidate’s performance is known as global scoring, where
the impressions of the performance in sections of the clinical task are
assigned a score according to their completeness and relevance according
to the defined key features. These scores are entered on the structured
scoring scale developed in the station plan and are made available to the
Chief Examiner at the end of the examination.
= Station title
= Structured scoring scale including key features
= Weightings of station components
= Indication of acceptable score for candidates reaching a minimallyacceptable standard of performance.
108Assessment of Competence
DAY
THREE
ORGANIZING TOACSPREPARATION BEFORE THE TOACS
l Form a group comprising of subject experts and a medical
educationist.
l Keep the objectives of training and the examination and a table of
specifications in front of you
l Develop TOACS stations according to the required framework
l Review the stations in the light of the objectives and the table of
specifications to ensure their consistency and validity with the
competencies required
l Make a list of staff needed for a smoothly running TOACS
l Prepare guidelines for examiners and candidates
l Ensure availability of examiners, patients and other required
examination material
l Draw a plan showing the layout of the TOACS stations
l Ensure the availability of the venue for the day of the examination
l Ensure that the stations are all set a day before the TOACS starts
(tables, chairs, couches, screens, bell, stopwatches etc.)
ON THE DAY OF THE TOACSl Arrive well before time
l Re-check the stations
l Designate a timekeeper at a central place
l Distribute plans to invigilators and examiners
l Meet with the examiners and brief them about the plan of the day
l Explain the whole procedure in detail to the candidates
l Ensure that the examination runs smoothly and on time
l Attend to any irregularities that may arise
l Collect all candidate score sheets from all stations
l Note and record any situations that might have influenced candidate
performance
l Check with each station examiner on the standard required for a
minimal pass
l Ensure score sheets are conveyed to the Examinations Department
with appropriate security
109Assessment of Competence
DAY
THREE
POST EXAMINATION ANALYSISAfter collecting the response sheets from the stations (active or static) thescores will be collated and analyzed by the Examinations Department. Thisanalysis will include detailed statistical analysis of each station’s and eachcandidate’s performance which will be taken into account in thedetermination of the passing score. Other post hoc analysis includingreliability and inter-station correlations will also be carried out.
GLOBAL SCORINGResearch on the assessment of clinical competence has now clearly andconsistently demonstrated that global scoring is more valid and reliablethan detailed checklist scoring, particularly at postgraduate examinationlevel. In practice this means that although some structure and cleardefinition of what is required if a candidate is to perform a clinical task atan acceptable level, this procedure must be able to take into account thequality and appropriateness of that task. Many contextual variables willinfluence clinical performance, including the type and severity of apresenting problem and the background of the patient. Communicationand interpersonal skills are usually important in reaching and appropriatediagnosis and management, as are the professional and ethicalconsiderations of the candidate.
Taking all issues into consideration in a complex clinical task that requiresa number of skills to be applied successfully can best be judged by seniorclinicians with a wide ranging experience and expertise. Although they canbe guided by what it is necessary to perform to complete a taskacceptably, the quality and clinical reasoning behind the actions is often ofgreater importance and needs to observed or discussed. The ultimate testof clinical competence is not how much or how quickly a task is performed,but the manner in which it is performed and the outcomes of thatperformance. It is important, for example, not to penalize a competentcandidate who reaches a satisfactory outcome in shorter time and withfewer questions or investigations because he or she is extremely familiarwith the clinical presentation. Similarly it is important not to score poorperformance highly when most steps are taken in a task but theperformance is disorganized, unsystematic and carried out without dueregard for the patient’s rights or condition.
Examiners of clinical competence require some flexibility to apply theirprofessional judgement to a range of candidate performances within astandardized setting and task definition. Global scoring allows thisto occur.
110Assessment of Competence
DAY
THREE
111Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS INSTRUCTIONS FOR EXAMINERS
STATION NO. _____________________________________________________________
TOPIC: __________________________________________________________________
COMPETENCE TO BE ASSESSED: __________________________________________
TYPE OF STATION: INTERACTIVE/STATIC: __________________________________
INSTRUCTIONS AS GIVEN TO CANDIDATE IN HIS / HER INSTRUCTION SHEET:
(Questions to be asked from the candidate are on the scoring sheet (if applicable)
112Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
PROCEDURE INSTRUCTIONS FOR CANDIDATES
STATION NO: __________________________________________________________
TOPIC: _______________________________________________________________
TIME ALLOWED: _______________________________________________________
DESCRIPTION OF THE TASK TO THE CANDIDATES(Information and instructions)
113Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS COVER SHEET
SERIAL NO: _______________________________________________________________
TOPIC: ___________________________________________________________________
COMPETENCE TO BE ASSESSED: ___________________________________________
TYPE OF STATION: INTERACTIVE/ STATIC: ___________________________________
RESOURCES REQUIRED: (Please write number of item required wherenecessary and put a tick mark before it)
Table chairs couch illuminator paperPencil/s eraser screen/s
Any other, please write the item and the number required below:
NAMES OF EXAMINERS DEVELOPING THE STATION:
DATE THAT STATION WAS DEVELOPED:(TO BE FILLED IN BY THE EXAMINATION DEPARTMENT)
Date when previously used Difficulty index Discrimination index
CO
LLE
GE
OF
PH
YSI
CIA
NS
AN
D S
UR
GE
ON
S PA
KIS
TAN
TOAC
S SC
OR
ING
SH
EET
STA
TIO
N N
O._
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
_
TOP
IC:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
EX
AM
CEN
TRE:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
CO
MP
ON
EN
T
KE
YA
GR
EE
DC
OM
PO
NE
NT
PAS
SIN
GA
WA
RD
ED
FE
AT
UR
ES
AN
SW
ER
SS
CO
RE
SC
OR
ES
CO
RE
EX
AM
INE
RS
’NA
ME
____
____
____
____
____
___
____
____
____
____
____
_
EX
AM
INE
RS
’SIG
NA
TU
RE
S__
____
____
____
____
____
___
____
____
____
____
___
PR
OM
PT
QU
ES
TIO
NS
: (if
any
)
Tota
l
CO
LLE
GE
OF
PH
YSI
CIA
NS
AN
D S
UR
GE
ON
S PA
KIS
TAN
TOAC
S SC
ORIN
G SH
EET
STA
TIO
N N
O.
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
__
TOP
IC:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
EX
AM
CEN
TRE:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
CO
MP
ON
EN
T
KE
YA
GR
EE
DC
OM
PO
NE
NT
FE
AT
UR
ES
AN
SW
ER
SS
CO
RE
EX
AM
INE
RS
’NA
ME
____
____
____
____
____
___
____
____
____
____
____
_
EX
AM
INE
RS
’SIG
NA
TU
RE
S__
____
____
____
____
____
___
____
____
____
____
___
RA
TIN
G S
CA
LE
PR
OM
PT
QU
ES
TIO
NS
: (i
f an
y)
Tota
l
INADEQUATE
ADEQUATE
GOOD
EXCELLENT
POOR
CO
LLE
GE
OF
PH
YSI
CIA
NS
AN
D S
UR
GE
ON
S PA
KIS
TAN
TOAC
S SC
ORIN
G SH
EET
STA
TIO
N N
O.
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
__
TOP
IC:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
EX
AM
CEN
TRE:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
CO
MP
ON
EN
T
KE
YA
GR
EE
DC
OM
PO
NE
NT
FE
AT
UR
ES
AN
SW
ER
SS
CO
RE
EX
AM
INE
RS
’NA
ME
____
____
____
____
____
___
____
____
____
____
____
_
EX
AM
INE
RS
’SIG
NA
TU
RE
S__
____
____
____
____
____
___
____
____
____
____
___
RA
TIN
G S
CA
LE
PR
OM
PT
QU
ES
TIO
NS
: (i
f an
y)
Tota
l
INADEQUATE
ADEQUATE
GOOD
EXCELLENT
POOR
CO
LLE
GE
OF
PH
YSI
CIA
NS
AN
D S
UR
GE
ON
S PA
KIS
TAN
TOAC
S SC
OR
ING
SH
EET
STA
TIO
N N
O._
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
_
TOP
IC:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
EX
AM
CEN
TRE:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
CO
MP
ON
EN
T
KE
YA
GR
EE
DC
OM
PO
NE
NT
PAS
SIN
GA
WA
RD
ED
FE
AT
UR
ES
AN
SW
ER
SS
CO
RE
SC
OR
ES
CO
RE
EX
AM
INE
RS
’NA
ME
____
____
____
____
____
___
____
____
____
____
____
_
EX
AM
INE
RS
’SIG
NA
TU
RE
S__
____
____
____
____
____
___
____
____
____
____
___
PR
OM
PT
QU
ES
TIO
NS
: (if
any
)
Tota
l
118Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS INSTRUCTIONS FOR EXAMINERS
STATION NO. _____________________________________________________________
TOPIC: __________________________________________________________________
COMPETENCE TO BE ASSESSED: __________________________________________
TYPE OF STATION: INTERACTIVE/STATIC: __________________________________
INSTRUCTIONS AS GIVEN TO CANDIDATE IN HIS / HER INSTRUCTION SHEET:
(Questions to be asked from the candidate are on the scoring sheet (if applicable)
119Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
PROCEDURE INSTRUCTIONS FOR CANDIDATES
STATION NO: __________________________________________________________
TOPIC: _______________________________________________________________
TIME ALLOWED: _______________________________________________________
DESCRIPTION OF THE TASK TO THE CANDIDATES(Information and instructions)
120Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS COVER SHEET
SERIAL NO: _______________________________________________________________
TOPIC: ___________________________________________________________________
COMPETENCE TO BE ASSESSED: ___________________________________________
TYPE OF STATION: INTERACTIVE/ STATIC: ___________________________________
RESOURCES REQUIRED: (Please write number of item required wherenecessary and put a tick mark before it)
Table chairs couch illuminator paperPencil/s eraser screen/s
Any other, please write the item and the number required below:
NAMES OF EXAMINERS DEVELOPING THE STATION:
DATE THAT STATION WAS DEVELOPED:(TO BE FILLED IN BY THE EXAMINATION DEPARTMENT)
Date when previously used Difficulty index Discrimination index
121Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS INSTRUCTIONS FOR EXAMINERS
STATION NO. _____________________________________________________________
TOPIC: __________________________________________________________________
COMPETENCE TO BE ASSESSED: __________________________________________
TYPE OF STATION: INTERACTIVE/STATIC: __________________________________
INSTRUCTIONS AS GIVEN TO CANDIDATE IN HIS / HER INSTRUCTION SHEET:
(Questions to be asked from the candidate are on the scoring sheet (if applicable)
122Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS INSTRUCTIONS FOR EXAMINERS
STATION NO. _____________________________________________________________
TOPIC: __________________________________________________________________
COMPETENCE TO BE ASSESSED: __________________________________________
TYPE OF STATION: INTERACTIVE/STATIC: __________________________________
INSTRUCTIONS AS GIVEN TO CANDIDATE IN HIS / HER INSTRUCTION SHEET:
(Questions to be asked from the candidate are on the scoring sheet (if applicable)
123Assessment of Competence
DAY
THREE
COLLEGE OF PHYSICIANS AND SURGEONSPAKISTAN
TOACS COVER SHEET
SERIAL NO: _______________________________________________________________
TOPIC: ___________________________________________________________________
COMPETENCE TO BE ASSESSED: ___________________________________________
TYPE OF STATION: INTERACTIVE/ STATIC: ___________________________________
RESOURCES REQUIRED: (Please write number of item required wherenecessary and put a tick mark before it)
Table chairs couch illuminator paperPencil/s eraser screen/s
Any other, please write the item and the number required below:
NAMES OF EXAMINERS DEVELOPING THE STATION:
DATE THAT STATION WAS DEVELOPED:(TO BE FILLED IN BY THE EXAMINATION DEPARTMENT)
Date when previously used Difficulty index Discrimination index
CO
LLE
GE
OF
PH
YSI
CIA
NS
AN
D S
UR
GE
ON
S PA
KIS
TAN
TOAC
S SC
ORIN
G SH
EET
STA
TIO
N N
O.
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
__
TOP
IC:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
EX
AM
CEN
TRE:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
CO
MP
ON
EN
T
KE
YA
GR
EE
DC
OM
PO
NE
NT
FE
AT
UR
ES
AN
SW
ER
SS
CO
RE
EX
AM
INE
RS
’NA
ME
____
____
____
____
____
___
____
____
____
____
____
_
EX
AM
INE
RS
’SIG
NA
TU
RE
S__
____
____
____
____
____
___
____
____
____
____
___
RA
TIN
G S
CA
LE
PR
OM
PT
QU
ES
TIO
NS
: (i
f an
y)
Tota
l
INADEQUATE
ADEQUATE
GOOD
EXCELLENT
POOR
CO
LLE
GE
OF
PH
YSI
CIA
NS
AN
D S
UR
GE
ON
S PA
KIS
TAN
TOAC
S SC
OR
ING
SH
EET
STA
TIO
N N
O._
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
_
TOP
IC:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
EX
AM
CEN
TRE:
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
____
___
CO
MP
ON
EN
T
KE
YA
GR
EE
DC
OM
PO
NE
NT
PAS
SIN
GA
WA
RD
ED
FE
AT
UR
ES
AN
SW
ER
SS
CO
RE
SC
OR
ES
CO
RE
EX
AM
INE
RS
’NA
ME
____
____
____
____
____
___
____
____
____
____
____
_
EX
AM
INE
RS
’SIG
NA
TU
RE
S__
____
____
____
____
____
___
____
____
____
____
___
PR
OM
PT
QU
ES
TIO
NS
: (if
any
)
Tota
l