A procedural skills OSCE: assessing technicaland non-technical skills of internal medicine residents
Debra Pugh • Stanley J. Hamstra • Timothy J. Wood •
Susan Humphrey-Murto • Claire Touchie • Rachel Yudkowsky •
Georges Bordage
Received: 18 September 2013 / Accepted: 5 May 2014� Springer Science+Business Media Dordrecht 2014
Abstract Internists are required to perform a number of procedures that require mastery
of technical and non-technical skills, however, formal assessment of these skills is often
lacking. The purpose of this study was to develop, implement, and gather validity evidence
for a procedural skills objective structured clinical examination (PS-OSCE) for internal
medicine (IM) residents to assess their technical and non-technical skills when performing
procedures. Thirty-five first to third-year IM residents participated in a 5-station PS-OSCE,
which combined partial task models, standardized patients, and allied health professionals.
Formal blueprinting was performed and content experts were used to develop the cases and
rating instruments. Examiners underwent a frame-of-reference training session to prepare
them for their rater role. Scores were compared by levels of training, experience, and to
evaluation data from a non-procedural OSCE (IM-OSCE). Reliability was calculated using
Generalizability analyses. Reliabilities for the technical and non-technical scores were 0.68
and 0.76, respectively. Third-year residents scored significantly higher than first-year
residents on the technical (73.5 vs. 62.2 %) and non-technical (83.2 vs. 75.1 %) compo-
nents of the PS-OSCE (p \ 0.05). Residents who had performed the procedures more
frequently scored higher on three of the five stations (p \ 0.05). There was a moderate
disattenuated correlation (r = 0.77) between the IM-OSCE and the technical component of
the PS-OSCE scores. The PS-OSCE is a feasible method for assessing multiple
D. Pugh � T. J. Wood � S. Humphrey-Murto � C. TouchieDepartment of Medicine, University of Ottawa, Ottawa, ON, Canada
D. Pugh (&)Division of General Internal Medicine, The Ottawa Hospital, General Campus, 501 Smyth Road,Box 209, Ottawa, ON K1H 8L6, Canadae-mail: [email protected]
S. J. HamstraDepartments of Medicine, Surgery and Anesthesia, University of Ottawa, Ottawa, ON, Canada
R. Yudkowsky � G. BordageDepartment of Medical Education, University of Illinois at Chicago, Chicago, IL, USA
123
Adv in Health Sci EducDOI 10.1007/s10459-014-9512-x
competencies related to performing procedures and this study provides validity evidence to
support its use as an in-training examination.
Keywords Assessment � Non-technical skills � OSCE � Post-graduate
trainees � Procedures � Technical skills
Introduction
Graduates of Canadian Internal Medicine (IM) Residency programs are required to be
proficient in performing a number of medical procedures, such as thoracentesis and lumbar
puncture (Royal College of Physicians and Surgeons of Canada 2011). The formal
assessment of these skills, however, is often lacking. With no standardized and validated
approach available to assess residents’ competence in performing procedures, most
Canadian IM program directors continue to rely on informal assessment methods such as
logbooks that document the number of procedures done (Pugh et al. 2010). As a result, IM
residents, including recent graduates, do not feel adequately prepared to perform many
procedures expected of them (Card et al. 2006; Hicks et al. 2000; Huang et al. 2006;
Wickstrom et al. 2000).
One approach to address this issue and formalize the learning experience is to apply a
mastery approach to teaching procedural skills. Using this method, trainees are taught to
perform procedures (e.g., on a partial task model) and their technical skills are then
observed until a pre-determined level of competence has been achieved (Barsuk et al.
2009; Wayne et al. 2008). Another well-described approach is the objective structured
assessment of technical skill (OSATS), which is a performance-based examination that
assesses the technical skills of surgical trainees as they perform a series of procedures on
bench models (e.g., excision of a skin lesion, tracheostomy) (Reznick et al. 1997).
Although completing a procedure requires mastery of both technical and non-technical
skills, however, both these approaches focus primarily on the technical aspects of proce-
dural competence (i.e., ensuring that certain steps are completed). The OSATS does
incorporate the use of an assistant, however, their role is to follow instructions, rather than
to challenge the candidates’ ability to communicate and collaborate.
Using the CanMEDS framework (Frank 2005) several distinct, but overlapping, roles
can be used to describe the competencies required when performing procedures. For
example, when performing an endotracheal intubation, a physician must be able to obtain
informed consent from the patient or their substitute decision maker (Communicator),
work with a registered nurse and respiratory therapist to ensure adequate preparation and
flow of the procedure (Collaborator), remain calm if the patient has an unanticipated
deterioration in their clinical status (Professional), and perform the technical aspects of the
procedure to properly place the tube (Medical Expert). Although assessing multiple
competencies with one test poses several challenges, Jefferies et al. (2007) have shown that
it is feasible to use an objective structured clinical examination (OSCE) to assess multiple
CanMEDS roles in an examination of non-procedural clinical skills.
The integrated procedural performance instrument (IPPI) (Kneebone et al. 2006) is a
tool that has been used to teach and assess technical skills in context. Using a hybrid
model, both standardized patients (SPs) and partial task models are included in a
D. Pugh et al.
123
performance-based examination to assess non-technical (i.e., professionalism and com-
munication) as well as technical skills. The IPPI has been shown to be feasible and to have
acceptable psychometric characteristics when used to assess some of the procedures
commonly performed by medical students (e.g., urinary catheterization or venipuncture)
and surgical trainees (e.g., wound closure and cast application) (LeBlanc et al. 2009;
Moulton et al. 2009). However, the IPPI has not been used to assess the invasive proce-
dures performed by IM residents, nor has it been used to assess the collaborative skills that
are often required when performing complex medical procedures.
Given the lack of standardized and validated means of assessing procedural skills in
internal medicine, we developed and implemented an innovative procedural skills OSCE
(PS-OSCE) for IM residents to simultaneously assess both the technical and non-technical
skills (including communication, collaboration, and professionalism) required when per-
forming procedures. Modern validity theory was used as a framework to gather multiple
sources of evidence for the validity of the scores from this examination (Messick 1989;
American Educational Research Association et al. 1999; Downing 2003). In this frame-
work, the onus is on the experimenters to demonstrate construct validity, that is, to what
degree the scores from an assessment actually measure the underlying construct of interest.
Validity is demonstrated by collecting evidence that supports or refutes the proposed
meaning of a set of scores, and this evidence can be grouped into five categories or sources.
The five sources of evidence gathered to demonstrate validity for the scores from the PS-
OSCE, were: content (i.e., test items are representative of the construct of interest),
response process (i.e., evidence of data integrity including: clear test instructions for
candidates, rigorous rater training, methods for scoring, and data entry), internal structure
(i.e., psychometric properties of the exam including: score reliability, exam difficulty, and
inter-item correlations), relations with other variables (i.e., convergent and discriminant
evidence, including correlations to other variables), and consequences (i.e., impact on
learners, instructors, and curriculum).
Methods
Study participants
The PS-OSCE was a formative but mandatory in-training examination for first to third-year
residents (PGY1-3) in the University of Ottawa IM Residency Program in 2012. The exam
was administered near the end of the academic year, after which all trainees would have
been expected to have participated in at least one formal, hands-on procedural teaching
session. Ethics approval was given from the University of Illinois at Chicago (as part of
DP’s thesis requirements) and the University of Ottawa. Examinees received an email
invitation to participate in the study and written consent was obtained. Non-consenting
examinees still participated in the PS-OSCE but, as per Research Ethics Board rules, their
results were not included in any analyses.
Using five simultaneous tracks and two consecutive administrations (i.e., testing peri-
ods) all participants were assessed in one evening. Each track consisted of a series of rooms
distributed throughout the testing centre. There were five tracks running simultaneously
throughout the night so that all candidates were examined in two testing periods (i.e., early
evening and late evening). Each track included all five OSCE stations; in each station there
was a different examiner, SP, and allied health professional (AHP). Each resident com-
pleted the entire exam by going through only one track (e.g., track A, B, or C, etc.). As
PS-OSCE
123
such, residents were each examined by five different raters, but each resident in one track
(e.g., track A) saw the same 5 raters, as well as SPs and AHPs.
Content
We developed a PS-OSCE with five 18 min stations, each containing: (1) a partial task
model (i.e., for lumbar puncture, endotracheal intubation, central venous catheter insertion,
thoracentesis, and knee joint aspiration), (2) an SP, and (3) an AHP.
The examination blueprint was based on the procedures required of Canadian IM
graduates (Royal College of Physicians and Surgeons of Canada 2011). The five proce-
dures were selected (from a total of ten possible procedures) because of feasibility. That is,
the procedures were amenable to testing in an OSCE setting and suitable partial task
models were available. Other important skills that were not included on the examination
were: abdominal paracentesis and peripheral arterial line insertion (because of the lack of
suitable commercially available partial task models at the time of the study); resuscitation
skills (as this is assessed through a separate course at our institution); electrocardiogram
interpretation (as this skill could be assessed through more cost-efficient ways); and airway
management (as this was essentially assessed in the intubation station).
In addition, four of the seven CanMEDs roles were represented in each case, that is, the
Professional, the Communicator, the Collaborator, and the Medical Expert. These roles
were judged to be most relevant for the proper execution of the procedures being assessed.
The cases were written by the investigators (DP, CT, SH-M) who have experience with
OSCE case development at a local and national level. As an example, for the thoracentesis
station, residents had to: obtain consent from an angry SP; collaborate with an inexperi-
enced nurse (actor) who accidentally contaminates the sterile field; perform the thora-
centesis on a model; and discuss the results of a post-procedure radiograph demonstrating a
pneumothorax with the SP. Each case was reviewed by content experts and pilot-tested.
Checklists were developed by groups composed of 7–12 content experts, chosen for
their familiarity with the procedures of interest. An online, iterative survey was used to
obtain consensus about checklist items. For items in which unanimous consensus could not
be reached (ranging between zero and four checklist items for each of the five cases),
decisions to include the items were based on a majority rule.
Two separate rating scales were developed for each of the three non-technical domains
assessed (i.e., professionalism, collaboration, and communication skills). Each rating scale
was developed to assess different non-technical skills and the descriptors referred to
specific behaviours rather than norm-referenced anchors. The scales were pilot-tested in
two settings. Using an iterative process, the rating scales were then modified following
review by clinicians with expertise in the area of assessment to ensure clarity and ease of
use; see ‘‘Appendix’’.
Response process
Residents’ technical skills were rated by physician examiners (PEs) using task-specific
checklists. Residents received a score of 1 for each item done correctly, and 0 for items
unsatisfactorily attempted or not done. Non-technical skills were assessed using six 7-point
rating scales.
Faculty from the University of Ottawa were recruited as PEs and underwent a 2 h
frame-of-reference training session to introduce them to the rating instruments, prepare
them for their rater role, and ensure the accuracy of their ratings. During their training, they
D. Pugh et al.
123
watched two videos of simulated PS-OSCE stations, and rated the candidates’ non-tech-
nical performance. Facilitators collected the forms and presented data on the distribution of
scores, and participants were led in a discussion about their individual ratings. They then
re-watched and re-rated the training videos. This resulted in four separate rating forms for
each rater (i.e., initial and revised ratings for each video). A total score for each rating form
was calculated by summing the six ratings, resulting in a maximum score of 42. Mean total
scores were then compared using a one-sample t test, and standard deviations were
compared using an f test.
Immediately preceding the exam, candidates and PEs participated in separate orienta-
tion sessions to ensure that all exam instructions were clear. The SPs and AHPs received
training to ensure that the portrayal of their roles was accurate and consistent.
During the administration of the PS-OSCE, staff monitored data entry by reviewing the
checklist and rating scale forms completed after the first and second rounds of candidates.
Data entry was performed by experienced staff with quality control checks.
Internal structure
Exam scores were reported separately for the technical (checklist) and non-technical
(rating scale) components of each station. Scores were calculated on each station by
converting the respective score or rating into a percentage. Total scores for each measure
(i.e., technical and non-technical) were determined by averaging the scores on the five
stations. Corrected item-total correlations for the stations were calculated for technical and
non-technical scores. The correlation between total technical and non-technical scores was
calculated using Pearson’s correlation coefficient.
Reliability of the PS-OSCE scores was assessed using Generalizability analyses. Using
this model allows one to identify which variables (i.e., participants, tracks, training level, or
stations) are contributing the most and least to the overall variability in the scores. The amount
of variance accounted for by each variable is expressed as a percentage of the overall vari-
ability in the scores with higher percentages accounting for more of the variability. For the
Generalizability model that was used, PGY level (l) and track (t) were crossed with stations
(s). Participants were nested within l and t because the participants are unique to lt combi-
nation. Because there was only one PE per station, examiners were confounded with stations
and therefore not included as a separate facet in the model.
The variance components that were generated from this model are also used to derive
the estimates of reliability for the examination. A relative error coefficient was assumed
and the formula for calculating the G-coefficients is G = variancep:lt/(variancep:lt ? var-
ianceps:lt/ns). UrGenova was used to generate variance components for this analysis
(Brennan 2001).
Relations to other variables
Self-reported (i.e., recollected) data on the number of times a resident had performed the
procedure (i.e., 0–1 time, 2–4 times, 5–10 times, or[10 times) was collected using a pre-
OSCE survey. Differences in performance, based on year of training (i.e., PGY1-3) and
experience (i.e., number of times the procedure had been performed) were compared using
univariate ANOVA. Effect sizes were calculated using partial eta squared (gp2). Post hoc
independent t tests were used to further explore any differences found.
Scores were also compared to a non-procedural internal medicine OSCE (IM-OSCE), a
mandatory in-training exam for IM residents; it was administered 2 months before the PS-
PS-OSCE
123
OSCE. The IM-OSCE was composed of 1 communication, 4 structured oral, and 4 physical
examination stations, and had a test-score reliability of 0.74 (Cronbach’s a). Correlations
between PS-OSCE and IM-OSCE scores were calculated using Pearson’s correlation coef-
ficient; disattenuated correlations were calculated using the known reliabilities of the exams.
Consequences
Residents were surveyed after the PS-OSCE to determine the acceptability of the exam.
Costs for the administration of the PS-OSCE were calculated and compared to a hypo-
thetical non-procedural OSCE of the same size at our institution.
Residents received a written summary of their scores for each station (technical and
non-technical), and the results were forwarded to the IM residency program director. For
the purpose of this paper, standard setting procedures are not discussed because, while
mandatory, the PS-OSCE was a formative examination.
Results
Forty-one IM residents took the examination and 35 consented to participate in the study
(n = 15 PGY-1, n = 10 PGY-2, n = 10 PGY-3). The six non-consenting participants
were distributed evenly between the first 2 years of training; their results were not included
in any analyses.
Content
The PS-OSCE blueprint represented five of the ten procedures required of graduates of
Canadian IM programs, and four of the seven CanMEDs roles.
The scoring instruments were developed using content experts. This resulted in five
station-specific checklists, each composed of 16–24 items, as well as a set of six 7-point
rating scales used to assess non-technical skills across all stations.
Response process
Following frame-of-reference training, mean PE ratings changed significantly from 21.0 to 18.5
for video one, and from 29.6 to 32.0 for video two (p \ 0.001). Using an f test, the difference in
SD decreased significantly from 4.82 to 2.40 with the second ratings for video one [F (16,
16) = 3.372, p \ 0.01] and from 4.57 to 2.76 for video two [F (16, 16) = 2.74, p \ 0.05].
Internal structure
Total scores for the technical and non-technical components of the exam were 66.6 and
77.6 %, with total scores ranging from 36.9 to 82.6 and 58.6 to 90.0 %, respectively; see
Table 1. Corrected item-total correlations ranged from 0.27 to 0.62 and from 0.15 to 0.50
for technical and non-technical scores, respectively.
There was a significant correlation between total scores for the technical and non-
technical components of the exam (r = 0.76, p \ 0.001); correlations between technical
and non-technical components for each station were all significant (r = 0.35–0.87,
p \ 0.05).
D. Pugh et al.
123
Based on a Generalizability analysis, the track to which a participant was assigned
accounted for 6 % (for technical skills) and 0 % (for non-technical skills) of the variability
in scores and PGY level accounted for 4 % of the variance in scores for both measures; see
Table 2. Participants accounted for 20 % (for technical) and 22 % (for non-technical) of
the variance, indicating that there were differences between residents within a PGY level
on each track. Stations did not account for a significant amount of the variation in the
scores but the interaction between track and station accounted for 15 % of the variation in
technical skills and 33 % of the variation in non-technical skills, indicating that there may
have been differences in the way stations were scored across tracks.
These variance components were used to generate Generalizability (G) coefficients for
scores on both components and the resulting coefficients were 0.68 and 0.76 for the
technical and non-technical components, respectively. A D study revealed that to achieve a
G-coefficient of 0.80, at least 10 stations would be required for the technical component,
and 7 stations for the non-technical component of the exam.
Relations to other variables
For both the technical and non-technical components of the exam, senior residents scored
significantly higher than more junior residents; see Table 3. Post hoc pair-wise compari-
sons, using Tukey’s HSD test, showed that PGY-3 residents scored higher than PGY-1
residents on the technical (73.5 vs. 62.2 %, p = 0.029) and non-technical (83.2 vs. 75.1 %,
p = 0.024) components.
Scores for the technical component of the stations differed as a function of the number
of times a resident reported having previously performed a procedure for the central line,
lumbar puncture, and thoracentesis stations; see Table 4. Independent t tests showed the
differences were between those who had performed the following procedures more fre-
quently: central line insertion ([10 times vs. 2–4 times;[10 times vs. 5–10 times); lumbar
puncture ([10 times vs. 0–1 time; 5–10 times vs. 0–1 time); and thoracentesis (2–4 times
vs. 0–1 time), p \ 0.05. Scores for the non-technical component of the stations did not
differ significantly by the amount of previous procedural experience.
Of the 35 residents in this study, 27 had also participated in the IM-OSCE. There was a
significant positive correlation between the IM-OSCE scores and total score for the
technical (r = 0.47, p = 0.013), but not the non-technical (r = 0.35, p = 0.074) scores of
the PS-OSCE. Disattenuated correlations between the IM-OSCE scores and total scores
were calculated: r = 0.77 for the technical, and r = 0.54 for the non-technical compo-
nents, respectively.
Consequences
When surveyed about their experience with the PS-OSCE compared to assessments using a
partial task model alone, 23 residents (66 %) reported that an OSCE with SPs and AHPs
allowed for a more valid assessment of their skills than a model alone; eight (23 %)
reported that it depends on the skill while four (11 %) reported that assessing technical
skills alone is more valid.
Costs related to the administration of this 5-station, 5-track OSCE were approximately
$33,500 CAN, including payment to PEs (*$10,000), SPs (*$4,000), AHPs (*$7,500),
SP trainers ($3,500), OSCE staff (*$4,000), catering (*$2,500), data entry (*$500),
administrative fees (*$1,000) and medical equipment (*$500). Purchase of 20 models
PS-OSCE
123
would have added approximately $31,000 to the total cost. The estimated cost of admin-
istering a non-procedural OSCE of the same size at our institution would be approximately
$21,750.
Table 1 Technical and non-technical mean scores and corrected item-total correlations according tostations
Station Technical score Non-technical score
Mean %(SD)
Min/max CorrectedITC
Meanraw score(SD)
Meana
convertedto %(SD)
Min/max CorrectedITC
Central line 65.1 (18.0) 21.7–91.3 0.62 5.4 (0.9) 77.2 (12.7) 45.2–100.0 0.15
Lumbarpuncture
70.1 (13.9) 33.3–90.5 0.28 5.6 (0.8) 79.3 (11.5) 47.6–100.0 0.15
Intubation 64.6 (16.1) 19.1–90.5 0.49 5.1 (0.9) 73.4 (12.2) 47.6–97.6 0.49
Knee 70.9 (14.1) 37.5–93.8 0.27 5.6 (0.7) 80.7 (9.9) 61.9–97.6 0.37
Thoracentesis 62.1 (19.1) 20.8–87.5 0.59 5.4 (1.3) 77.2 (17.9) 38.1–100.0 0.50
Total score 66.6 (11.0) 36.9–82.6 n/a 27.1 (2.8) 77.6 (7.9) 58.6–90.0 n/a
ITC item-total correlationa Ratings were converted from 1–7 to 0–6, and then a percentage was calculated
Table 2 Generalizability analysis by skills type
Facet Technical Non-technical Explanation
Variancecomponent
%variance
Variancecomponent
%variance
t 16.33 6 1.32 0 To what degree do scores on the five tracks differ?
l 10.63 4 10.48 4 To what degree do scores assigned to the each ofthe training levels differ within each track?
p:lt 58.13 20 60.96 22 To what degree to scores assigned to participantswithin a training level and track differ? This isthe object of measurement for this study.
s 0 0 0 0 To what degree do scores on the five stationsdiffer?
tl 5.76 2 0 0 To what degree do scores assigned to eachtraining level differ by track?
ts 41.85 15 90.86 33 To what degree do station scores differ as afunction of track?
ls 0.79 0 13.12 5 To what degree to station scores differ as afunction of training level?
tls 14.57 5 9.02 3 To what degree do station scores differ across thetraining levels and track?
ps:lt 136.57 45 93.62 34 Represents a combination of unexplained errorand the degree that station scores forparticipants within a training level and trackdiffered
t Track, l level (post-graduate year), s stations, p participant
D. Pugh et al.
123
Scores on both the technical and non-technical components of the exam (see Table 1)
revealed the lowest scores were on the central line insertion, intubation, and thoracentesis
stations.
Discussion
The standardized, formal assessment of IM residents’ ability to perform procedures was
effectively achieved with this PS-OSCE. Interpreting the scores from the PS-OSCE, as
with any exam, requires analysis of several sources of validity evidence, which are dis-
cussed successively in the following sections.
Content
Content evidence refers to ensuring that the construct being assessed is accurately and com-
pletely represented on a test (Cook and Beckman 2006). In this case, the PS-OSCE blueprint
allowed for the assessment of residents’ skills in performing five procedures, which represents
half of all required procedures of graduates of Canadian IM programs, and four of the seven
CanMEDs roles. Although five procedures were not tested, resident abilities may generalize to
other procedures because of the similarity in skills across procedures (e.g., the technical skills
required for thoracentesis may be transferable to paracentesis, given that the basic procedural
steps and equipment are similar). Given that trainees have variable opportunities to perform
procedures while being observed by faculty (Pugh et al. 2010; Boots et al. 2009), this was an
efficient way to formally assess a number of skills. Attention to pilot-testing and revisions of the
cases by content experts also helped to ensure that the PS-OSCE was representative of the
challenges faced by residents when performing procedures. A rigorous approach to the
development and pilot-testing of the instruments, including groups of content experts and a
consensus survey, provides further evidence to support content validity.
Response process
To ensure that the results reported to candidates are valid, one must ensure that the ratings
provided by examiners are accurate. An important step when introducing a new assessment
Table 3 Technical and non-technical mean scores by levelsof training
Total scoretechnical(SD)
Total scorenon-technical(SD)
PGY-1n = 15
62.2 (11.8) 75.1 (7.2)
PGY-2n = 10
66.3 (9.5) 75.7 (7.9)
PGY-3n = 10
73.5 (7.9) 83.2 (6.5)
Totaln = 35
66.6 (11.0) 77.6 (7.9)
F (2, 32) 3.66 4.32
p 0.037 0.022
gp2 0.19 0.21
PS-OSCE
123
Ta
ble
4T
ech
nic
alan
dn
on
-tec
hn
ical
mea
nsc
ore
sb
yn
um
ber
of
tim
esp
rev
iou
sly
per
form
edan
db
yst
atio
nty
pes
No.
tim
esper
form
edC
entr
alli
ne
score
(SD
)L
um
bar
punct
ure
score
(SD
)In
tubat
ion
score
(SD
)K
nee
score
(SD
)T
hora
cente
sis
score
(SD
)
NT
echnic
alN
on-t
echnic
alN
Tec
hnic
alN
on-t
echnic
alN
Tec
hnic
alN
on-t
echnic
alN
Tec
hnic
alN
on-t
echnic
alN
Tec
hnic
alN
on-t
echnic
al
0–1
360.8
7(1
5.6
8)
73.8
1(1
1.9
0)
759.8
6(1
2.2
5)
74.1
5(8
.95)
854.1
7(2
2.9
0)
69.6
4(1
1.1
5)
27
69.6
8(1
4.5
8)
80.3
4(1
0.0
5)
18
53.2
4(2
0.6
4)
69.7
1(1
9.0
3)
2–4
12
58.7
0(2
0.1
8)
76.3
9(1
2.8
2)
18
69.3
1(1
4.1
7)
78.9
7(1
0.2
7)
12
69.4
4(9
.62)
69.4
4(1
0.9
7)
675.0
0(1
4.2
5)
80.9
5(1
1.2
7)
11
69.7
0(1
3.4
5)
85.0
7(1
2.1
4)
5–10
959.9
0(1
6.3
8)
71.4
2(1
3.6
3)
877.3
8(9
.78)
84.2
2(1
5.9
9)
12
65.8
7(1
4.7
6)
77.7
8(1
2.9
7)
275.0
0(8
.83)
84.5
2(1
.68)
373.6
1(1
0.4
9)
83.3
3(1
5.6
1)
[10
11
77.4
7(1
2.1
1)
83.7
7(1
0.2
6)
283.3
3(1
0.1
0)
79.7
6(5
.05)
368.2
5(1
6.7
2)
81.7
4(1
1.0
0)
0n/a
n/a
376.3
9(4
.81)
87.3
0(1
7.8
7)
Tota
l35
65.0
9(1
8.0
3)
77.2
1(1
2.6
6)
35
70.0
7(1
3.9
3)
79.2
5(1
1.4
7)
35
64.6
3(1
6.1
4)
73.4
0(1
2.1
6)
35
70.8
9(1
4.1
3)
80.6
8(9
.85)
35
62.1
4(1
9.0
8)
77.2
1(1
7.8
8)
F(3
,31)
(2,
32
for
knee
)
2.9
81.8
23.0
90.9
71.6
41.7
80.4
20.1
63.3
92.4
9
p0.0
47
0.1
64
0.0
41
0.4
22
0.2
00.1
71
0.6
59
0.8
50
0.0
30
0.0
79
gp2
0.2
20.1
50.2
30.0
90.1
40.1
50.0
30.0
10.2
50.1
9
D. Pugh et al.
123
instrument is to ensure that raters undergo training with the instrument in question. Our
PEs underwent frame-of-reference training, which was used to help them develop per-
formance schemas in order to arrive at a consensus regarding rating varying levels of
performance (Castorr et al. 1990; Gorman and Rentsch 2009). This helped to prepare them
for their rater role and to ensure that all raters were applying the instruments in a stan-
dardized way, as demonstrated by the decrease in variability of scores with their revised
ratings. The decrease in variability suggests that, after the training sessions, raters were
scoring performance in a more unified way. Quality assurance measures, such as ensuring
that all ratings were being completed after each candidate, also helped to ensure the
accurate collection of data for this examination.
Internal structure
Evidence for internal structure relates to the psychometric properties of an examination
(Downing 2004). The scores from this examination were found to be reliable, especially
for a formative examination. If greater levels of reliability were required, one could
incorporate additional shorter cases to minimize the impact on cost and feasibility. The
incorporation of more stations is certainly feasible, especially if one were to include a
smaller number of candidates (e.g., only PGY-3 residents).
However, given content specificity, the restricted measurement domain (at least for
technical skills) and the difficulties with simulating certain conditions, the number of tasks
that can be modeled and incorporated in an examination is finite. Therefore, while the
consistency of the scores is important, and certainly dependent on the number of tasks, the
choice of what to measure (validity) is key.
Generalizability analyses were useful in identifying various sources of error in the PS-
OSCE. Although track by itself did not account for much of the variability in scores, there
was a significant interaction between track and station, suggesting that stations were scored
differently across tracks. This may relate to rater differences on the tracks, station order
differences within the tracks, or it may represent actual differences in participants’ ability,
given that there were small numbers of participants in each track and not equally dis-
tributed by PGY level. In future studies, the use of more than one rater per station could
help to explore rater reliability. The analysis also showed that there were differences
between residents within a PGY level within tracks, that is, not all PGY-3 s have equal
ability.
Other measures of internal structure include scenario difficulty and correlations between
station scores. In this study, corrected item-total correlations ranged from low to moderate
for both technical and non-technical skills. One would not expect high correlations
between stations for technical ability, given that the stations were developed to measure
different procedural skills (i.e., case specificity) (Norman et al. 2006). However, it is
somewhat surprising that the item-total correlations for the non-technical skills were not
higher since the same three skills (i.e., communication, collaboration, and professionalism)
were assessed using the same rating scales in all stations. Further analyses showed that
performance of the non-technical skills were moderately to highly correlated with technical
skills, which may indicate that those who are more skilled in performing a given procedure
may have more marginal attention available, resulting in a greater ability to communicate
effectively, collaborate with others, and behave professionally. Non-technical skills are
largely contextual, and in the case of executing procedures, they may be dependent on
trainees’ technical ability (Ginsburg et al. 2000).
PS-OSCE
123
Relations to other variables
When considering the relationship between measures of the construct of interest with other
variables, one must consider both convergent and discriminant validity as sources of
evidence (Campbell and Fiske 1959). For procedural skills, it might be expected that more
senior trainees and those who have performed a given procedure more frequently would
perform better. In this study, as expected, PGY-3 residents performed better overall than
PGY-1 residents on both the technical and non-technical components of the stations.
Although the numbers of participants in each sub-group were small, the effect size was
moderate (i.e., gp2 [ 0.11).
Similarly, for three stations (central line, lumbar puncture and thoracentesis), technical
skill performance was better for those who had performed the procedure more often.
Again, although group numbers were small, the effect size was moderate. For the knee and
intubation stations, however, greater experience did not result in higher scores, perhaps
because of the relative paucity of participants’ experience with these procedures. Most
residents (27 out of 35) had never performed a knee aspiration, or only once, and only three
had performed more than ten endotracheal intubations.
Technical performance on the PS-OSCE correlated moderately with performance on a
non-procedural OSCE, while non-technical performance did not correlate significantly.
Although both the IM-OSCE and the technical component of the PS-OSCE were designed
to assess different constructs, they are both purportedly assessing the Medical Expert role,
and so one would expect a positive, yet moderate correlation between the two. Conversely,
the IM-OSCE does not attempt to assess the non-Medical Expert roles (with the exception
of one station designed to assess communication skills), so it is not surprising that scores
did not correlate with the non-technical scores on the PS-OSCE (i.e., discriminant
validity).
Consequences
The results from this study have important implications for our IM residency program. As
assessment drives learning (Kromann et al. 2009), we expect residents to demand more and
more opportunities to practice procedures and receive feedback on their strengths and
weaknesses. As a formative examination, the information collected can help identify areas
for improvement for individual residents, as well as identify potential weaknesses in the
current procedural skills curriculum. Overall, scores on the PS-OSCE were low, which
calls for program revisions or adjustments regarding procedural skills.
The number of procedures performed required to achieve competency varies greatly by
individual, (Naik et al. 2003) and it is clear that trainees need more experience with
procedures to become proficient, not just more time in-training. One cannot assume that
more senior trainees will be competent in performing routine procedures unless they have
had sufficient opportunities to practice their skills over time. For each of the procedures
studied, there was great variability in the amount of experience that residents had accu-
mulated, with many residents having never performed a given procedure, despite the fact
that this examination was near the end of the academic year. As emphasized by Ericsson’s
theory of expertise, trainees require the opportunity for deliberate mixed practice with
feedback in order to develop their technical skills (Ericsson 2008). Training programs must
ensure that trainees are gaining sufficient and repeated exposure to procedures over time,
and that formal assessment of these skills is occurring on a regular basis.
D. Pugh et al.
123
Limitations and strengths
Limitations of this study include the relatively small sample size and use of a single
institution. Although the procedures we tested are important for all Canadian IM graduates,
the requirements differ from those required by other countries. For example, the American
Board of Internal Medicine requires residents to be knowledgeable about the indications
for a number of procedures and their interpretation, however, their trainees are not required
to demonstrate competence in actually performing any of the procedures tested in our PS-
OSCE (American Board of Internal Medicine 2013). In addition, the cost, although
comparable to that of other OSCEs, may limit the feasibility for some institutions to
implement a similar examination.
Strengths of this study include the use of a validity framework and the incorporation of
non-technical skills. Several sources of evidence for the validity of the scores from the
exam were systematically presented including: the process for the development of the
cases and the rating instruments, extensive rater training, inclusion of a Generalizability
analyses and calculation of effect sizes, comparison to performance on another exam, and
discussion about cost and feasibility. The incorporation of SPs and AHPs allowed for the
assessment of trainees in a more realistic and complex setting than if part-task models
alone had been used.
Although this was a formative examination, it could potentially be used as a summative,
high stakes assessment. Future steps will include developing PS-OSCE stations to assess
IM residents’ ability to perform other medical procedures and determining a fair and valid
process for setting a passing standard that ensures mastery of the skills.
Conclusion
IM residents are required to be proficient in performing a number of medical procedures,
and yet, due to limited opportunities to perform some of these procedures, these skills are
being performed and assessed infrequently. It is imperative to ensure that residents’ ability
to perform procedures, both from technical and non-technical perspectives, is being
assessed and the PS-OSCE is one way to accomplish this goal. Although the implemen-
tation of a PS-OSCE is time-consuming, it is a feasible and worthwhile initiative for IM
programs, given the importance of resident education and patient safety.
Acknowledgments We would like to acknowledge Dr. John (Jack) R. Boulet, Ms. Lesley Ananny, and thestaff at the Ottawa Exam Centre, Academy for Innovation in Medical Education (AIME), and University ofOttawa Skills and Simulation Centre (uOSSC) for their support and valuable advice. Funding for this Projectwas provided by the University of Ottawa through the Academy for Innovation in Medical Education, theDepartment of Medicine, and the Office of Postgraduate Residency Education. Dr. Pugh was also partlysupported by the W. Dale Dauphinee Fellowship from the Medical Council of Canada.
PS-OSCE
123
Appendix: Rating scales for non-technical skills
Professionalism- Altruism; respect; compassion; integrity and honesty; disclosure of errors and adverse events
1 2 3 4 5 6 7
Ignores patient’s comfort, needs or rights (e.g.,
dishonest; failure to use appropriate analgesia)
On several occasions, fails to acknowledge patient’s
comfort, needs, or rights
Acknowledges and attempts to tend to patient’s comfort,
needs, or rights
Attentive to patient’s comfort, needs and rights (e.g.,
enquires frequently about patient’s comfort; fully
discloses errors)
1 2 3 4 5 6 7
Loses control/composure (e.g., raised voice; inappropriate
language)
Often appears flustered during encounter
Maintains composure throughout most of the
encounter, but has difficulty in stressful situations
Maintains composure throughout encounter, even
when under stress
Collaboration- Appropriate delegation; respect and understanding for others’ roles; prevention and negotiation of conflict
1 2 3 4 5 6 7
Does not delegate tasks or involve team members when
appropriate (e.g., ignores other team members)
Seems unsure of role of other team members and/or
delegates inappropriately (e.g., asks nurse for dose of
medications; takes over the roles of others)
Delegates most tasks appropriately but does not
always involve team members in decision-making
Appropriately delegates tasks, involves team members in
decision-making when appropriate (e.g., integrates
information from others when planning)
1 2 3 4 5 6 7
Interferes with team functioning; and/or escalates
conflict (dismissive; condescending; hostile)
Often avoids conflict with team members by ignoring
rather than addressing issues
Attempts to intervene to mitigate conflict throughout
most of the encounter
Adequately mitigates conflict throughout encounter to ensure functional team
dynamic (e.g., allows team members to clarify their opinions when there is
disagreement)
Communication - Informed consent; use of expert verbal and non-verbal communication; effective listening
1 2 3 4 5 6 7
Fails to explain the procedure and its risks to patient or substitute decision maker
Explains some aspects of the procedure but does not provide
enough information for informed consent (e.g., omits essential information; does not use language that the patient is
likely to understand)
Clearly explains the procedure and its risks but does not verify
for understanding (e.g., does not allow patient to ask
questions)
Clearly explains the procedure and its risks and ensures that they have understood all the
information
D. Pugh et al.
123
1 2 3 4 5 6 7
Fails to communicate effectively with patient or
team members (e.g., provides information in disorganized way; dismissive of others’
views)
Usually communicates their perspective, but does not allow
others to express themselves (e.g., interrupts; ignores verbal
or non-verbal cues)
Communicates effectively with patient and team members about plan
throughout most of the encounter
Consistently communicates clearly and effectively with patient and team members
(e.g., listens carefully; attentive to verbal and non-verbal cues; demonstrates
sensitivity)
References
American Board of Internal Medicine. (2013). Retrieved Sept 18, 2013, from http://www.abim.org/certification/policies/imss/im.aspx.
American Educational Research Association, American Psychological Association, & National Council onMeasurement in Education. (1999). Standards for educational and psychological testing. Washington,DC: American Educational Research Association.
Barsuk, J., Ahya, S., Cohen, E., McGaghie, W., & Wayne, D. (2009). Mastery learning of temporaryhemodialysis catheter insertion by nephrology fellows using simulation technology and deliberatepractice. American Journal of Kidney Diseases, 54, 70–76.
Boots, R. J., Egerton, W., McKeering, H., & Winter, H. (2009). They just don’t get enough! Variable internexperience in bedside procedural skills. Internal Medicine, 39, 222–227.
Brennan, R. (2001). GENOVA suite programs. Retrieved Apr 15, 2013, from http://www.education.uiowa.edu/centers/casma/computer-programs.aspx.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multi-method matrix. Psychology Bulletin, 56, 81–105.
Card, S., Snell, L., & O’Brien, B. (2006). Are Canadian General Internal Medicine training programgraduates well prepared for their future careers? BMC Medical Education, 6, 56.
Castorr, A. H., Thompson, K. O., Ryan, J. W., Phillips, C. Y., Prescott, P. A., & Soeken, K. L. (1990). Theprocess of rater training for observational instruments: Implications for interrater reliability. Researchin Nursing Health, 13, 311–318.
Cook, D., & Beckman, T. (2006). Current concepts in validity and reliability for psychometric instruments:Theory and application. The American Journal of Medicine, 119, 166.e7–166.e16.
Downing, S. M. (2003). Validity: On meaningful interpretation of assessment data. Medical Education, 37,830–837.
Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38,1006–1012.
Ericsson, K. A. (2008). Deliberate practice and acquisition of expert performance: A general overview.Academic Emergency Medicine, 15, 988–994.
Frank, J. (Ed.). (2005). The CanMEDS 2005 physician competency framework: Better standards. Betterphysicians. Better care. Ottawa: The Royal College of Physicians and Surgeons of Canada.
Ginsburg, S., Regehr, G., Hatala, R., McNaughton, N., Frohna, A., Hodges, B., et al. (2000). Context,conflict, and resolution: A new conceptual framework for evaluating professionalism. AcademicMedicine, 75, S6–S11.
Gorman, C. A., & Rentsch, J. R. (2009). Evaluating frame-of-reference rater training effectiveness usingperformance schema accuracy. Journal of Applied Psychology, 94, 1336–1344.
PS-OSCE
123
Hicks, C. M., Gonzalez, R., Morton, M. T., Gibbons, R. V., Wigton, R. S., & Anderson, R. J. (2000).Procedural experience and comfort level in internal medicine trainees. Journal of General InternalMedicine, 15, 716–722.
Huang, G., Smith, C. C., Gordon, C., Feller-Kopman, D., Davis, R., Phillips, R., et al. (2006). Beyond thecomfort zone: Residents assess their comfort performing inpatient medical procedures. AmericanJournal of Medicine, 119, 71.e17–71.e24.
Jefferies, A., Simmons, B., Tabak, D., McIlroy, J., Lee, K., Roukema, H., et al. (2007). Using an objectivestructured clinical examination (OSCE) to assess multiple physician competencies in postgraduatetraining. Medical Teacher, 29, 183–191.
Kneebone, R., Nestel, D., Yadollahi, F., Brown, R., Nolan, C., Durack, J., et al. (2006). Assessing proceduralskills in context: Exploring the feasibility of an Integrated Procedural Performance Instrument (IPPI).Medical Education, 40, 1105–1114.
Kromann, C., Jensen, M., & Ringsted, C. (2009). The effect of testing on skills learning. Medical Education,43, 21–27.
LeBlanc, V., Tabak, D., Kneebone, R., Nestel, D., MacRae, H., & Moulton, C. (2009). Psychometricproperties of an integrated assessment of technical and communication skills. American Journal ofSurgery, 197, 96–101.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104). NewYork: American Council on Education and Macmillan.
Moulton, C., Tabak, D., Kneebone, R., Nestel, D., MacRae, H., & LeBlanc, V. (2009). Teaching com-munication skills using the integrated procedural performance instrument (IPPI): A randomized con-trolled trial. American Journal of Surgery, 197, 113–118.
Naik, V., Devito, I., & Halpern, S. (2003). Cusum analysis is a useful tool to assess resident proficiency atinsertion of labour epidurals. Canadian Journal of Anesthesia, 50, 694–698.
Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Edu-cation, 40, 618–623.
Pugh, D., Touchie, C., Code, C., & Humphrey-Murto, S. (2010). Teaching and testing procedural skills:Survey of Canadian Internal Medicine program directors and residents. International Conference onResidency Education, Ottawa, ON. [OM Abstract 76]. Retrieved on Sept 18, 2013 from http://www.openmedicine.ca/article/view/439/353.
Reznick, R., Regehr, G., MacRae, H., & Martin, J. (1997). Testing technical skill via an innovative ‘‘BenchStation’’ Examination. The American Journal of Surgery, 173, 226–230.
Royal College of Physicians and Surgeons of Canada. (2011). Objectives of training in Internal Medicine.Retrieved Sept 18, 2013 from http://www.deptmedicine.utoronto.ca/Assets/DeptMed?Digital?Assets/Core?Internal?Medicine?Files/rcpscobjectives.pdf.
Wayne, D., Barsuk, J., O’Leary, K., Fudala, M., & McGaghie, W. (2008). Mastery learning of thoracentesisskills by internal medicine residents using simulation technology and deliberate practice. Journal ofHospital Medicine, 3, 48–54.
Wickstrom, G. C., Kolar, M. M., Keyserling, T. C., Kelley, D. K., Xie, S. X., Bognar, B. A., et al. (2000).Confidence of graduating internal medicine residents to perform ambulatory procedures. Journal ofInternal Medicine, 15, 361–365.
D. Pugh et al.
123