Download - A procedural skills OSCE: assessing technical and non-technical skills of internal medicine residents

A procedural skills OSCE: assessing technicaland non-technical skills of internal medicine residents

Debra Pugh • Stanley J. Hamstra • Timothy J. Wood •

Susan Humphrey-Murto • Claire Touchie • Rachel Yudkowsky •

Georges Bordage

Received: 18 September 2013 / Accepted: 5 May 2014� Springer Science+Business Media Dordrecht 2014

Abstract Internists are required to perform a number of procedures that require mastery

of technical and non-technical skills, however, formal assessment of these skills is often

lacking. The purpose of this study was to develop, implement, and gather validity evidence

for a procedural skills objective structured clinical examination (PS-OSCE) for internal

medicine (IM) residents to assess their technical and non-technical skills when performing

procedures. Thirty-five first to third-year IM residents participated in a 5-station PS-OSCE,

which combined partial task models, standardized patients, and allied health professionals.

Formal blueprinting was performed and content experts were used to develop the cases and

rating instruments. Examiners underwent a frame-of-reference training session to prepare

them for their rater role. Scores were compared by levels of training, experience, and to

evaluation data from a non-procedural OSCE (IM-OSCE). Reliability was calculated using

Generalizability analyses. Reliabilities for the technical and non-technical scores were 0.68

and 0.76, respectively. Third-year residents scored significantly higher than first-year

residents on the technical (73.5 vs. 62.2 %) and non-technical (83.2 vs. 75.1 %) compo-

nents of the PS-OSCE (p \ 0.05). Residents who had performed the procedures more

frequently scored higher on three of the five stations (p \ 0.05). There was a moderate

disattenuated correlation (r = 0.77) between the IM-OSCE and the technical component of

the PS-OSCE scores. The PS-OSCE is a feasible method for assessing multiple

D. Pugh � T. J. Wood � S. Humphrey-Murto � C. TouchieDepartment of Medicine, University of Ottawa, Ottawa, ON, Canada

D. Pugh (&)Division of General Internal Medicine, The Ottawa Hospital, General Campus, 501 Smyth Road,Box 209, Ottawa, ON K1H 8L6, Canadae-mail: [email protected]

S. J. HamstraDepartments of Medicine, Surgery and Anesthesia, University of Ottawa, Ottawa, ON, Canada

R. Yudkowsky � G. BordageDepartment of Medical Education, University of Illinois at Chicago, Chicago, IL, USA

123

Adv in Health Sci EducDOI 10.1007/s10459-014-9512-x

competencies related to performing procedures and this study provides validity evidence to

support its use as an in-training examination.

Keywords Assessment � Non-technical skills � OSCE � Post-graduate

trainees � Procedures � Technical skills

Introduction

Graduates of Canadian Internal Medicine (IM) Residency programs are required to be

proficient in performing a number of medical procedures, such as thoracentesis and lumbar

puncture (Royal College of Physicians and Surgeons of Canada 2011). The formal

assessment of these skills, however, is often lacking. With no standardized and validated

approach available to assess residents’ competence in performing procedures, most

Canadian IM program directors continue to rely on informal assessment methods such as

logbooks that document the number of procedures done (Pugh et al. 2010). As a result, IM

residents, including recent graduates, do not feel adequately prepared to perform many

procedures expected of them (Card et al. 2006; Hicks et al. 2000; Huang et al. 2006;

Wickstrom et al. 2000).

One approach to address this issue and formalize the learning experience is to apply a

mastery approach to teaching procedural skills. Using this method, trainees are taught to

perform procedures (e.g., on a partial task model) and their technical skills are then

observed until a pre-determined level of competence has been achieved (Barsuk et al.

2009; Wayne et al. 2008). Another well-described approach is the objective structured

assessment of technical skill (OSATS), which is a performance-based examination that

assesses the technical skills of surgical trainees as they perform a series of procedures on

bench models (e.g., excision of a skin lesion, tracheostomy) (Reznick et al. 1997).

Although completing a procedure requires mastery of both technical and non-technical

skills, however, both these approaches focus primarily on the technical aspects of proce-

dural competence (i.e., ensuring that certain steps are completed). The OSATS does

incorporate the use of an assistant, however, their role is to follow instructions, rather than

to challenge the candidates’ ability to communicate and collaborate.

Using the CanMEDS framework (Frank 2005) several distinct, but overlapping, roles

can be used to describe the competencies required when performing procedures. For

example, when performing an endotracheal intubation, a physician must be able to obtain

informed consent from the patient or their substitute decision maker (Communicator),

work with a registered nurse and respiratory therapist to ensure adequate preparation and

flow of the procedure (Collaborator), remain calm if the patient has an unanticipated

deterioration in their clinical status (Professional), and perform the technical aspects of the

procedure to properly place the tube (Medical Expert). Although assessing multiple

competencies with one test poses several challenges, Jefferies et al. (2007) have shown that

it is feasible to use an objective structured clinical examination (OSCE) to assess multiple

CanMEDS roles in an examination of non-procedural clinical skills.

The integrated procedural performance instrument (IPPI) (Kneebone et al. 2006) is a

tool that has been used to teach and assess technical skills in context. Using a hybrid

model, both standardized patients (SPs) and partial task models are included in a

D. Pugh et al.

123

performance-based examination to assess non-technical (i.e., professionalism and com-

munication) as well as technical skills. The IPPI has been shown to be feasible and to have

acceptable psychometric characteristics when used to assess some of the procedures

commonly performed by medical students (e.g., urinary catheterization or venipuncture)

and surgical trainees (e.g., wound closure and cast application) (LeBlanc et al. 2009;

Moulton et al. 2009). However, the IPPI has not been used to assess the invasive proce-

dures performed by IM residents, nor has it been used to assess the collaborative skills that

are often required when performing complex medical procedures.

Given the lack of standardized and validated means of assessing procedural skills in

internal medicine, we developed and implemented an innovative procedural skills OSCE

(PS-OSCE) for IM residents to simultaneously assess both the technical and non-technical

skills (including communication, collaboration, and professionalism) required when per-

forming procedures. Modern validity theory was used as a framework to gather multiple

sources of evidence for the validity of the scores from this examination (Messick 1989;

American Educational Research Association et al. 1999; Downing 2003). In this frame-

work, the onus is on the experimenters to demonstrate construct validity, that is, to what

degree the scores from an assessment actually measure the underlying construct of interest.

Validity is demonstrated by collecting evidence that supports or refutes the proposed

meaning of a set of scores, and this evidence can be grouped into five categories or sources.

The five sources of evidence gathered to demonstrate validity for the scores from the PS-

OSCE, were: content (i.e., test items are representative of the construct of interest),

response process (i.e., evidence of data integrity including: clear test instructions for

candidates, rigorous rater training, methods for scoring, and data entry), internal structure

(i.e., psychometric properties of the exam including: score reliability, exam difficulty, and

inter-item correlations), relations with other variables (i.e., convergent and discriminant

evidence, including correlations to other variables), and consequences (i.e., impact on

learners, instructors, and curriculum).

Methods

Study participants

The PS-OSCE was a formative but mandatory in-training examination for first to third-year

residents (PGY1-3) in the University of Ottawa IM Residency Program in 2012. The exam

was administered near the end of the academic year, after which all trainees would have

been expected to have participated in at least one formal, hands-on procedural teaching

session. Ethics approval was given from the University of Illinois at Chicago (as part of

DP’s thesis requirements) and the University of Ottawa. Examinees received an email

invitation to participate in the study and written consent was obtained. Non-consenting

examinees still participated in the PS-OSCE but, as per Research Ethics Board rules, their

results were not included in any analyses.

Using five simultaneous tracks and two consecutive administrations (i.e., testing peri-

ods) all participants were assessed in one evening. Each track consisted of a series of rooms

distributed throughout the testing centre. There were five tracks running simultaneously

throughout the night so that all candidates were examined in two testing periods (i.e., early

evening and late evening). Each track included all five OSCE stations; in each station there

was a different examiner, SP, and allied health professional (AHP). Each resident com-

pleted the entire exam by going through only one track (e.g., track A, B, or C, etc.). As

PS-OSCE

123

such, residents were each examined by five different raters, but each resident in one track

(e.g., track A) saw the same 5 raters, as well as SPs and AHPs.

Content

We developed a PS-OSCE with five 18 min stations, each containing: (1) a partial task

model (i.e., for lumbar puncture, endotracheal intubation, central venous catheter insertion,

thoracentesis, and knee joint aspiration), (2) an SP, and (3) an AHP.

The examination blueprint was based on the procedures required of Canadian IM

graduates (Royal College of Physicians and Surgeons of Canada 2011). The five proce-

dures were selected (from a total of ten possible procedures) because of feasibility. That is,

the procedures were amenable to testing in an OSCE setting and suitable partial task

models were available. Other important skills that were not included on the examination

were: abdominal paracentesis and peripheral arterial line insertion (because of the lack of

suitable commercially available partial task models at the time of the study); resuscitation

skills (as this is assessed through a separate course at our institution); electrocardiogram

interpretation (as this skill could be assessed through more cost-efficient ways); and airway

management (as this was essentially assessed in the intubation station).

In addition, four of the seven CanMEDs roles were represented in each case, that is, the

Professional, the Communicator, the Collaborator, and the Medical Expert. These roles

were judged to be most relevant for the proper execution of the procedures being assessed.

The cases were written by the investigators (DP, CT, SH-M) who have experience with

OSCE case development at a local and national level. As an example, for the thoracentesis

station, residents had to: obtain consent from an angry SP; collaborate with an inexperi-

enced nurse (actor) who accidentally contaminates the sterile field; perform the thora-

centesis on a model; and discuss the results of a post-procedure radiograph demonstrating a

pneumothorax with the SP. Each case was reviewed by content experts and pilot-tested.

Checklists were developed by groups composed of 7–12 content experts, chosen for

their familiarity with the procedures of interest. An online, iterative survey was used to

obtain consensus about checklist items. For items in which unanimous consensus could not

be reached (ranging between zero and four checklist items for each of the five cases),

decisions to include the items were based on a majority rule.

Two separate rating scales were developed for each of the three non-technical domains

assessed (i.e., professionalism, collaboration, and communication skills). Each rating scale

was developed to assess different non-technical skills and the descriptors referred to

specific behaviours rather than norm-referenced anchors. The scales were pilot-tested in

two settings. Using an iterative process, the rating scales were then modified following

review by clinicians with expertise in the area of assessment to ensure clarity and ease of

use; see ‘‘Appendix’’.

Response process

Residents’ technical skills were rated by physician examiners (PEs) using task-specific

checklists. Residents received a score of 1 for each item done correctly, and 0 for items

unsatisfactorily attempted or not done. Non-technical skills were assessed using six 7-point

rating scales.

Faculty from the University of Ottawa were recruited as PEs and underwent a 2 h

frame-of-reference training session to introduce them to the rating instruments, prepare

them for their rater role, and ensure the accuracy of their ratings. During their training, they

D. Pugh et al.

123

watched two videos of simulated PS-OSCE stations, and rated the candidates’ non-tech-

nical performance. Facilitators collected the forms and presented data on the distribution of

scores, and participants were led in a discussion about their individual ratings. They then

re-watched and re-rated the training videos. This resulted in four separate rating forms for

each rater (i.e., initial and revised ratings for each video). A total score for each rating form

was calculated by summing the six ratings, resulting in a maximum score of 42. Mean total

scores were then compared using a one-sample t test, and standard deviations were

compared using an f test.

Immediately preceding the exam, candidates and PEs participated in separate orienta-

tion sessions to ensure that all exam instructions were clear. The SPs and AHPs received

training to ensure that the portrayal of their roles was accurate and consistent.

During the administration of the PS-OSCE, staff monitored data entry by reviewing the

checklist and rating scale forms completed after the first and second rounds of candidates.

Data entry was performed by experienced staff with quality control checks.

Internal structure

Exam scores were reported separately for the technical (checklist) and non-technical

(rating scale) components of each station. Scores were calculated on each station by

converting the respective score or rating into a percentage. Total scores for each measure

(i.e., technical and non-technical) were determined by averaging the scores on the five

stations. Corrected item-total correlations for the stations were calculated for technical and

non-technical scores. The correlation between total technical and non-technical scores was

calculated using Pearson’s correlation coefficient.

Reliability of the PS-OSCE scores was assessed using Generalizability analyses. Using

this model allows one to identify which variables (i.e., participants, tracks, training level, or

stations) are contributing the most and least to the overall variability in the scores. The amount

of variance accounted for by each variable is expressed as a percentage of the overall vari-

ability in the scores with higher percentages accounting for more of the variability. For the

Generalizability model that was used, PGY level (l) and track (t) were crossed with stations

(s). Participants were nested within l and t because the participants are unique to lt combi-

nation. Because there was only one PE per station, examiners were confounded with stations

and therefore not included as a separate facet in the model.

The variance components that were generated from this model are also used to derive

the estimates of reliability for the examination. A relative error coefficient was assumed

and the formula for calculating the G-coefficients is G = variancep:lt/(variancep:lt ? var-

ianceps:lt/ns). UrGenova was used to generate variance components for this analysis

(Brennan 2001).

Relations to other variables

Self-reported (i.e., recollected) data on the number of times a resident had performed the

procedure (i.e., 0–1 time, 2–4 times, 5–10 times, or[10 times) was collected using a pre-

OSCE survey. Differences in performance, based on year of training (i.e., PGY1-3) and

experience (i.e., number of times the procedure had been performed) were compared using

univariate ANOVA. Effect sizes were calculated using partial eta squared (gp2). Post hoc

independent t tests were used to further explore any differences found.

Scores were also compared to a non-procedural internal medicine OSCE (IM-OSCE), a

mandatory in-training exam for IM residents; it was administered 2 months before the PS-

PS-OSCE

123

OSCE. The IM-OSCE was composed of 1 communication, 4 structured oral, and 4 physical

examination stations, and had a test-score reliability of 0.74 (Cronbach’s a). Correlations

between PS-OSCE and IM-OSCE scores were calculated using Pearson’s correlation coef-

ficient; disattenuated correlations were calculated using the known reliabilities of the exams.

Consequences

Residents were surveyed after the PS-OSCE to determine the acceptability of the exam.

Costs for the administration of the PS-OSCE were calculated and compared to a hypo-

thetical non-procedural OSCE of the same size at our institution.

Residents received a written summary of their scores for each station (technical and

non-technical), and the results were forwarded to the IM residency program director. For

the purpose of this paper, standard setting procedures are not discussed because, while

mandatory, the PS-OSCE was a formative examination.

Results

Forty-one IM residents took the examination and 35 consented to participate in the study

(n = 15 PGY-1, n = 10 PGY-2, n = 10 PGY-3). The six non-consenting participants

were distributed evenly between the first 2 years of training; their results were not included

in any analyses.

Content

The PS-OSCE blueprint represented five of the ten procedures required of graduates of

Canadian IM programs, and four of the seven CanMEDs roles.

The scoring instruments were developed using content experts. This resulted in five

station-specific checklists, each composed of 16–24 items, as well as a set of six 7-point

rating scales used to assess non-technical skills across all stations.

Response process

Following frame-of-reference training, mean PE ratings changed significantly from 21.0 to 18.5

for video one, and from 29.6 to 32.0 for video two (p \ 0.001). Using an f test, the difference in

SD decreased significantly from 4.82 to 2.40 with the second ratings for video one [F (16,

16) = 3.372, p \ 0.01] and from 4.57 to 2.76 for video two [F (16, 16) = 2.74, p \ 0.05].

Internal structure

Total scores for the technical and non-technical components of the exam were 66.6 and

77.6 %, with total scores ranging from 36.9 to 82.6 and 58.6 to 90.0 %, respectively; see

Table 1. Corrected item-total correlations ranged from 0.27 to 0.62 and from 0.15 to 0.50

for technical and non-technical scores, respectively.

There was a significant correlation between total scores for the technical and non-

technical components of the exam (r = 0.76, p \ 0.001); correlations between technical

and non-technical components for each station were all significant (r = 0.35–0.87,

p \ 0.05).

D. Pugh et al.

123

Based on a Generalizability analysis, the track to which a participant was assigned

accounted for 6 % (for technical skills) and 0 % (for non-technical skills) of the variability

in scores and PGY level accounted for 4 % of the variance in scores for both measures; see

Table 2. Participants accounted for 20 % (for technical) and 22 % (for non-technical) of

the variance, indicating that there were differences between residents within a PGY level

on each track. Stations did not account for a significant amount of the variation in the

scores but the interaction between track and station accounted for 15 % of the variation in

technical skills and 33 % of the variation in non-technical skills, indicating that there may

have been differences in the way stations were scored across tracks.

These variance components were used to generate Generalizability (G) coefficients for

scores on both components and the resulting coefficients were 0.68 and 0.76 for the

technical and non-technical components, respectively. A D study revealed that to achieve a

G-coefficient of 0.80, at least 10 stations would be required for the technical component,

and 7 stations for the non-technical component of the exam.


For both the technical and non-technical components of the exam, senior residents scored

significantly higher than more junior residents; see Table 3. Post hoc pair-wise compari-

sons, using Tukey’s HSD test, showed that PGY-3 residents scored higher than PGY-1

residents on the technical (73.5 vs. 62.2 %, p = 0.029) and non-technical (83.2 vs. 75.1 %,

p = 0.024) components.

Scores for the technical component of the stations differed as a function of the number

of times a resident reported having previously performed a procedure for the central line,

lumbar puncture, and thoracentesis stations; see Table 4. Independent t tests showed the

differences were between those who had performed the following procedures more fre-

quently: central line insertion ([10 times vs. 2–4 times;[10 times vs. 5–10 times); lumbar

puncture ([10 times vs. 0–1 time; 5–10 times vs. 0–1 time); and thoracentesis (2–4 times

vs. 0–1 time), p \ 0.05. Scores for the non-technical component of the stations did not

differ significantly by the amount of previous procedural experience.

Of the 35 residents in this study, 27 had also participated in the IM-OSCE. There was a

significant positive correlation between the IM-OSCE scores and total score for the

technical (r = 0.47, p = 0.013), but not the non-technical (r = 0.35, p = 0.074) scores of

the PS-OSCE. Disattenuated correlations between the IM-OSCE scores and total scores

were calculated: r = 0.77 for the technical, and r = 0.54 for the non-technical compo-

nents, respectively.

Consequences

When surveyed about their experience with the PS-OSCE compared to assessments using a

partial task model alone, 23 residents (66 %) reported that an OSCE with SPs and AHPs

allowed for a more valid assessment of their skills than a model alone; eight (23 %)

reported that it depends on the skill while four (11 %) reported that assessing technical

skills alone is more valid.

Costs related to the administration of this 5-station, 5-track OSCE were approximately

$33,500 CAN, including payment to PEs (*$10,000), SPs (*$4,000), AHPs (*$7,500),

SP trainers ($3,500), OSCE staff (*$4,000), catering (*$2,500), data entry (*$500),

administrative fees (*$1,000) and medical equipment (*$500). Purchase of 20 models

PS-OSCE

123

would have added approximately $31,000 to the total cost. The estimated cost of admin-

istering a non-procedural OSCE of the same size at our institution would be approximately

$21,750.

Table 1 Technical and non-technical mean scores and corrected item-total correlations according tostations

Station Technical score Non-technical score

Mean %(SD)

Min/max CorrectedITC

Meanraw score(SD)

Meana

convertedto %(SD)

Min/max CorrectedITC

Central line 65.1 (18.0) 21.7–91.3 0.62 5.4 (0.9) 77.2 (12.7) 45.2–100.0 0.15

Lumbarpuncture

70.1 (13.9) 33.3–90.5 0.28 5.6 (0.8) 79.3 (11.5) 47.6–100.0 0.15

Intubation 64.6 (16.1) 19.1–90.5 0.49 5.1 (0.9) 73.4 (12.2) 47.6–97.6 0.49

Knee 70.9 (14.1) 37.5–93.8 0.27 5.6 (0.7) 80.7 (9.9) 61.9–97.6 0.37

Thoracentesis 62.1 (19.1) 20.8–87.5 0.59 5.4 (1.3) 77.2 (17.9) 38.1–100.0 0.50

Total score 66.6 (11.0) 36.9–82.6 n/a 27.1 (2.8) 77.6 (7.9) 58.6–90.0 n/a

ITC item-total correlationa Ratings were converted from 1–7 to 0–6, and then a percentage was calculated

Table 2 Generalizability analysis by skills type

Facet Technical Non-technical Explanation

Variancecomponent

%variance

Variancecomponent

%variance

t 16.33 6 1.32 0 To what degree do scores on the five tracks differ?

l 10.63 4 10.48 4 To what degree do scores assigned to the each ofthe training levels differ within each track?

p:lt 58.13 20 60.96 22 To what degree to scores assigned to participantswithin a training level and track differ? This isthe object of measurement for this study.

s 0 0 0 0 To what degree do scores on the five stationsdiffer?

tl 5.76 2 0 0 To what degree do scores assigned to eachtraining level differ by track?

ts 41.85 15 90.86 33 To what degree do station scores differ as afunction of track?

ls 0.79 0 13.12 5 To what degree to station scores differ as afunction of training level?

tls 14.57 5 9.02 3 To what degree do station scores differ across thetraining levels and track?

ps:lt 136.57 45 93.62 34 Represents a combination of unexplained errorand the degree that station scores forparticipants within a training level and trackdiffered

t Track, l level (post-graduate year), s stations, p participant

D. Pugh et al.

123

Scores on both the technical and non-technical components of the exam (see Table 1)

revealed the lowest scores were on the central line insertion, intubation, and thoracentesis

stations.

Discussion

The standardized, formal assessment of IM residents’ ability to perform procedures was

effectively achieved with this PS-OSCE. Interpreting the scores from the PS-OSCE, as

with any exam, requires analysis of several sources of validity evidence, which are dis-

cussed successively in the following sections.

Content

Content evidence refers to ensuring that the construct being assessed is accurately and com-

pletely represented on a test (Cook and Beckman 2006). In this case, the PS-OSCE blueprint

allowed for the assessment of residents’ skills in performing five procedures, which represents

half of all required procedures of graduates of Canadian IM programs, and four of the seven

CanMEDs roles. Although five procedures were not tested, resident abilities may generalize to

other procedures because of the similarity in skills across procedures (e.g., the technical skills

required for thoracentesis may be transferable to paracentesis, given that the basic procedural

steps and equipment are similar). Given that trainees have variable opportunities to perform

procedures while being observed by faculty (Pugh et al. 2010; Boots et al. 2009), this was an

efficient way to formally assess a number of skills. Attention to pilot-testing and revisions of the

cases by content experts also helped to ensure that the PS-OSCE was representative of the

challenges faced by residents when performing procedures. A rigorous approach to the

development and pilot-testing of the instruments, including groups of content experts and a

consensus survey, provides further evidence to support content validity.

Response process

To ensure that the results reported to candidates are valid, one must ensure that the ratings

provided by examiners are accurate. An important step when introducing a new assessment

Table 3 Technical and non-technical mean scores by levelsof training

Total scoretechnical(SD)

Total scorenon-technical(SD)

PGY-1n = 15

62.2 (11.8) 75.1 (7.2)

PGY-2n = 10

66.3 (9.5) 75.7 (7.9)

PGY-3n = 10

73.5 (7.9) 83.2 (6.5)

Totaln = 35

66.6 (11.0) 77.6 (7.9)

F (2, 32) 3.66 4.32

p 0.037 0.022

gp2 0.19 0.21

PS-OSCE

123

Ta

ble

4T

ech

nic

alan

dn

on

-tec

hn

ical

mea

nsc

ore

sb

yn

um

ber

of

tim

esp

rev

iou

sly

per

form

edan

db

yst

atio

nty

pes

No.

tim

esper

form

edC

entr

alli

ne

score

(SD

)L

um

bar

punct

ure

score

(SD

)In

tubat

ion

score

(SD

)K

nee

score

(SD

)T

hora

cente

sis

score

(SD

)

NT

echnic

alN

on-t

echnic

alN

Tec

hnic

alN

on-t

echnic

alN

Tec

hnic

alN

on-t

echnic

alN

Tec

hnic

alN

on-t

echnic

alN

Tec

hnic

alN

on-t

echnic

al

0–1

360.8

7(1

5.6

8)

73.8

1(1

1.9

0)

759.8

6(1

2.2

5)

74.1

5(8

.95)

854.1

7(2

2.9

0)

69.6

4(1

1.1

5)

27

69.6

8(1

4.5

8)

80.3

4(1

0.0

5)

18

53.2

4(2

0.6

4)

69.7

1(1

9.0

3)

2–4

12

58.7

0(2

0.1

8)

76.3

9(1

2.8

2)

18

69.3

1(1

4.1

7)

78.9

7(1

0.2

7)

12

69.4

4(9

.62)

69.4

4(1

0.9

7)

675.0

0(1

4.2

5)

80.9

5(1

1.2

7)

11

69.7

0(1

3.4

5)

85.0

7(1

2.1

4)

5–10

959.9

0(1

6.3

8)

71.4

2(1

3.6

3)

877.3

8(9

.78)

84.2

2(1

5.9

9)

12

65.8

7(1

4.7

6)

77.7

8(1

2.9

7)

275.0

0(8

.83)

84.5

2(1

.68)

373.6

1(1

0.4

9)

83.3

3(1

5.6

1)

[10

11

77.4

7(1

2.1

1)

83.7

7(1

0.2

6)

283.3

3(1

0.1

0)

79.7

6(5

.05)

368.2

5(1

6.7

2)

81.7

4(1

1.0

0)

0n/a

n/a

376.3

9(4

.81)

87.3

0(1

7.8

7)

Tota

l35

65.0

9(1

8.0

3)

77.2

1(1

2.6

6)

35

70.0

7(1

3.9

3)

79.2

5(1

1.4

7)

35

64.6

3(1

6.1

4)

73.4

0(1

2.1

6)

35

70.8

9(1

4.1

3)

80.6

8(9

.85)

35

62.1

4(1

9.0

8)

77.2

1(1

7.8

8)

F(3

,31)

(2,

32

for

knee

)

2.9

81.8

23.0

90.9

71.6

41.7

80.4

20.1

63.3

92.4

9

p0.0

47

0.1

64

0.0

41

0.4

22

0.2

00.1

71

0.6

59

0.8

50

0.0

30

0.0

79

gp2

0.2

20.1

50.2

30.0

90.1

40.1

50.0

30.0

10.2

50.1

9

D. Pugh et al.

123

instrument is to ensure that raters undergo training with the instrument in question. Our

PEs underwent frame-of-reference training, which was used to help them develop per-

formance schemas in order to arrive at a consensus regarding rating varying levels of

performance (Castorr et al. 1990; Gorman and Rentsch 2009). This helped to prepare them

for their rater role and to ensure that all raters were applying the instruments in a stan-

dardized way, as demonstrated by the decrease in variability of scores with their revised

ratings. The decrease in variability suggests that, after the training sessions, raters were

scoring performance in a more unified way. Quality assurance measures, such as ensuring

that all ratings were being completed after each candidate, also helped to ensure the

accurate collection of data for this examination.

Internal structure

Evidence for internal structure relates to the psychometric properties of an examination

(Downing 2004). The scores from this examination were found to be reliable, especially

for a formative examination. If greater levels of reliability were required, one could

incorporate additional shorter cases to minimize the impact on cost and feasibility. The

incorporation of more stations is certainly feasible, especially if one were to include a

smaller number of candidates (e.g., only PGY-3 residents).

However, given content specificity, the restricted measurement domain (at least for

technical skills) and the difficulties with simulating certain conditions, the number of tasks

that can be modeled and incorporated in an examination is finite. Therefore, while the

consistency of the scores is important, and certainly dependent on the number of tasks, the

choice of what to measure (validity) is key.

Generalizability analyses were useful in identifying various sources of error in the PS-

OSCE. Although track by itself did not account for much of the variability in scores, there

was a significant interaction between track and station, suggesting that stations were scored

differently across tracks. This may relate to rater differences on the tracks, station order

differences within the tracks, or it may represent actual differences in participants’ ability,

given that there were small numbers of participants in each track and not equally dis-

tributed by PGY level. In future studies, the use of more than one rater per station could

help to explore rater reliability. The analysis also showed that there were differences

between residents within a PGY level within tracks, that is, not all PGY-3 s have equal

ability.

Other measures of internal structure include scenario difficulty and correlations between

station scores. In this study, corrected item-total correlations ranged from low to moderate

for both technical and non-technical skills. One would not expect high correlations

between stations for technical ability, given that the stations were developed to measure

different procedural skills (i.e., case specificity) (Norman et al. 2006). However, it is

somewhat surprising that the item-total correlations for the non-technical skills were not

higher since the same three skills (i.e., communication, collaboration, and professionalism)

were assessed using the same rating scales in all stations. Further analyses showed that

performance of the non-technical skills were moderately to highly correlated with technical

skills, which may indicate that those who are more skilled in performing a given procedure

may have more marginal attention available, resulting in a greater ability to communicate

effectively, collaborate with others, and behave professionally. Non-technical skills are

largely contextual, and in the case of executing procedures, they may be dependent on

trainees’ technical ability (Ginsburg et al. 2000).

PS-OSCE

123


When considering the relationship between measures of the construct of interest with other

variables, one must consider both convergent and discriminant validity as sources of

evidence (Campbell and Fiske 1959). For procedural skills, it might be expected that more

senior trainees and those who have performed a given procedure more frequently would

perform better. In this study, as expected, PGY-3 residents performed better overall than

PGY-1 residents on both the technical and non-technical components of the stations.

Although the numbers of participants in each sub-group were small, the effect size was

moderate (i.e., gp2 [ 0.11).

Similarly, for three stations (central line, lumbar puncture and thoracentesis), technical

skill performance was better for those who had performed the procedure more often.

Again, although group numbers were small, the effect size was moderate. For the knee and

intubation stations, however, greater experience did not result in higher scores, perhaps

because of the relative paucity of participants’ experience with these procedures. Most

residents (27 out of 35) had never performed a knee aspiration, or only once, and only three

had performed more than ten endotracheal intubations.

Technical performance on the PS-OSCE correlated moderately with performance on a

non-procedural OSCE, while non-technical performance did not correlate significantly.

Although both the IM-OSCE and the technical component of the PS-OSCE were designed

to assess different constructs, they are both purportedly assessing the Medical Expert role,

and so one would expect a positive, yet moderate correlation between the two. Conversely,

the IM-OSCE does not attempt to assess the non-Medical Expert roles (with the exception

of one station designed to assess communication skills), so it is not surprising that scores

did not correlate with the non-technical scores on the PS-OSCE (i.e., discriminant

validity).

Consequences

The results from this study have important implications for our IM residency program. As

assessment drives learning (Kromann et al. 2009), we expect residents to demand more and

more opportunities to practice procedures and receive feedback on their strengths and

weaknesses. As a formative examination, the information collected can help identify areas

for improvement for individual residents, as well as identify potential weaknesses in the

current procedural skills curriculum. Overall, scores on the PS-OSCE were low, which

calls for program revisions or adjustments regarding procedural skills.

The number of procedures performed required to achieve competency varies greatly by

individual, (Naik et al. 2003) and it is clear that trainees need more experience with

procedures to become proficient, not just more time in-training. One cannot assume that

more senior trainees will be competent in performing routine procedures unless they have

had sufficient opportunities to practice their skills over time. For each of the procedures

studied, there was great variability in the amount of experience that residents had accu-

mulated, with many residents having never performed a given procedure, despite the fact

that this examination was near the end of the academic year. As emphasized by Ericsson’s

theory of expertise, trainees require the opportunity for deliberate mixed practice with

feedback in order to develop their technical skills (Ericsson 2008). Training programs must

ensure that trainees are gaining sufficient and repeated exposure to procedures over time,

and that formal assessment of these skills is occurring on a regular basis.

D. Pugh et al.

123

Limitations and strengths

Limitations of this study include the relatively small sample size and use of a single

institution. Although the procedures we tested are important for all Canadian IM graduates,

the requirements differ from those required by other countries. For example, the American

Board of Internal Medicine requires residents to be knowledgeable about the indications

for a number of procedures and their interpretation, however, their trainees are not required

to demonstrate competence in actually performing any of the procedures tested in our PS-

OSCE (American Board of Internal Medicine 2013). In addition, the cost, although

comparable to that of other OSCEs, may limit the feasibility for some institutions to

implement a similar examination.

Strengths of this study include the use of a validity framework and the incorporation of

non-technical skills. Several sources of evidence for the validity of the scores from the

exam were systematically presented including: the process for the development of the

cases and the rating instruments, extensive rater training, inclusion of a Generalizability

analyses and calculation of effect sizes, comparison to performance on another exam, and

discussion about cost and feasibility. The incorporation of SPs and AHPs allowed for the

assessment of trainees in a more realistic and complex setting than if part-task models

alone had been used.

Although this was a formative examination, it could potentially be used as a summative,

high stakes assessment. Future steps will include developing PS-OSCE stations to assess

IM residents’ ability to perform other medical procedures and determining a fair and valid

process for setting a passing standard that ensures mastery of the skills.

Conclusion

IM residents are required to be proficient in performing a number of medical procedures,

and yet, due to limited opportunities to perform some of these procedures, these skills are

being performed and assessed infrequently. It is imperative to ensure that residents’ ability

to perform procedures, both from technical and non-technical perspectives, is being

assessed and the PS-OSCE is one way to accomplish this goal. Although the implemen-

tation of a PS-OSCE is time-consuming, it is a feasible and worthwhile initiative for IM

programs, given the importance of resident education and patient safety.

Acknowledgments We would like to acknowledge Dr. John (Jack) R. Boulet, Ms. Lesley Ananny, and thestaff at the Ottawa Exam Centre, Academy for Innovation in Medical Education (AIME), and University ofOttawa Skills and Simulation Centre (uOSSC) for their support and valuable advice. Funding for this Projectwas provided by the University of Ottawa through the Academy for Innovation in Medical Education, theDepartment of Medicine, and the Office of Postgraduate Residency Education. Dr. Pugh was also partlysupported by the W. Dale Dauphinee Fellowship from the Medical Council of Canada.

PS-OSCE

123

Appendix: Rating scales for non-technical skills

Professionalism- Altruism; respect; compassion; integrity and honesty; disclosure of errors and adverse events

1 2 3 4 5 6 7

Ignores patient’s comfort, needs or rights (e.g.,

dishonest; failure to use appropriate analgesia)

On several occasions, fails to acknowledge patient’s

comfort, needs, or rights

Acknowledges and attempts to tend to patient’s comfort,

needs, or rights

Attentive to patient’s comfort, needs and rights (e.g.,

enquires frequently about patient’s comfort; fully

discloses errors)

1 2 3 4 5 6 7

Loses control/composure (e.g., raised voice; inappropriate

language)

Often appears flustered during encounter

Maintains composure throughout most of the

encounter, but has difficulty in stressful situations

Maintains composure throughout encounter, even

when under stress

Collaboration- Appropriate delegation; respect and understanding for others’ roles; prevention and negotiation of conflict

1 2 3 4 5 6 7

Does not delegate tasks or involve team members when

appropriate (e.g., ignores other team members)

Seems unsure of role of other team members and/or

delegates inappropriately (e.g., asks nurse for dose of

medications; takes over the roles of others)

Delegates most tasks appropriately but does not

always involve team members in decision-making

Appropriately delegates tasks, involves team members in

decision-making when appropriate (e.g., integrates

information from others when planning)

1 2 3 4 5 6 7

Interferes with team functioning; and/or escalates

conflict (dismissive; condescending; hostile)

Often avoids conflict with team members by ignoring

rather than addressing issues

Attempts to intervene to mitigate conflict throughout

most of the encounter

Adequately mitigates conflict throughout encounter to ensure functional team

dynamic (e.g., allows team members to clarify their opinions when there is

disagreement)

Communication - Informed consent; use of expert verbal and non-verbal communication; effective listening

1 2 3 4 5 6 7

Fails to explain the procedure and its risks to patient or substitute decision maker

Explains some aspects of the procedure but does not provide

enough information for informed consent (e.g., omits essential information; does not use language that the patient is

likely to understand)

Clearly explains the procedure and its risks but does not verify

for understanding (e.g., does not allow patient to ask

questions)

Clearly explains the procedure and its risks and ensures that they have understood all the

information

D. Pugh et al.

123

1 2 3 4 5 6 7

Fails to communicate effectively with patient or

team members (e.g., provides information in disorganized way; dismissive of others’

views)

Usually communicates their perspective, but does not allow

others to express themselves (e.g., interrupts; ignores verbal

or non-verbal cues)

Communicates effectively with patient and team members about plan

throughout most of the encounter

Consistently communicates clearly and effectively with patient and team members

(e.g., listens carefully; attentive to verbal and non-verbal cues; demonstrates

sensitivity)

References

American Board of Internal Medicine. (2013). Retrieved Sept 18, 2013, from http://www.abim.org/certification/policies/imss/im.aspx.

American Educational Research Association, American Psychological Association, & National Council onMeasurement in Education. (1999). Standards for educational and psychological testing. Washington,DC: American Educational Research Association.

Barsuk, J., Ahya, S., Cohen, E., McGaghie, W., & Wayne, D. (2009). Mastery learning of temporaryhemodialysis catheter insertion by nephrology fellows using simulation technology and deliberatepractice. American Journal of Kidney Diseases, 54, 70–76.

Boots, R. J., Egerton, W., McKeering, H., & Winter, H. (2009). They just don’t get enough! Variable internexperience in bedside procedural skills. Internal Medicine, 39, 222–227.

Brennan, R. (2001). GENOVA suite programs. Retrieved Apr 15, 2013, from http://www.education.uiowa.edu/centers/casma/computer-programs.aspx.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multi-method matrix. Psychology Bulletin, 56, 81–105.

Card, S., Snell, L., & O’Brien, B. (2006). Are Canadian General Internal Medicine training programgraduates well prepared for their future careers? BMC Medical Education, 6, 56.

Castorr, A. H., Thompson, K. O., Ryan, J. W., Phillips, C. Y., Prescott, P. A., & Soeken, K. L. (1990). Theprocess of rater training for observational instruments: Implications for interrater reliability. Researchin Nursing Health, 13, 311–318.

Cook, D., & Beckman, T. (2006). Current concepts in validity and reliability for psychometric instruments:Theory and application. The American Journal of Medicine, 119, 166.e7–166.e16.

Downing, S. M. (2003). Validity: On meaningful interpretation of assessment data. Medical Education, 37,830–837.

Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38,1006–1012.

Ericsson, K. A. (2008). Deliberate practice and acquisition of expert performance: A general overview.Academic Emergency Medicine, 15, 988–994.

Frank, J. (Ed.). (2005). The CanMEDS 2005 physician competency framework: Better standards. Betterphysicians. Better care. Ottawa: The Royal College of Physicians and Surgeons of Canada.

Ginsburg, S., Regehr, G., Hatala, R., McNaughton, N., Frohna, A., Hodges, B., et al. (2000). Context,conflict, and resolution: A new conceptual framework for evaluating professionalism. AcademicMedicine, 75, S6–S11.

Gorman, C. A., & Rentsch, J. R. (2009). Evaluating frame-of-reference rater training effectiveness usingperformance schema accuracy. Journal of Applied Psychology, 94, 1336–1344.

PS-OSCE

123

http://www.abim.org/certification/policies/imss/im.aspx

http://www.abim.org/certification/policies/imss/im.aspx

http://www.education.uiowa.edu/centers/casma/computer-programs.aspx

http://www.education.uiowa.edu/centers/casma/computer-programs.aspx

Hicks, C. M., Gonzalez, R., Morton, M. T., Gibbons, R. V., Wigton, R. S., & Anderson, R. J. (2000).Procedural experience and comfort level in internal medicine trainees. Journal of General InternalMedicine, 15, 716–722.

Huang, G., Smith, C. C., Gordon, C., Feller-Kopman, D., Davis, R., Phillips, R., et al. (2006). Beyond thecomfort zone: Residents assess their comfort performing inpatient medical procedures. AmericanJournal of Medicine, 119, 71.e17–71.e24.

Jefferies, A., Simmons, B., Tabak, D., McIlroy, J., Lee, K., Roukema, H., et al. (2007). Using an objectivestructured clinical examination (OSCE) to assess multiple physician competencies in postgraduatetraining. Medical Teacher, 29, 183–191.

Kneebone, R., Nestel, D., Yadollahi, F., Brown, R., Nolan, C., Durack, J., et al. (2006). Assessing proceduralskills in context: Exploring the feasibility of an Integrated Procedural Performance Instrument (IPPI).Medical Education, 40, 1105–1114.

Kromann, C., Jensen, M., & Ringsted, C. (2009). The effect of testing on skills learning. Medical Education,43, 21–27.

LeBlanc, V., Tabak, D., Kneebone, R., Nestel, D., MacRae, H., & Moulton, C. (2009). Psychometricproperties of an integrated assessment of technical and communication skills. American Journal ofSurgery, 197, 96–101.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104). NewYork: American Council on Education and Macmillan.

Moulton, C., Tabak, D., Kneebone, R., Nestel, D., MacRae, H., & LeBlanc, V. (2009). Teaching com-munication skills using the integrated procedural performance instrument (IPPI): A randomized con-trolled trial. American Journal of Surgery, 197, 113–118.

Naik, V., Devito, I., & Halpern, S. (2003). Cusum analysis is a useful tool to assess resident proficiency atinsertion of labour epidurals. Canadian Journal of Anesthesia, 50, 694–698.

Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Edu-cation, 40, 618–623.

Pugh, D., Touchie, C., Code, C., & Humphrey-Murto, S. (2010). Teaching and testing procedural skills:Survey of Canadian Internal Medicine program directors and residents. International Conference onResidency Education, Ottawa, ON. [OM Abstract 76]. Retrieved on Sept 18, 2013 from http://www.openmedicine.ca/article/view/439/353.

Reznick, R., Regehr, G., MacRae, H., & Martin, J. (1997). Testing technical skill via an innovative ‘‘BenchStation’’ Examination. The American Journal of Surgery, 173, 226–230.

Royal College of Physicians and Surgeons of Canada. (2011). Objectives of training in Internal Medicine.Retrieved Sept 18, 2013 from http://www.deptmedicine.utoronto.ca/Assets/DeptMed?Digital?Assets/Core?Internal?Medicine?Files/rcpscobjectives.pdf.

Wayne, D., Barsuk, J., O’Leary, K., Fudala, M., & McGaghie, W. (2008). Mastery learning of thoracentesisskills by internal medicine residents using simulation technology and deliberate practice. Journal ofHospital Medicine, 3, 48–54.

Wickstrom, G. C., Kolar, M. M., Keyserling, T. C., Kelley, D. K., Xie, S. X., Bognar, B. A., et al. (2000).Confidence of graduating internal medicine residents to perform ambulatory procedures. Journal ofInternal Medicine, 15, 361–365.

D. Pugh et al.

123

http://www.openmedicine.ca/article/view/439/353

http://www.openmedicine.ca/article/view/439/353

http://www.deptmedicine.utoronto.ca/Assets/DeptMed%2bDigital%2bAssets/Core%2bInternal%2bMedicine%2bFiles/rcpscobjectives.pdf

http://www.deptmedicine.utoronto.ca/Assets/DeptMed%2bDigital%2bAssets/Core%2bInternal%2bMedicine%2bFiles/rcpscobjectives.pdf

Download - A procedural skills OSCE: assessing technical and non-technical skills of internal medicine residents

Top Related