peering through the looking glass: towards a programmatic view of the qualifying examination

|||

André De Champlain, PhD

Director, Psychometrics and Assessment Services

Presented at the MCC Annual General Meeting | 27 September 2015

Peering Through the

Looking Glass: Towards a

Programmatic View of the

Qualifying Examination

||

A Brave New Journey…. Programmatic Assessment

2MCC Annual Meeting – September 2015

||

Guiding Vision: The Assessment Review Task Force

3

6 ARTF RECOMMENDATIONSRecommendation 1

• LMCC becomes ultimate credential (legislation issue)

Recommendation 2

• Validate and update blueprint for MCC examinations

Recommendation 3

• More frequent scheduling of the exams and associated automation

Recommendation 4

• IMG assessment enhancement and national standardization (NAC & Practice Ready Assessment)

Recommendation 5

• Physician practice improvement assessments

Recommendation 6

• Implementation oversight, committee priorities, and budgets

MCC Annual Meeting – September 2015

|||

Validate and Update Blueprint for MCC Examinations

ARTF Recommendation #2

||

Blueprint Project

6

That the content of MCC examinations be expanded by:

• Defining knowledge and behaviours in all CanMEDS Roles that demonstrate competency of the physician about to enter independent practice

• Reviewing adequacy of content and skill coverage on blueprints for all MCC examinations

• Revising examination blueprints and reporting systems with aim of demonstrating that appropriate assessment of all core competencies is covered and fulfills purpose of each examination

• Determining whether any general core competencies considered essential cannot be tested employing the current MCC examinations, and exploring the development of new tools to assess these specific competencies when current examinations cannot


||

Addressing Micro-Gaps

7

A number of efforts are underway to assess how

the current MCCQE Parts I & II might evolve

towards better fulfilling the MCCQE blueprint

• New OSCE stations focusing on more generic skills and complex

presentations expected of all physicians, irrespective of

specialty, have been piloted with great success in the fall 2014

and spring 2015 MCCQE Part II administrations

• Potential inclusion of innovative item types, including situational

judgment test challenges, is also under careful consideration and

review

• From a micro-gap perspective, the MCCQE is better aligning

towards the two dimensions outlined in our new blueprint MCC Annual Meeting – September 2015

||

Tacit Sub-Recommendation: Macro-Analysis

8

An intimated challenge outlined in the ARTF

report pertains to the need to conduct a

macro-analysis and review of the MCCQE

• Applying a systemic (macroscopic) lens to the MCCQE as an

integrated examination system and not simply as a restricted

number of episodic “hurdles” (MCCQE Parts I &II)

• How are the components of the MCCQE interconnected and

how do they inform key markers along a physician’s

educational and professional continuum?

• How can the MCCQE progress towards embodying an

integrated, logically planned and sequenced system of

assessments that mirrors the Canadian physician’s journey? MCC Annual Meeting – September 2015

||

Key Recommendation: MEAAC

9

• Medical Education Assessment Advisory Committee (MEAAC)

• MEAAC was a key contributor to our practice analysis (blueprinting) efforts through their report, Current Issues in Health Professional and Health Professional Trainee Assessment

• Key recommendations

◦ Explore the implementation of an integrated & continuous model of assessment (linked assessments)

◦ Continue to incorporate “authentic” assessments in the MCCQE

– OSCE stations that mimic real practice

– Direct observation based assessment to supplement MCCQE Parts I & II


|||

Framework for a Systemic Analysis

of the MCCQE

|| 11

• Calls for a “deliberate”, arranged set of longitudinal

assessment activities

• Joint attestation of all data points for decision and remediation

purposes

• Input of expert professional judgment is a cornerstone of this

model

• (Purposeful) link between assessment and learning/remediation

• Dynamic, recursive relationships between assessment and

learning points

Programmatic Assessment (van der Vleuten et al., 2012)


|| 12

• Application of a program evaluation framework to assessment

• Systematic collection of data to answer specific questions about

a program

• Gaining in popularity within several medical education settings

• Competency-based workplace learning

• Medical schools (e.g., Dalhousie University, University of

Toronto, etc.)

• Etc.

Programmatic Assessment (van der Vleuten et al., 2012)


|| 13

Reductionism

• A system reduces to its

most basic elements (e.g.,

corresponds to the sum of

its parts)

• Decision point I =

MCCQE Part I

• Decision point II –

MCCQE Part II

Emergentism

• A system is more than the sum of its parts & also depends on complex interdependencies amongst its component parts

• Decision point I: Purposeful integration of MCCQE Part I scores with other data elements

• Decision point II: Purposeful integration of MCCQE Part II scores with other data elements

VS.

Programmatic Assessment Refocuses the Debate


||

Can a Programmatic Assessment Framework Be

Applied to the MCCQE?

14

• Programmatic assessment is primarily restricted to local

contexts (medical school, postgraduate training

program, etc.)

• What about applicability to high-stakes registration/

licensing exam programs?

• The model is limited to programmatic assessment in the

educational context, and consequently licensing assessment

programmes are not considered” (van der Vleuten et al.; 2012; p. 206)


||

Can a Programmatic Assessment Framework Be

Applied to the MCCQE?

15

• Probably not as conceived due to differences in:

• Settings (medical school vs. qualifying exam)

• Stakes (graduation vs. licence to practise as a physician)

• Outcomes

• Interpretation of data sources

• Nature of the program and its constituent elements


||

Can the Philosophy Underpinning Programmatic

Assessment Be Applied to the MCCQE?


||

How can the MCCQE Evolve?

17

• At its philosophical core, from: (1) an episodic system of two point-in-

time exams to; (2) an integrated program of assessment, continued to be

supported by best practice and evidence, which includes:

• The identification of data elements aimed at informing key decisions and

activities along the continuum of a physician’s medical education

• Clearly laid out relationships that are exemplified by the interactions

between those elements, predicated on a clearly defined program (the

MCCQE)

• Defensible feedback interwoven at key points in the program

• The $64,000 question: What does the MCCQE program of

assessment look like (actually, the $529,153.80 question)?


||

D2

In PracticePostgraduate Training

Assessment Continuum for the Canadian Trainees

Undergraduate Education

Continuing Professional Development

D1

Fu

ll

Lic

.

UGME Assessments Potentially

Leading to MCC BP Decision Point 1 in

Clerkship:

• SR-items (17/17)

• CR-items (16/17)

• OSCE (17/17)

• Direct observation reports (14/17)

• In-training evaluation (13/17)

• Simulation (10/17)

• MSF/360 (6/17)

• Others (11/17)

PPI (FMRAC)

• Assessment of practice

• Audits

CFPC:

• Direct obs.

• CR-items

(SAMPs)

• Structured

orals

Royal College

32 Entry Specialties:

• ITEs

• Direct observation

• SR-items

• CR-items

• OSCE/orals

• Simulations

• Chart audits

18

|| 19

• What constitutes the learning/assessment continuum for physicians from “cradle to grave” (UGME to PPI)?

• At a pan-Canadian level:

◦ A temporal timeline is a necessary, but insufficient condition, for better understanding the lifecycle of a physician

◦ What competencies do physicians develop throughout this life cycle?

◦ What behavioural indicators (elements) best describe “competency” at various points in the life cycle?

◦ How are these competencies related (both within and across)?

◦ How do these competencies evolve?

◦ Etc.

• All of these questions are critical in better informing the development of a programmatic model for the LMCC

Major Step Towards a Programmatic View of the MCCQE


|| 20

November Group on Assessment (NGA)

• Purpose

• To define the “life of a physician” from the beginning of medical school to retirement in terms of assessments

• To propose a common national framework of assessment using a programmatic approach

• Composition

• Includes representation from the AFMC, CFPC, CMQ, FMRAC, MCC, MRAs and Royal College

• First step

• Physician pathway

First Step Towards a Programmatic View of the MCCQE


|| 21

• Summit to define a program of assessment• Planned for first quarter of 2016

• Starting points to develop a program of assessment◦ Various ongoing North American EPA projects

◦ Milestone projects (Royal College, ACGME)

◦ CanMEDS 2015

◦ MCCQE Blueprint!

◦ … and many others

• Critical to develop an overarching framework (program) prior to specifying elements and relationships of this program

November Group on Assessment: Next Step


|||

Validating a Program of Assessment

||

Appeal

23

• Emphasis is on a composite of data elements (quantitative and

qualitative) to better inform key educational decisions as well as

learning

• Additional intricacy of including both micro-level (elements) and

macro-level (complex system of interrelated elements) indicators

adds an extra layer of complexity in the MCCQE validation process

• Systemic nature of programmatic assessment requires validating

not only the constituent elements (various data points) but also the

program in and of itself

• How do we proceed?


|| 24MCC Annual Meeting – September 2015

||

Standards for Educational and Psychological Testing (2014)

25

Key Objectives

• Provide criteria for the development and evaluation of tests and testing practices and to provide guidelines for assessing the validity of interpretations of test scores for the intended test uses

• Although such evaluations should depend heavily on professional judgment, the Standards provides a frame of reference to ensure that relevant issues are addressed


|| 26

Assessing the Foundational Properties of a Program of

Assessment

1• Reliability

2• Validity


|| 27

• “Test” Score: A reminder

• Any assessment, by virtue of practical constraints (e.g., available

testing time), is composed of a very restricted number of items,

stations, tasks that comprise the domain of interest

• My WBA program includes 12 completed mini-CEX forms

• But as a test score user, are you really interested in the

performance of candidates in those 12 very specific

encounters? No!

Reliability


|| 28

• You’re interested in generalizing from the performance in those

12 very specific encounters to the broader domains of interest

• Reliability provides us with an indication of the degree of

consistency (or precision) with which test scores and/or

decisions are being measured by a given examination (sample

of OSCE stations, sample of MCQs, sample of workplace-based

assessments, etc.)

Reliability


|| 29

• Measurement error arises from multiple sources (multifaceted)

• For a WBA, measurement error could be attributable to:

◦ Selection of a particular set of patient encounters

◦ Patient effects

◦ Occasion effects

◦ Rater effects

◦ Setting (if given at multiple locations)

• Need to clearly identify these sources and address them a priori

• The impact of all of these sources needs to be estimated

Reliability


||

Reliability of a Program of Assessment?

30

• Programmatic assessment is predicated on the notion that many purposefully selected and arranged data elements contribute to the evaluation of candidates

• In addition to assessing the reliability of each element in the system, the reliability of scores/decisions based on this composite of measures therefore needs to be assessed

• Models:

◦ Multivariate generalizability theory (Moonen van-Loon et al., 2013)

◦ Structural equation modeling


||

Validity: What It Is

31

Validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores or other modes of assessment. (Messick, 1989)


||

Validity: What It’s Not

32

There is no such thing as a valid or invalid exam or assessment• Statements such as “my mini-CEX shows

construct validity” are completely devoid of meaning

• Validity refers to the appropriateness of inferences or judgments based on test scores, given supporting empirical evidence


|| 33

1• State the interpretive argument as clearly as possible

2• Assemble evidence relevant to the interpretive argument

3• Evaluate the weakest part(s) of the interpretive argument

4• Restate the interpretive argument and repeat

• Five key arguments

Validity: Kane’s Framework (1992)


|| 34

Validity: Kane’s Five Key Arguments

1. Evaluation Argument

• The scoring rule is appropriate

• The scoring rule is applied accurately and consistently

• Evidence

• Clearly documented training, scoring rules and processes for elements included in program as well as for complex interactions among these components

2. Generalization Argument

• The sample of items/cases in the exam is representative of the domain (universe of items/cases)

• Evidence

• Practice analysis/blueprinting effort


|| 35


3. Extrapolation Argument

• Does the program of assessment lead to intended outcomes?

• Evidence

• Do the outcomes of the program (LMCC or not) relate to clear practice-based indicators as anticipated?

4. Explanation Argument

• Is the program of assessment measuring what was intended?

• Evidence

• Structural equation modeling, mapping of expert judgments, etc.


|| 36


5. Decision-Making Argument

• Is the program of assessment appropriately passing and failing the

“right” candidates?

• How do we set a passing standard for a program of assessment?

• Evidence

• Internal validity

◦ Documentation of process followed

◦ Inter-judge reliability, generalizability analyses, etc.

• External validity

◦ Relationship of performance on exam to other criteria


||

A Practical Framework (Dijkstra et al., 2012)

37

Collecting information:

• Identify the components of the assessment program (What?)

• Identify how component contributes to the goal of the

assessment program for stakeholders (Why?)

• Outline the balance between components that beast achieves

the goal of the assessment program for stakeholders (How?)


||


38

Obtaining support for the program:

• Significant amount of faculty development required to assure a

level of expertise in performing critical tasks (e.g., rating)

• The higher the stakes, the more robust procedures need to be

• Acceptability

• Involve and seek buy-in from key stakeholders


||


39

Domain mapping:

• Gather evidence to support that each assessment component

targets the intended element in the program (micro-level)

• Gather evidence to support that the combination of components

measures the overarching framework (macro-level)


||


40

Justifying the program:

• All new initiatives need to be supported by scientific

(psychometric) evidence

• Cost-benefit analysis undertaken in light of the purpose(s) of the

assessment program


|||

Some Additional Challenges

||

Narrative Data in the MCCQE

• Narrative (qualitative) data poses unique opportunities and challenges for inclusion into a high-stakes program of assessment (MCCQE)

• Opportunity

◦ Enhance the quality and usefulness of feedback provided at key points in the program

• Challenge

◦ How to best integrate qualitative data in a sound, defensible, reliable and valid fashion in a program of assessment that fully meets legal and psychometric best practice

• How can we better systematize feedback?


||


43

• Automated Essay Scoring (AES)

• AES can build scoring models based on previously human-

scored responses

• AES relies on:

◦ Natural language processing (NLP) to extract linguistic features

of each written answer

◦ Machine-learning algorithms (MLA) to construct a mathematical

model linking the linguistic features and the human scores

• The same scoring model can be applied to new sets of answers


||


44

• AES of MCCQE Part I CDM write-in responses

• AES was used to parallel score 73 spring 2015 CDM write-ins

(LightSide)

• Overall human-machine concordance rate >0.90

◦ Higher for dichotomous items; lower for polytomous items

• Overall pass/fail concordance near 0.99, whether CDMs are

scored by residents or computer

• AES holds a great deal of promise as a means to

systematize qualitative data in the MCCQE program


||

Argument for Accreditation of Observation-Based Data

45

• Insufficient evidence to support the incorporation of “local”

(e.g., medical school) based scores (ratings) obtained

from direct observation in the MCCQE without addressing

a number of issues

• Examiner training, patient problem variability, etc.

◦ Issues may never be fully resolved to high-stakes assessment

standards

• However, accrediting (attestation) observational data

sources based on strict criteria and guidelines might be a

viable compromise MCC Annual Meeting – September 2015

||

Argument for Accreditation of Observation-Based Data

46

• Accrediting (with partners) all facets of observation-based

data sources will require meeting a number of agreed-

upon standards:

• Selection of specific rating tool(s) (e.g., mini-CEX)

• Adherence to a strict examiner training protocol

◦ Attestation that examiners have successfully met training targets

(online video training module)

• Sampling strategy (patient mix) based on agreed-upon list of

common problems (and MCCQE blueprint)

• Adherence to common scoring modelsMCC Annual Meeting – September 2015

||

Putting the Pieces Together

47

• At a programmatic level, how can we aggregate this combination

of low- and high-stakes data to arrive at a defensible decision both

for entry into supervised and independent practice?

• Standard setting process offers a defensible model that would

allow expert judgment to be applied towards the development of a

policy that could factor in all sources of data

• Empirical (substantively-based) analyses would then be carried out

to support & better inform (or even refute that policy)

• Structural equation modeling, multivariate generalizability analysis,

etc.


||

Next Steps

48

• Begin to lay foundation for a MCCQE program of assessment

• Define both the micro- and macro-elements that define a program

of assessment leading up to each MCCQE decision point

• Initial efforts led by the November Group on Assessment

• Agree on all supporting standards that need to be uniformly

adopted by all stakeholders:

• Accreditation criteria, where applicable

• Core tools and pool of cases to be adopted by all schools

• Training standards and clear outcomes for examiners

• Scoring and standard setting frameworks


||

Next Steps

49

• Collaborative pilot project framework with key partners

and stakeholders

• Formulate key targeted research questions needed to support the

implementation of a programmatic framework for the MCCQE

• Identify collaborators (e.g., UGME programs, postgraduate training

programs, MRAs, etc.) to answer specific questions from

investigations

• Aggregate information to better inform and support a programmatic

model of assessment for the MCCQE


|| 50

Would you tell me,

please, which way I ought

to go from here?

That depends a good deal

on where you want

to get to!

- Alice in Wonderland


|||

THANK YOU!THANK YOU!

André De Champlain, [email protected]