psychometrics for clinical skills assessment

23
Psychometrics for Clinical Skills Assessment Serkan Toy, PhD Director of Evaluation and Program Development Graduate Medical Education Children’s Mercy Hospital – Kansas City

Upload: inspirenetwork

Post on 19-Jun-2015

236 views

Category:

Education


3 download

DESCRIPTION

Dr. Serkan Toy (Children's Mercy Hospital Kansas City) summarizes current literature on assessment, evaluations, rubrics, and Global Assessment Scales from the perspective of Psychometrics.

TRANSCRIPT

Page 1: Psychometrics for Clinical Skills Assessment

Psychometrics for Clinical Skills Assessment

Serkan Toy, PhDDirector of Evaluation and Program Development

Graduate Medical Education

Children’s Mercy Hospital – Kansas City

Page 2: Psychometrics for Clinical Skills Assessment

Outline

• What is psychometrics?– Measurement– Construct development

• Reliability

• Validity

• A few other issues to consider– Checklists vs. Global ratings

Page 3: Psychometrics for Clinical Skills Assessment

Psychometrics

• Educational & Psychological Measurement– measurement of knowledge, abilities,

attitudes, and personality traits – mainly concerned with the construction and

validation of measurement instruments (i.e. cognitive tests, surveys/questionnaires, and personality assessments.)

Page 4: Psychometrics for Clinical Skills Assessment

Measurement• Assigning numerals to observations based on

some pre-defined criteria

• Or assigning a value to one object/observation in relation to another – Intelligence - IQ– Personality - Big Five– Academic or procedural performance

Page 5: Psychometrics for Clinical Skills Assessment

Measurement

Subjective Objective

Qualitative Quantitative

Page 6: Psychometrics for Clinical Skills Assessment

Measurement

Global Analytic

Page 7: Psychometrics for Clinical Skills Assessment

Construct

Well-defined vs. Ill-definedWhat do we already know about the construct in

question?

• Epistemological Beliefs

• Self efficacy

• Academic Performance

• Procedural Competency

Page 8: Psychometrics for Clinical Skills Assessment

Defining the Construct

• Factor Analysis • Cluster Analytic Approach • Cognitive/Procedural Task Analysis

– Hierarchical Task Analysis

– Delphi Technique - expert consensus (face validity)

– Angoff Method - determining passing (cut-off) scores

Page 9: Psychometrics for Clinical Skills Assessment

Key Concepts in Assessment

• Reliability

• Validity

Page 10: Psychometrics for Clinical Skills Assessment

Reliability“The more consistent the scores are over different raters and occasions, the more reliable the assessment is thought to be” (Moskal & Leydens, 2000 as cited in Jonsson & Svingby, 2007)

• Test re-rest• Inter-rater

• Binary vs. Likert scale • Rubrics and calibration process

Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2, 130-144.

Page 11: Psychometrics for Clinical Skills Assessment

Rubrics for Assessment

• “An assessment tool that describes levels of performance on a particular task” (Jonsson & Svingby, 2007)

• Analytic, topic-specific rubrics seem to enhance reliable scoring of performance especially when accompanied by examples and/or rater training

• Example:– Objective assessment of surgical competence in

gynaecological laparoscopy: development and validation of a procedure-specific rating scale Larsen et al. (2008)

Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2, 130-144.

Larsen C.R., Soerensen J.L., Grantcharov T.P., Dalsgaard T, Schouenborg L., Ottosen C., Schroeder T.V., Ottesen B.S. (2008) Effect of virtual reality training on laparoscopic surgery: randomised controlled trial. BMJ, 14, 338, b1802.

Page 12: Psychometrics for Clinical Skills Assessment

Larsen et al. 2008Example rating scale

1 2 3 4 5

Economy of movements

Many unnecessary movements

Efficient motion but some unnecessary motion

Maximum economy of movements

Economy of time

Too long time used to perform sufficiently

Intermediate time used to perform sufficiently

Minimal time used to perform sufficiently

Errors: respect for tissue

… … …

Flow of operation

… … …

Page 13: Psychometrics for Clinical Skills Assessment

Traditionally Validity

• Three C’s of the “Trinitarian”– Content– Criterion – Construct

Page 14: Psychometrics for Clinical Skills Assessment

Conceptual Change in Validity

“Validity is not a property of the test or assessment as such, but rather of the meaning of the test scores. These scores are a function not only of the items or stimulus conditions, but also of the persons responding as well as the context of the assessment. In particular, what needs to be valid is the meaning or interpretation of the score; as well as any implications for action that this meaning entails” (p. 741).

Messick, S. (1995) Validity of psychological assessment: validation of inferences from persons’ responses and performance as scientific inquiry into score meaning. American Psychologist, 50, 741–9.

Page 15: Psychometrics for Clinical Skills Assessment

Validity

• Construct Validity– Content / Face– Convergent – Discriminant– Predictive

The question is: What can we validly conclude about a trainee who receives a score of “X” vs. that of receiving “Y”?

Page 16: Psychometrics for Clinical Skills Assessment

Construct Validity

• In competency assessment instruments validity is usually established by examining whether they distinguish between groups logically presumed to differ on competency being measured – Experienced practitioners vs. trainees or– Peer nominated superior performers vs. average

performers

Scofield, M. E., & Yoxtheimer, L. Y. (1983). Psychometric issues in theassessment of clinical competencies. Journal of Counseling Psychology.30, 413-420.

Page 17: Psychometrics for Clinical Skills Assessment

Other Issues to Consider

• Global ratings vs. Checklist scores• Required sample size for validity testing• Training the trainees• Scoring the videotaped performances

– Individual procedural vs. qualitative skills– Team performance

• Formative and summative assessment at the same time – Training (intervention) or assessment

(measurement)?

Page 18: Psychometrics for Clinical Skills Assessment

Global ratings vs. Checklist scores

• High correlation between global ratings and checklist scores

• Both seem to differentiate similarly between more experienced trainees and novices

Examples• Kim J., Neilipovitz D., Cardinal P., Chiu M. (2009) A comparison of global rating scale

and checklist scores in the validation of an evaluation tool to assess performance in the resuscitation of critically ill patients during simulated emergencies (abbreviated as ‘‘CRM simulator study IB’’). Simulation in Healthcare, 4, 6–16.

• Morgan P.J., Cleave-Hogg D., Guest C.B. (2001) A comparison of global ratings and checklist scores from an undergraduate assessment using an anesthesia simulator. Academic Medicine, 76, 1053–5.

Page 19: Psychometrics for Clinical Skills Assessment

A Comparison of Global Rating Scale and Checklist Scores in the Validation of an Evaluation Tool to Assess Performance in the

Resuscitation of Critically Ill Patients During Simulated Emergencies (Abbreviated as "CRM Simulator Study IB")

Kim, John MD, MEd, FRCPC; Neilipovitz, David MD, FRCPC; Cardinal, Pierre MD, FRCPC; Chiu, Michelle MD, FRCPC

• 32 PGY-1 & 28 PGY-3 2 simulation scenarios on Crisis Resource Management (CRM)

• Ottawa Global Rating Scale 7-point anchored ordinal scale – 5 CRM categories and overall performance score

• Ottawa CRM Checklist 12 item in 5 CRM categories- max 30 points

• 3 raters blinded to year of training rated each videotaped performance (no order specified for use of evaluation tools)

Page 20: Psychometrics for Clinical Skills Assessment

Kim et al. 2009 - Continued

Reliability: Inter-rater reliability Intraclass Correlation Coefficient (ICC)• Ottawa GRS: S1=0.59 & S2=0.61

– Subcategories showed similar ICC except “resource utilization & communication (range 0.24 to 0.38)

• Cumulative CRM checklist: S1=0.63 & S2=0.55– Again subcategories showed similar ICC except “resource utilization &

communication (again range 0.24 to 0.38)

Validity: • Content validation (face validity) Delphi process• Response process Resident orientation & rater training• Comparison of scores by PGY T test (ANOVA)

– Both the checklist and GRS overall & subcategory scores showed statistically significant differences between PGY-1 and PGY-3 (more experienced residents receiving higher scores)

– ANOVA showed similar results by each scenario as well as per rater

Page 21: Psychometrics for Clinical Skills Assessment

A Comparison of Global Ratings and Checklist Scores from an Undergraduate Assessment Using an Anesthesia Simulator

Morgan, Pamela J. MD; Cleave-Hogg, Doreen PhD; Guest, Cameron B. MD, MEd

• 140 final year medical students 15-minute faculty-facilitated sim scenario– Conducted in 2nd week of the 2-week anesthesia rotation

– Faculty followed a script and each session was videotaped

– Each student completed 1 of 6 scenarios (each with similar learning objectives)

• 25-point criterion checklist for each scenario (not performed=0; performed=1)

• 5-point global rating (clear failure=1 to superior performance=5)

• 10 faculty attended a workshop on performance protocols– Randomly assigned to a rating pair to evaluate 25 to 34 videotaped

performances

Page 22: Psychometrics for Clinical Skills Assessment

Morgan et al. 2001 - Continued

• Correlation between checklist scores and global ratings Pearson r = 0.74

• Global ratings correlated more highly with technical skills and judgment than with knowledge – Knowledge r = 0.24

– Technical skills r =0.51

– Judgment r =0.53

• Single-rater reliability (consistency) – Mean ICC for Checklist scores 0.77 (range 0.58 to 0.93)– Mean ICC for Global ratings 0.62 (range 0.40 to 0.77)

Page 23: Psychometrics for Clinical Skills Assessment

Other Issues to Consider

• Required sample size for validity testing• Training the trainees• Scoring the videotaped performances

– Individual procedural vs. qualitative skills– Team performance

• Formative and summative assessment at the same time – Training (intervention) or assessment (measurement)?

• Assessment tools to link simulated performance to actual patient outcomes