threats to validity in evaluating cme activities

1

Threats to Validity in Evaluating CME Activities

Jason KingBaylor College of Medicine

Citation: King, J. E. (2008, January). Threats to Validity in Evaluating CME Activities. Presented at the annual meeting of the Alliance for Continuing Medical Education, Orlando, FL.

Please do not disseminate or adapt without express permission of the author. Thank you.

Introduction

CME activity evaluation can be conceptualized as a research study because it includes:• Design (how the study is conducted)

• Instrument(s)

• Analysis of data to draw inferences about the effect of an intervention or treatment (ie, the educational activity)

Each component can be affected by bias

GOAL:GOAL:Minimize Bias Increase Validity Ability to Draw More Plausible Conclusions

2

Introduction

Objectives for Participants:

1. Be aware of the importance of collecting valid data

2. Begin to think about ways to minimize the effects of bias

3. Keep things simple• Much can be learned by applying a simple, yet

effect research design...and fully describing the data using simple summary statistics!

Introduction

There are potential biases associated with each component of a study:

1. Study Design• Issues related to InternalInternal Validity

• Issues related to ExternalExternal Validity

2. Instrument Design• Issues related to ConstructConstruct Validity

3. Data Analysis• Issues related to Statistical ConclusionStatistical Conclusion Validity

3

Study Design—Internal Validity Issues

Internal Validity: Internal Validity: To what degree is the study designed such that we can infer that the treatment caused the measured effect?

EXAMPLE:EXAMPLE: Did participation in a CME program changephysician prescribing practices?

• An internally valid study will minimize the influence of extraneous variables


Types of Designs:

1. One Group Posttest Design

X E

X = Implementation of the treatmentE = Measurement of subjects in experimental group

4


Types of Designs:

1. One Group Posttest Design

X E

• NOT a true experiment• Perhaps most common design in CME, yet this

design is most likely to produce biased results: Why?Why?


Types of Designs:

2. One Group Pretest/Posttest Design

E1 X E2

X = Implementation of the treatmentE = Measurement of subjects in experimental group

5


Types of Designs:

2. One Group Pretest/Posttest Design

E1 X E2

• NOT a true experiment• Potentially offers less biased results because each

subject serves as their own control


Types of Designs:

3. Comparison Group Posttest Design

X E

C

X = Implementation of the treatmentE = Measurement of subjects in experimental groupC = Measurement of subjects in comparison group

6


Types of Designs:

3. Comparison Group Posttest Design

X E

C

• NOT a true experiment • Comparison group potentially controls some

threats to validity, but is not randomly assigned


Types of Designs:

4. Comparison Group Pretest/Posttest Design

E1 X E2

C1 C2

X = Implementation of the treatmentE = Measurement of subjects in experimental groupC = Measurement of subjects in comparison group

7


Types of Designs:

4. Comparison Group Pretest/Posttest Design

E1 X E2

C1 C2

• NOT a true experiment• Comparison group potentially controls some

threats to validity• Pre‐existing differences can be measured


Types of Designs:

5. Control Group Posttest DesignX E1

RC1

X = Implementation of the treatmentE = Measurement of subjects in experimental groupC = Measurement of subjects in control group R = Randomization to treatment groups

8


Types of Designs:

5. Control Group Posttest DesignX E1

RC1

• TRUE experiment because of randomization• Pre‐existing differences can be minimized through

randomization


Types of Designs:

6. Control Group Pretest/Posttest DesignE1 X E2

RC1 C2

X = Implementation of the treatmentE = Measurement of subjects in experimental groupC = Measurement of subjects in control group R = Randomization to treatment groups

9


Types of Designs:

6. Control Group Pretest/Posttest DesignE1 X E2

RC1 C2

• TRUE experiment because of randomization• Offers strong control and measurement of pre‐

existing differences


Types of Designs:

6. Control Group Pretest/Posttest Design

EXAMPLE:EXAMPLE:

• Comparison of online CME vs. traditional, live CME

• Internet novices were expected to gravitate away from online CME, so subjects were randomly assigned

10



Potential Biases and Threats to Validity:

1. HISTORY: Events that occur in the subjects’environment that may affect the outcome

ScenarioA CME program is held that targets specific

changes in physician prescribing practices

11




Potential BiasDuring the study period, a series of meetings are

offered nationally (external to the study) to increase the same prescribing practices, which artificially increases the treatment effect




Minimize Bias• Include a control group (eg, individuals who did

not participate in the CME program)

• Report results (eg, means) disaggregated by the variables that may be confounding the relationship (eg, attendance at one of the meetings associated with the initiative)

12




When possible, always plan to include measures of any variables that may potentially cause bias!



2. TESTING EFFECTS: Changes in what is being measured brought about by the reaction to the process of measurement

ScenarioA test is administered before and after an online

CME activity to measure change in knowledge levels

13




Potential BiasUsing the same items on both instruments “cues”

subjects into the topics that will be assessed on the posttest resulting in an artificially increased treatment effect

• Can affect attitudinal measures as well




Minimize Bias• Include a comparison group not receiving the

pretest• Create a parallel posttest

However, a Psychometrician may be needed (ie, to ensure that items are of similar difficulty and discrimination)

14



3. INSTRUMENTATION: Changes in the attributes of the measuring instrument or procedure take place during the study

ScenarioPhysicians are asked to rate their office personnel

before and after a systems‐based educational intervention aimed at improving communication skills




Potential BiasPhysician may be disposed to give more

favorable ratings the second time because they expect (consciously or subconsciously) a change to have occurred

15




Minimize BiasHave an external observer rate the office

personnel, perhaps without knowledge of when the intervention takes place (blinding)




Scenario #2A self‐report instrument is administered before

and after the CME activity to determine changes in self‐assessed confidence levels

16




Potential BiasThe intervention changes the subject’s evaluation

standard with regard to the dimension measured resulting in incommensurate pre/post data (Response Shift Bias)




Minimize Bias• Administer a post‐activity survey using a

retrospective assessment :Confidence in ability to differentially diagnose vascular dementia from other forms of cognitive dysfunction No Some High Very High Confidence Confidence Confidence Confidence Before Activity: 1 2 3 4 5 6 7 8 9 10 After Activity: 1 2 3 4 5 6 7 8 9 10 Absent from related presentation(s) Not applicable to my practice

17



4. SELECTION: Bias occurring when naturally existing groups are studied (eg, volunteers)

ScenarioPhysicians choosing to attend a CME activity are

surveyed and tested after the activity, with results compared to data obtained from a group of physicians who elected not to attend the activity



4. SELECTION: Bias occurring when naturally existing groups are studied

Potential BiasPhysicians who chose not to attend the activity

had little interest in the subject matter and thus entered the study with lower motivation and knowledge levels

18



4. SELECTION: Bias occurring when naturally existing groups are studied

Minimize Bias• Administering a pretest will help to determine

whether or not selection bias is a problem, but will not solve the problem

• Offer the activity at two time periods, randomly assigned volunteers to attend either the first or second activity (the latter serves as control group)



5. DIFFERENTIAL ATTRITION: When subjects who drop out differ in important ways from the remaining subjects (related to Non‐Response Bias)

19


To Strengthen Internal Validity:

1. Use a comparison group to control some threats to validity

2. Use random assignment to groups to equally distribute prior differences between individuals• Does not always work, so differences should also

be measured


To Strengthen Internal Validity :

3. Use statistical analysis to control for differences on relevant background variables.• In spite of frequent usage, this approach is not

optimal1!

1 Loftin, L. B., & Madison, S. Q. (1991). The extreme dangers of covariance corrections. In B. Thompson (Ed.), Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 133‐147). Greenwich, CT: JAI Press.

Miller, G. A. & Chapman, J. P. (2001). Misunderstanding Analysis of Covariance. Journal of Abnormal Psychology, 110(1), 40‐48.

20

Study Design—External Validity Issues

External Validity: External Validity: Are the results generalizable outside the context of the study?

Question to Keep in Mind: Would effects observed for a CME activity generalize to

any physician who might participate in a similar future activity? What limitations should be considered?

21


Factors that may contribute to lack of generalizability:• Age

• Gender

• Education

• Occupational goals and interests; motivation

• Volunteer status


Scenario

Physicians attend a CME activity and are asked to complete an outcomes assessment at post‐activity and again 3 months later, but fewer complete the follow‐up assessment

Potential Bias

Those subjects who made fewer practice changes chose not to complete the assessment

22


Minimize Bias

Include a follow‐up request and offer incentives to increase response rate


Strategies Shown to Improve Response Rate:

Incentives

•• Monetary incentiveMonetary incentive vs. no incentive

•• Incentive with questionnaireIncentive with questionnaire vs. incentive on return

Length

•• ShorterShorter vs. longer questionnaire

Edwards, P., et al. “Increasing response rates to postal questionnaires: systematic review,” BMJ, 324 (May 2002).

23



Appearance

•• Colored inkColored ink vs. standard [slight effect]

•• More personalizedMore personalized vs. less personalized [slight effect]

Delivery

•• Recorded deliveryRecorded delivery vs. standard

•• Stamped return envelopeStamped return envelope vs. business reply [slight effect]

•• First class outward mailingFirst class outward mailing vs. other class



Contact

•• PrePre‐‐contactcontact vs. no pre‐contact

•• FollowFollow‐‐upup vs. no follow‐upContent

•• More interestingMore interesting vs. less interesting

24


To Establish Trust To Increase Rewards… To Reduce Social Costs... Provide token of

appreciation in advance Sponsorship by legitimate

authority Make the task appear

important Invoke other exchange

relationships

Show positive regard Say thank you Ask for advice Support group values Give tangible rewards Make the questionnaire

interesting Give social validation Communicate scarcity of

response opportunities

Avoid subordinating language

Avoid embarrassment Avoid inconvenience Make questionnaire short

and easy Minimize requests to

obtain personal information

Emphasize similarity to other requests

Dillman, D.A. (2000). Mail and Internet Surveys: The Tailored Design Method, Second Edition. New York: John Wiley.


Minimize Bias (cont.)

Randomly select a sample of dropouts and diligentlyattempt to collect data from them

• If results for dropouts and respondents are similar, non‐response bias is less likely

25


Minimize Bias

Compare the dropouts and respondents on other available measures such as demographics or (better yet) proxy measures related to the outcomes of interest

However, be sure that the variable in question is related to the outcome variable!


Minimize Bias

EXAMPLE: EXAMPLE:

• Paper published in a 2007 CME journal compared demographic data for 3 groups

• A demographic difference was reported as a limitation of the study, but not examined in relation to the outcomes of interest

• May not have been a “limitation” at all

• Perhaps the groups also differed on eye color!

26


Minimize Bias

If data are collected anonymously, use a linking code to match respondents and dropouts, and compare their responses to the initial assessment

Example:Example:

a. 4-digit month/day of birth (e.g., Jan. 15 = 01/15): /

b. 2-digit year of graduation from medical school (e.g., 1973 = 73):

c. First 3 letters of city in which you attended medical school (e.g., El Paso = ELP):

27

Instrument Design – Construct Validity Issues

Construct Validity: Construct Validity: To what extent do the items measure the presumed theoretical construct(s)?

• Could be attitudes, satisfaction, knowledge, behaviors, skills, etc.


Construct validity can be viewed as encompassing:

Face Validity• The extent to which an item/instrument appears to actually

measure what it is supposed to measure

• Weakest of the four types because assessment is not supported by any empirical evidence

• Not an assessment of validity in the technical sense

28



Content Validity

• The extent to which the items cover a representative sample of the content domain of interest

• Content experts are often consulted in item development to ensure content validity (eg, using Bloom’s taxonomy)



Criterion Validity

• The extent to which items correlate with items on other instruments that measure the same or different construct(s)

• Criterion validity exists if high/low correlations emerge as expected

29



1. EXTREMITY BIAS: Avoiding extreme responses

Minimize Bias• Use more “moderate” answer options

Frequently used scale:Frequently used scale:

Strongly Neither Agree StronglyDisagree Disagree Nor Disagree Agree Agree



2. FLOOR/CEILING EFFECTS: Compression of scores at the top or bottom of the scale

Minimize Bias• Spread out the score distribution

Original Scale:Original Scale:

Poor Fair Good Excellent

30


Poor Fair Good Excellent

4321

Freq

uenc

y (f)

120

100

80

60

40

20

0





Revision #1:Revision #1:

Poor Fair Good Very Good Excellent

31





Revision #2: Revision #2: (after listing our expectations)





Example #2:Example #2:

7‐point knowledge rating changed to a 10‐point rating (assessed pre, post, 3‐month followup)

32


Knowledge--Before

10987654321

50

40

30

20

10

0


Knowledge--After

10987654321

50

40

30

20

10

0

33


Knowledge--Follow-Up

10987654321

30

20

10

0



3. ACQUIESCENCE AND SOCIAL DESIRABILITY: Agreeing with all questions or desiring to create a favorable impression

Minimize Bias• Reverse code items

• Anonymity, confidentiality The likelihood of obtaining biased data is high for non‐anonymous surveys!

34



4. RESPONDENT FATIGUE: Tiring when answering questions

Minimize Bias• Use fewer items, with more important items first

RATING SCALE: E = Excellent G = Good F = Fair P = Poor

Presenter Topic Content Delivery Course Materials

Audio-Visuals

Overall

E G F P E G F P E G F P E G F P E G F P

RATING SCALE: E = Excellent G = Good F = Fair P = Poor

Presenter Topic Content Delivery Course Materials

Audio-Visuals

Overall

E G F P E G F P E G F P E G F P E G F P

35

Data Analysis – Statistical Conclusion Validity Issues

Statistical Conclusion Validity:Statistical Conclusion Validity: To what degree does the statistical analysis allow one to draw the correct conclusions?


Potential Biases and Threats to Validity

1. Lack of “Power”• Statistical power is the ability to detect

relationships between variables that truly exist

• Need more power to find a needle in a haystack than to find a MACK truck in a haystack

• Important consideration if you wish to draw a sample (eg, potential CME participants)

36



1. Lack of “Power”

• Power is reduced through:

a. Small sample size

b. Unreliable measures

c. Violating the assumptions of the statistical test



2. Apply Inappropriate Statistical Tests

37



2. Apply Inappropriate Statistical Tests • Important issue is how the items are scaled (eg,

dichotomy, Likert scale, unordered categories)

• Parametric tests are preferred, when applicable

Conclusions

• Bias can affect the validity of a study at any pointStudy DesignItem DevelopmentData Analysis

• Taking proactive steps to reduce bias at each step will result in more valid assessments of CME effects

• Should also assess Reliability (eg, Cronbach’salpha, test/retest)

• END RESULT: Improved CME activities

38

Additional Slides


Questionnaire Item Writing Tips• Use only one question per issue (avoid double‐barreled questions)

• Use mutually exclusive response categories

• Questions with more than one embedded concept are difficult to answer, and thus impossible to interpret

• The use of “and” or “or” often indicates a double‐barreled item

• Use simple language

Adapted from Wildes, Kimberly R. MEI’s Hints for Writing Effective Survey Items, Measurement Excellence Initiative Site, available http://www.measurementexperts.org/hints.htm; Internet; Accessed Jan. 7, 2004.; and other sources.

39


Questionnaire Item Writing Tips • Use the fewest words and simplest grammatical structure possible—avoid compound sentences

• Minimize respondent reading time in phrasing each item

• Keep vocabulary consistent with the respondent’s level of understanding

• Avoid, or use sparingly, the phrase “all of the above”

• Avoid, or use sparingly, the phrase “none of the above”

• Avoid the use of the phrase “I don’t know”


Questionnaire Item Writing Tips • Avoid negatively worded items

• Some contend that alternating between negatively and positively worded items helps to reduce response bias (responding similarly to every item). However, negatively worded items have the propensity to load on a single, separate factor. Further, they are easily misunderstood.

• In general, negative items should be avoided, especially when “strongly disagree to strongly agree” response categories are used

40


Questionnaire Item Writing Tips • Do not use leading questions

• Leading questions result in bias, even if unintended• Questions should be fair to the respondent and not one‐sided

• Avoid ambiguous words (i.e., occasionally, regularly)

• Avoid extreme words (i.e., always, all, never, ever)


Pretest the Items• Eliminates complex or technical questions

• Ensures face validity

• Ensures that item, length, and placement are appropriate

• Effective at revealing items that may have double meanings or other problems

• Often facilitates changing open to closed‐ended questions

41


Approaches to Pretesting• Think Aloud – Have respondents report aloud what they are thinking as they work through the test items

• Immediate Recall – Immediately after choosing a response, have respondents describe why they chose that response

• Criteria Probe – After respondents have marked an answer, ask if various pieces of information in the item affected their response


Phases of Pretesting

Phase 1: Review by knowledgeable colleagues and analysts• Have I included all of the necessary questions?

• Can I eliminate some of the questions?

• Did I use categories that will allow me to compare responses to census data or results of other surveys?

• What are the merits of modernizing categories versus keeping categories as they have been used for past studies?

Dillman, D. A. (2000). Mail and Internet surveys: The tailored design method, 2nd Ed. New York: John Wiley & Sons.

42



Phase 2: Review by potential respondents to evaluate cognitiveand motivational components• Are all of the words understood?

• Do respondents have the information to answer the question?

• Does the respondent really know?

• Can the respondent remember?

• Will the questions be interpreted similarly by all respondents?



Phase 2: Review by potential respondents (cont.)• Will the respondents be willing to answer?

• Will respondents truthfully provide correct information?

• Place sensitive questions at the end of the survey

• Place the respondent at ease by suggesting that the behavior is common

• Is the question necessary?• Does each question have an answer that can be marked by every respondent?

43



Phase 2: Review by potential respondents (cont.)• Does the question lead the respondent to answer in a certain way?

• Is each respondent likely to read and answer each question?

• Does the mailing package (envelope, cover letter, and questionnaire) create a positive impression?



Phase 3: Conduct a small pilot study• Have I constructed the response categories for scalar questions so people distribute themselves across categories rather than being concentrated in only one or two of them?

• Do items from which I hope to build a scale correlate in a way that will allow me to build the scale?

• What kind of response rate is the survey likely to obtain?

• Are some questions generating high nonresponse rate?

44



Phase 3: Conduct a small pilot study (cont.)• Do some variables correlate so highly that for all practical purposes I can eliminate one or more of them?

• Is useful information being obtained from open‐ended questions?

• Are entire pages or sections of the questionnaire being skipped?



Phase 4: A final check• Present to a few people unfamiliar with the questionnaire for a final look‐over

45


Principles to Follow in Layout Design• Place general questions before specific questions

• Some researchers suggest that the answer to a general question may be influenced by previous specific questions

• Others suggest random ordering of items to evaluate the existence of order effects; the placement of items would then bebased on whether or not order effects are found

• Place sensitive items near the end of survey

• Sensitive questions may provoke embarrassment or resentment, resulting in non‐response

• Begin with emotionally neutral questions to “warm‐up” the respondents

Dillman, D. A. (2000). Mail and Internet surveys: The tailored design method, 2nd Ed. New York: John Wiley & Sons.


Principles to Follow in Layout Design • Questions and corresponding response categories should never be broken up between pages

• Ensure enough space between items for response (Use white space as divider between items, rather than dividing lines)

• Provide square boxes for marking answers

• Square boxes should be no smaller than 1/8” x 1/8”

• Leave as much space between boxes as within them• Do not print items on the front or back of cover pages

46


Principles to Follow in Layout Design • Format items vertically, rather than horizontally

• Keep the length of the options fairly consistent

• If using skip patterns, provide clear instructions

• Respondent should be clear on when and how to utilize skips

• Instructions may be provided in parenthesis next to the relevantresponse choice, or arrows may be used in self‐administered surveys to indicate response flow

• Patterns should be checked & tested several times to ensure items perform correctly, before using the survey in the field


Principles to Follow in Layout Design • Surround answer boxes by a black line & print against a colored background field (Encourages making marks within boxes):• Background color should provide clear contrast with white boxes, but not so intense that it results in poor contrast with the words in black print

• Use 20% tints of certain blues or greens

• Use 80% or even 100% of the full tint of certain yellows

• Provide plenty of space and no segmentation marks when seeking open‐ended answers

threats to validity in evaluating cme activities

Documents