clear 2008 annual conference anchorage, alaska fundamental testing assumptions revisited:...

CLEAR 2008 Annual Conference

Anchorage, Alaska

Fundamental Testing Assumptions Revisited: Examination Length and

Number of Options

Karine Georges & Kelly Piasentin

Assessment Strategies Inc.

2

Overview

Credentialing organizations seek to balance many factors such as program validity and credibility with more tangible aspects such as costs and ease of development. Two such aspects are investigated:

• Method to reduce the total number of test questions while retaining validity and reliability.

• The effects of reducing the typical number of options from four (4) to three (3).

3

Part I

Examination Length: A Case Study

Karine Georges, MSc.

4

Case Study: Certification Program

• Tasked in 2007 to determine whether 180-item, 4-hour examinations could be shortened in light of a potential move to CBT.

5

Validity and Examination Length

• Content Validity: The number of items on an examination must be sufficient to ensure adequate representative coverage.

• Face Validity: If shortened, perceptions of stakeholders need to be considered vis-a-vis comparable professions.

6

Examination Length and Reliability

• What is an acceptable reliability index for credentialing?

• “ A reliability correlation coefficient should fall in the high .80s or above for longer examinations (e.g., 150 or more items)”. [NOCA, 2004].

• What is the range of reliability indices for the current 180-item certification examinations?

• Average : .84• Min: .78• Max: .92

7

Examination Length and Practical Considerations

If reliability is related to item length why shorten the examination?

Costs and efficiency• Each item costs between $300-$1000 to develop

(Vale, 2006). • Need additional items for safeguard purposes, or

ancillary materials such as prep guides or readiness tests.

• Client’s intention to go to CBT makes it an advantage to have shorter examinations so seat time can be reduced and more candidates accommodated within the testing period.

8

Research Approaches

• Two approaches:

• Classical Test Theory (CTT) approach Examining reliability coefficient using Spearman-Brown formula.

• Item Response Theory (IRT) approach Examining the item information function using empirical data.

9

CTT Results for the Two Certification Programs

Spearman Brown Formulation:

Pxx= Npxx1+ (N-1)

pxx

• Results show that examinations can be lowered by 20-30 questions (or about 10%) and still remain above .80.

Number of Items

100% 90% 75% 50%

A .91 .90 .88 .84

B .89 .88 .86 .80

10

Limitations of CTT Results

• General Limitations of Spearman Brown:• Assumption that examinations are exactly parallel• Only one value for a range of abilities• Largely impacted by cohort

11

IRT Approach: Item Information Curve

• Research has shown that in higher stakes examinations with Pass/Fail decisions such as certification examinations, examinations can be shortened without impacting classification abilities (Schulz & Wang, 2001)

• What would be the impact if the certification examinations had 10% fewer items?– How about 25% or 50%?

12

IRT - Item Information Curve

• IRT models specify the probability of a discrete outcome such as a correct response to an item, in terms of person and item parameters.

• Person parameter: ability of a candidate (theta)• Item parameters:

a: Discrimination (slope)b: Difficulty (location)c: Guessing

13

IRT - Test Information Curve

• All Item Information Curves add to a Test Information Curve

• Amount of information scale differs based on length of examination and quality of the items

• Pass/Fail decision must be made where error is minimal (ideally where the passmark is located) and where level of ability can be clearly differentiated

14

IRT Results for Program A

Test Information Functions for Program A, 2006-2007

0

5

10

15

20

25

30

35

-3.0

-2.7

-2.4

-2.1

-1.8

-1.5

-1.2

-0.9

-0.6

-0.3

0.0

0.3

0.6

0.9

1.2

1.5

1.8

2.1

2.4

2.7

3.0

Theta (Ability) Scale

Am

ou

nt

of

Info

rma

tio

n

168 items (100%) 150 items (~90%) 125 items (~75%) 85 items (~50%)

15

IRT Results for Program B

Test Information Functions for Program B, 2004-2006

0

5

10

15

20

25

30

35

-3.0

-2.7

-2.4

-2.1

-1.8

-1.5

-1.2

-0.9

-0.6

-0.3

0.0

0.3

0.6

0.9

1.2

1.5

1.8

2.1

2.4

2.7

3.0

Theta (Ability) Scale

Am

ou

nt

of

Info

rma

tio

n

162 items (100%) 146 items (~90%) 120 items (~75%) 80 items (~50%)

16

IRT - Results and Implications

• The examinations can be reduced by at least 10% without significantly impacting the pass/fail decision.

• Other factors to take into consideration• Number of candidates• Robustness of item bank

17

Other Considerations

• What about face validity?

• How would an examination with 90 items be viewed by other professionals compared to a comparable examination of 180 items?

18

Other Certification Programs

• Review of over 75 certification programs within the same profession.

• The average number of items: 164 or between 150-175 items (including experimental items) • Minimum: 100 • Maximum: 250

19

Summary

• Data suggest that the number of items can be reduced by 10% with minimal impact on the validity and reliability.

20

Part II

How Many Options is Optimal in Multiple Choice Testing?

Kelly Piasentin, PhD

21

Multiple Choice Testing

• Most common format used in Licensure and Certification examinations

• Consists of a stem (i.e., the question being asked) and a series of options to choose from (usually 4)

Example:• In which state is the 2008 CLEAR conference

being held?1. Arkansas2. Alaska3. Arizona4. Alabama

Stem

Options

22

Advantages of Multiple Choice

• Versatility• Efficiency• Scoring accuracy and economy• Reliability• Diagnosis• Control of difficulty• Amenable to item analysis

23

Disadvantages of Multiple Choice

• Time consuming to write

• Difficult to create effective distracters (i.e., options that are plausible, but incorrect)

24

Time Spent Writing MCQs

• Sample of 75 Item Writers for 3 different licensing/certification examinations

• Average time spent writing an MCQ: 52 minutes• Percentage of time spent writing:

Stem 26%

Correct Response 12%

1st Distracter 11%

2nd Distracter 13%

3rd Distracter 17%

Rationales/References 21%

25

Effort Spent Writing Distracters

Of the 75 Item Writers…

• 25% reported that it was difficult to write the 1st distracter

• 40% reported that it was difficult to write the 2nd distracter

• 75% reported that it was difficult to write the 3rd distracter

26

How many options should an MCQ have?

• 4-option MCQs are widely used in standardized testing everywhere

• But, are 4 options ideal?• Some IW guidelines say, “develop as many options as

feasible” (Haladyna & Downing, 1989)

• More recently, “develop as many functional distractors as are feasible” (Haladyna, Downing, & Rodriguez, 2002)

• Increasing emphasis on the quality of distractors as opposed to the quantity

27

Definition of a Functional Distracter

“A functional distracter is one that has (a) a significant negative point-biserial correlation with the total test score, (b) a negative sloping item characteristic curve, and (c) a frequency of response greater than 5% for the total group.”

Haladyna & Downing (1988)

28

How does # options impact guessing?

• With 4 options, candidates have a 25% chance of getting any one question correct by simply guessing – Probability is reduced to 20% if there are 5 options– Probability is increased to 33% if there are 3 options

• BUT…. if a typical examination has 25 items, each with 3-options, chance of getting at least a 70% on the examination by pure blind guessing is 1 in 25,000

• So, do you get more bang for your buck by having more options?

29

Are 4-option MCQs optimal?

Factors to consider:• Time and cost it takes to develop distracters• Time it takes for candidates to complete the examination• Psychometric properties of examination

– Item difficulty– Item discrimination– Test reliability (Coefficient alpha)

30

Arguments in favour of 3-options:

• Less time is needed to develop two plausible distracters

• More 3-option items can be administered without increasing testing time

– Inclusion of additional high quality items per unit of time should improve test score reliability

• Having fewer options decreases the likelihood of exposing additional aspects of the domain to candidates (e.g., context clues to other questions)

31

Data from a Licensing/Certification Examination

• Number of MCQs: 235

• Number of candidates: 5,393

• Mean item difficulty: .721

• Mean discrimination index: .166

• Test reliability: .88

• Most chosen distracter: .167

• 2nd most chosen distracter: .077

• Least chosen distracter: .035

32

Reducing Examination Items to 3 Options

What would be the effect on item difficulty, discrimination and reliability of reducing the items on the examination to 3 options if the least chosen distracter was:

• Attributed to correct answer?

• Attributed to 2nd least chosen distracter?

• Randomly distributed to each of the other 3 choices?

33


If least chosen attributed to correct answer:

• Item difficulty: .752


• Coefficient Alpha: .834

34


If least chosen attributed to 2nd least chosen distracter:

• Item difficulty: .720


• Reliability: .881

35


If least chosen distributed randomly to each of the other 3 choices:

• Item difficulty .731


• Reliability : .868

36

Summary

Difficulty Discrimination Reliability

4 options .721 .166 .880

LCD → Correct .752 .136 .834

LCD → 2nd LCD .720 .168 .881

LCD → Random .731 .158 .868

37

4 Options vs. 3 Options

• Moving from 4 options to 3 options did not have a significant impact on average item difficulty, discrimination or test reliability.

38

Summary

• Two primary benefits of using 3 options (as opposed to 4 options)– Faster item writing– Better testing

• Better quality items• Cost savings• Shorter test time• More questions in same amount of time (potential

for increased reliability)

39

Conclusion

• These two presentations demonstrate that you can accrue some efficiencies from reducing test length and number of response options without compromising test validity.

• Further research needed to confirm findings.

40

Contact Information

Assessment Strategies1400 Blair Place, Suite 210

Ottawa, ON K1J 9B8Canada.

Telephone: 613-237-0241E-mail: www.asinc.ca

• Karine Georges, MSc [email protected] • Kelly Piasentin, PhD [email protected]

clear 2008 annual conference anchorage, alaska fundamental testing assumptions revisited:...

Documents