clear 2008 annual conference anchorage, alaska fundamental testing assumptions revisited:...
TRANSCRIPT
CLEAR 2008 Annual Conference
Anchorage, Alaska
Fundamental Testing Assumptions Revisited: Examination Length and
Number of Options
Karine Georges & Kelly Piasentin
Assessment Strategies Inc.
2
Overview
Credentialing organizations seek to balance many factors such as program validity and credibility with more tangible aspects such as costs and ease of development. Two such aspects are investigated:
• Method to reduce the total number of test questions while retaining validity and reliability.
• The effects of reducing the typical number of options from four (4) to three (3).
3
Part I
Examination Length: A Case Study
Karine Georges, MSc.
4
Case Study: Certification Program
• Tasked in 2007 to determine whether 180-item, 4-hour examinations could be shortened in light of a potential move to CBT.
5
Validity and Examination Length
• Content Validity: The number of items on an examination must be sufficient to ensure adequate representative coverage.
• Face Validity: If shortened, perceptions of stakeholders need to be considered vis-a-vis comparable professions.
6
Examination Length and Reliability
• What is an acceptable reliability index for credentialing?
• “ A reliability correlation coefficient should fall in the high .80s or above for longer examinations (e.g., 150 or more items)”. [NOCA, 2004].
• What is the range of reliability indices for the current 180-item certification examinations?
• Average : .84• Min: .78• Max: .92
7
Examination Length and Practical Considerations
If reliability is related to item length why shorten the examination?
Costs and efficiency• Each item costs between $300-$1000 to develop
(Vale, 2006). • Need additional items for safeguard purposes, or
ancillary materials such as prep guides or readiness tests.
• Client’s intention to go to CBT makes it an advantage to have shorter examinations so seat time can be reduced and more candidates accommodated within the testing period.
8
Research Approaches
• Two approaches:
• Classical Test Theory (CTT) approach Examining reliability coefficient using Spearman-Brown formula.
• Item Response Theory (IRT) approach Examining the item information function using empirical data.
9
CTT Results for the Two Certification Programs
Spearman Brown Formulation:
Pxx= Npxx1+ (N-1)
pxx
• Results show that examinations can be lowered by 20-30 questions (or about 10%) and still remain above .80.
Number of Items
100% 90% 75% 50%
A .91 .90 .88 .84
B .89 .88 .86 .80
10
Limitations of CTT Results
• General Limitations of Spearman Brown:• Assumption that examinations are exactly parallel• Only one value for a range of abilities• Largely impacted by cohort
11
IRT Approach: Item Information Curve
• Research has shown that in higher stakes examinations with Pass/Fail decisions such as certification examinations, examinations can be shortened without impacting classification abilities (Schulz & Wang, 2001)
• What would be the impact if the certification examinations had 10% fewer items?– How about 25% or 50%?
12
IRT - Item Information Curve
• IRT models specify the probability of a discrete outcome such as a correct response to an item, in terms of person and item parameters.
• Person parameter: ability of a candidate (theta)• Item parameters:
a: Discrimination (slope)b: Difficulty (location)c: Guessing
13
IRT - Test Information Curve
• All Item Information Curves add to a Test Information Curve
• Amount of information scale differs based on length of examination and quality of the items
• Pass/Fail decision must be made where error is minimal (ideally where the passmark is located) and where level of ability can be clearly differentiated
14
IRT Results for Program A
Test Information Functions for Program A, 2006-2007
0
5
10
15
20
25
30
35
-3.0
-2.7
-2.4
-2.1
-1.8
-1.5
-1.2
-0.9
-0.6
-0.3
0.0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7
3.0
Theta (Ability) Scale
Am
ou
nt
of
Info
rma
tio
n
168 items (100%) 150 items (~90%) 125 items (~75%) 85 items (~50%)
15
IRT Results for Program B
Test Information Functions for Program B, 2004-2006
0
5
10
15
20
25
30
35
-3.0
-2.7
-2.4
-2.1
-1.8
-1.5
-1.2
-0.9
-0.6
-0.3
0.0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7
3.0
Theta (Ability) Scale
Am
ou
nt
of
Info
rma
tio
n
162 items (100%) 146 items (~90%) 120 items (~75%) 80 items (~50%)
16
IRT - Results and Implications
• The examinations can be reduced by at least 10% without significantly impacting the pass/fail decision.
• Other factors to take into consideration• Number of candidates• Robustness of item bank
17
Other Considerations
• What about face validity?
• How would an examination with 90 items be viewed by other professionals compared to a comparable examination of 180 items?
18
Other Certification Programs
• Review of over 75 certification programs within the same profession.
• The average number of items: 164 or between 150-175 items (including experimental items) • Minimum: 100 • Maximum: 250
19
Summary
• Data suggest that the number of items can be reduced by 10% with minimal impact on the validity and reliability.
20
Part II
How Many Options is Optimal in Multiple Choice Testing?
Kelly Piasentin, PhD
21
Multiple Choice Testing
• Most common format used in Licensure and Certification examinations
• Consists of a stem (i.e., the question being asked) and a series of options to choose from (usually 4)
Example:• In which state is the 2008 CLEAR conference
being held?1. Arkansas2. Alaska3. Arizona4. Alabama
Stem
Options
22
Advantages of Multiple Choice
• Versatility• Efficiency• Scoring accuracy and economy• Reliability• Diagnosis• Control of difficulty• Amenable to item analysis
23
Disadvantages of Multiple Choice
• Time consuming to write
• Difficult to create effective distracters (i.e., options that are plausible, but incorrect)
24
Time Spent Writing MCQs
• Sample of 75 Item Writers for 3 different licensing/certification examinations
• Average time spent writing an MCQ: 52 minutes• Percentage of time spent writing:
Stem 26%
Correct Response 12%
1st Distracter 11%
2nd Distracter 13%
3rd Distracter 17%
Rationales/References 21%
25
Effort Spent Writing Distracters
Of the 75 Item Writers…
• 25% reported that it was difficult to write the 1st distracter
• 40% reported that it was difficult to write the 2nd distracter
• 75% reported that it was difficult to write the 3rd distracter
26
How many options should an MCQ have?
• 4-option MCQs are widely used in standardized testing everywhere
• But, are 4 options ideal?• Some IW guidelines say, “develop as many options as
feasible” (Haladyna & Downing, 1989)
• More recently, “develop as many functional distractors as are feasible” (Haladyna, Downing, & Rodriguez, 2002)
• Increasing emphasis on the quality of distractors as opposed to the quantity
27
Definition of a Functional Distracter
“A functional distracter is one that has (a) a significant negative point-biserial correlation with the total test score, (b) a negative sloping item characteristic curve, and (c) a frequency of response greater than 5% for the total group.”
Haladyna & Downing (1988)
28
How does # options impact guessing?
• With 4 options, candidates have a 25% chance of getting any one question correct by simply guessing – Probability is reduced to 20% if there are 5 options– Probability is increased to 33% if there are 3 options
• BUT…. if a typical examination has 25 items, each with 3-options, chance of getting at least a 70% on the examination by pure blind guessing is 1 in 25,000
• So, do you get more bang for your buck by having more options?
29
Are 4-option MCQs optimal?
Factors to consider:• Time and cost it takes to develop distracters• Time it takes for candidates to complete the examination• Psychometric properties of examination
– Item difficulty– Item discrimination– Test reliability (Coefficient alpha)
30
Arguments in favour of 3-options:
• Less time is needed to develop two plausible distracters
• More 3-option items can be administered without increasing testing time
– Inclusion of additional high quality items per unit of time should improve test score reliability
• Having fewer options decreases the likelihood of exposing additional aspects of the domain to candidates (e.g., context clues to other questions)
31
Data from a Licensing/Certification Examination
• Number of MCQs: 235
• Number of candidates: 5,393
• Mean item difficulty: .721
• Mean discrimination index: .166
• Test reliability: .88
• Most chosen distracter: .167
• 2nd most chosen distracter: .077
• Least chosen distracter: .035
32
Reducing Examination Items to 3 Options
What would be the effect on item difficulty, discrimination and reliability of reducing the items on the examination to 3 options if the least chosen distracter was:
• Attributed to correct answer?
• Attributed to 2nd least chosen distracter?
• Randomly distributed to each of the other 3 choices?
33
Reducing Examination Items to 3 Options
If least chosen attributed to correct answer:
• Item difficulty: .752
• Mean discrimination index: .136
• Coefficient Alpha: .834
34
Reducing Examination Items to 3 Options
If least chosen attributed to 2nd least chosen distracter:
• Item difficulty: .720
• Mean discrimination index: .168
• Reliability: .881
35
Reducing Examination Items to 3 Options
If least chosen distributed randomly to each of the other 3 choices:
• Item difficulty .731
• Mean discrimination index: .158
• Reliability : .868
36
Summary
Difficulty Discrimination Reliability
4 options .721 .166 .880
LCD → Correct .752 .136 .834
LCD → 2nd LCD .720 .168 .881
LCD → Random .731 .158 .868
37
4 Options vs. 3 Options
• Moving from 4 options to 3 options did not have a significant impact on average item difficulty, discrimination or test reliability.
38
Summary
• Two primary benefits of using 3 options (as opposed to 4 options)– Faster item writing– Better testing
• Better quality items• Cost savings• Shorter test time• More questions in same amount of time (potential
for increased reliability)
39
Conclusion
• These two presentations demonstrate that you can accrue some efficiencies from reducing test length and number of response options without compromising test validity.
• Further research needed to confirm findings.
40
Contact Information
Assessment Strategies1400 Blair Place, Suite 210
Ottawa, ON K1J 9B8Canada.
Telephone: 613-237-0241E-mail: www.asinc.ca
• Karine Georges, MSc [email protected] • Kelly Piasentin, PhD [email protected]