caveon webinar series: improving testing with key strength analysis

40
Upcoming Caveon Events Caveon Webinar Series: Next session, October 16 The Good and Bad of Online Proctoring, Part 2 EATP – September 25-27 in St. Julian’s, Malta. Caveon’s John Fremer and Steve Addicott presenting: What are we Accountable For? Security Standards and Resources for High Stakes Testing Programs Steve Addicott hosting an ignite session: Leveraging Social Media to Connect with International Test Candidates The 2nd Annual Statistical Detection of Potential Test Fraud Conference October 17-19, 2013, Madison, Wisconsin Caveon’s Dennis Maynes and Cindy Butler will be presenting three sessions Handbook of Test Security Now Available. We will share a discount code at the end of this session.

Upload: caveon-test-security

Post on 12-Jun-2015

192 views

Category:

Education


0 download

DESCRIPTION

Improving Testing with Key Strength Analysis Have you ever wondered whether some distractors were just a little too close to being a right answer? Have you wished you had a way to decide whether an item's answer choice did not meet your standard? What about those items which were published with the wrong answer key? If you have ever asked yourself these questions, be sure to watch our webinar, presented as part of the Caveon Webinar Series on September 18, 2013. You will learn a new evaluation method that will help you feel confident about your key strength. The webinar will discuss the underlying concepts, the theory, and applications for the method Caveon has been using since 2011. The method uses classical item statistics, so it can be used for all assessments that can be analyzed using p-values and point-biserial correlations. As such, we believe it to be a valuable enhancement to other commonly-used item analyses.

TRANSCRIPT

Page 1: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Upcoming Caveon Events

• Caveon Webinar Series: Next session, October 16

The Good and Bad of Online Proctoring, Part 2

• EATP – September 25-27 in St. Julian’s, Malta. – Caveon’s John Fremer and Steve Addicott presenting:

What are we Accountable For? Security Standards and Resources for High Stakes Testing Programs

– Steve Addicott hosting an ignite session: Leveraging Social Media to Connect with International Test Candidates

• The 2nd Annual Statistical Detection of Potential Test Fraud Conference– October 17-19, 2013, Madison, Wisconsin– Caveon’s Dennis Maynes and Cindy Butler will be presenting three sessions

• Handbook of Test Security – Now Available. We will share a discount code at the end of this session.

Page 2: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Caveon Online

• Caveon Security Insights Blog– http://www.caveon.com/blog/

• twitter– Follow @Caveon

• LinkedIn– Caveon Company Page– “Caveon Test Security” Group

• Please contribute!

• Facebook– Will you be our “friend?”– “Like” us!

www.caveon.com

Page 3: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Improving Testing with Key Strength Analysis

Dennis Maynes Dan Allen

Chief Scientist Psychometrician

Caveon Test Security Western Governors University

Marcus Scott Barbara Foster

Data Forensics Scientist Psychometrician

Caveon Test Security American Board of Obstetrics and Gynecology

September 18, 2013

Caveon Webinar Series:

Page 4: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Agenda for Today

• Review classical item analysis• Introduce Key Strength Analysis• Derive Key Strength Analysis• Observations by Dan Allen and Barbara Foster• Conclusions and Q&A

Page 5: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Review Classical Item Analysis

• Statistics– P-value

– Point-biserial correlation

• Typical rules– Low p-values (hard items)– High p-values (easy items)– Low point-biserial correlations (low discriminations)

• Easy to understand and implement• Good at flagging poor items

Page 6: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Introduce Key Strength Analysis

• Why Key Strength Analysis?– Model uses information from all items– Answer choices for same item are compared– Provides possible reasons for poor performance

• High performing test takers (knowledgeable students)– Typically report problems with the answer key– Usually choose the correct answer

• Most frequently selected choice– Is usually correct for easy items– Is not necessarily correct for hard items

Page 7: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Capabilities of Key Strength Analysis

• Built upon classical item analysis– Point-biserial correlations discriminate between high and low

performers– P-values detect hard/easy items

• Typical problems with items– Mis-keyed items– Weakly keyed items– Ambiguously keyed items

• Use probabilities to make inferences about item performance

Page 8: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Modify Point-Biserial Correlation

1. Exclude the item score from the test score• Places all answer choices on “the same playing field”• Allows correct and incorrect answers to be compared using

“what if”

2. Compute point-biserial correlations• For correct answer and• For distractors

3. Scale point-biserial appropriately• We call this statistic, z*• Use z* to compute the probability of the choice (A, B, etc.) being

a key--this is the “key strength”

Page 9: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Derive Key Strength Analysis

• Point-biserial correlation• is item score and is test score

Page 10: Caveon Webinar Series: Improving Testing with Key Strength Analysis

After Some Algebra

depends upon all the right quantities• : Number of respondents• : Proportion answering correctly• : Standard deviation of test scores• : Average score for examinees who answered correctly• : Average score for all examinees• : Difference from overall mean

Page 11: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Why z* Depends on all the Right Quantities

• Assume a random sample of units from a population of size

• Distribution of sample mean, – The expected value is the population mean, – The standard deviation is where .

• The standardized value is , which happens to be for correct responders.

Page 12: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Z* for all Items and Responses

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Right Wrong

z*

154 Examinees, 100 Items

Page 13: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Calculating p(choice is a key | data)

• We want to calculate p(choice is a key | z*)• We can easily calculate p(z* | choice is a key)• Bayes’ Rule allows probability inversion

𝑝 (𝑧∗ )=𝑝 ( 𝑧∗∨choice   is   a   key )𝑝 (key )+𝑝 (𝑧∗∨choice   is   not   a   key )𝑝 (not   key )

Page 14: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Approximation Theory

• Central Limit Theorem z* is normal.• Probability function should be monotonic

increasing, which requires equal variances

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Right Right Normal Wrong Wrong Normal

z*

Page 15: Caveon Webinar Series: Improving Testing with Key Strength Analysis

P(choice is a key | z*)

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

z*

p(ch

oice

is a

key

| z

*)

Page 16: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Analysis of Distractors

• Compute key strength (KS) for all responses• Low KS – probability less than 50%• High KS – probability 50% or more

Answer\Distractors Low KS High KS

Low KS Weakly keyed Potential mis-key

High KS Normal Ambiguously keyed

Page 17: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example I – Good Key

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.10.20.30.40.50.60.70.80.9

1

z*

p(ch

oice

is a

key

| z*

) A

C D B

Response z* Probability

A 3.25 0.99

B 0.25 0.06

C -2.75 0

D -2.4 0

Answer key arrow is colored gold

Page 18: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example II – Potential Mis-key

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.10.20.30.40.50.60.70.80.9

1

z*

p(ch

oice

is a

key

| z*

) A

BC D

Response z* Probability

A 3.25 0.99

B 0.25 0.06

C -2.75 0

D -2.4 0

Answer key arrow is colored gold

Page 19: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example III – Weak Key

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.10.20.30.40.50.60.70.80.9

1

z*

p(ch

oice

is a

key

| z*

)

ABC D

Response z* Probability

A 1.0 0.32

B 0.25 0.06

C -3 0

D -2.5 0

Answer key arrow is colored gold

Page 20: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example IV – Ambiguous Key

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.10.20.30.40.50.60.70.80.9

1

z*

p(ch

oice

is a

key

| z*

)

Response z* Probability

A 3.75 0.99

B 2.25 0.9

C -3 0

D -2.5 0

C D

AB

Answer key arrow is colored gold

Page 21: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Validation – Answer Key Estimation

• Assume the key is not known• Check accuracy of estimated answer key• Algorithm:

– Start with most frequent response as initial guess– Revise key using probabilities until no more changes

• For 12 different exams– Key estimation accuracy varied from 81% to 99%– Cannot infer multiple keys– Cannot guess key when there are no correct responses

Page 22: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Summary of Validation Study

• Accuracy improves with item quality• Accuracy affected by sample size & test length

Exam Name

N FormsForm

LengthItems

Non-scored Items

Accuracy Observations

A 2,966 2 180 307 0 99.2% B 337 2 107 214 0 85.5% C 337 1 230 230 0 90.9% D 1815 1 204 204 7 92.1%Some association with "deleted" itemsE 1408 1 199 199 1 96.0% F 46,356 2 240 480 0 96.0% G 44,104 2 120 240 0 95.8% H 25,448 2 60 120 0 93.3% 

I 121 3 165 417 43 81.0%Strong association with "field test" items

J 1,071 8 52 & 61 391 0 80.5%85.2% (English-only)

K 2,033 8 68, 76 & 77 510 0 85.9% 

L 6,473 21 250 1050 850 85.7%All errors except one were on non-scored items.

Page 23: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Reason for Answer Key Estimation

• If a group of test takers has stolen the test and worked out their own answer key, it is likely some answers will be wrong.

• Answer key estimation can find the errors committed by test thieves.

Page 24: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Dan Allen

Psychometrician

Western Governors University

Page 25: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example Item: Ambiguous Key

Which is a property of all X? A. They contain Y.

B. They have property Z.

C. * They do not contain Y.

D. They have property W.

Looking at the item text, we see that this is likely being caused by rival options A and C. SME feedback suggests the item is too text specific.

Page 26: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example Item: Ambiguous Key

Which is a component of X?A. * Real anticipated expense

B. Time spent

C. Liquid assets

D. Quality

In this case, students of high ability were often selecting C instead of A. SME feedback suggests the deleted word may have been turning students off to that option.

Page 27: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example Item: Weak Key

Select 3 possible causes of X

A. *Obesity

B. Contaminated drinking water

C. *Unhealthy diet

D. *Genetic factors

E. Lack of exercise

High performing students were picking C and D correctly, but were as likely to pick E as they were to pick A. SME feedback suggested that E may be a reasonable answer to the question. The revision involved making A, C, and E all incorrect answers so that D would remain the sole answer.

Page 28: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Example Item: Potential Mis-key

Which is a sound accounting principle?A. X

B. Not X

C. *Y

D. Z

Nearly all students selected distractor B (Not X). This item was not mis-keyed. It seems most likely that this concept was not covered sufficiently in the text and/or other learning resources—leaving students to use guessing strategies rather than content knowledge.

Page 29: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Barbara Foster

PsychometricianThe American Board of Obstetrics

and Gynecology

Page 30: Caveon Webinar Series: Improving Testing with Key Strength Analysis

The American Board of Obstetrics and Gynecology

2013 Certifying Exam• 180 scored items • Five sets of 40 field test items

Page 31: Caveon Webinar Series: Improving Testing with Key Strength Analysis

• Potential mis-keys from Caveon– 8 identified among the scored items (4%)– 22 identified among the field test items (11%)

The lower proportion in the scored items is not surprising since those items have been field tested and some may have been previously used.

The American Board of Obstetrics and Gynecology

Page 32: Caveon Webinar Series: Improving Testing with Key Strength Analysis

• Result of the SME review of the flagged scored items:– 4 of the 8 (50%) were found to have problems.

These problems were a combination of ambiguous wording, new information published just prior to the exam, recent changes in guidelines, or just a very difficult item. These items were deleted from the exam prior to scoring.

The American Board of Obstetrics and Gynecology

Page 33: Caveon Webinar Series: Improving Testing with Key Strength Analysis

• Result of the SME review of the flagged field test items:– 15 of the 22 (68%) were found to have problems.

These problems were mostly a combination of ambiguous wording, responses too closely related, and changes in the field.

The American Board of Obstetrics and Gynecology

Page 34: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Our Standard Methods The z* Method

27 Field Test Items flagged(13.5%)

22 Field Test Items flagged(11.0%)8 (4%)

items flagged by both

The American Board of Obstetrics and Gynecology

Page 35: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Our Standard Methods The z* Method

27 Field Test Items flagged(13.5%)

13 had problems

22 Field Test Items flagged(11.0%)

15 had problems

8 (4%)5 items had problems

The American Board of Obstetrics and Gynecology

Page 36: Caveon Webinar Series: Improving Testing with Key Strength Analysis

• Conclusion

This new method indicates that it is detecting differences that are not being detected by our current methods. These differences do not appear to be strictly keying errors but involve other important problem areas as well.

The American Board of Obstetrics and Gynecology

Page 37: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Conclusions

• Item analysis helps ensure– Unidimensionality– Desired item performance

• Key Strength Analysis enhances classical item analysis– Uses information from all items– Compares answer choices for same item

• Can detect structural flaws in items• Can suggest the actual key when the item is mis-keyed

– Suggests possible reasons for poor performance

• Future research– Investigate thresholds for Key Strength Analysis– Simulate item problems to measure ability to detect– Evaluate performance when assumptions fail

Page 38: Caveon Webinar Series: Improving Testing with Key Strength Analysis

Questions?

Please type questions for our presenters in the GoToWebinar control panel on your screen.

Page 39: Caveon Webinar Series: Improving Testing with Key Strength Analysis

HANDBOOK OF TEST SECURITY

• Editors - James Wollack & John Fremer• Published March 2013• Preventing, Detecting, and Investigating Cheating• Testing in Many Domains

– Certification/Licensure

– Clinical– Educational– Industrial/Organizational

• Don’t forget to order your copy at www.routledge.com– http://bit.ly/HandbookTS (Case Sensitive)– Save 20% - Enter discount code: HYJ82

Page 40: Caveon Webinar Series: Improving Testing with Key Strength Analysis

THANK YOU!

- Follow Caveon on twitter @caveon- Check out our blog…www.caveon.com/blog- LinkedIn Group – “Caveon Test Security”

Dennis Maynes Dan Allen

Chief Scientist Psychometrician

Caveon Test Security Western Governors University

Marcus Scott Barbara Foster

Data Forensics Scientist Psychometrician

Caveon Test Security American Board of Obstetrics and Gynecology