implementing item response theory

67
Day 2 AM: Advanced IRT topics Linking and Equating DIF Polytomous IRT IRT Software overview Dimensionality

Upload: nathan-thompson

Post on 22-Jan-2018

217 views

Category:

Education


8 download

TRANSCRIPT

Page 1: Implementing Item Response Theory

Day 2 AM: Advanced IRT topics

Linking and Equating

DIF

Polytomous IRT

IRT Software overview

Dimensionality

Page 2: Implementing Item Response Theory

Part 1

Linking, equating, and scaling

Linking = setting different sets of item parameters onto the

same scale

Equating = setting different sets of students/scores on the

same scale

Page 3: Implementing Item Response Theory

Linking and equating

Why important? This is necessary for

a stable scale

If we equate scores on this year’s

exam forms to last year’s we know

that a score of X still means the

exact same thing

If we don’t, a score of X could mean

different things

Page 4: Implementing Item Response Theory

Linking and equating

Two approaches

◦ Base form: Completely map Form B on to

the scale of Form A (the base form)

Appropriate for maintaining continuity across

time… Form A is base for a long time

◦ Merged scale: Combine data and scales

for Forms A and B (“super matrix”)

OK for multiple forms at same time, but not

across time

Page 5: Implementing Item Response Theory

Linking and equating

In IRT, items and people are on the

same scale, so linking/equating are

equivalent theoretically (although

they can be conducted separate)

In CTT, this is not the case

Linking doesn’t really exist in CTT –

but there is extensive research on

equating because it is so important

Page 6: Implementing Item Response Theory

IRT equating

Many issues in CTT equating are

reduced with IRT because of the

property of invariance

Item parameters are invariant across

calibration groups, except for a

linear transformation

All we have to do is find it

Page 7: Implementing Item Response Theory

IRT equating

Why? The scales are defined by the

sample scores, centered on N(0,1)

Another sample might have a slightly

different distribution of true scores,

and is calibrated with a theoretically

different N(0,1)

So we find how to map the two scales

to each other

Page 8: Implementing Item Response Theory

Prerequisites

To accomplish linking/equating:

1. The two tests/forms must measure

the same thing (otherwise

concordance)

2. The two tests/forms must have

equal reliability (classically)

3. The equating transformation must

be invertible

◦ (A to B and B to A)

Page 9: Implementing Item Response Theory

Common items or people?

To do an effective linking between

two data sets, you need something in

common

◦ People – some of the students are the

same as last year (but unchanged, so this

is probably not a good idea for

education!)

◦ Items – Rule of thumb is 20% or 20 items

Page 10: Implementing Item Response Theory

Common item linking

Page 11: Implementing Item Response Theory

Common item linking

Suppose there were 100 items on a

test in 2010

…and the first 20 were anchors back

to 2009

Then we need to pick 20 out of the

last 80 to be anchors in 2011

80 additional items would be

selected as “new” (not necessarily

brand new)

Page 12: Implementing Item Response Theory

Common item linking

2009 average = 65

2010 average = 67

2009 anchor average = 11

2010 anchor average = 12

Page 13: Implementing Item Response Theory

Common item linking

Items should specifically be selected

to be the anchors

Difficulty: spread similar to the test

as a whole

Discrimination: higher is better, but

not so much that it is

unrepresentative of the test

Not previous anchors

Page 14: Implementing Item Response Theory

IRT linking analyses

There are two paradigms for linking:

◦ Concurrent calibration linking

Full Group (merges scale!!!)

Target Group (fix parameters)

◦ Conversion linking

Parameter transformation (mean/mean, mean

sigma)

TRF methods (Stocking & Lord, Haebara)

Page 15: Implementing Item Response Theory

IRT linking analyses

I recommend either target group

calibration or S&L conversion

Xcalibre does concurrent calibration

methods

Conversion methods are an additional

post-hoc analysis, so separate

software: IRTEQ

Page 16: Implementing Item Response Theory

Linking software - IRTEQ

User-friendly conversion linking

◦ Kyung (Chris) T. Han

Now at GMAC

◦ Windows GUI

◦ Does all major conversion methods, and

compares them

◦ Interfaces with Parscale

Page 17: Implementing Item Response Theory

Linking software - IRTEQ

Purpose of conversion methods:

estimate the linear conversion

between two IRT scales (two

different forms)

Kind of like regression

Since linear, “no change” as a slope

(A) of 1 and intercept (B) of 0

Five different methods of estimating

these

Page 18: Implementing Item Response Theory

Linking software – IRTEQ

Page 19: Implementing Item Response Theory

Linking software - IRTEQ

Compare bs

Page 20: Implementing Item Response Theory

Linking software - IRTEQ

Output from example data…

Page 21: Implementing Item Response Theory

Linking software - IRTEQ

Then what?

θ* = A(θ)+B

b* = A(b)+B

a* = a/A

Page 22: Implementing Item Response Theory

Part 2

Differential item functioning (DIF)

Page 23: Implementing Item Response Theory

What is DIF?

Differential item functioning

The item functions differently for

different groups – and is therefore unfair

One group is more likely to get an item

correct when ability is held constant

Page 24: Implementing Item Response Theory

What is DIF?

Two ways to operationally define:

Directly evaluate probability of response

for ability level slices (Mantel-Haenszel)

Compare item parameters or statistics

for each group

◦ Basically, analyze the data for each group

separately, then compare

Page 25: Implementing Item Response Theory

DIF Groups

Reference group – the main (usually

majority) group

Focal group – the group being

examined to see if different than

reference group (usually minority)

DIF analyses assume that both are on

same scale

Page 26: Implementing Item Response Theory

Types of DIF

Non-Crossing DIF = same across all ability levels◦ Females do better than males, regardless of

ability

◦ “Bias”

◦ aka Uniform DIF

Crossing DIF = not the same across ability levels◦ Females do better than males at above

average ability, but the same for low ability

◦ aka Non-Uniform DIF

Page 27: Implementing Item Response Theory

Non-crossing DIF Example

Page 28: Implementing Item Response Theory

Crossing DIF

Example

Page 29: Implementing Item Response Theory

Quantifying DIF

Mantel-Haenszel (in Xcalibre)

Make N ability level slices

At each, 2 x 2 table of reference/

focal and correct/incorrect

“Ability” can be classical or IRT

scores

Show Xcalibre – SMKING with P/L

Page 30: Implementing Item Response Theory

Quantifying DIF

There are two IRT-only approaches to

quantifying DIF

◦ Difference between item parameters

bR = bF?

Parscale uses this

◦ Difference between IRFs

More advanced and recent (1995)

Special program needed: DFIT

ASC sells Windows version; DOS is free

Page 31: Implementing Item Response Theory

DIF in Parscale

Parscale gives several indices (bR = bF)

◦ 1. Absolute difference in parameter

◦ 2. Standardized difference (absolute/SE)

◦ 3. Chi-square = (StanDiff)2

Page 32: Implementing Item Response Theory

DIF in Parscale Contrast and standardized difference

Page 33: Implementing Item Response Theory

DIF in Parscale

Chi-square

More

conservative,

so better to

use

Page 34: Implementing Item Response Theory

DIF in Parscale

Straightforward interpretations

◦ Raw diff

◦ Standard diff

◦ p values

But notice that they are different in the two

tables!

Page 35: Implementing Item Response Theory

DFIT

Tests the shaded area

Page 36: Implementing Item Response Theory

Compensatory DIF

Another thing to keep an eye out for

DIF in one item can be offset by DIF

in another

So a few items in one direction can

be offset

The total test is then non-DTF

Page 37: Implementing Item Response Theory

Compensatory DIF

And you’re not likely to have a test

without DIF items, it just happens for

whatever reason (Type I error)

DIF analysis can only flag items for

you

You then need to closely evaluate

content and decide if there is an

issue, or to proceed

Page 38: Implementing Item Response Theory

Part 3

Polytomous IRT

Page 39: Implementing Item Response Theory

Polytomous IRT

For data that is scored with 3+ data

points (remember, multiple choice

collapses to two)

As mentioned previously, there are

two main families of polytomous

models

◦ Rating Scale

◦ Partial Credit

Rasch and non-Rasch (“Generalized”)

Page 40: Implementing Item Response Theory

Rating scale approach

Designed for Likert-type questions…

◦ Rate on a scale of 1 to 5 whether the

adjective applies to you:

Adjective 1 2 3 4 5

Trustworthy

Outgoing

Diligent

Conscientious

Page 41: Implementing Item Response Theory

Rating scale approach

We assume that the process or

mental structure behind the 1-5 scale

is the same for every item

But items might differ in “difficulty”

Page 42: Implementing Item Response Theory

Partial credit approach

We assume that the response process

is different for every item

The difference between 2 and 4

points might be wider/narrower

Page 43: Implementing Item Response Theory

Rasch/Non-Rasch

Non-Rasch allows discrimination to

vary between items

This means curves can have different

steepness/separation

Page 44: Implementing Item Response Theory

Comparison table

Model Item Disc. Step

Spacing

Step

Ordering

Option Disc.

RSM Fixed Fixed Fixed Fixed

PCM Fixed Variable Variable Fixed

GRSM Variable Fixed Fixed Fixed

GRM Variable Variable Fixed Fixed

GPCM Variable Variable Variable Fixed

NRM Variable (each

option)

Variable Variable Variable

Fixed/Variable between items

Page 45: Implementing Item Response Theory

Polytomous IRT

It used to be that you had to program

all that manually (PARSCALE)

Let’s look at it in Xcalibre 4…

Page 46: Implementing Item Response Theory

Part 4

IRT Software

Page 47: Implementing Item Response Theory

IRT Software

There are a number of programs out

there, reflecting:

◦ Types of approaches (Rasch, 3PL, Poly)

◦ Cost (free up to 100s of dollars)

◦ Special topics like fit, linking, and form

assembly

◦ Usability vs. flexibility

Page 48: Implementing Item Response Theory

Some IRT calibration programs

Xcalibre 4 – easy to use, complete reports

Parscale – extremely flexible, does most models, but difficult to use

Bilog – most powerful dichotomous program, difficult to use

ConQuest – advanced things like facets models and multidimensional

Winsteps – most common Rasch program

Page 49: Implementing Item Response Theory

Some IRT calibration programs

PARAM3PL – free; only 3PL

ICL – free; lots of stuff, but difficult

to use and no support

R – free; some routines there, but

slow, and inferior output

OPLM – free; from Cito

Page 50: Implementing Item Response Theory

Other IRT programs

ASC’s Form Building Tool to build new forms using calibrated items

DIF Tool for DIF graphing

DFIT8 – DFIT framework for DIF

ScoreAll – scores examinees

CATSim – CAT simulations

IRTLRDIF2

Most organizations build own tools for specific purposes

Page 51: Implementing Item Response Theory

What to do with the results?

Often a good idea to import scores

and item parameters into Excel (ASC

does CSV directly)

You can manipulate and further

analyze (frequency graphs, etc.)

Also helpful for further importing –

scores into a database and item

parameters into the item banker

Page 52: Implementing Item Response Theory

Part 5

Assumptions of IRT: Dimensionality

Page 53: Implementing Item Response Theory

IRT assumptions

Basic Assumptions

1. A stable unidimensional trait

Item responses are independent of each

other (local independence), except for the

trait/ability that they measure

2. A specific form of the relationship

between trait level and probability of a

response (the response function, or IRT

model)

Page 54: Implementing Item Response Theory

IRT assumptions

Unidimensionality and local

independence are actually equivalent

◦ If items are interdependent, then the

probability of a response is due to two

things: your trait level, and whether you

saw the “tripping” item first

◦ This makes it two-dimensional

Page 55: Implementing Item Response Theory

IRT assumptions

Two other common violations:

◦ Speededness

◦ Actual multidimensional test (medical

knowledge vs. clinical ability, language

vs. math)

Page 56: Implementing Item Response Theory

How to check

So there are two important things to

check:

◦ Unidimensionality

Factor Analysis

Bejar’s method

DIMTEST

◦ Whether our IRT model was a good

choice

Model fit

Page 57: Implementing Item Response Theory

Checking unidimensionality

Factor analysis

Used often in research investigating

dimensionality

But it is not recommended to use

“normal” factor analysis, which uses

Pearson correlations

◦ This is used in typical software packages

like SPSS

Page 58: Implementing Item Response Theory

Checking unidimensionality

Item-level data of tests are

dichotomous, unlike total scores,

which are continuous

Special software does factor analysis

with tetrachoric calculations for this

MicroFact (from ASC)

TESTFACT (from SSI)

Page 59: Implementing Item Response Theory

Checking unidimensionality

Output is still similar to regular

factor analysis

Eigenvalue plot to examine number

of factors

Factor loading matrix to examine

“sorting” of items

Page 60: Implementing Item Response Theory

Checking unidimensionality

Page 61: Implementing Item Response Theory

Checking unidimensionality

See output files…

If unidimensional, factor loadings will

pattern similar to IRT a parameters

Item a Loading

1 .72 .42

2 .81 .44

3 .96 .54

4 .83 .25

5 .47 .11

Page 62: Implementing Item Response Theory

Checking unidimensionality

Bejar’s Method

Useful in situations where you know

your test has different content areas

Examples:

◦ Cognitive test with fluid and crystallized

◦ Math test with story problems and

number-only problems

◦ Language test with writing and reading

Page 63: Implementing Item Response Theory

Checking unidimensionality

It is possible that these tests are not

completely unidimensional, and we

have a good reason to check

Page 64: Implementing Item Response Theory

Checking unidimensionality

Bejar’s method:

◦ 1. Do an IRT calibration of the entire test

◦ 2. Do an IRT calibration of each area

separately

◦ 3. Compare the item parameters

Page 65: Implementing Item Response Theory

Checking unidimensionality

b parameters

Page 66: Implementing Item Response Theory

Checking unidimensionality

a parameters

Page 67: Implementing Item Response Theory

Checking unidimensionality

c parameters