implementing item response theory

Day 2 AM: Advanced IRT topics

Linking and Equating

DIF

Polytomous IRT

IRT Software overview

Dimensionality

Part 1

Linking, equating, and scaling

Linking = setting different sets of item parameters onto the

same scale

Equating = setting different sets of students/scores on the

same scale

Linking and equating

Why important? This is necessary for

a stable scale

If we equate scores on this year’s

exam forms to last year’s we know

that a score of X still means the

exact same thing

If we don’t, a score of X could mean

different things


Two approaches

◦ Base form: Completely map Form B on to

the scale of Form A (the base form)

Appropriate for maintaining continuity across

time… Form A is base for a long time

◦ Merged scale: Combine data and scales

for Forms A and B (“super matrix”)

OK for multiple forms at same time, but not

across time


In IRT, items and people are on the

same scale, so linking/equating are

equivalent theoretically (although

they can be conducted separate)

In CTT, this is not the case

Linking doesn’t really exist in CTT –

but there is extensive research on

equating because it is so important

IRT equating

Many issues in CTT equating are

reduced with IRT because of the

property of invariance

Item parameters are invariant across

calibration groups, except for a

linear transformation

All we have to do is find it

IRT equating

Why? The scales are defined by the

sample scores, centered on N(0,1)

Another sample might have a slightly

different distribution of true scores,

and is calibrated with a theoretically

different N(0,1)

So we find how to map the two scales

to each other

Prerequisites

To accomplish linking/equating:

1. The two tests/forms must measure

the same thing (otherwise

concordance)

2. The two tests/forms must have

equal reliability (classically)

3. The equating transformation must

be invertible

◦ (A to B and B to A)

Common items or people?

To do an effective linking between

two data sets, you need something in

common

◦ People – some of the students are the

same as last year (but unchanged, so this

is probably not a good idea for

education!)

◦ Items – Rule of thumb is 20% or 20 items

Common item linking

Common item linking

Suppose there were 100 items on a

test in 2010

…and the first 20 were anchors back

to 2009

Then we need to pick 20 out of the

last 80 to be anchors in 2011

80 additional items would be

selected as “new” (not necessarily

brand new)

Common item linking

2009 average = 65

2010 average = 67

2009 anchor average = 11

2010 anchor average = 12

Common item linking

Items should specifically be selected

to be the anchors

Difficulty: spread similar to the test

as a whole

Discrimination: higher is better, but

not so much that it is

unrepresentative of the test

Not previous anchors

IRT linking analyses

There are two paradigms for linking:

◦ Concurrent calibration linking

Full Group (merges scale!!!)

Target Group (fix parameters)

◦ Conversion linking

Parameter transformation (mean/mean, mean

sigma)

TRF methods (Stocking & Lord, Haebara)

IRT linking analyses

I recommend either target group

calibration or S&L conversion

Xcalibre does concurrent calibration

methods

Conversion methods are an additional

post-hoc analysis, so separate

software: IRTEQ

Linking software - IRTEQ

User-friendly conversion linking

◦ Kyung (Chris) T. Han

Now at GMAC

◦ Windows GUI

◦ Does all major conversion methods, and

compares them

◦ Interfaces with Parscale


Purpose of conversion methods:

estimate the linear conversion

between two IRT scales (two

different forms)

Kind of like regression

Since linear, “no change” as a slope

(A) of 1 and intercept (B) of 0

Five different methods of estimating

these

Linking software – IRTEQ


Compare bs


Output from example data…


Then what?

θ* = A(θ)+B

b* = A(b)+B

a* = a/A

Part 2

Differential item functioning (DIF)

What is DIF?

Differential item functioning

The item functions differently for

different groups – and is therefore unfair

One group is more likely to get an item

correct when ability is held constant

What is DIF?

Two ways to operationally define:

Directly evaluate probability of response

for ability level slices (Mantel-Haenszel)

Compare item parameters or statistics

for each group

◦ Basically, analyze the data for each group

separately, then compare

DIF Groups

Reference group – the main (usually

majority) group

Focal group – the group being

examined to see if different than

reference group (usually minority)

DIF analyses assume that both are on

same scale

Types of DIF

Non-Crossing DIF = same across all ability levels◦ Females do better than males, regardless of

ability

◦ “Bias”

◦ aka Uniform DIF

Crossing DIF = not the same across ability levels◦ Females do better than males at above

average ability, but the same for low ability

◦ aka Non-Uniform DIF

Non-crossing DIF Example

Crossing DIF

Example

Quantifying DIF

Mantel-Haenszel (in Xcalibre)

Make N ability level slices

At each, 2 x 2 table of reference/

focal and correct/incorrect

“Ability” can be classical or IRT

scores

Show Xcalibre – SMKING with P/L

Quantifying DIF

There are two IRT-only approaches to

quantifying DIF

◦ Difference between item parameters

bR = bF?

Parscale uses this

◦ Difference between IRFs

More advanced and recent (1995)

Special program needed: DFIT

ASC sells Windows version; DOS is free

DIF in Parscale

Parscale gives several indices (bR = bF)

◦ 1. Absolute difference in parameter

◦ 2. Standardized difference (absolute/SE)

◦ 3. Chi-square = (StanDiff)2

DIF in Parscale Contrast and standardized difference

DIF in Parscale

Chi-square

More

conservative,

so better to

use

DIF in Parscale

Straightforward interpretations

◦ Raw diff

◦ Standard diff

◦ p values

But notice that they are different in the two

tables!

DFIT

Tests the shaded area

Compensatory DIF

Another thing to keep an eye out for

DIF in one item can be offset by DIF

in another

So a few items in one direction can

be offset

The total test is then non-DTF

Compensatory DIF

And you’re not likely to have a test

without DIF items, it just happens for

whatever reason (Type I error)

DIF analysis can only flag items for

you

You then need to closely evaluate

content and decide if there is an

issue, or to proceed

Part 3

Polytomous IRT

Polytomous IRT

For data that is scored with 3+ data

points (remember, multiple choice

collapses to two)

As mentioned previously, there are

two main families of polytomous

models

◦ Rating Scale

◦ Partial Credit

Rasch and non-Rasch (“Generalized”)

Rating scale approach

Designed for Likert-type questions…

◦ Rate on a scale of 1 to 5 whether the

adjective applies to you:

Adjective 1 2 3 4 5

Trustworthy

Outgoing

Diligent

Conscientious

Rating scale approach

We assume that the process or

mental structure behind the 1-5 scale

is the same for every item

But items might differ in “difficulty”

Partial credit approach

We assume that the response process

is different for every item

The difference between 2 and 4

points might be wider/narrower

Rasch/Non-Rasch

Non-Rasch allows discrimination to

vary between items

This means curves can have different

steepness/separation

Comparison table

Model Item Disc. Step

Spacing

Step

Ordering

Option Disc.

RSM Fixed Fixed Fixed Fixed

PCM Fixed Variable Variable Fixed

GRSM Variable Fixed Fixed Fixed

GRM Variable Variable Fixed Fixed

GPCM Variable Variable Variable Fixed

NRM Variable (each

option)

Variable Variable Variable

Fixed/Variable between items

Polytomous IRT

It used to be that you had to program

all that manually (PARSCALE)

Let’s look at it in Xcalibre 4…

Part 4

IRT Software

IRT Software

There are a number of programs out

there, reflecting:

◦ Types of approaches (Rasch, 3PL, Poly)

◦ Cost (free up to 100s of dollars)

◦ Special topics like fit, linking, and form

assembly

◦ Usability vs. flexibility

Some IRT calibration programs

Xcalibre 4 – easy to use, complete reports

Parscale – extremely flexible, does most models, but difficult to use

Bilog – most powerful dichotomous program, difficult to use

ConQuest – advanced things like facets models and multidimensional

Winsteps – most common Rasch program

Some IRT calibration programs

PARAM3PL – free; only 3PL

ICL – free; lots of stuff, but difficult

to use and no support

R – free; some routines there, but

slow, and inferior output

OPLM – free; from Cito

Other IRT programs

ASC’s Form Building Tool to build new forms using calibrated items

DIF Tool for DIF graphing

DFIT8 – DFIT framework for DIF

ScoreAll – scores examinees

CATSim – CAT simulations

IRTLRDIF2

Most organizations build own tools for specific purposes

What to do with the results?

Often a good idea to import scores

and item parameters into Excel (ASC

does CSV directly)

You can manipulate and further

analyze (frequency graphs, etc.)

Also helpful for further importing –

scores into a database and item

parameters into the item banker

Part 5

Assumptions of IRT: Dimensionality

IRT assumptions

Basic Assumptions

1. A stable unidimensional trait

Item responses are independent of each

other (local independence), except for the

trait/ability that they measure

2. A specific form of the relationship

between trait level and probability of a

response (the response function, or IRT

model)

IRT assumptions

Unidimensionality and local

independence are actually equivalent

◦ If items are interdependent, then the

probability of a response is due to two

things: your trait level, and whether you

saw the “tripping” item first

◦ This makes it two-dimensional

IRT assumptions

Two other common violations:

◦ Speededness

◦ Actual multidimensional test (medical

knowledge vs. clinical ability, language

vs. math)

How to check

So there are two important things to

check:

◦ Unidimensionality

Factor Analysis

Bejar’s method

DIMTEST

◦ Whether our IRT model was a good

choice

Model fit

Checking unidimensionality

Factor analysis

Used often in research investigating

dimensionality

But it is not recommended to use

“normal” factor analysis, which uses

Pearson correlations

◦ This is used in typical software packages

like SPSS


Item-level data of tests are

dichotomous, unlike total scores,

which are continuous

Special software does factor analysis

with tetrachoric calculations for this

MicroFact (from ASC)

TESTFACT (from SSI)


Output is still similar to regular

factor analysis

Eigenvalue plot to examine number

of factors

Factor loading matrix to examine

“sorting” of items


See output files…

If unidimensional, factor loadings will

pattern similar to IRT a parameters

Item a Loading

1 .72 .42

2 .81 .44

3 .96 .54

4 .83 .25

5 .47 .11


Bejar’s Method

Useful in situations where you know

your test has different content areas

Examples:

◦ Cognitive test with fluid and crystallized

◦ Math test with story problems and

number-only problems

◦ Language test with writing and reading


It is possible that these tests are not

completely unidimensional, and we

have a good reason to check


Bejar’s method:

◦ 1. Do an IRT calibration of the entire test

◦ 2. Do an IRT calibration of each area

separately

◦ 3. Compare the item parameters


b parameters


a parameters


c parameters

implementing item response theory

Education