implementing item response theory
TRANSCRIPT
![Page 1: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/1.jpg)
Day 2 AM: Advanced IRT topics
Linking and Equating
DIF
Polytomous IRT
IRT Software overview
Dimensionality
![Page 2: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/2.jpg)
Part 1
Linking, equating, and scaling
Linking = setting different sets of item parameters onto the
same scale
Equating = setting different sets of students/scores on the
same scale
![Page 3: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/3.jpg)
Linking and equating
Why important? This is necessary for
a stable scale
If we equate scores on this year’s
exam forms to last year’s we know
that a score of X still means the
exact same thing
If we don’t, a score of X could mean
different things
![Page 4: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/4.jpg)
Linking and equating
Two approaches
◦ Base form: Completely map Form B on to
the scale of Form A (the base form)
Appropriate for maintaining continuity across
time… Form A is base for a long time
◦ Merged scale: Combine data and scales
for Forms A and B (“super matrix”)
OK for multiple forms at same time, but not
across time
![Page 5: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/5.jpg)
Linking and equating
In IRT, items and people are on the
same scale, so linking/equating are
equivalent theoretically (although
they can be conducted separate)
In CTT, this is not the case
Linking doesn’t really exist in CTT –
but there is extensive research on
equating because it is so important
![Page 6: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/6.jpg)
IRT equating
Many issues in CTT equating are
reduced with IRT because of the
property of invariance
Item parameters are invariant across
calibration groups, except for a
linear transformation
All we have to do is find it
![Page 7: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/7.jpg)
IRT equating
Why? The scales are defined by the
sample scores, centered on N(0,1)
Another sample might have a slightly
different distribution of true scores,
and is calibrated with a theoretically
different N(0,1)
So we find how to map the two scales
to each other
![Page 8: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/8.jpg)
Prerequisites
To accomplish linking/equating:
1. The two tests/forms must measure
the same thing (otherwise
concordance)
2. The two tests/forms must have
equal reliability (classically)
3. The equating transformation must
be invertible
◦ (A to B and B to A)
![Page 9: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/9.jpg)
Common items or people?
To do an effective linking between
two data sets, you need something in
common
◦ People – some of the students are the
same as last year (but unchanged, so this
is probably not a good idea for
education!)
◦ Items – Rule of thumb is 20% or 20 items
![Page 10: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/10.jpg)
Common item linking
![Page 11: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/11.jpg)
Common item linking
Suppose there were 100 items on a
test in 2010
…and the first 20 were anchors back
to 2009
Then we need to pick 20 out of the
last 80 to be anchors in 2011
80 additional items would be
selected as “new” (not necessarily
brand new)
![Page 12: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/12.jpg)
Common item linking
2009 average = 65
2010 average = 67
2009 anchor average = 11
2010 anchor average = 12
![Page 13: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/13.jpg)
Common item linking
Items should specifically be selected
to be the anchors
Difficulty: spread similar to the test
as a whole
Discrimination: higher is better, but
not so much that it is
unrepresentative of the test
Not previous anchors
![Page 14: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/14.jpg)
IRT linking analyses
There are two paradigms for linking:
◦ Concurrent calibration linking
Full Group (merges scale!!!)
Target Group (fix parameters)
◦ Conversion linking
Parameter transformation (mean/mean, mean
sigma)
TRF methods (Stocking & Lord, Haebara)
![Page 15: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/15.jpg)
IRT linking analyses
I recommend either target group
calibration or S&L conversion
Xcalibre does concurrent calibration
methods
Conversion methods are an additional
post-hoc analysis, so separate
software: IRTEQ
![Page 16: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/16.jpg)
Linking software - IRTEQ
User-friendly conversion linking
◦ Kyung (Chris) T. Han
Now at GMAC
◦ Windows GUI
◦ Does all major conversion methods, and
compares them
◦ Interfaces with Parscale
![Page 17: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/17.jpg)
Linking software - IRTEQ
Purpose of conversion methods:
estimate the linear conversion
between two IRT scales (two
different forms)
Kind of like regression
Since linear, “no change” as a slope
(A) of 1 and intercept (B) of 0
Five different methods of estimating
these
![Page 18: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/18.jpg)
Linking software – IRTEQ
![Page 19: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/19.jpg)
Linking software - IRTEQ
Compare bs
![Page 20: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/20.jpg)
Linking software - IRTEQ
Output from example data…
![Page 21: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/21.jpg)
Linking software - IRTEQ
Then what?
θ* = A(θ)+B
b* = A(b)+B
a* = a/A
![Page 22: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/22.jpg)
Part 2
Differential item functioning (DIF)
![Page 23: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/23.jpg)
What is DIF?
Differential item functioning
The item functions differently for
different groups – and is therefore unfair
One group is more likely to get an item
correct when ability is held constant
![Page 24: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/24.jpg)
What is DIF?
Two ways to operationally define:
Directly evaluate probability of response
for ability level slices (Mantel-Haenszel)
Compare item parameters or statistics
for each group
◦ Basically, analyze the data for each group
separately, then compare
![Page 25: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/25.jpg)
DIF Groups
Reference group – the main (usually
majority) group
Focal group – the group being
examined to see if different than
reference group (usually minority)
DIF analyses assume that both are on
same scale
![Page 26: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/26.jpg)
Types of DIF
Non-Crossing DIF = same across all ability levels◦ Females do better than males, regardless of
ability
◦ “Bias”
◦ aka Uniform DIF
Crossing DIF = not the same across ability levels◦ Females do better than males at above
average ability, but the same for low ability
◦ aka Non-Uniform DIF
![Page 27: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/27.jpg)
Non-crossing DIF Example
![Page 28: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/28.jpg)
Crossing DIF
Example
![Page 29: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/29.jpg)
Quantifying DIF
Mantel-Haenszel (in Xcalibre)
Make N ability level slices
At each, 2 x 2 table of reference/
focal and correct/incorrect
“Ability” can be classical or IRT
scores
Show Xcalibre – SMKING with P/L
![Page 30: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/30.jpg)
Quantifying DIF
There are two IRT-only approaches to
quantifying DIF
◦ Difference between item parameters
bR = bF?
Parscale uses this
◦ Difference between IRFs
More advanced and recent (1995)
Special program needed: DFIT
ASC sells Windows version; DOS is free
![Page 31: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/31.jpg)
DIF in Parscale
Parscale gives several indices (bR = bF)
◦ 1. Absolute difference in parameter
◦ 2. Standardized difference (absolute/SE)
◦ 3. Chi-square = (StanDiff)2
![Page 32: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/32.jpg)
DIF in Parscale Contrast and standardized difference
![Page 33: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/33.jpg)
DIF in Parscale
Chi-square
More
conservative,
so better to
use
![Page 34: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/34.jpg)
DIF in Parscale
Straightforward interpretations
◦ Raw diff
◦ Standard diff
◦ p values
But notice that they are different in the two
tables!
![Page 35: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/35.jpg)
DFIT
Tests the shaded area
![Page 36: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/36.jpg)
Compensatory DIF
Another thing to keep an eye out for
DIF in one item can be offset by DIF
in another
So a few items in one direction can
be offset
The total test is then non-DTF
![Page 37: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/37.jpg)
Compensatory DIF
And you’re not likely to have a test
without DIF items, it just happens for
whatever reason (Type I error)
DIF analysis can only flag items for
you
You then need to closely evaluate
content and decide if there is an
issue, or to proceed
![Page 38: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/38.jpg)
Part 3
Polytomous IRT
![Page 39: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/39.jpg)
Polytomous IRT
For data that is scored with 3+ data
points (remember, multiple choice
collapses to two)
As mentioned previously, there are
two main families of polytomous
models
◦ Rating Scale
◦ Partial Credit
Rasch and non-Rasch (“Generalized”)
![Page 40: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/40.jpg)
Rating scale approach
Designed for Likert-type questions…
◦ Rate on a scale of 1 to 5 whether the
adjective applies to you:
Adjective 1 2 3 4 5
Trustworthy
Outgoing
Diligent
Conscientious
![Page 41: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/41.jpg)
Rating scale approach
We assume that the process or
mental structure behind the 1-5 scale
is the same for every item
But items might differ in “difficulty”
![Page 42: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/42.jpg)
Partial credit approach
We assume that the response process
is different for every item
The difference between 2 and 4
points might be wider/narrower
![Page 43: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/43.jpg)
Rasch/Non-Rasch
Non-Rasch allows discrimination to
vary between items
This means curves can have different
steepness/separation
![Page 44: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/44.jpg)
Comparison table
Model Item Disc. Step
Spacing
Step
Ordering
Option Disc.
RSM Fixed Fixed Fixed Fixed
PCM Fixed Variable Variable Fixed
GRSM Variable Fixed Fixed Fixed
GRM Variable Variable Fixed Fixed
GPCM Variable Variable Variable Fixed
NRM Variable (each
option)
Variable Variable Variable
Fixed/Variable between items
![Page 45: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/45.jpg)
Polytomous IRT
It used to be that you had to program
all that manually (PARSCALE)
Let’s look at it in Xcalibre 4…
![Page 46: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/46.jpg)
Part 4
IRT Software
![Page 47: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/47.jpg)
IRT Software
There are a number of programs out
there, reflecting:
◦ Types of approaches (Rasch, 3PL, Poly)
◦ Cost (free up to 100s of dollars)
◦ Special topics like fit, linking, and form
assembly
◦ Usability vs. flexibility
![Page 48: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/48.jpg)
Some IRT calibration programs
Xcalibre 4 – easy to use, complete reports
Parscale – extremely flexible, does most models, but difficult to use
Bilog – most powerful dichotomous program, difficult to use
ConQuest – advanced things like facets models and multidimensional
Winsteps – most common Rasch program
![Page 49: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/49.jpg)
Some IRT calibration programs
PARAM3PL – free; only 3PL
ICL – free; lots of stuff, but difficult
to use and no support
R – free; some routines there, but
slow, and inferior output
OPLM – free; from Cito
![Page 50: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/50.jpg)
Other IRT programs
ASC’s Form Building Tool to build new forms using calibrated items
DIF Tool for DIF graphing
DFIT8 – DFIT framework for DIF
ScoreAll – scores examinees
CATSim – CAT simulations
IRTLRDIF2
Most organizations build own tools for specific purposes
![Page 51: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/51.jpg)
What to do with the results?
Often a good idea to import scores
and item parameters into Excel (ASC
does CSV directly)
You can manipulate and further
analyze (frequency graphs, etc.)
Also helpful for further importing –
scores into a database and item
parameters into the item banker
![Page 52: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/52.jpg)
Part 5
Assumptions of IRT: Dimensionality
![Page 53: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/53.jpg)
IRT assumptions
Basic Assumptions
1. A stable unidimensional trait
Item responses are independent of each
other (local independence), except for the
trait/ability that they measure
2. A specific form of the relationship
between trait level and probability of a
response (the response function, or IRT
model)
![Page 54: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/54.jpg)
IRT assumptions
Unidimensionality and local
independence are actually equivalent
◦ If items are interdependent, then the
probability of a response is due to two
things: your trait level, and whether you
saw the “tripping” item first
◦ This makes it two-dimensional
![Page 55: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/55.jpg)
IRT assumptions
Two other common violations:
◦ Speededness
◦ Actual multidimensional test (medical
knowledge vs. clinical ability, language
vs. math)
![Page 56: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/56.jpg)
How to check
So there are two important things to
check:
◦ Unidimensionality
Factor Analysis
Bejar’s method
DIMTEST
◦ Whether our IRT model was a good
choice
Model fit
![Page 57: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/57.jpg)
Checking unidimensionality
Factor analysis
Used often in research investigating
dimensionality
But it is not recommended to use
“normal” factor analysis, which uses
Pearson correlations
◦ This is used in typical software packages
like SPSS
![Page 58: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/58.jpg)
Checking unidimensionality
Item-level data of tests are
dichotomous, unlike total scores,
which are continuous
Special software does factor analysis
with tetrachoric calculations for this
MicroFact (from ASC)
TESTFACT (from SSI)
![Page 59: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/59.jpg)
Checking unidimensionality
Output is still similar to regular
factor analysis
Eigenvalue plot to examine number
of factors
Factor loading matrix to examine
“sorting” of items
![Page 60: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/60.jpg)
Checking unidimensionality
![Page 61: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/61.jpg)
Checking unidimensionality
See output files…
If unidimensional, factor loadings will
pattern similar to IRT a parameters
Item a Loading
1 .72 .42
2 .81 .44
3 .96 .54
4 .83 .25
5 .47 .11
![Page 62: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/62.jpg)
Checking unidimensionality
Bejar’s Method
Useful in situations where you know
your test has different content areas
Examples:
◦ Cognitive test with fluid and crystallized
◦ Math test with story problems and
number-only problems
◦ Language test with writing and reading
![Page 63: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/63.jpg)
Checking unidimensionality
It is possible that these tests are not
completely unidimensional, and we
have a good reason to check
![Page 64: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/64.jpg)
Checking unidimensionality
Bejar’s method:
◦ 1. Do an IRT calibration of the entire test
◦ 2. Do an IRT calibration of each area
separately
◦ 3. Compare the item parameters
![Page 65: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/65.jpg)
Checking unidimensionality
b parameters
![Page 66: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/66.jpg)
Checking unidimensionality
a parameters
![Page 67: Implementing Item Response Theory](https://reader031.vdocuments.site/reader031/viewer/2022020204/58adf79a1a28abf0628b5305/html5/thumbnails/67.jpg)
Checking unidimensionality
c parameters