introduction to item response theory

Day 1 AM: An Introduction toItem Response TheoryNathan A. ThompsonVice President, Assessment Systems CorporationAdjunct faculty, University of [email protected]

Welcome!Thank you for attending!Introductions and important info nowSoftware… download or USBPlease ask questions

◦Also, slow me down or ask for translation!Goal: provide an intro on IRT/CAT to

those who are new◦For those with some experience, to

provide new viewpoints and more resources/recommendations

Where I’m from, professionallyPhD, University of Minnesota

◦CAT for classificationsTest development manager for

ophthalmology certificationsPsychometrician at Prometric

(many certifications)VP at ASC

Where I’m from, geographically

Except now things look like…

We do odd things in winter

Introduce yourselvesNameEmployer/organizationTypes of tests you do and/or why

you are interested in IRT/CAT

(There might be someone with similar interests here)

Another announcementNewly formed: International

Association for Computerized Adaptive Testing (IACAT)◦www.iacat.org◦Free membership◦Growing resources◦Next conference: August 2012,

Sydney

http://www.iacat.org/

Welcome!This workshop is on two highly

related topics: IRT and CATIRT is the modern paradigm for

developing, analyzing, scoring, and linking tests

CAT is a next-generation method of delivering tests

CAT is not feasible without IRT, so we do IRT first

IRT – where are we going?IRT, as many of you know, provides a

way of analyzing itemsHowever, it has drawbacks (no

distractor analysis), so the main reasons to use IRT are at the test level

It solves certain issues with classical test theory (CTT)

But the two should always be used together

IRT – where are we going?Advantages

◦Better error characterization◦More precise scores◦Better linking◦Model-based◦Items and people on same scale

(CAT)◦Sample-independence◦Powerful test assembly

IRT – where are we going?Keyword: paradigm or approach

◦Not just another statistical analysis◦It is a different way of thinking about

how tests should work, and how we can approach specific problems (scaling, equating, test assembly) from that viewpoint

Day 1There will be four parts this

morning, covering the theory behind IRT:◦Rationale: A graphical introduction to

IRT◦Models (dichotomous and polytomous)

and their response functions◦IRT scoring (θ estimation)◦Item parameter estimation and model

fit

Part 1A graphical introduction to IRT

What is IRT?Basic Assumptions1. Unidimensionality

A unidimensional latent trait (1 at a time) Item responses are independent of each

other (local independence), except for the trait/ability that they measure

2. A specific form of the relationship between trait level and probability of a response

The response function, or IRT model There are a growing number of models

What is IRT?A theory of mathematical functions

that model the responses of examinees to test items/questions

These functions are item response functions (IRFs)

Historically, it has also been known as latent trait theory and item characteristic curve theory

The IRFs are best described by showing how the concept is derived from classical analysis…

Classical item statisticsCTT statistics are typically

calculated for each optionOption N Prop Rpbis Mean

1 307 0.860 0.221 91.876

2 25 0.070 -0.142 85.600

3 14 0.039 -0.137 83.929

4 11 0.031 -0.081 86.273

Classical item statisticsThe proportions are often

translated to a figure like this, where examinees are split into groups

Classical item statisticsThe general idea of IRT is to split

the previous graph up into more groups, and then find a mathematical model for the blue line

This is what makes the item response function (IRF)

Classical item statisticsExample with 10 groups

The item response functionReflects the probability of a given

response as a function of the latent trait (z-score)

Example:

The IRFFor dichotomously scored items,

it is the probability of a correct or keyed response

Also called Item Characteristic Curve (ICC) or Trace Line

Only one curve (correct response), and all other responses are grouped as (1-IRF)

For polytomous items (partial credit, etc.), it is the probability of each response

The IRFHow do we know exactly what

the IRF for an item is?We estimate parameters for an

equation that draws the curveFor dichotomous IRT, there are

three relevant parameters: a, b, and c

The IRFa: The discrimination parameter;

represents how well the item differentiates examinees; slope of the curve at its center

b: The difficulty parameter; represents how easy or hard the item is with respect to examinees; location of the curve (left to right)

c: The pseudoguessing parameter; represents the ‘base probability’ of answering the question; lower asymptote

The IRFa=1, b=0, c=0.25

The IRF…is the “basic building block” of IRTwill differ from item to itemcan be one of several different

models (now)can be used to evaluate items (now)is used for IRT scoring (next)leads to “information” used for test

design (after that)is the basis of CAT (tomorrow)

Part 2IRT models

IRT modelsSeveral families of models

◦Dichotomous◦Polytomous◦Multidimensional◦Facets (scenarios vs raters)◦Mixed (additional parameters)◦Cognitive diagnostic

◦We will focus on first two

Dichotomous IRT modelsThere are 3 main models in use, as

mentioned earlier: 1PL, 2PL, 3PLThe “L” refers to “logistic”: which is

the type of equationIRT was originally developed

decades ago with a cumulative normal curve

This means that calculus needed to be used

The logistic functionAn approximation was developed:

the logistic curveNo calculus neededThere are two formats based on DIf D = 1.702, then diff < 0.01If D = 1.0, a little more difference;

called the true logistic formDoes not really matter, as long as

you are consistent

The logistic functionThe basic form of the curve

Item parametersWe add parameters to slightly

modify the shape to get it to match our data

For example, a 4-option multiple choice item has a 25% chance of being guessed correctly

So we add a c parameter as a lower asymptote, which means that the curve is “squished” so it never goes below 0.25 (next)

Item parametersSample IRF to show c

Item parametersWe can also add a parameter (a)

that modifies the slopeAnd a b parameter that slides the

entire curve left or right◦Tells us which person z-score for which

the item is appropriateItems can be evaluated based on

these just like with CTT statisticsA little more next…

Item parameters: aThe a parameter ranges from 0.0

to about 2.0 in practice (theoretically to infinity)

Higher means better discriminating

For achievement testing, 0.7 or 0.8 is good, aptitude testing is higher

Helps you: Remove items with a<0.4? Identify a>1.0 as great items?

Item parameters: bFor what person z-score is the

item appropriate? (non-Rasch)Should be between -3 and 3

◦99.9% of students are in that range0.0 is average person1.0 is difficult (85th percentile)-1.0 is easy (15th percentile)2.0 is super difficult (98%)-2.0 is super easy (2%)

Item parameters: bIf item difficulties are normally

distributed, where does this fall? (Rasch)

0.0 is average item (NOT PERSON)

Item parameters: cThe c parameter should be about 1/k,

where k is the number of optionsIf higher, this indicates that options

are not attractiveFor example, suppose c = 0.5This means there is a 50/50 chanceThat implies that even the lowest

students are able to ignore two options and guess between the other two options

Item parametersExtreme example:

◦What is 23+25? A. 48 B. 47 C. 3.141529… D. 1,256,457

The (3PL) logistic functionHere is the equation for the 3PL, so

you can see where the parameters are inserted

Item i, person j

Equivalent formulations can be seen in the literature, like moving the (1-c) above the line

( )

1( 1| ) (1 )1 i j ii i j i i Da bP X c ce

The (3PL) logistic functionai is the item discrimination

parameter for item i,bi is the item difficulty or location

parameter for item i,ci is the lower asymptote, or

pseudoguessing parameter for item i,

D is the scaling constant equal to 1.702 or 1.0.

The (3PL) logistic functionThe P is due primarily to (-b)The effect due to a and c is not

as strongThat is, your probability of

getting the item correct is mostly due to whether it is easy/difficult for you◦This leads to the idea of adaptive

testing

3PLIRT has 3 dichotomous modelsI’ll now go through the models

with more detail, from 3PL down to 1PL

The 3PL is appropriate for knowledge or ability testing, where guessing is relevant

Each item will have an a, b, and c parameter

IRT modelsThree 3PL IRFs, c = 0, 0.1, 0.2,

(b = -1, 0, 1; a = 1, 1, 1)

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

theta

probability

2PLThe 2PL assumes that there is no

guessing (c = 0.0)Items can still differ in

discriminationThis is appropriate for attitude or

psychological type data with dichotomous responses◦I like recess time at school (T/F)◦My favorite subject is math (T/F)

IRT modelsThree 2PL IRFs, a = 0.75, 1.5,

0.3, b = -1.0, 0.0, 1.0

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

theta

probability

1PLThe 1PL assumes that all items

are of equal discriminationItems only differ in terms of

difficultyThe raw score is now a sufficient

statistic for the IRT score Not the case with 2PL or 3PL; it’s

not just how many items you get right, but which ones

10 hard items vs. 10 easy items

1PLThe 1PL is also appropriate for

attitude or psychological type data, but where there is no reason to believe items differ substantially in terms of discrimination

This is rarely the caseStill used: see Rasch discussion

later

1PLThree 1PL IRFs: b = -1, 0, 1

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

theta

probability

How to choose?Characteristics of the itemsCheck with the data! (fit)Sample size:

◦1PL = 100 minimum◦2PL = 300 minimum◦3PL = 500 minimum

Score report considerations (sufficient statistics)

The Rasch PerspectiveAnother argument in choiceThere is a group of

psychometricians (mostly from Australia and Chicago) who believe that the 1PL is THE model

Everything else is just noiseData should be “cleaned” to

reflect this

The Rasch PerspectiveHow to clean? A big target is to

eliminate guessingBut how do you know?Slumdog Millionaire Effect

The Rasch PerspectiveThis group is very strong in their

beliefWhy? They believe it is

“objective” measurementScore scale centered on items,

not people, so “person-free”Software and journals devoted

just to the Rasch idea

The Rasch PerspectiveShould you use it?I was trained to never use Rasch

◦Equal discrimination assumption is completely unrealistic… we all know some items are better than others

◦We all know guessing should not be ignored

◦Data should probably not be doctored

◦Instead, data should drive the model

The Rasch PerspectiveHowever, while some researchers

hate the Rasch model, I don’t◦It is very simple◦It works better with tiny samples◦It is easier to describe◦Score reports and sufficient statistics◦Discussion points from you?

◦Nevertheless, I recommend IRT

Polytomous modelsPolytomous models are for items

that are not scored correct/incorrect, yes/no, etc.

Two types:◦Rating scale or Likert: “Rate on a

scale of 1 to 5”◦Partial credit – very useful in

constructed-response educational items My experience as a scorer

Polytomous modelsPartial credit example with rubric:

◦Open response question to “2+3(4+5)=“ 0: no answer 1: 2, 3, 4, or 5 (picks one) 2: 14 (adds all) 3: 45 (does (2+3) x (4+5) ) 4: 27 (everything but add 2) 5: 29 (correct)

The IRFPolytomous example (CRFs):

Comparison tableModel Item Disc. Step

SpacingStep Ordering

Option Disc.

RSM Fixed Fixed Fixed Fixed

PCM Fixed Variable Variable Fixed

GRSM Variable Fixed Fixed Fixed

GRM Variable Variable Fixed Fixed

GPCM Variable Variable Variable Fixed

NRM Variable (each option)

Variable Variable Variable

Fixed/Variable between items… more later, if time

Part 3Ability () estimation

(IRT Scoring)

ScoringFirst: throw out your idea of a

“score” as the number of items correct

We actually want something more accurate: the precise z-score

Because the z-scores axis is called θ in IRT, the scoring is called θ estimation

ScoringIRT utilizes the IRFs in scoring

examineesIf an examinee gets a question

right, they “get” the item’s IRFIf they get the question wrong,

they “get” the (1-IRF)These curves are multiplied for

all items to get a final curve called the likelihood function

ScoringHere’s an example IRF; a =1, b=0,

c = 0

ScoringA “1-IRF”

ScoringWe multiply those to get a curve

like this…

Scoring - MLEThe score is the point on the x-

axis where the highest likelihood is

This is the maximum likelihood estimate

In the example, 0.0 (average ability)

This obtains precise estimates on the scale

Maximum likelihoodThe LF is technically defined as:

Where u is a response vector of 1s and 0s

Note what this does to the exponents

ij i jn

u 1 uj ij ij

i 1

L P Qu

Scoring - SEMA quantification of just how

precise can also be calculated, called the standard error of measurement

This is assumed to be the same for everyone in classical test theory, but in IRT depends on the items and the responses, and the level of

Scoring - SEMHere’s a new LF – blue has the

same MLE but is less spread outBoth are two items, blue with a =

2

Scoring - SEMThe first LF had an SEM ~ 1.0The second LF had an SEM ~ 0.5We have more certainty about

the second person’s scoreThis shows how much high-

quality items aid in measurement◦Same items and responses, except a

higher a

Scoring - SEMSEM is usually used to stop CATsGeneral interpretation:

confidence intervalPlus or minus 1.96 (about 2) is

95%So if the SEM in the example is

0.5, we are 95% sure that the student’s true ability is somewhere between -1.0 and +1.0

Scoring - SEMIf a student gives aberrant

responses (cheating, not paying attention, etc.) they will have a larger SEM

This is not enough to accuse of cheating (they could have just dozed off), but it can provide useful information for research

Scoring - SEMSEM CI is also used to make

decisions◦Pass if 2 SEMs above a cutoff

Details on IRT ScoresStudent scores are on the scale,

which is analogous to the standard normal z scale – same interpretations!

There are four methods of scoring◦Maximum Likelihood (MLE)◦Bayesian Modal (or MAP, for

maximum a posteriori)◦Bayesian EAP (expectation a

posteriori)◦Weighted MLE (less common)

Maximum likelihoodTake the likelihood function “as

is” and find the highest point

Maximum likelihoodProblem: all incorrect or all

correct answers

Bayesian modalAddresses that problem by

always multiplying the LF by a bell-shaped curve, which forces it to have a maximum somewhere

Still find the highest point

Bayesian EAPArgues that the curve is not

symmetrical, and we should not ignore everything except the maximum

So it takes the “average” of the curve by splitting it into many slices and finding the weighted average

The slices are called quadrature points or nodes

Bayesian EAPExample: see 3PL tail

Bayesian EAPSimple EAP overlay: ~ -0.50

BayesianWhy Bayesian?

◦Nonmixed response vectors◦Asymmetric LF

Why not Bayesian?◦Biased inward – if you find the

estimates of 1000 students, the SD would be smaller with the Bayesian estimates, maybe 0.95

Newton-RaphsonMost IRT software actually uses a

somewhat different approach to MLE and Bayesian Modal

The straightforward way is to calculate the value of the LF at each point in , within reason

For example, -4 to 4 at 0.001That’s 8,000 calculations! Too

much for 1970s computers…

Newton-RaphsonNewton-Raphson is a shortcut

method that searches the curve iteratively for its maximum

Why? Same 0.001 level of accuracy in only 5 to 20 iterations

Across thousands of students, that is a huge amount of calculations saved

But certain issues (local maxima or minima)… maybe time to abandon?

ExamplesSee IRT Scoring and Graphing

Tool

Part 4Item parameter estimation

How do we get a, b, and c?

The estimation problemEstimating student given a set

of known item parameters is easy because we have something established

But what about the first time a test is given?

All items are new, and there are no established student scores

The estimation problemWhich came first, the chicken or the

egg?Since we don’t know, we go back

and forth, trying one and then the other◦Fix “temporary” z-scores◦Estimate item parameters◦Fix the new item parameters◦Estimate scores◦Do it again until we’re satisfied

Calibration algorithmsThere are two calibration

algorithms◦Joint maximum likelihood (JML) –

older◦Marginal maximum likelihood (MML)

– newer, and works better with smaller samples… the standard

◦Also conditional maximum likelihood, but it only works with 1PL, so rarer

◦New in research, but not in standard software: Markov chain monte carlo

Calibration algorithmsThe term maximum likelihood is used

here because we are maximizing the likelihood of the entire data set, for all items i and persons j

X is the data set of responses xijb is the set of item parameters bi is the set of examinee js

Calibration algorithmsThis means we want to find the b

and that make that number the largest

So we set , find a good b, use it to score students and find a new , find a better b, etc…◦Marginal ML uses marginal

distributions not exact points, hence it being faster and working better with smaller samples of people/items

Calibration algorithmsNote: rather than examine the LF

(which gets incredibly small), software examines -2*ln(LF)

IRT software tracks these iterations because they provide information on model fit

See output

Part 4 (cont.)Assumptions of IRT: Model-data fit

Checking fitOne assumption of IRT (#2) is that

our data even follows the idea of IRT!

This is true at both the item and the test level

Also true about examinees: they should be getting items wrong that are above their θ and getting items correct that are below their θ

Model-data fitWhenever fitting any

mathematical model to empirical data (not just IRT), it is important to assess fit

Fit refers to whether the model adequately represents the data

Alternatively, if the data is far away from the model

Model-data fitThere are two types of fit

important in IRT◦Item (and test) - compares observed

data to the IRF◦Person – evaluates whether

individual students are responding according to the model Easy items correct, hard items incorrect

Model-data fitRemember the 10-group

empirical IRF that I drew? This is great!

Model-data fitYou’re more likely to see

something like this:

Model-data fitOr even worse…

Model-data fitNote that if we drew an IRF in

each of those graphs, it would be about the same

But it is obviously less appropriate in Graph #3 (“even worse”)

Fit analyses provide a way of quantifying this

Item fitMost basic approach is to

subtract observed frequency correct from the expected value for each slice (g) of

This is then summarized in a chi-square statistic

Bigger = bad fit

Item fitGraphical depiction:

Item fitBetter fit

Item fitThe slices are called quadrature

pointsAlso used for item parameter

estimationThe number of slices for chi-square

need not be the same as for estimation, but it helps interpretation

Item fitChi-square is oversensitive to sample

sizeA better way is to compute

standardized residualsDivide a chi-square by its df = G-m

where m is the number of item parameters

This is more interpretable because of the well-known scale

0 is OK, examine items > 2

Item fitFor broad analysis of fit, use

quantile plots (Xcalibre, Iteman, or Lertap)◦3 to 7 groups◦Can find hidden issues (My example:

social desirability in Likert #2)See Xcalibre output

◦Fit statistics◦Fit graphs (many more groups, and

IRF)

Person fitIs an examinee responding oddly?Most basic measure: take the log

of the LF at the max ( estimate)

A higher number means we are more sure of the estimate

But this is dependent on the level of , so we need it standardized: lz

n

1i

u1

i

u

ioii ˆQˆPln l

Person fitlz is like a z-score for fit: z = (x-

μ)/sLess than -2 means bad fit

n

1i

2

i

iiio

n

1iiiiio

o

ooz

ˆP1

ˆPlnˆP1ˆPlVar

ˆP1ln ˆP1ˆPln ˆPlE

lVarlEl

l

Person fitlz is sensitive to the distribution

of item difficultiesWorks best when there is a range

of difficultyThat is, if there are no items for

high-ability examinees, none of them will have a good estimate!

Best to evaluate groups, not individuals

How is fit useful?Throw out items?Throw out people?Change model used?Bad fit can flag other possible

issues◦Speededness: fit (and N) gets worse

at end of test◦Multidimensionality: certain areas

How is fit useful?Note that this fits in with the

estimation processIRT calibration is not “one-click”Review results, then make

adjustments◦Remove items/people◦Modify par distributions◦Modify quadrature points◦Etc.

SummaryThat was a basic intro to the

rationale of IRTNow start talking about some

applications and usesAlso examine IRT software and

output

introduction to item response theory

Education