measurement theory item response theory what is...

30
1 1 Item Response Theory Ryan P. Bowles May 17, 2005 2 Outline IRT as model-based measurement Measurement theory Why sum scores are insufficient Rasch measurement IRT models Advanced topics in IRT – Linking – DIF – CAT IRT as explanatory models Item predictor models Multidimensional models Dynamic models 3 Item Response Theory as Model-Based Measurement 4 What is measurement? Main use of IRT is in psychological measurement psychological measurement = measurement in the social sciences = measurement of latent traits What is measurement? What makes measurement meaningful? How is meaningful measurement achieved? Hand, D. J. (1996). Statistics and the theory of measurement. Journal of the Statistical Society of America, 159, 445-492. Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Erlbaum. 5 Representational view Standard conceptualization of psychological measurement “Assignment of numbers to objects according to a rule” (Stevens, 1951) 6 Representational view Attribute(s) of interest- latent trait – E.g., print concepts knowledge, anxiety, skill Population of objects of measurement having the attribute(s)- subjects – E.g., kids, test-takers, teams

Upload: dinhdang

Post on 03-Apr-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

1

1

Item Response Theory

Ryan P. BowlesMay 17, 2005

2

Outline• IRT as model-based measurement

– Measurement theory– Why sum scores are insufficient

• Rasch measurement• IRT models• Advanced topics in IRT

– Linking– DIF– CAT

• IRT as explanatory models– Item predictor models– Multidimensional models– Dynamic models

3

Item Response Theory as Model-Based Measurement

4

What is measurement?• Main use of IRT is in psychological

measurement– psychological measurement = measurement in the

social sciences = measurement of latent traits• What is measurement?• What makes measurement meaningful?• How is meaningful measurement achieved?

Hand, D. J. (1996). Statistics and the theory of measurement. Journal of the Statistical Society of America, 159, 445-492.

Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Erlbaum.

5

Representational view• Standard conceptualization of

psychological measurement• “Assignment of numbers to objects

according to a rule” (Stevens, 1951)

6

Representational view

• Attribute(s) of interest- latent trait– E.g., print concepts knowledge, anxiety, skill

• Population of objects of measurement having the attribute(s)- subjects– E.g., kids, test-takers, teams

2

7

Representational view

• Agent of measurement- instrument– test, questionnaire

• Create a homomorphism (mapping) from the objects to the real numbers such that– each object is represented by a number– relations between the objects on the attributes

are represented by relations between the real numbers (equality, order)

8

6 ft.

Representational view

6 ft.

Heights are the same

6 ft.

Assigned numbers are the same

9

Representational view

Differences in height are the same

6’ (72”)5’5” (67”)4’10” (58”)

Differences in assigned numbers are the same

10

Representational view

• Meaningful measurement occurs when the assignment of numbers successfully achieves representation of the desired relations between objects

11

What is the appropriate way to conduct meaningful

measurement?

12

Measurement Models• A measurement model is the way numbers are

“assigned” to objects– a model is a mathematical expression of how

observed and unobserved relate to each other– use model to get an estimate of latent trait level using

observed responses– how we “choose” the numbers

• Measurement is meaningful when the model provides an appropriate number assignment– trait level estimates that reflect relations between trait

levels• What is an appropriate measurement model?

3

13

Item Response Theory

• Collection of models and associated statistical techniques for converting (usually) ordinal categorical data to trait level estimates– dichotomous: right/wrong, yes/no, win/loss– polytomous: Likert scales, other rating scales

14

IRT• Used in many contexts

– standardized educational tests: SAT, GRE, ACT, NAEP– normed psychological tests: Woodcock-Johnson– licensure and certification: nurses, systems administrators– educational standards: Lexiles– rehabilitation medicine– sports

• Two branches– IRT perspective (mechanistic)– Rasch perspective (prescriptive)

15

Mechanistic (IRT)

• Model should describe response mechanism– model process by which latent trait yields

observations– how do subjects interact with items to yield

responses?– if model is correct, then can work backwards from

responses to get trait level estimates• Measurement is successful when

– the model fully describes or closely approximates the response process

– when model fits the data16

Mechanistic (IRT)

• If not successful, then– conclude that our theory about response

process was wrong– reformulate theory– reformulate model– usually by relaxing constraints (i.e. adding

parameters)

17

Prescriptive (Rasch)

• Model should reflect properties required for measurement– define measurement properties– derive or identify model that reflects these

properties– such that measurement properties are

testable in the framework of the model• Measurement is meaningful when

– the required measurement properties are met– when data fit the model

18

Prescriptive (Rasch)

• If not successful, then– meaningful measurement cannot be achieved

with the current subjects and instrument• some items may not behave as required• latent trait may not have same meaning for all

subjects– identify and eliminate measurement

anomalies– usually by removing or replacing parts of

measurement instrument

4

19

Comparison

Rasch IRT

Model is valid If has important measurement

properties

If describes response process

Validity evidence

Data fit model Model fits data

Response to misfit

Remove anomalies

Revise theory/ model

20

Comparison

• Ongoing debate– Jaeger (1987), Fisher (1994), Andrich

(2002)• Great deal of confusion about

differences– perhaps Kuhnian incompatibility between

the paradigms (Andrich, 2002)– incorrect interpretations of results– difficulties in peer review– incompatible recommendations for practice

21

Comparison

• In practice, many similarities– both use statistical fit as evidence of

successful measurement– Rasch model is functionally equivalent to the

1PL IRT (often called Rasch) model (Andrich, 1989)

– often yield very similar results • correlations between ability estimates with Rasch

or IRT are above .9 if the data-model fit is reasonable

22

Comparison

• Can straddle the fence– Accept measurement properties as ideal

but relax when model becomes untenable for a given data set

– Proponents of IRT perspective sometimes eliminate observations because of undesired properties (e.g., multidimensionality)

– Proponents of Rasch perspective sometimes generalize model at expense of relaxation of measurement properties (e.g., mixed Rasch model)

23

Sum scores

• So where does typical practice of summing responses fall in this?

• Two ways to looks at sum scores– operationalism– classical test theory

24

Operationalism

• Alternative to representational measurement• Attribute is defined by the instrument

– no more and no less• No assumption of an unobservable reality

– no latent attributes– as soon as you give a psychological name to the

attribute, you are in a gray area between operationalism and representationalism

• Example: SAT

5

25

Operationalism

• Meaningful measurement is achieved if it is useful

• Usefulness stems from convenience, widespread use, and/or predictive power– no theoretical considerations or evidence of

meaningful measurement is needed– goal of SAT is to predict college success

26

Operationalism

• Measurement in the social sciences is generally representational– social scientists routinely hypothesize

latent attributes that can only be approximated by measurement procedure

27

Classical Test Theory

• Representational way to look at sum scores: Classical Test Theory (CTT)– aka True Score Theory

X = T + E

28

CTT

• Simplistic view of the response process• Mostly untestable• Items are essentially irrelevant

– item effects are ignored– item properties are not linked to behavior– no clear way to compare item functioning

across non-equivalent samples

29

CTT

• Items must be regarded as fixed– difficult to generalize to new items– dealing with different tests is awkward

• test parallelism or test equating– dealing with missing data is very awkward

• Measurement error is same for all subjects– but we cannot measure everyone equally well

30

General Principles of IRT

6

31

Parameterization

• Set of parameters describing the latent trait(s) of the person– most of IRT is unidimensional– single trait level (or ability), usually called θ – measurement is estimation of θ

• Set of parameters describing the items– item difficulty, discrimination, guessing– category thresholds

32

Functional form

• Function that describes how latent trait parameters and item parameters combine to yield a response to an item

• Usually logistic

exp( )( )1 exp( )

nini

ni

P X x νν

= =+

33

Local Independence

• Assume that a person’s responses to items are independent conditional on the trait level– the only thing that makes two item responses

related to each other is the trait level– conditional independence– local independence

( | ) ( )

( AND ) ( ) ( )ni nj ni

ni nj ni nj

P X x X y P X x

P X x X y P X x P X y

= = = =

= = = = =

34

Estimation

• Maximum likelihood estimation– best trait level estimate is the one that makes

pattern of observed responses most likely

35

Immunity to missing data

• Missing data are generally ignorable• Missing data have no effect on trait level

estimates– higher SE

• Why?– all responses are conditionally independent,

so a missing response doesn’t affect likelihood of another response

36

Rasch Measurement

7

37

Measurement PropertiesSpecific Objectivity:

“the comparison of any two subjects can be carried out in such a way that no other parameters are involved than those of the two subjects, neither the parameter of any other subject nor any of the stimulus (i.e., measurement instrument) parameters” (Rasch, 1966, p. 92)

– separability, test-free measurement, sample independence

38

Measurement Properties

• Existence of sufficient statistic• Interval level measurement

– additive conjoint measurement

• Objective measurement• Fundamental measurement

39

History

Georg Rasch Ben Wright Gerhard Fischer

40

The Rasch (1960/1980) model

• Note that you can add a fixed amount to all trait levels (θ) and item difficulties (β) and get the same probability– need identification constraint– mean item difficulty = 0– mean trait level = 0

exp( )( 1)1 exp( )

n ini

n i

P X θ βθ β

−= =

+ −( 1)ln( 0)

nin i

ni

P XP X

θ β⎛ ⎞=

= −⎜ ⎟=⎝ ⎠

41

00.10.20.30.40.50.60.70.80.9

1

-4 -3 -2 -1 0 1 2 3 4

Trait Level (theta)

Prob

abili

ty

β = -1

β = 0 β = 1

42

What you need

• Using the Rasch model requires– an instrument designed to measure a single

latent trait– dichotomous data– orderable responses (i.e., you have to know

which response indicates more of the latent trait)

– multiple items (10 or more?)– multiple subjects (at least 10? at least 300?)

8

43

Letters data

• Example: Letters data– Justice, Pence, Bowles, & Wiggins (2005)– What (print) letters are easiest for kids to

learn?• 339 preschool kids• 26 letters (PALS)• Winsteps

– winsteps.com

44

Data file

100110000000000000000000000000100211011000100010100010000100100301011000001100111000000110100411101100101011101011101111100501001010011000000000000100100600000100010000000001010000100700000000000000000000001000100800000000000000000000000000100911000000000000000000001100101011101111011111111111011110…

ID Responses

45

Control file&INSTTITLE='Letters data'data=letters_forWinsteps.prnITEM1=5NI=26CODES=01&ENDAB…ZEND NAMES

Run Winsteps46

Analysis

1. Summary statistics (Table 3.2)2. Person-item map (Table 1)

– aka Wright plot3. Item fit (Table 14, or 10, 13, 15)

– unexpected responses (Table 10)4. Person fit (Table 6, or 17, 18, 19)

– unexpected responses (Table 6)

47

Fit analysis

• Two common Rasch fit statistics– infit and outfit

• Both are based on response residualXni - E(Xni) = Xni – P(Xni)Ex.: if response is 1 (correct) and expectation

(probability) is .9, then residual is .1• Different weightings

48

Fit analysis

• Infit– information-weighted fit– relatively sensitive to patterns of misfit– conservative, so misfit indicated by infit is

likely important

9

49

Fit analysis

• Outfit– outlier-sensitive fit– relatively sensitive to outliers, or highly

unexpected responses– more liberal, so misfit indicated by outfit may

be unimportant – one highly unexpected response (e.g., from

miscoding) may be enough to indicate substantial misfit

50

Fit analysis

• Two forms• Mean-square

– indicates amount of misfit– expectation is 1– values higher than 1 indicate more noise

(randomness, error) than expected– values below 1 indicate less information than

expected (redundancy)

51

Fit analysis

– Cutoffs depend on goals– Conservative (Smith & Suh, 2002): .9 to 1.1– Typical (Wright & Linacre, 1994): .7 to 1.3– Liberal (Wright & Linacre, 1994): .6 to 1.4

• Standardized– converted to approximate standard normal

distribution– significant misfit: <-2 or >2

52

Fit analysis

• What should you use?– matter of opinion– I prefer to use mean-square for items and

concentrate on standardized for persons

53

Fit analysis

• Why does an item misfit?– assesses a different dimension– multiple correct answers– contaminated by irrelevant factors

• answer given away by another item• display is hard to read

54

Fit analysis

• Why does a person misfit?– not answering based on trait of interest

• guessing• inattention• test anxiety• fatigue• learning from other items

10

55

Fit analysis

If something misfits, what should you do?1. Reformulate model to relax assumptions

– IRT method, not appropriate for Rasch analysis2. Eliminate misfitting data

– remove misfitting items or persons– remove specific data points (if you can identify

source of misfit)3. Eliminate misfitting data, rerun, and

compare– if it doesn’t make any noticeable difference, leave

misfitting data in56

Control file&INSTTITLE='Letters data'data=letters_forWinsteps.prnITEM1=5NI=26CODES=01IDFILE=*172426*&END Run Winsteps

57

Results

• This instrument is an internally valid measure of letter knowledge– interval scaling

• X, Q, and Z do not behave as expected– something about letters that are rarely used,

perhaps especially at the beginning of words– offers a direction for further exploration of the

latent trait of print letter knowledge

58

Results

• What now?– tweak instrument: e.g., use fewer items– use trait level estimates externally– e.g., as indicator of risk

• kids with low letter knowledge may need remedial help

– e.g., correlation with gender = .11• females tend to have greater letter knowledge

59

Polytomous models• For rating scale data• Responses must be ordered

– Likert scales– Test items with partial credit

• Two main models– Rating Scale Model for rating scale common to all

items– Partial Credit Model for instruments with different

rating scales on each item– hybrid: some items share a common rating scale,

some don’t

60

Polytomous models

Rating Scale Model (RSM; Andrich, 1978)

exp[ ( )]( | or 1)1 exp[ ( )]

n i xni ni

n i x

P X x X x x θ β τθ β τ

− += = − =

+ − +

[ ]

[ ]0

0 0

exp ( )( )

exp ( )

x

n i kk

ni m z

n i kz k

P X xθ β τ

θ β τ

=

= =

⎧ ⎫− +⎨ ⎬⎩ ⎭= =

⎧ ⎫− +⎨ ⎬⎩ ⎭

∑ ∑

11

61

Trailt level (theta)

-3 -2 -1 0 1 2 3

Prob

abilit

y

0.0

0.2

0.4

0.6

0.8

1.0

Trait level (theta)

-3 -2 -1 0 1 2 3

Prob

abilit

y

0.0

0.2

0.4

0.6

0.8

1.0

1.5

1

23

4

62

Polytomous models

Partial Credit Model (PCM; Masters, 1982)

exp( )( | or 1)1 exp( )

n ixni ni

n ix

P X x X x x θ τθ τ

−= = − =

+ −

( )

( )0

0 0

exp( )

exp

x

n ikk

ni m z

n ikz k

P X xθ τ

θ τ

=

= =

⎧ ⎫−⎨ ⎬⎩ ⎭= =

⎧ ⎫−⎨ ⎬⎩ ⎭

∑ ∑

63

What you need

• Using the RSM or PCM requires– an instrument designed to measure a single latent

trait– polytomous data– orderable responses (i.e., you have to know which

direction the rating scale indicates more of the latent trait)

– multiple items (5 or more? fewer than Rasch model because there is more information per item)

– multiple subjects (at least 20? 300?)

64

PWPA data

• Justice, Bowles, & Skibbe (in press)• 128 preschoolers• 12 items

– 7 dichotomous (0-1)– 4 3-option rating scale (0-2)– 1 4-option rating scale (0-3)

• Designed to measure Print Concept Knowledge– validate instrument to assess Print Concept

Knowledge

65

Score/Item Page: Examiner Script Scoring Criteria

_____ 1. Front of book Before administering task: Give book to child with spine facing child.Cover: Show me the front of the book.

1 pt: turns book to front or points to front

_____ 2. Title of book Cover: Show me the name of the book. 1 pt: points to one or more words in title

_____ 3. Role of title Cover: What do you think it says? 1 pt: explains role of title (‘tells what book’s about’)2 pt: says 1 or more words in title or relevant title

_____4. Print vs pictures Page 1-2: Where do I begin to read?After administering task: Put finger on first word in top line and say: I begin to read here

2 pt: points to first word, top line1 pt: points to any part of narrative text

_____5. Directionality Page 1-2: Then which way do I read? 2 pt: sweeps left to right1 pt: sweeps top to bottom

66

&INSTTITLE='PWPA data'data=pwpa_composite.prnITEM1=8NI=12CODES=0123groups=0&ENDpwpa1pwpa2pwpa3…END NAMES

Run Winsteps

12

67

&INSTTITLE='PWPA data'data=pwpa_composite.prnITEM1=8NI=12CODES=0123groups=0NEWSCORE=0011RESCORE=001000000000&ENDpwpa1pwpa2pwpa3…END NAMES 68

Results

• PWPA (slightly modified) is an internally valid instrument for the measurement of print concepts knowledge

• Also externally valid– low-SES kids are 1.5 SDs lower than middle-

SES kids– kids with language impairment are 1.2 SDs

lower than kids without LI– no additional risk from having both risk factors

compared to having one or the other

69

1005540 8570 115 130 145 160

10 2 3 4 6 125 7 8 9 10 11 13 14 15 16 17

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

3

Print Concepts Knowledge

Total Score

Item 12

Item 11

Item 10

Item 9

Item 8

Item 7

Item 6

Item 5

Item 4

Item 3

Item 2

Item 1

Directions: Circle the responses to each item, and circle the total score. Draw a vertical line through the total score. PCK is indicated by where the line passes through the bottom row, and is also indicated in the chart mapping total score to PCK. The vertical line also goes through the expected item responses. Responses far off the line indicate unexpected scores.

1

1

1

1

2

2

1

1

1

1

3

2

0

0

0

0

0

0

0

0

0

0

0

0

Mid/TLMid/LI

Low/TLLow/LI

145

16

16113412812311811511110710410097928882746346PCK

171514131211109876543210Total Score

70

1005540 8570 115 130 145 160

10 2 3 4 6 125 7 8 9 10 11 13 14 15 16 17

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

3

Print Concepts Knowledge

Total Score

Item 12

Item 11

Item 10

Item 9

Item 8

Item 7

Item 6

Item 5

Item 4

Item 3

Item 2

Item 11

1

1

1

2

2

1

1

1

1

3

2

0

0

0

0

0

0

0

0

0

0

0

0

71

Multifaceted models

• Two facets of measurement– subjects– items

• Can have multiple facets– subjects– items– raters– diving: divers, dives, raters

72

Multifaceted Rasch Model

Linacre (1989)

FACETS program winsteps.com(same authors as Winsteps: Linacre & Wright)

exp( )( 1)

1 exp( )n i j

nin i j

P Xθ β γ

θ β γ− −

= =+ − −

13

73

Multifaceted Rasch Model

• Can incorporate rating scales• Have to determine level of rating scale

invariance– do all subjects, raters, and items use the

rating scale the same way?– differ across items?– differ across raters?– can’t have all three varying at once

• need invariance to link scales

74

Rooms data• Sutfin, Fulcher, Bowles, & Patterson (2005)• Subjects: 57 kids’ rooms• Items: 4 questions designed to assess the

stereotypicality of pictures of the rooms• Raters: 79 UVA undergrads• Each rater rated up to 12 rooms

– cannot use sum scores unless adjust for differential rater severity

– connectedness• Goal: to assess whether kids of lesbian parents

have less stereotypical gender development

75

title=roomsfacets=3 ;rater, room, itempositive=1 ;data=facetsdata.csvmodels=?,?,2,D?,?,4,R?,?,7,R?,?,9,R*labels=1,rater100-270*2,room11-150*

3,Item1,B_G2,Corr3,Confi4,Stere5,Gender6,Masc7,Masc28,Femin9,Femin210,Qual11,Impr

76

-------------------------------------------------------------|Measr|+rater |-room |-Item |S.2 |S.3 |S.4 |-------------------------------------------------------------+ 3 + + + +(5) +(5) +(5) +| | | | | | | || | | | | | | || | | | | | --- | || | | | Stere | | | || | | | | | | --- || | | * | | | | || | . | * | | | | |+ 2 + + + + + + +| | . | | | | | || | . | | | | | || | *. | | | | | || | **. | ** | | | 4 | || | **. | * | | --- | | 4 || | ***** | * | | | | || | ****. | * | | | | |+ 1 + ******** + + + + + +| | ****. | *** | | | | || | **** | ** | | 4 | --- | || | ****. | ** | | | | || | * | *** | | | | --- || | | *** | | --- | | || | | ***** | | | | || | | * | | | | |* 0 * * ***** * * 3 * 3 * *| | | * | | | | 3 || | | *** | | --- | | || | | ***** | Femin2 Masc2 | | | || | | *** | | | | || | | ***** | | 2 | | || | | ** | | | --- | --- || | | | | | | |+ -1 + + * + + + + +| | | | | | | || | | * | | | | || | | * | | --- | | 2 || | | * | | | | || | | * | | | 2 | || | | | Corr | | | || | | | | | | |+ -2 + + + + + + +| | | | | | | --- || | | | | | | || | | * | | | | || | | | | | | || | | | | | --- | || | | | | | | || | | * | | | | |+ -3 + + + +(1) +(1) +(1) +-------------------------------------------------------------|Measr| * = 2 | * = 1 |-Item |S.2 |S.3 |S.4 |-------------------------------------------------------------

• Raters differed in severity– ignoring this would

have yielded inaccurate stereotypicality estimates

• Children of heterosexual parents had more strongly stereotyped rooms (d = .45)

77

Summary

• Rasch measurement models share the property of specific objectivity– trait and measurement instrument are

statistically separable• Fit of data to these models provides strong

validity evidence– misfit indicates measurement anomolies

78

IRT models

14

79

IRT models

• Describe response process• What is a good theory of the response

process?– go with what’s worked well before– set of standard IRT models that have

provided good results in the past

80

History

• Fred Lord & Melvin Novick (1968)– 4 chapters by Allan Birnbaum

• Fumiko Samejima (1969)• Darrell Bock

– University of Chicago at same time as Ben Wright

81

1PL or Rasch model

• Items differ in difficulty only

exp( )( 1)1 exp( )

n ini

n i

P X θ βθ β

−= =

+ −

82

00.10.20.30.40.50.60.70.80.9

1

-4 -3 -2 -1 0 1 2 3 4

Trait Level (theta)

Prob

abili

ty

β = -1

β = 0 β = 1

83

2PL

• Items differ in two ways– difficulty– discrimination

• how well the item can discriminate between subjects • how strongly the item reflects the latent trait• similar to factor loading

exp[ ( )]( 1)1 exp[ ( )]

i n ini

i n i

P X α θ βα θ β

−= =

+ −

84

00.10.20.30.40.50.60.70.80.9

1

-4 -3 -2 -1 0 1 2 3 4

Trait Level (theta)

Prob

abili

ty

β = -1α = 1.5

β= 0α = .7

β = 1α = 1

15

85

3PL

• Items differ in three ways– difficulty– discrimation– guessing or pseudoguessing

• minimal probability of correct response• lower asymptote

exp[ ( )]( 1) (1 )1 exp[ ( )]

i n ini

i n i

P X c c α θ βα θ β

−= = + −

+ −86

00.10.20.30.40.50.60.70.80.9

1

-4 -3 -2 -1 0 1 2 3 4

Trait Level (theta)

Prob

abili

ty β = -1α = 1.5c = 0

β= 0α= .7c = 0 β = 1

α = 1c = .2

87

GSS data

• Bowles, Grimm, & McArdle (in press)• Over 21000 persons from 15 nationally

representative samples– roughly every other year between 1974 and

2000• 10 item multiple choice vocabulary test

– 5 options• Goal: understand relation between age

and vocabulary knowledge88

Item Target Option 1 Option 2 Option 3 Option 4 Option 5WordA Space school noon captain room boardWordB Broaden efface make level elapse embroider widen

WordC Emanate populate free prominent rival come

WordD Edible auspicious eligible fit to eat sagacious able to speak

WordE Animosity hatred animation disobed-ience

diversity friendship

WordF Pact puissance remon-strance

agreement skillet pressure

WordG Cloistered miniature bunched arched malady secluded

WordH Caprice value a star grimace whim induce-ment

WordI Accustom disappoint customary encounter get used business

WordJ Allusion aria illusion eulogy dream reference

89

GSS data

• Winsteps analysis indicated that the test did not fit Rasch model

• Try 2PL and 3PL• Bilog-MG

– dichotomous data– Scientific Software International– ssicentral.com

90

GSS

>GLOBAL DFName = 'gssdkas0.dat', NPArm = 1, LOGistic, SAVe;

>SAVE PARm = 'gss.PAR';>LENGTH NITems = (10);>INPUT NTOtal = 10,

NALt = 2, NIDchar = 1;

>ITEMS INAmes = (word1(1)word10);>TEST1 TNAme = 'TEST0001',

INUmber = (1);(A1,10A1)>CALIB ACCel = 1.0000;

title

1PL (2 for 2PL, 3 for 3PL)

test length

right/wrong

item names

format of data file

16

91

Results

Model X2 ΔX2 Δdf

1PL 189577

2347

487

2PL 187230 10

3PL 186743 10

92

ResultsItem Difficulty Discrimination GuessingwordA -1.137 1.434 .023wordB -1.266 3.554 .016wordC 1.203 1.982 .035wordD -1.281 3.636 .019wordE -.551 2.428 .012wordF -.690 2.659 .030wordG .875 2.172 .092wordH .870 2.839 .060wordI -.908 1.407 .017wordJ 1.043 2.918 .039

93

00.10.20.30.40.50.60.70.80.9

1

-4 -2 0 2 4

Trait level (theta)

Prob

abili

ty

wordA

wordB

wordC

wordDwordE

wordF

wordG

wordH

wordI

wordJ

94Age

10 20 30 40 50 60 70 80 90

Trai

t lev

el ( θ

)

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

95

Polytomous models

• Three types of polytomous models– adjacent logits

• probability of being in higher of adjacent categories• RSM and PCM

– cumulative logits• probability of being in category x or above

compared to category x-1 or below– baseline-category logits

• probability of being in category x compared to reference category z

• unordered categories96

Graded Response Model

• Cumulative logits

* exp[ ( )]( ) ( )1 exp[ ( )]

i n ixni

i n ix

P x P X x α θ τα θ τ

−= ≥ =

+ −

* *( ) ( ) ( 1)niP X x P x P x= = − +

17

97

0

0.2

0.4

0.6

0.8

1

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Trait level (theta)

Prob

abili

ty 12

3

4 α = 1.5

0

0.2

0.4

0.6

0.8

1

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Trait level (theta)

Prob

abili

ty

α = 2.5

1 2

3

4

98

GRM

• Parscale• Multilog

– SSI• SEM programs for categorical outcomes

– Mplus

99

Polytomous models

• Which model?– r > .97– global fit statistics are not well developed– GRM and PCM are not nested– biggest difference is what happens when

collapse categories (Masters, 1985)– I prefer PCM/RSM

• fit and results indistinguishable• PCM/RSM has Rasch measurement properties

• Sample size– at least 250, maybe more 100

Nominal Response Model

• Baseline-category logits– for unordered categories with a single reference

category– identify what response options are indicative of the

latent trait

exp[ ( )]( | or 1)exp( )

ix n ixni ni

n

P X x X x α θ τθ

−= = =

101

Bulimia data

• Thissen (1993)• One unordered item:

– I prefer to eatA. at home aloneB. at home with othersC. in a public restaurantD. at a friend’s houseE. doesn’t matter

102

18

103

Summary

• IRT models are designed to reflect the response process

• Ryan’s opinion: there is little theory in IRT– models are selected because

• they have worked well before• estimation software is available

– little effort to develop response process theory– I call them IRMs- item response models

• Huge advantage over sum scores– item effects are explicitly modeled

104

Advanced topics in IRT

Advantages of explicitly modeling item effects

105

Advanced topics

• Linking• Differential Item Functioning• Computer Adaptive Tests

106

Linking

107

Linking

• How can two tests that measure the same latent trait be linked to the same measurement scale?– same zero point– same measurement unit

108

Linking

• With sum scores, need representative sample– items properties are not part of the model

• cannot be used to link tests– can only use person properties

• only property: sum score• equipercentile equating• regression prediction

19

109

Linking

• Common item equating– items are given on more than one test– assume item parameters are same for both

tests• Common person equating

– persons take multiple items from one or more tests

– assume latent trait level is same for all items

110

Vocabulary data

• McArdle, Grimm, Hamagami, Bowles, & Meredith (2005)

• Goal: understand lifespan development of vocabulary knowledge

• 441 persons from 3 longitudinal studies assessed on tests of intelligence from as young as 3 to as old as age 75

• Problem: tests were not repeated

111

WAIS-RSB-LMSB-L WB-I WAIS WJ-R

SB-LMSB-L WB-I WAIS-R WJ-RWAIS

SB-LMSB-L WB-I WAIS-RWAIS WJ-R

SB-LMSB-L WB-I WJ-RWAIS-RWAIS

SB-LM WB-I WJ-RWAIS-RWAISSB-L

SB-LM WJ-RWAIS-RSB-L WAISWB-I

SB-L WB-I WJ-RWAIS-RWAISSB-LM

SB-L WB-I WJ-RWAIS-RSB-LM WAIS

SB

SB

SB

SB

SB

SB

SB

SB

SB-LM WB-I WJ-RWAIS-RSB-LSB WAIS

SB-LM WB-I WJ-RWAIS-RSB WAISSB-L

SB-M

SB-M

SB-M

SB-M

SB-M

SB-M

SB-M

SB-M

SB-M

SB-M

1.

3.

4.

5.

2.

6.

7.

8.

9.

10.

data file

112

Vocabulary data

• Common item linking– some items are on multiple tests

• for example, WAIS and WAIS-R Vocabulary share 33 items

– tests given in more than one year• Estimated vocabulary knowledge (trait

level) for each person at each measurement occasion– using Rasch model and PCM

113 114

20

115 116

117

Differential Item Functioning

118

DIF

• Does an item work the same for all people?

• Examine if there are differences in item parameters between groups

• May indicate bias– ETS examines tests for evidence of bias

against minorities

119

DIF

1. Force item parameters to be the same in both groups

2. Allow item parameters to vary across groups

3. Compare fit4. Interpret parameter differences

• requires linking assumption between groups

120

WM Span data

• Bowles & Salthouse (2003)• 698 adults• Working memory span (WM span) is the

maximum amount of information that can be stored while engaging in ongoing processing

21

121

Theories• Inhibition deficit

– Hasher & Zacks (1988)• Older adults are less able to suppress

irrelevant information in working memory

• Prediction: older adults should be more susceptible to proactive interference (PI) than younger adults– May, Hasher, & Kane (1999)

• PI will build up faster for older adults 122

Participants

• N = 698• Young

17 <= age <40 N = 280• Middle

40 <= age <60 N = 187• Old

60 <= age <=92 N = 231

123

WM tasks

• Computation Span task– solve arithmetic problem– remember second number in problem

124

READY

125

4 + 2 =a. 3b. 7c. 4d. 6

126

3 + 5 =a. 8b. 6c. 9d. 7

22

127

RECALL

128

Reading span

John read the book in the library.

Who read the book?A. TimB. JohnC. KarenD. Jack

129

WM Span data

• Rasch model• Assume no DIF

– item difficulty same for all three age groups– R2 (age, computation span) = .066 (r = -.26)– R2 (age, reading span) = .099 (r = -.32)

• Allow item difficulty to differ – link by assuming first item has same difficulty

for all three age groups- no PI on first item

130

Computation span

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8 9 10

order position

item

diff

icul

ty

youngmiddleold

131

What’s left?

• Before accounting for differential effects of PI, R2(WM span on age) = .066

• After accounting for differential effects of PI, R2(WM span on age) = .037

• Percent change = 45%

Computation span

132

Reading span

0

2

4

6

8

10

12

14

0 2 4 6 8 10

order position

item

diff

icul

ty

youngmiddleold

23

133

What’s left?

• Before accounting for differential effects of PI, R2(WM span on age) = .099

• After accounting for differential effects of PI, R2(WM span on age) = .043

• Percent change = 57%

Reading span

134

Results

• DIF analysis indicates that about 50% of the age-related variance in WM span is due to differential susceptibility to PI

135

Computer Adaptive Tests

136

Item

Item

Item

Item

Item

Item

Item

Correct

Correct

Correct

Incorrect

Incorrect

Incorrect

137

Advantages

• Tests are targeted (aka tailored)• Greater efficiency

– usually test length cut in half for same precision

• Advantages from computer-administration– e.g. immediate scoring

138

Disadvantages

• Critically dependent on appropriateness of IRT model

• Need large item bank of calibrated items• Scheduling

– GRE and China

24

139

Select Item 1 Response

Identify potential items

Maximize objective function

Check controls

Administer item

Item Selection

Response

Provisional ability

estimate

Check stopping

rule

Keep going

Final ability estimate

Done

Scoring procedure

140

MultiCAT

• Two language placement CATs in three languages– reading ability– language in context

141

Summary

• Explicit modeling of item effects– can link tests through items– can test if tests function the same for different

groups– can give different tests to everyone

142

IRT as Explanatory Models

143

Explanatory models• Goal of explanatory models is to

– test theories– understand processes– make predictions– not to estimate a trait level, although latent traits are

often part of the theory– can apply to items or persons or both

• Item predictor models• Multidimensional models• Dynamic models

144

Item predictor models

25

145

Item predictor models

• Item difficulty may be determined by features of the item

• Can incorporate the predictors into the item response model

146

Linear Logistic Test Model• LLTM (Fischer, 1973)

• similar to incorporating regression of item difficulty on design features

• equivalent to multilevel logistic regression (items clustered within persons)

exp( )( 1)

1 exp( )

n k ikk

nin k ik

k

qP X

q

θ β

θ β

−= =

+ −

∑∑

147

Letters data revisited

• Letters data– what makes an letter easier or harder to

know?• Four predictors

– First letter of child’s first or nick name (dummy)

– Position in alphabet (A=1, B=2, etc.)– Average age of phoneme acquisition (missing

for letters with multiple phonemes)148

Letters data revisited

– CV pattern (3 dummy variables)• pattern 1: CV with V being /i/ (B, D)• pattern 2: CV with V not /i/ (J, K)• pattern 3: VC (F, M)• pattern 4: name is not part of pronunciation

(H, W)

0 1 2 3

4 12 5 13 6 14

* * * _ * * *

i k ikk

i i i

difficulty q

flnn position ph ageCVD CVD CVD

β

β β β ββ β β

= =

+ + + +

+ +

149

LLTM

• Need specialized or general program– LPCM– SAS PROC NLMIXED

150

Parameter Estimates

StandardParameter Estimate Error DF t Value Pr > |t|beta0 0.8264 0.1835 338 4.50 <.0001betaflnn 2.8355 0.2022 338 -14.02 <.0001betaposition 0.03534 0.006408 338 5.51 <.0001betaph_age 0.1542 0.02899 338 5.32 <.0001betaCVD12 -0.06078 0.1444 338 -0.42 0.6740betaCVD13 -0.4790 0.09683 338 4.95 <.0001betaCVD14 -0.9604 0.1278 338 7.51 <.0001vtheta 5.9847 0.6664 338 8.98 <.0001

Results

odds ratio = exp(β)

e.g., exp(2.8355) = 17

26

151

Results• 17 times more likely to know the first letter of first or

nickname• 1.03 times more likely to know a letter one position

earlier in the alphabet• 1.17 times more likely to know letter for each half year

earlier that a consonantal phoneme is acquired• No more likely to know a letter containing a consonant

vowel pattern CV not /i/ than a letter containing a consonant vowel pattern CV/i/

• .62 times as likely to know VC than CV/i/• .38 times as likely to know a NOT CV than CV/i/

152

Multidimensional Models

153

Multidimensional models

• Item responses depend on two or more latent traits

• Equivalent to item factor analysis

154

F1 F2

Y1 Y2 Y4 Y5 Y6Y3

155

F1 F2

X1 X2 X4 X5 X6X3

RP1 RP2 RP3 RP4 RP5 RP6

τ1 τ2 τ3 τ4 τ5 τ6

λ11

λ21λ31 λ32

λ42λ52

λ62

τ

10156

Multidimensional models

For one factor model, equivalent to 2PL with– α = λ– β = τ/λ

1

1

exp[ ]( 1)

1 exp[ ]

K

ki kn ik

ni K

ki kn ik

P Xλ θ τ

λ θ τ

=

=

−= =

+ −

27

157

Multidimensional models

• Can do exploratory or confirmatory item factor analysis

• Exploratory– same statistics to decide on number of factors

• scree plot• eigenvalues (of tetrachoric correlation matrix) > 1

– can do rotation to simple structure

158

Multidimensional models

• Estimation (Tate, 2003)• SEM programs

– Mplus– Less robust: Lisrel

• IRT programs– TESTFACT

• Other– NOHARM

159

GSS data

• Goals: – identify whether vocabulary knowledge as

measured by the GSS vocabulary test is unidimensional

– assess relation to age for each dimension• Mplus

160

DimensionsRMSEA: .048 vs. .022

Fit: ΔX2 = 1416, Δdf = 9

161

Item Factor 1loading

Factor 2Loading

Proportioncorrect

worda .703 .003 .78wordb .931 .018 .88wordc .045 .668 .21wordd .894 .051 .88worde .558 .337 .69wordf .565 .356 .74wordg -.022 .705 .32wordh -.025 .796 .28wordi .685 .033 .73wordj .052 .752 .22

162

28

163

Dynamic Models

164

Dynamic models

Incorporate change model into IRT model

Linear change

Exponential learning

exp[ ( , ) ]( 1)1 exp[ ( , ) ]

n itni

n it

f tP Xf t

ββ

Θ −= =

+ Θ −

( , ) *level slopef t tθ θΘ = +

( , ) *exp( * )asymptote startdifference ratef t tθ θ θΘ = + −

165

Emotions data

• Ram, Chow, Bowles, Wang, Grimm, Fujita, & Nesselroade (2005)

• 179 college students• Diaries over 52 days• 16 items

– 8 assessing positive affect– 8 assessing negative affect– 7 point rating scales– item parameters assumed invariant over time

166

Emotions data

• How prevalent is 7-day cycle of emotions?– blue Monday– accounts for 40% of variance in aggregated

data

( , ) t [cos( )] n n n n n ntf t Rμ β ω φ εΘ = + + + +

167

Predicted cycles

168

Results

• Average variance explainedPA: 6.3%NA: 3.0%

– average variance explained for random data: 4%

• Only 77 of 179 show weekly cycle in PA, 44 or 179 in NA

• Phase shift (day of peak) unreliable

29

169

Summary

• IRT can be used not just as a measurement model, but also as an explanatory model

• Three ways– item predictors– multidimensional models– dynamic models

• Can also do– person predictors– item by person interactions

170

Conclusions

171

When to use IRT

• Measurement– validate instruments– choose numbers to assign to subjects

• Explanatory modeling– test theories – make predictions

172

What you need

• Ordered categorical data– except NRM: need baseline category

• Relatively large sample size– I recommend 50 for Rasch model– may need >1000 for complex models

• 3PL• multidimensional

173

Software

• Specialized software– IRT statistics– applicable to only a small number of models– generally not user friendly– error messages incomprehensible– developers helpful and friendly

174

Software

• Generalized software– can do many models– can be painfully slow

• 36 hours for cyclical model– often require programming equations

30

175

IRT Software

Software ModelsWinsteps Rasch, RSM, PCMFacets MRM, Rasch, RSM, PCMBilog-MG 1PL, 2PL, 3PLMultilog GRM, NRM, 1-3PLParscale GRM, RSM, PCM, 1-3PLTestfact MultidimensionalNoharm Multidimensional

176

General Software

Software ModelsMplus 2PL, GRM, MDim, SEMNLMIXED All, but slow and inaccurateWinBUGS All, but slow