introduction to differential item functioningdifferential ... · •“differential item...

Introduction to Differential Item FunctioningDifferential Item Functioning

American Board of Internal MedicineItem Response Theory Course

OverviewOverview

• Detection of Differential Item• Detection of Differential Item Functioning (DIF):g ( )–Distinguish “Bias” from “DIF”.–Test vs. Item-level bias.

• Revisit some test score• Revisit some test score equating issues.q g

OutlineOutline

• Review relevant assumptions• Review relevant assumptions.• Concept of potentially biasedConcept of potentially biased

test items (or “DIF,” as we’ll call it).

• IRT based methods of• IRT-based methods of detecting DIF.g

Some NotesSome Notes

• We will focus on:• We will focus on:– DIF with 0-1 test items; DIF with

l t it i li t dpolytomous items is more complicated, though similar approaches are used.

– IRT methods only.– DIF as a statistical issue; interpretation p

of “why?” can be quite a bit trickier.

Why study DIF?Why study DIF?

• We’re often interested in comparingWe re often interested in comparing cultural, ethnic, or gender groups.

• Meaningful comparisons require that• Meaningful comparisons require that measurement equivalence holds.Cl i l t t th th d f d• Classical test theory methods confound “bias” with true mean differences; not IRT.

• In IRT terminology, item/test “bias” is referred to as DIF/DTF.

Definition of DIFDefinition of DIF• A test item is labeled with “DIF” when

examinees with equal ability, but from different groups have an unequaldifferent groups, have an unequal probability of item success.A i i l b l d “ DIF” if• A test item is labeled as “non-DIF” if examinees having the same ability have equal probability of getting the item correct, regardless of groupitem correct, regardless of group membership.

DIF versus DTFDIF versus DTF

• DIF is basically found by examiningDIF is basically found by examining differences in ICCs across groups.

• Differential Test Functioning, or DTF, isDifferential Test Functioning, or DTF, is the analogous procedure for determining differences in TCCs.

• DTF is arguably more important because it speaks to impact (DIF in one item may be p p ( ysignificant, but might not have too much practical impact on test results).

Classical Approach to DIFClassical Approach to DIF

• Compare item p-values in the two groups:Compare item p values in the two groups:– Criticism: p-value difference may be due to

real and important group differences Need toreal and important group differences. Need to compare p-value difference for one item in relation to the differences for other items.

– This provides a baseline to interpret the difference, but item p-value differences will be confounded by discrimination differences.

= 1)

P(u

yve

Fre

quen

cyR

elat

i

-3 -2 -1 0 1 2 3

Ability (θ)

Important AssumptionsImportant Assumptions

• Unidimensionality• Unidimensionality.• Parameter Invariance.• Violations of these assumptions

id tifi blacross identifiable groups basically provide us with a y pdefinition of DIF.

UnidimensionalityUnidimensionality

• This assumption basically states• This assumption basically states that the test of interest measures ONE construct (e.g., math proficiency verbal ability trait etc )proficiency, verbal ability, trait, etc.)

• Group membership (e.g., gender, ethnicity, high-low ability) should not differentially impact success.not differentially impact success.

UnidimensionalityUnidimensionality

• If group membership impacts• If group membership impacts or explains performance, then p p ,the assumptions of the model are not being metare not being met.

• Basically DIF=DimensionalityBasically, DIF Dimensionality

Parameter InvarianceParameter Invariance

• Invariance of Item parameters (a b c)• Invariance of Item parameters (a,b,c)– Compare item statistics obtained in two

( hi h d lor more groups (e.g., high and low performing groups; Ethnicity; Gender).

– If the model fits, different examinee samples should still produce close to the same item parameter estimates.

Parameter InvarianceParameter Invariance

• Invariance of item parameters (a b c)• Invariance of item parameters (a,b,c)– Scatterplots of b-b, a-a, c-c based on

l f i thone sample of examinees versus the other should be strongly linearly related.

– Relationship won’t be perfect: sampling error does enter the picture.

– Those estimates far from the best-fit line represent a violation of invariance.p

Parameter Invarianceb-plot3

1

2

)

0

(For

m B

)

-2

-1

b

-3

-2

-3 -2 -1 0 1 2 33 2 1 0 1 2 3

b (Form A)

Check Item InvarianceCheck Item Invariance

• If the test is unidimensional and• If the test is unidimensional and parameters are invariant across groups, then even though there may be differences in themay be differences in the distributions of ability, similar (li l l t d) lt ill b(linearly related) results will be obtained for item parameter pestimates.

1 .0on

se

0 .5Cor

rect

Res

po Two groups may have

Prob

abili

ty o

f C different distributions

0.5

0 .0for the trait being

0.3

0.4

requ

ency

measured, but the same

d l0.1

0.2

Rel

ativ

e Fr model

should fit

0.0-3 -2 -1 0 1 2 3

Ability

1 .0on

se

0 .5Cor

rect

Res

po Frequencies may differ,

Prob

abili

ty o

f C but matched ability

0.5

0 .0groups should have

0.3

0.4

requ

ency

the same probability

0.1

0.2

Rel

ativ

e Fr of success

on the item

0.0-3 -2 -1 0 1 2 3

Ability

Check Item InvarianceCheck Item Invariance

• Differential Item Functioning• Differential Item Functioning (DIF).(DIF).

• This violates parameter pinvariance because items

’t idi i laren’t unidimensional.

DIF in PracticeDIF in Practice

• “Reference” versus “Focal” groupsReference versus Focal groups– Males – Females

Whites Blacks– Whites – Blacks– Majority – Minority

E li h ESL– English – ESL groups• One example of DIF is “item drift” due

to over- or under-emphasis of content measured over time.

May have to remove items:3

yParameter Drift

1

2

)

0

(For

m B

)

2

-1

b

-3

-2

-3 -2 -1 0 1 2 33 2 1 0 1 2 3

b (Form A)


• “Differential Item Functioning” NOTDifferential Item Functioning NOT “item bias” is the term used.

• DIF is a statistical property which• DIF is a statistical property, which states that matched-ability groups have differential probabilities ofhave differential probabilities of success on an item.Bi i b i i i f• Bias is a substantive interpretation of any DIF that may be observed.


• DIF studies are absolutelyDIF studies are absolutely essential in testing programs with high stakeshigh stakes.

• Potential gender and/or ethnicityPotential gender and/or ethnicity bias could negatively impact one or more groups in a constructor more groups in a construct irrelevant way; we’re only interested in θ!

Identifying DIFIdentifying DIF

• Uniform DIF: item is• Uniform DIF: item is systematically more difficult for members of one group, even after matching examinees on ability (θ)matching examinees on ability (θ).

• Cause: shift in b-parameter.

1.0

0.8

0.6

1 | θ

)

0.4P (u

=

0.2

0.03 2 1 0 1 2 3-3 -2 -1 0 1 2 3

Ability (θ)

Identifying DIFIdentifying DIF

• Non-Uniform DIF: shift in item• Non-Uniform DIF: shift in item difficulty is not consistent across the

bilit tiability continuum.• Increase/decrease in P for low-ability y

examinees is offset by the converse for high-ability examineesfor high ability examinees.

• Cause: shift in a (and possibly b).

1.0

0.8

0.6

1 | θ

)

0.4P (u

=

0.2

0.03 2 1 0 1 2 3-3 -2 -1 0 1 2 3

Ability (θ)

Steps in a DIF StudySteps in a DIF Study• Identify Reference and Focal groupsIdentify Reference and Focal groups

of interest (usually two at a time).• Design the DIF study to haveDesign the DIF study to have

samples which are large as possible.• Choose DIF statistics which areChoose DIF statistics which are

appropriate for the data.• Carry out the statistical analysesCarry out the statistical analyses.• Interpret DIF stats and delete items or

make item changes as necessarymake item changes as necessary.

The problem of sample sizeThe problem of sample size

• CAUTION: Don’t assume if DIF is• CAUTION: Don t assume if DIF is found with statistical tests that there

d fi it l bl With biare definitely problems. With big samples, the power exists to detect even small conditional differences.

• The flip side is that with smallThe flip side is that with small samples, sufficient power may not exist to identify truly problematic itemsexist to identify truly problematic items.

IRT DIF AnalysesIRT DIF Analyses

• Combine data from both groups andCombine data from both groups and obtain c-parameter estimates.

• Fix the c parameters and obtain a• Fix the c-parameters and obtain a-and b-parameter estimates in each group from separate calibrationsgroup from separate calibrations.

• If multivariate DIF test is done, two l l lib icompletely separate calibrations may

be conducted (not always).

IRT DIF AnalysesIRT DIF Analyses• Before the ICCs in the two groups g p

can be compared, the must be placed onto the same scaleplaced onto the same scale (equated). Typically, item parameter estimates from the Focal group (minority) are placedFocal group (minority) are placed onto the scale of the Reference

( j it )group (majority).

Mean & Sigma EquatingMean & Sigma Equating

• After separate calibrations• After separate calibrations, determine the linear transformation that matches the mean and SD of anchor item b-mean and SD of anchor item bvalues across administrations:

b refb ref b focx y x

σμ μ

σ−

− −= = −b focσ −

Mean & Sigma EquatingMean & Sigma Equating

• This transformation places the• This transformation places the scale of item parameters from the Focal Group onto the scale of item parameters from theitem parameters from the Reference Group:

b refb ref b focx y x

σμ μ

σ−

− −= = −b focσ −

Mean & Sigma with DIFMean & Sigma with DIF

• All items comprise the “anchor test”All items comprise the anchor test as both groups took the same form.

• Treat the entire test as a block of• Treat the entire test as a block of linking items, determine the equating constants x and y and transformconstants, x and y, and transform.

• Another transformation procedure, h TCC b M & S ldsuch as TCC or robust M & S, could

alternatively be utilized.

Mean & Sigma ExampleMean & Sigma Example

• 25 items25 items– Reference: μb = -0.06, σb = 1.03– Focal: μb = -0.25, σb = 1.12Focal: μb 0.25, σb 1.12

• x = 1.03 / 1.12 = 0.92y = -0 06 – (0 92 * -0 25) = 0 17y = 0.06 (0.92 0.25) = 0.17

• bnew = x bold + y• b = 0 92 b + 0 17• bnew = 0.92 bold + 0.17

M & S Transformation of Focal Group b-parametersFocal Group b-parameters

3bnew = x bold + y

1

2

)

bnew = 0.92 bold + 0.17

Intercept (y) is above zero (i e0

Form

B)

cal)

Intercept (y) is above zero (i.e., common items were easier).

This indicates that the Focal

-1b (

b(Fo

This indicates that the Focal Group had higher ability than the Reference Group.

-3

-2

-3 -2 -1 0 1 2 3

b (Form A)b(Reference)

After item if x yθ θ= +parameters are adjusted, the same

if then

new

new

x yb xb yθ θ +

= + same transformation of b is done for all

new yaa = θ…now the

Focal examinees will look more

newax

c c

=

will look more able (as they should).

newc c=

These transformations preserve the probability:( 1| ) ( 1| )P Pθ θ( 1| ) ( 1| )newP u P uθ θ= = =

Sample Size IssueSample Size Issue• We might also not have enough g g

data to estimate two calibrations (e g the Focal N may be small)(e.g., the Focal N may be small).

• One solution is to estimate the ICC ith j t th j itICCs with just the majority group, score the minority group with y gthose parameters, and examine their residuals against the model.their residuals against the model.

0 9

1.0

This shows a pattern0.8

0.9

spon

se

This shows a pattern of non-uniform DIF for the minority

0.6

0.7

rrec

t Re for the minority

group

0.4

0.5

ty o

f Cor

0.2

0.3

roba

bilit

0.0

0.1Pr

-3 -2 -1 0 1 2 3

Ability (θ)

Important NoteImportant Note

• Group differences do NOT• Group differences do NOT (necessarily) mean DIF (or bi f h )bias, for that matter).

• Two groups can have• Two groups can have differences in their means, and if th d l fit thif the model fits, the same ICCs will be estimated.ICCs will be estimated.

Important NoteImportant Note

• The point here is that if after being• The point here is that if, after being matched on ability, different ICCs

th th it di l DIFoccur, then the item displays DIF.• If matched ability examinees have y

the same probability of success on the item then there is no DIF even ifthe item, then there is no DIF, even if one group is smarter than the other.

1.0

pons

e

0.5

f Cor

rect

Res

p

No DIF: Despite group Pr

obab

ility

of p g p

differences, one ICC fits for both groups

0.5

0.0

0.3

0.4

requ

ency

0.1

0.2

Rel

ativ

e Fr

0.0-3 -2 -1 0 1 2 3

Ability

1.0

spon

se0.5

Cor

rect

ResDIF: The item performs

roba

bilit

y of

pdifferentially even after matching on ability

0.5

0.0

Pr

0.3

0.4

requ

ency

0.1

0.2

Rel

ativ

e Fr

0.0-3 -2 -1 0 1 2 3

Ability

IRT DIF AnalysesIRT DIF Analyses

• After equating parameters compareAfter equating parameters, compare the items across groups using one of the following methods we’ll discussthe following methods we ll discuss.

• Optional: delete items that have large DIF statistics re run analyses andDIF statistics, re-run analyses, and re-estimate equating constants. Carefully review any flagged items forCarefully review any flagged items for substantive review, interpretation.

Two-stage MethodsTwo stage Methods

• Conduct a DIF analysis• Conduct a DIF analysis.• Flag DIF items temporarilyFlag DIF items, temporarily

remove them from criterion (raw score, θ).

• Re evaluate DIF with new• Re-evaluate DIF with new criterion.

DIF Detection MethodsDIF Detection Methods• Mantel-Haenszel: condition on raw

score, statistical test of contingency tables.

• Logistic Regression: condition on raw score model group-responsescore, model group response relationship.

• IRT Methods: condition on ability (θ)• IRT Methods: condition on ability (θ) Compare item parameters or ICCs.

IRT DIF DetectionIRT DIF Detection

• Compare Item Parameter Estimates• Compare Item Parameter Estimates– Multivariate test (b, a, and, c).– t-tests on b-values.

• Area MethodsArea Methods– Total Area (e.g., Raju, 1988, 1990).

Squared Differences– Squared Differences.– Weighted Areas and Differences.

Parameter ComparisonsParameter Comparisons

• H : b = b• H0: b1 = b2

a1 = a21 2

c1 = c2

• H1: b1 ≠ b21 1 2

a1 ≠ a2

c1 ≠ c2


χ2 test of parameter differences, afterχ test of parameter differences, after parameters have been equated:2 -1( ) ' ( )b bΣ

• This info can be obtained from BILOG-MG

2 -1( ) ' ( )diff diff diff diff diff diffa b c a b cχ = ΣThis info can be obtained from BILOG MG.

• Df = p, the number of parameters; Sometimes done with common c, which is left out (df = 2).do e t co o c, c s e t out (d )

• Criticism: asymptotic properties known, but it’s not clear how big a sample you need to do it!g p y


• t-test of difference between b-parameterst test of difference between b parameters, after parameters have been equated.

• This is a very simple procedure which may• This is a very simple procedure which may be informative for identifying items which call for a closer look but not too commoncall for a closer look, but not too common to rely solely on this.D ’t t f d t• Doesn’t account for a- and c-parameters, which may vary even for fixed b-value.

t-test for b-parametert test for b parameter

b b2 2

ref focDIF

b bt

−=

2 2( ) ( )ref focSE b SE b+

• More useful in a Rasch situation becauseMore useful in a Rasch situation, because this is the same as the multivariate test!

t-test for b-parametert test for b parameter

Thi b d t ti ll• This may be done automatically for you in BILOG-MG, using a y , gDIF command (see Help menu).DIF i l d i t f• DIF is only assessed in terms of item difficulty (b-parameter), no y ( p ),consideration to non-uniform DIF possibilitiespossibilities.

IRT Area MethodsIRT Area Methods

• These methods are more common• These methods are more common, as they compare an ICC from one

i t ICC f th thgroup against an ICC from the other and look at how much area is between the two…doesn’t depend on differences in parameters, but actual p ,differences in conditional probability.


• The basic approach to these• The basic approach to these methods is to evaluate the amount of “space” between the two ICCs the smaller the bettertwo ICCs…the smaller, the better.

• Recall that if the item parameters were invariant across groups, the two ICCs would be coincident.two ICCs would be coincident.

1.0

0.8

0.9

0.6

0.7

1)

0.4

0.5

P(u

=

0.2

0.3

0.0

0.1

-3 -2 -1 0 1 2 3

Ability (θ)

Calculating AreaCalculating Area

• The area between the ICCs is• The area between the ICCs is defined as:

| ( ) ( ) |m

Area P Pθ θ θ= Δ −∑1

| ( ) ( ) |

where is the width of a quadrature node

k ref fock

Area P Pθ θ θ

θ=

= Δ

Δ

∑where is the width of a quadrature nodekθΔ

1.0 Pref(θ)User must choose

0.8

0.9

P (θ)

User must choose range [-3,3] and size of the interval

0.6

0.7

1)

Pfoc(θ)size of the interval (e.g., 0.01)

0.4

0.5

P(u

=

0.2

0.3 Δθ

0.0

0.1

-3 -2 -1 0 1 2 3

Ability (θ)

Calculating AreaCalculating Area

• Closed form solution (requires that• Closed form solution (requires that a common c-parameter be fixed across calibrations):

2( ) ( )2( )(1 ) ln 1 ( )ref foc ref focDa a b bref foc

ref focref foc

a aArea c e b b

Da a−− ⎡ ⎤= − + − −⎣ ⎦

• Otherwise, the area between ICCs ld b i fi it !would be infinite!


• In practice a very useful way toIn practice, a very useful way to examine DIF is to actually construct the ICCs for bothconstruct the ICCs for both groups and visually inspect any differencesdifferences.

• Let’s go through an example of an actual DIF study to see how thisactual DIF study to see how this would work…

DIF ExampleDIF Example• To demonstrate the graphical methods of DIF, g p

we will now work through an example.• Data from a single test were collected:

Simulated data (for example purposes)– Simulated data (for example purposes).– 50 multiple choice items total.

• All scored correct/incorrect.10 000 i t k th t t– 10,000 examinees took the test.

– Two groups:• 5,000 examinees per group.

• The data for each group can be found in the files group1.dat and group2.dat.

Analysis StepsAnalysis Steps

1. Run a 2PL model in BILOG on Group 1.1. Run a 2PL model in BILOG on Group 1.• Save the item and examinee parameters.

2 Run a 2PL model in BILOG on Group 22. Run a 2PL model in BILOG on Group 2.• Again, save the item and examinee

parameters.p3. Load the estimated item parameter

values into Excel.• Find the parameters for all items for each

exam.

Analysis StepsAnalysis Steps4. Using the estimated b parameters, compute:g p , p

11 2 b Group

b Group b Groupx y xσ

μ μσ

−− −= = −

5. Transform the a and b parameters from Group

2b Groupσ −

2 onto the scale of the Group 1 calibration.

6 R h ICC f h d6. Reconstruct the ICCs for each group and visually inspect the slides.

• Use the macro in Excel• Use the macro in Excel.

DIF DTF InterpretationDIF, DTF Interpretation• DTF can be examined in the sameDTF can be examined in the same

way, only using a comparison of TCCs instead of ICCs.

• How much is too much?– No significance test, so one common g ,

approach is to determine the amount of DIF present in simulated non-DIF data.

Simulated Data for ComparisonSimulated Data for Comparison

• Combine groups and estimate• Combine groups and estimate item and ability parameters.

• Use item and person parameter estimates to simulate itemestimates to simulate item response data; any DIF present will only be due to statistical artifact.artifact.

Simulated Data for ComparisonSimulated Data for Comparison

• For each item determine the• For each item, determine the amount of DIF present, which will only be a result of sampling error.

• This amount of DIF becomes the• This amount of DIF becomes the cutoff point; any DIF greater and the item is flagged.

Other Area MethodsOther Area Methods

• Absolute value of each difference is• Absolute value of each difference is the most commonplace approach.

• Less Common– Signed DIF – positives and negatives g p g

can cancel out (with non-uniform DIF).– Squared DIF – less intuitive to interpretSquared DIF less intuitive to interpret.– Weighted DIF – magnitude of

differences are weighted by conditionaldifferences are weighted by conditional sample sizes.

Practical ConcernsPractical Concerns

• When all is said and doneWhen all is said and done, after an item has been identified as DIF, the questions then becomes “why?”then becomes why?

• In the absence of interpretation, it is difficult to j tif d l ti itjustify deleting an item.

*Example DIF against FemalesExample DIF against Females

Decoy : Duck :: :Decoy : Duck :: _______ : _______(A) Net : Butterfly(B) W b S id(B) Web : Spider(C)Lure : Fish( )(D)Lasso : Rope(E) Detour : Shortcut(E) Detour : Shortcut

*from Holland & Wainer, 1993

Practical ConcernsPractical Concerns

• Some “essential” items (either due to• Some essential items (either due to content or statistical properties) may b l ft th t t ith DIFbe left on the test, even with DIF.

• This is ameliorated by constructing y gforms balanced to include some other item or items which favor the focalitem or items which favor the focal group…make TCCs match.

ConclusionConclusion

• Good reference:• Good reference:Holland & Wainer (Eds.) (1993)Diff i l I F i iDifferential Item Functioning

• DIF is a heavily researched topicDIF is a heavily researched topic, no shortage of articles comparing methods (or creating new ones)methods (or creating new ones).

• Methods presented today are general but very useful.

NextNext…

• Practice with software• Practice with software• Operational analysesOperational analyses.

introduction to differential item functioningdifferential ... · •“differential item...

Documents