introduction to differential item functioningdifferential ... · •“differential item...
TRANSCRIPT
Introduction to Differential Item FunctioningDifferential Item Functioning
American Board of Internal MedicineItem Response Theory Course
OverviewOverview
• Detection of Differential Item• Detection of Differential Item Functioning (DIF):g ( )–Distinguish “Bias” from “DIF”.–Test vs. Item-level bias.
• Revisit some test score• Revisit some test score equating issues.q g
OutlineOutline
• Review relevant assumptions• Review relevant assumptions.• Concept of potentially biasedConcept of potentially biased
test items (or “DIF,” as we’ll call it).
• IRT based methods of• IRT-based methods of detecting DIF.g
Some NotesSome Notes
• We will focus on:• We will focus on:– DIF with 0-1 test items; DIF with
l t it i li t dpolytomous items is more complicated, though similar approaches are used.
– IRT methods only.– DIF as a statistical issue; interpretation p
of “why?” can be quite a bit trickier.
Why study DIF?Why study DIF?
• We’re often interested in comparingWe re often interested in comparing cultural, ethnic, or gender groups.
• Meaningful comparisons require that• Meaningful comparisons require that measurement equivalence holds.Cl i l t t th th d f d• Classical test theory methods confound “bias” with true mean differences; not IRT.
• In IRT terminology, item/test “bias” is referred to as DIF/DTF.
Definition of DIFDefinition of DIF• A test item is labeled with “DIF” when
examinees with equal ability, but from different groups have an unequaldifferent groups, have an unequal probability of item success.A i i l b l d “ DIF” if• A test item is labeled as “non-DIF” if examinees having the same ability have equal probability of getting the item correct, regardless of groupitem correct, regardless of group membership.
DIF versus DTFDIF versus DTF
• DIF is basically found by examiningDIF is basically found by examining differences in ICCs across groups.
• Differential Test Functioning, or DTF, isDifferential Test Functioning, or DTF, is the analogous procedure for determining differences in TCCs.
• DTF is arguably more important because it speaks to impact (DIF in one item may be p p ( ysignificant, but might not have too much practical impact on test results).
Classical Approach to DIFClassical Approach to DIF
• Compare item p-values in the two groups:Compare item p values in the two groups:– Criticism: p-value difference may be due to
real and important group differences Need toreal and important group differences. Need to compare p-value difference for one item in relation to the differences for other items.
– This provides a baseline to interpret the difference, but item p-value differences will be confounded by discrimination differences.
Important AssumptionsImportant Assumptions
• Unidimensionality• Unidimensionality.• Parameter Invariance.• Violations of these assumptions
id tifi blacross identifiable groups basically provide us with a y pdefinition of DIF.
UnidimensionalityUnidimensionality
• This assumption basically states• This assumption basically states that the test of interest measures ONE construct (e.g., math proficiency verbal ability trait etc )proficiency, verbal ability, trait, etc.)
• Group membership (e.g., gender, ethnicity, high-low ability) should not differentially impact success.not differentially impact success.
UnidimensionalityUnidimensionality
• If group membership impacts• If group membership impacts or explains performance, then p p ,the assumptions of the model are not being metare not being met.
• Basically DIF=DimensionalityBasically, DIF Dimensionality
Parameter InvarianceParameter Invariance
• Invariance of Item parameters (a b c)• Invariance of Item parameters (a,b,c)– Compare item statistics obtained in two
( hi h d lor more groups (e.g., high and low performing groups; Ethnicity; Gender).
– If the model fits, different examinee samples should still produce close to the same item parameter estimates.
Parameter InvarianceParameter Invariance
• Invariance of item parameters (a b c)• Invariance of item parameters (a,b,c)– Scatterplots of b-b, a-a, c-c based on
l f i thone sample of examinees versus the other should be strongly linearly related.
– Relationship won’t be perfect: sampling error does enter the picture.
– Those estimates far from the best-fit line represent a violation of invariance.p
Parameter Invarianceb-plot3
1
2
)
0
(For
m B
)
-2
-1
b
-3
-2
-3 -2 -1 0 1 2 33 2 1 0 1 2 3
b (Form A)
Check Item InvarianceCheck Item Invariance
• If the test is unidimensional and• If the test is unidimensional and parameters are invariant across groups, then even though there may be differences in themay be differences in the distributions of ability, similar (li l l t d) lt ill b(linearly related) results will be obtained for item parameter pestimates.
1 .0on
se
0 .5Cor
rect
Res
po Two groups may have
Prob
abili
ty o
f C different distributions
0.5
0 .0for the trait being
0.3
0.4
requ
ency
measured, but the same
d l0.1
0.2
Rel
ativ
e Fr model
should fit
0.0-3 -2 -1 0 1 2 3
Ability
1 .0on
se
0 .5Cor
rect
Res
po Frequencies may differ,
Prob
abili
ty o
f C but matched ability
0.5
0 .0groups should have
0.3
0.4
requ
ency
the same probability
0.1
0.2
Rel
ativ
e Fr of success
on the item
0.0-3 -2 -1 0 1 2 3
Ability
Check Item InvarianceCheck Item Invariance
• Differential Item Functioning• Differential Item Functioning (DIF).(DIF).
• This violates parameter pinvariance because items
’t idi i laren’t unidimensional.
DIF in PracticeDIF in Practice
• “Reference” versus “Focal” groupsReference versus Focal groups– Males – Females
Whites Blacks– Whites – Blacks– Majority – Minority
E li h ESL– English – ESL groups• One example of DIF is “item drift” due
to over- or under-emphasis of content measured over time.
May have to remove items:3
yParameter Drift
1
2
)
0
(For
m B
)
2
-1
b
-3
-2
-3 -2 -1 0 1 2 33 2 1 0 1 2 3
b (Form A)
DIF in PracticeDIF in Practice
• “Differential Item Functioning” NOTDifferential Item Functioning NOT “item bias” is the term used.
• DIF is a statistical property which• DIF is a statistical property, which states that matched-ability groups have differential probabilities ofhave differential probabilities of success on an item.Bi i b i i i f• Bias is a substantive interpretation of any DIF that may be observed.
DIF in PracticeDIF in Practice
• DIF studies are absolutelyDIF studies are absolutely essential in testing programs with high stakeshigh stakes.
• Potential gender and/or ethnicityPotential gender and/or ethnicity bias could negatively impact one or more groups in a constructor more groups in a construct irrelevant way; we’re only interested in θ!
Identifying DIFIdentifying DIF
• Uniform DIF: item is• Uniform DIF: item is systematically more difficult for members of one group, even after matching examinees on ability (θ)matching examinees on ability (θ).
• Cause: shift in b-parameter.
Identifying DIFIdentifying DIF
• Non-Uniform DIF: shift in item• Non-Uniform DIF: shift in item difficulty is not consistent across the
bilit tiability continuum.• Increase/decrease in P for low-ability y
examinees is offset by the converse for high-ability examineesfor high ability examinees.
• Cause: shift in a (and possibly b).
Steps in a DIF StudySteps in a DIF Study• Identify Reference and Focal groupsIdentify Reference and Focal groups
of interest (usually two at a time).• Design the DIF study to haveDesign the DIF study to have
samples which are large as possible.• Choose DIF statistics which areChoose DIF statistics which are
appropriate for the data.• Carry out the statistical analysesCarry out the statistical analyses.• Interpret DIF stats and delete items or
make item changes as necessarymake item changes as necessary.
The problem of sample sizeThe problem of sample size
• CAUTION: Don’t assume if DIF is• CAUTION: Don t assume if DIF is found with statistical tests that there
d fi it l bl With biare definitely problems. With big samples, the power exists to detect even small conditional differences.
• The flip side is that with smallThe flip side is that with small samples, sufficient power may not exist to identify truly problematic itemsexist to identify truly problematic items.
IRT DIF AnalysesIRT DIF Analyses
• Combine data from both groups andCombine data from both groups and obtain c-parameter estimates.
• Fix the c parameters and obtain a• Fix the c-parameters and obtain a-and b-parameter estimates in each group from separate calibrationsgroup from separate calibrations.
• If multivariate DIF test is done, two l l lib icompletely separate calibrations may
be conducted (not always).
IRT DIF AnalysesIRT DIF Analyses• Before the ICCs in the two groups g p
can be compared, the must be placed onto the same scaleplaced onto the same scale (equated). Typically, item parameter estimates from the Focal group (minority) are placedFocal group (minority) are placed onto the scale of the Reference
( j it )group (majority).
Mean & Sigma EquatingMean & Sigma Equating
• After separate calibrations• After separate calibrations, determine the linear transformation that matches the mean and SD of anchor item b-mean and SD of anchor item bvalues across administrations:
b refb ref b focx y x
σμ μ
σ−
− −= = −b focσ −
Mean & Sigma EquatingMean & Sigma Equating
• This transformation places the• This transformation places the scale of item parameters from the Focal Group onto the scale of item parameters from theitem parameters from the Reference Group:
b refb ref b focx y x
σμ μ
σ−
− −= = −b focσ −
Mean & Sigma with DIFMean & Sigma with DIF
• All items comprise the “anchor test”All items comprise the anchor test as both groups took the same form.
• Treat the entire test as a block of• Treat the entire test as a block of linking items, determine the equating constants x and y and transformconstants, x and y, and transform.
• Another transformation procedure, h TCC b M & S ldsuch as TCC or robust M & S, could
alternatively be utilized.
Mean & Sigma ExampleMean & Sigma Example
• 25 items25 items– Reference: μb = -0.06, σb = 1.03– Focal: μb = -0.25, σb = 1.12Focal: μb 0.25, σb 1.12
• x = 1.03 / 1.12 = 0.92y = -0 06 – (0 92 * -0 25) = 0 17y = 0.06 (0.92 0.25) = 0.17
• bnew = x bold + y• b = 0 92 b + 0 17• bnew = 0.92 bold + 0.17
M & S Transformation of Focal Group b-parametersFocal Group b-parameters
3bnew = x bold + y
1
2
)
bnew = 0.92 bold + 0.17
Intercept (y) is above zero (i e0
Form
B)
cal)
Intercept (y) is above zero (i.e., common items were easier).
This indicates that the Focal
-1b (
b(Fo
This indicates that the Focal Group had higher ability than the Reference Group.
-3
-2
-3 -2 -1 0 1 2 3
b (Form A)b(Reference)
After item if x yθ θ= +parameters are adjusted, the same
if then
new
new
x yb xb yθ θ +
= + same transformation of b is done for all
new yaa = θ…now the
Focal examinees will look more
newax
c c
=
will look more able (as they should).
newc c=
These transformations preserve the probability:( 1| ) ( 1| )P Pθ θ( 1| ) ( 1| )newP u P uθ θ= = =
Sample Size IssueSample Size Issue• We might also not have enough g g
data to estimate two calibrations (e g the Focal N may be small)(e.g., the Focal N may be small).
• One solution is to estimate the ICC ith j t th j itICCs with just the majority group, score the minority group with y gthose parameters, and examine their residuals against the model.their residuals against the model.
0 9
1.0
This shows a pattern0.8
0.9
spon
se
This shows a pattern of non-uniform DIF for the minority
0.6
0.7
rrec
t Re for the minority
group
0.4
0.5
ty o
f Cor
0.2
0.3
roba
bilit
0.0
0.1Pr
-3 -2 -1 0 1 2 3
Ability (θ)
Important NoteImportant Note
• Group differences do NOT• Group differences do NOT (necessarily) mean DIF (or bi f h )bias, for that matter).
• Two groups can have• Two groups can have differences in their means, and if th d l fit thif the model fits, the same ICCs will be estimated.ICCs will be estimated.
Important NoteImportant Note
• The point here is that if after being• The point here is that if, after being matched on ability, different ICCs
th th it di l DIFoccur, then the item displays DIF.• If matched ability examinees have y
the same probability of success on the item then there is no DIF even ifthe item, then there is no DIF, even if one group is smarter than the other.
1.0
pons
e
0.5
f Cor
rect
Res
p
No DIF: Despite group Pr
obab
ility
of p g p
differences, one ICC fits for both groups
0.5
0.0
0.3
0.4
requ
ency
0.1
0.2
Rel
ativ
e Fr
0.0-3 -2 -1 0 1 2 3
Ability
1.0
spon
se0.5
Cor
rect
ResDIF: The item performs
roba
bilit
y of
pdifferentially even after matching on ability
0.5
0.0
Pr
0.3
0.4
requ
ency
0.1
0.2
Rel
ativ
e Fr
0.0-3 -2 -1 0 1 2 3
Ability
IRT DIF AnalysesIRT DIF Analyses
• After equating parameters compareAfter equating parameters, compare the items across groups using one of the following methods we’ll discussthe following methods we ll discuss.
• Optional: delete items that have large DIF statistics re run analyses andDIF statistics, re-run analyses, and re-estimate equating constants. Carefully review any flagged items forCarefully review any flagged items for substantive review, interpretation.
Two-stage MethodsTwo stage Methods
• Conduct a DIF analysis• Conduct a DIF analysis.• Flag DIF items temporarilyFlag DIF items, temporarily
remove them from criterion (raw score, θ).
• Re evaluate DIF with new• Re-evaluate DIF with new criterion.
DIF Detection MethodsDIF Detection Methods• Mantel-Haenszel: condition on raw
score, statistical test of contingency tables.
• Logistic Regression: condition on raw score model group-responsescore, model group response relationship.
• IRT Methods: condition on ability (θ)• IRT Methods: condition on ability (θ) Compare item parameters or ICCs.
IRT DIF DetectionIRT DIF Detection
• Compare Item Parameter Estimates• Compare Item Parameter Estimates– Multivariate test (b, a, and, c).– t-tests on b-values.
• Area MethodsArea Methods– Total Area (e.g., Raju, 1988, 1990).
Squared Differences– Squared Differences.– Weighted Areas and Differences.
Parameter ComparisonsParameter Comparisons
• H : b = b• H0: b1 = b2
a1 = a21 2
c1 = c2
• H1: b1 ≠ b21 1 2
a1 ≠ a2
c1 ≠ c2
Parameter ComparisonsParameter Comparisons
χ2 test of parameter differences, afterχ test of parameter differences, after parameters have been equated:2 -1( ) ' ( )b bΣ
• This info can be obtained from BILOG-MG
2 -1( ) ' ( )diff diff diff diff diff diffa b c a b cχ = ΣThis info can be obtained from BILOG MG.
• Df = p, the number of parameters; Sometimes done with common c, which is left out (df = 2).do e t co o c, c s e t out (d )
• Criticism: asymptotic properties known, but it’s not clear how big a sample you need to do it!g p y
Parameter ComparisonsParameter Comparisons
• t-test of difference between b-parameterst test of difference between b parameters, after parameters have been equated.
• This is a very simple procedure which may• This is a very simple procedure which may be informative for identifying items which call for a closer look but not too commoncall for a closer look, but not too common to rely solely on this.D ’t t f d t• Doesn’t account for a- and c-parameters, which may vary even for fixed b-value.
t-test for b-parametert test for b parameter
b b2 2
ref focDIF
b bt
−=
2 2( ) ( )ref focSE b SE b+
• More useful in a Rasch situation becauseMore useful in a Rasch situation, because this is the same as the multivariate test!
t-test for b-parametert test for b parameter
Thi b d t ti ll• This may be done automatically for you in BILOG-MG, using a y , gDIF command (see Help menu).DIF i l d i t f• DIF is only assessed in terms of item difficulty (b-parameter), no y ( p ),consideration to non-uniform DIF possibilitiespossibilities.
IRT Area MethodsIRT Area Methods
• These methods are more common• These methods are more common, as they compare an ICC from one
i t ICC f th thgroup against an ICC from the other and look at how much area is between the two…doesn’t depend on differences in parameters, but actual p ,differences in conditional probability.
IRT Area MethodsIRT Area Methods
• The basic approach to these• The basic approach to these methods is to evaluate the amount of “space” between the two ICCs the smaller the bettertwo ICCs…the smaller, the better.
• Recall that if the item parameters were invariant across groups, the two ICCs would be coincident.two ICCs would be coincident.
Calculating AreaCalculating Area
• The area between the ICCs is• The area between the ICCs is defined as:
| ( ) ( ) |m
Area P Pθ θ θ= Δ −∑1
| ( ) ( ) |
where is the width of a quadrature node
k ref fock
Area P Pθ θ θ
θ=
= Δ
Δ
∑where is the width of a quadrature nodekθΔ
1.0 Pref(θ)User must choose
0.8
0.9
P (θ)
User must choose range [-3,3] and size of the interval
0.6
0.7
1)
Pfoc(θ)size of the interval (e.g., 0.01)
0.4
0.5
P(u
=
0.2
0.3 Δθ
0.0
0.1
-3 -2 -1 0 1 2 3
Ability (θ)
Calculating AreaCalculating Area
• Closed form solution (requires that• Closed form solution (requires that a common c-parameter be fixed across calibrations):
2( ) ( )2( )(1 ) ln 1 ( )ref foc ref focDa a b bref foc
ref focref foc
a aArea c e b b
Da a−− ⎡ ⎤= − + − −⎣ ⎦
• Otherwise, the area between ICCs ld b i fi it !would be infinite!
IRT Area MethodsIRT Area Methods
• In practice a very useful way toIn practice, a very useful way to examine DIF is to actually construct the ICCs for bothconstruct the ICCs for both groups and visually inspect any differencesdifferences.
• Let’s go through an example of an actual DIF study to see how thisactual DIF study to see how this would work…
DIF ExampleDIF Example• To demonstrate the graphical methods of DIF, g p
we will now work through an example.• Data from a single test were collected:
Simulated data (for example purposes)– Simulated data (for example purposes).– 50 multiple choice items total.
• All scored correct/incorrect.10 000 i t k th t t– 10,000 examinees took the test.
– Two groups:• 5,000 examinees per group.
• The data for each group can be found in the files group1.dat and group2.dat.
Analysis StepsAnalysis Steps
1. Run a 2PL model in BILOG on Group 1.1. Run a 2PL model in BILOG on Group 1.• Save the item and examinee parameters.
2 Run a 2PL model in BILOG on Group 22. Run a 2PL model in BILOG on Group 2.• Again, save the item and examinee
parameters.p3. Load the estimated item parameter
values into Excel.• Find the parameters for all items for each
exam.
Analysis StepsAnalysis Steps4. Using the estimated b parameters, compute:g p , p
11 2 b Group
b Group b Groupx y xσ
μ μσ
−− −= = −
5. Transform the a and b parameters from Group
2b Groupσ −
2 onto the scale of the Group 1 calibration.
6 R h ICC f h d6. Reconstruct the ICCs for each group and visually inspect the slides.
• Use the macro in Excel• Use the macro in Excel.
DIF DTF InterpretationDIF, DTF Interpretation• DTF can be examined in the sameDTF can be examined in the same
way, only using a comparison of TCCs instead of ICCs.
• How much is too much?– No significance test, so one common g ,
approach is to determine the amount of DIF present in simulated non-DIF data.
Simulated Data for ComparisonSimulated Data for Comparison
• Combine groups and estimate• Combine groups and estimate item and ability parameters.
• Use item and person parameter estimates to simulate itemestimates to simulate item response data; any DIF present will only be due to statistical artifact.artifact.
Simulated Data for ComparisonSimulated Data for Comparison
• For each item determine the• For each item, determine the amount of DIF present, which will only be a result of sampling error.
• This amount of DIF becomes the• This amount of DIF becomes the cutoff point; any DIF greater and the item is flagged.
Other Area MethodsOther Area Methods
• Absolute value of each difference is• Absolute value of each difference is the most commonplace approach.
• Less Common– Signed DIF – positives and negatives g p g
can cancel out (with non-uniform DIF).– Squared DIF – less intuitive to interpretSquared DIF less intuitive to interpret.– Weighted DIF – magnitude of
differences are weighted by conditionaldifferences are weighted by conditional sample sizes.
Practical ConcernsPractical Concerns
• When all is said and doneWhen all is said and done, after an item has been identified as DIF, the questions then becomes “why?”then becomes why?
• In the absence of interpretation, it is difficult to j tif d l ti itjustify deleting an item.
*Example DIF against FemalesExample DIF against Females
Decoy : Duck :: :Decoy : Duck :: _______ : _______(A) Net : Butterfly(B) W b S id(B) Web : Spider(C)Lure : Fish( )(D)Lasso : Rope(E) Detour : Shortcut(E) Detour : Shortcut
*from Holland & Wainer, 1993
Practical ConcernsPractical Concerns
• Some “essential” items (either due to• Some essential items (either due to content or statistical properties) may b l ft th t t ith DIFbe left on the test, even with DIF.
• This is ameliorated by constructing y gforms balanced to include some other item or items which favor the focalitem or items which favor the focal group…make TCCs match.
ConclusionConclusion
• Good reference:• Good reference:Holland & Wainer (Eds.) (1993)Diff i l I F i iDifferential Item Functioning
• DIF is a heavily researched topicDIF is a heavily researched topic, no shortage of articles comparing methods (or creating new ones)methods (or creating new ones).
• Methods presented today are general but very useful.