cross-grade scales in naep: research and real-life experience

37
Copyright © 2005 Educational Testing Service Listening. Learning. Leading. Cross-Grade Scales in NAEP: Research and Real-Life Experience Catherine A. McClellan, John R. Donoghue, Lydia Gladkova, & Xueli Xu

Upload: urian

Post on 01-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Cross-Grade Scales in NAEP: Research and Real-Life Experience. Catherine A. McClellan, John R. Donoghue, Lydia Gladkova, & Xueli Xu. Measurement invariance. One key idea in all types of modeling discussed here is that of invariance - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Listening. Learning. Leading.

Cross-Grade Scales in NAEP: Research and Real-Life Experience

Catherine A. McClellan, John R. Donoghue, Lydia Gladkova, & Xueli Xu

Page 2: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Measurement invariance

• One key idea in all types of modeling discussed here is that of invariance

• Many of the thorny assessment problems we face require an assumption of invariance

• In order to do modeling across grade levels, across time, across groups of people, and across scorers, something, somewhere must be assumed to be invariant – and usually it is some aspect of construct invariance

Page 3: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

It’s all about the construct…

• Cross-grade scaling needs construct invariance across ages/grades

• Differential item functioning (DIF) needs construct invariance across groups

• Trend measurement needs construct invariance across time

• Constructed-response (CR) item scoring needs rater invariance in interpreting the construct as reflected in the item and rubric

Page 4: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

There are a couple of other invariance areas to watch

• Design invariance – the assessment design should not change without careful study of the impact– Ink color matters! (particularly in reading)– Context matters – what items and subject matter

appear with (particularly before) others matters

• Analysis invariance – changes in analysis methodology can introduce artifactual changes in results and should be carefully evaluated before implementation

Page 5: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Cross-grade scales

• Cross-grade scales in NAEP must measure the growth between pairs of grade levels while maintaining the trend line for each grade

• Assessment design must meet these constraints:– non-adjacent grades assessed (4, 8, and 12)– use of IRT methodology to link grades and trend

points– the necessity of trend measurement– item release and replacement– years with missing grades

Page 6: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Differential Item Functioning (DIF)

• DIF requires construct invariance across groups of students defined by some known variable (race, gender, parental education, SES, etc.)

• Cross-grade scaling issues are age DIF

Page 7: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Trend issues

• Items are assumed to function the same way across time– Item parameter drift is a threat – societal

changes and scientific discoveries can alter item functioning

– Most marginal estimation procedures are sample-dependent

– Sets of items that refer to common stimulus materials are prone to dependence, and the structure of the dependence can change over time

Page 8: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Constructed response scoring issues

• CR items must be scored the same way– Rater change (or drift) corrupts trend measures– Often can’t get the same raters; even if they are

the same people, they have changed– Training may differ, especially if the trainer is

not the same– Historical events (state initiatives, etc.) may

change how raters perceive items and may even introduce new correct responses or remove previously correct responses

– Scoring may differ across grade levels

Page 9: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Page 10: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

• The Ugly – Long-Term Trend Writing

• The Bad – US History and Geography

• The Good – Reading

Page 11: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

The Ugly: NAEP long-term trend writing

• Originally designed in 1984• 6 writing prompts in 2 disjoint sets (4/2)

– Each student receives 1-4 prompts– Scored according to primary trait rubrics– Scores are on a four-point scale, 0-3

• In 1986, there were problems with scoring – Items were declared non-trend and 1984

responses rescored in 1986– 1986 then becomes the base year for the trend

• Items continue in same form through 1999

Page 12: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

IRT scaling with LTT writing

• IRT scaling (GPCM; Muraki, 1992) introduced in 1992

• 1992, 1990, 1988, and 1984 data calibrated simultaneously

• NAEP marginal estimation and plausible values technology used to produce trend results

• For each new wave of data, use adjacent pairs of years (i.e. current and previous) scaled together to place current assessment onto the reporting scale– Applied in 1994, 1996, and 1999

Page 13: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Cross-year invariance issues in 1999

• Basis of trend—assumption that items function identically across time

• In 1999, the re-score data and plots raised concerns about whether the assessment data supported this assumption

• Cross-year drift essentially “splits” an item into two separate items, one as rated in each year

• Creating two items from one can be done in analysis– Requires judgment, as there is currently no valid

statistic test for this type of misfit

Page 14: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Item effects

• Recall that there were a small number of prompts in the assessment and that the design was weakly linked across items

• Overall trend and results proved to be sensitive to decisions made about a single item

• Simultaneous calibration of all years was less sensitive, but still showed the same effect

Page 15: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

So now what?

• The alternatives seemed to be:– Report the 1999 IRT based results as they stood– Do alternative, non-IRT analysis to further

evaluate the situation and possibly use as the reporting results• Account for rater effects• Incorporate sources of error• Develop standard errors that reflect these sources of

error

• It was decided to pursue the non-IRT analyses

Page 16: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Accounting for rater effects - 1

• In 1988 rater drift was noted and a portion of 1984 papers were rescored, so 1988 became the official base year for subsequent scoring

• In pursuing the non-IRT analyses in 1999, data from all assessment years subsequent to 1988 (1990, 1992, 1994, 1996, and 1999) were analyzed

• A small number (230-500) of 1988 papers were rescored as part of the current assessment’s scoring

• These 1988 papers were used to estimate and remove the rater drift effect in subsequent years

Page 17: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Accounting for rater effects - 2

1. Form rescore data table with 1988 scores as the rows, 1999 scores as the columns

2. Compute the conditional probability

3. Take multiple draws from this posterior4. Analyze as if regular student scores5. Repeat analysis on each set of draws to yield an

estimate of the uncertainty due to imputation

88 99

88 9999

,( | )

P X j X kP X j X k

P X k

Page 18: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Potential concerns

• Tables were based on small samples, so estimates of were likely to be unstable

• Rescore table data had some (significant) gaps in some years– No scores in the highest score level for

some tables– No exact agreement for some tables

88 99( | )P X j X k

Page 19: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Smoothing (Part 1)

• Deal with variability using a smoothing procedure on the rescore table, then draw values from smoothed table

• Loglinear smoothing (Holland & Thayer, 1998) was applied– This method preserves the moments of the margin and

the correlation– Margins of the original tables were preserved exactly

• Results indicated poor model-data agreement• For age 9, 4 point items, 14 of 30 tables yielded

significant log-likelihood Chi-square values

Page 20: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Other concerns

• The empty diagonal cells and empty margins had to be dealt with

• Solution chosen was to insert a single observation into the table– The original cells were all multiplied by

(N-1)/N to maintain overall N

• This preserves important aspects of table:– Percent exact agreement– Mean difference of (current year -1988)

Page 21: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Smoothing (Part 2)

• Pre-smoothed tables input to loglinear smoothing procedure

• Fit was better than with the un-pre-smoothed data, but there were still some questionable cases

• We tried using a Bayesian method (Feinberg & Holland, 1970) to form a weighted combination of the two tables

• These tables were used to compute the conditional probabilities to draw imputations

Page 22: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Quantifying uncertainty

• Usual sources of uncertainty– Sampling of PSUs, schools, & students– Partial knowledge of student achievement: few

items– Usual jackknife procedures

• Plus– Uncertainty due to lack of knowledge of the

scores the 1988 raters would have assigned– Error introduced by estimation of the conditional

probabilities

• This got ugly in a hurry…

Page 23: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

The Ugly: In summary

• In 1999, important drift issues rose to fore• The treatment of single item (trend or

split) changed the direction of the overall national trend result

• Acting Commissioner Phillips — “I have lost faith in the instrument”

• The 1999 LTT writing results were never released

Page 24: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

The Bad: US History and Geography

• Base year is 1994 for both subjects

• Assessed again in 2001

• Reported using a cross-grade scale

• Two aspects for consideration: analysis design and construct considerations

Page 25: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Cross-grade scale design

1994 (base year)

2001 (first trend year)

Age 9 / grade 4

Age 13 / grade 8

Age 17 / grade 12

Age 9 / grade 4

Age 13 / grade 8

Age 17 / grade 12

Page 26: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Analysis design:US History, 1994 – 1

• There are no common items between grades 4 and 12, nor any across all 3 grades

Grade Democracy Cultures Technology World Role Total4 only 18 20 17 6 618 only 24 28 17 12 8112 only 29 28 34 30 1214 and 8 12 9 5 7 33

8 and 12

10 4 10 10 34

Total 93 89 83 65 331

Page 27: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Analysis design:US History, 1994 – 2

• History has four subscales: Democracy, Cultures, Technology, and World Role

• The IRT scaling and the vertical linking of the grades was done at the subscale level, using a weighted generalized Stocking-Lord procedure on the test characteristic curve of the common items

• Grade 4 and grade 12 were each linked separately to grade 8, since both had common items with grade 8

Page 28: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Design concerns

• There are some subscales that are quite thin across grade levels:– Technology across 4 and 8: 5 items– Cultures across 8 and 12: 4 items– World Role between 4 and 8: 7 items items (note

also that there are only six grade-4-specific items in this subscale)

• The TCCs that result from a weighted combination of IRFs from so few items may retain substantial variability within year, and also may be subject to trend instability

Page 29: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Construct concerns

• Vertical scales are generally based on content areas that are thought of as “developmental” in some way– The baseline construct is established early and

the skill is refined and the scope of application expanded as the child matures

• It is not clear that US History (or Geography, for that matter) fit this description well

Page 30: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

The Bad: In summary

• The analysis design is not poor, but there are relatively few items across grades to provide data to the linking

• A larger concern is whether or not these academic subject areas are appropriate for a vertical scale

Page 31: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

The Good: Reading

• The current reading assessment has a trend line back to 1992

• The cross-grade scale design used there was also used in mathematics, started in 1990

• The design implements a concurrent calibration of all three grade levels of data in the base year, then within-grade calibration in subsequent trend years

Page 32: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Year 1 (base year)

Year 2 (first trend year)

48

12

Year 3 (second trend year)

48

12

48

12

Cross-grade scale design

……

Page 33: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Some complications

2000 2002 2003 2005Grade 4 National Combined Combined CombinedGrade 8 Combined Combined CombinedGrade 12 National National

• NAEP does not assess every grade in every assessment year, so the design has some holes

• The sample sizes vary quite a lot: combined samples run ~170,000, national ~10,000

Page 34: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

1998(a) 2000 (a) 2002 (a) 2003 (a) 2005 (a)

48

12

48

12

48

12

48

4

Would an alternate design change the results?

Page 35: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Not much

217213

217 215221 220

260 260263 262 263 262

292286

289 289287

217214

217215

219 218

260 260264 263 264 263

292287

291 290287

190

220

250

280

310

' 92 ' 94 ' 98R2 ' 98R3 ' 02OP ' 03/03All

Years

Sca

le S

core

Mea

n

Cross Grade

Operational

Grade 8

Grade 12

Grade 4

Page 36: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

Summary of study results

• The majority of cross-grade items fit well in cross-grade calibration

• In general, reported values for cross-grade and operational scaling are close, both in mean scale scores and percentages at achievement levels

• In a number of subgroups, significant difference tests lead to different results

• The reported values for cross-grade and operational scaling differ more for the later years

Page 37: Cross-Grade Scales in NAEP:  Research and Real-Life Experience

Copyright © 2005 Educational Testing Service

The Good: In summary

• The current cross-grade scale design used in NAEP seems stable to the alternate design studied

• Little construct drift was apparent; the results were quite similar under both analysis designs

• This was an analytic study only: alternative assessment or item designs would almost certainly yield different conclusions