accuracy, transparency, and incentives: contrasting criteria for evaluating growth models

Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth

Models

Andrew HoHarvard Graduate School of Education

Maryland Assessment Research Center for Education Success (MARCES) Assessment Conference:Value Added Modeling and Growth Modeling with Particular Application to Teacher and School Effectiveness

College Park , Maryland, October 18, 2012

• How can we advance from passing judgment on schools and teachers to facilitating their improvement?

• By which criteria should we evaluate accountability models?

• Predictive accuracy of individual student “growth” models for school-level accountability.– Projection Model– Trajectory Model– Conditional Status Percentile Rank models (e.g., SGPs)

• Incentives.– Conditional Incentive diagrams and alignment to policy goals.

• Transparency, black boxes, and score reporting.

What makes a good growth model?

• School accountability metrics that count students who are “on track” to proficiency, career and college readiness, or some other future outcome.

• A seemingly straightforward criterion is minimization of the distance between predicted and actual future performance.

Context

• Once the discussion is framed in terms of standards, the only rhetorically acceptable choice is high standards.

• Once the discussion is framed in terms of predictive accuracy, the only rhetorically acceptable choice is maximal accuracy.

How “predictive accuracy” is like “standards”

• Define as a test score in Grade .• Consider the problem of predicting a “future” Grade 8

test score from a current Grade 7 score and a past score .

1) Take data from a past cohort with complete Grade 6, 7, and 8 data.

2) Estimate a simple prediction equation:

3) Assume the equation holds for the current cohort.4) Plug in and data from the current cohort into the

equation estimated from the past cohort (often looks like this for standardized data):

A simple projection model

A simple projection model

�̂� 8𝑝𝑟𝑜𝑗=( 0.5 ) 𝑋7+ (0.4 ) 𝑋 6

• Define a criterion that describes the average “miss” on the scale (lower the better):

• As you might expect of a regression model given only prior-year variables, the projection model does about as well as possible with prediction, with RMSEs between 0.4 and 0.6 standard deviation units.

• A convenient representation of RMSE assumes equal intergrade correlations (this is unrealistic but tolerable for approximation) for prior years (in this case, ).

Minimize Squared Error

• Requires an argument for a vertical scale.• Extends past gains into the future.

• Compare with a typical projection model.

• Under the caricatured conditions of equal intergrade correlations, , and prior years, an “average gain” trajectory model has

• The ratio is 1.4 to 2 over common scenarios.

A simple trajectory model

A simple trajectory model

• Castellano and Ho (2012) describe metrics like SGPs (Betebenner, 2008) and Percentile Ranks of OLS Residuals (PRRs) as Conditional Status Percentile Ranks.

• These can be used to make predictions as follows:1) Regress “current” Grade 7 on “past” Grade 6:

2) Define as the percentile rank of a student’s residual in the distribution of .

3) Obtain following the projection model.4) For each student, add , the th percentile residual

from , where corresponds with their from .

Projections from Conditional Status Percentile Ranks

Contrasting all predictive models

Growth Description Growth Prediction

Gain Scores, Trajectories

Status Beyond Prediction, CSPR

Trajectory, Gain-Score Model

Projection/Regression

Model

Use a regression model to predict

future scores from past scores, statistically, empirically.

Where a student was, where a

student is, and what has been

learned in between.

Where a student is, above and

beyond where we would have

predicted she would be, given

past scores.

Extend past gains in systematic fashion

into the future. Consider whether

future performance is adequate

An important contrast in “growth” use and interpretation

Gain

Prediction from previous score (or scores, or scores and demographics)

Status beyond prediction

Adding two students with equal gains Adding two different students with equal status beyond predictions.

Gain Scores, Trajectories

Status Beyond Prediction

Two Approaches to Growth Description

• Extends gains over time in straightforward fashion.

• With more prior years, a best-fit line or curve can be extended similarly.

• Extended trajectories do not have to be linear.

• Estimates a prediction equation for the “future” score.

• Because current students have unknown future scores, estimate the prediction equation from a previous cohort that does have their “future” year’s score.

• Input current cohort data into this prediction equation.

Trajectory Model Projection/Regression Model

Two Approaches to Growth Prediction

Three students with equal projections from a regression model.

The same three students’ predictions with a regression model.

Three students with equal projections from a gain-score model

Stark Contrasts in Projections

Models by RMSE

• As noted, the ratio ranges from 1.4 to 2 under common conditions.

• The ratio (Castellano & Ho, in preparation)• Compared to a projection model (regression)

baseline, CSPRs have an average “miss” that is 40% greater, and trajectory models are worse still.

• Regression does well what regression does well. • Case closed?

• What sorts of decisions do these models support?• How do these decisions set incentives for

teachers and school administrators?• Consider a future standard, , where if , one is

deemed to be “on track.” • Given , , , and any of our three models, we can

graph the “on track” boundary on an vs. plane (Ho, 2011; Ho, et al., 2009; Hoffer, et al., 2011).

• For example, we can plot , assuming a 50th percentile cut score () for simplicity:

Models by RMSE

Decision PlotsFor a Grade 6 score of -2, projection models require a 1.6, CSPRs require a -.5, and trajectory models require a -1.

For a Grade 6 score of 2, projection models require anything above a -1.6, CSPRs require a .5, and trajectory models require a 1.

Torquing Projection LinesThe Projection line (and the CSPR line) is empirically derived. When does the projection line start looking more like the Trajectory line?

As an illustration, the projection line slope is 0 when:

Adjacent grade correlations generally range from 0.6 to 0.8. Distal grade correlations are rarely that much lower, so this is an uncommon occurrence.

Aspirational Models vs. Predictive ModelsThe CSPR and trajectory model lines are, from this perspective, more aspirational than predictive.

They envision a covariance structure where relative position is less fixed over time than it is empirically.

Can this be okay, even if it decreases predictive accuracy?

From Decision Plots to Effort PlotsThe regression line or “conditional expectation” line gives us a baseline expectation given our Grade 6 scores.

Anything above this line may require “effort.” We can plot this effort on prior-year scores by subtracting out this regression line.

Conditional Effort PlotsRequired gain beyond expectation.

Maximizing predictive accuracy may lead to implausible gains required to get low-achieving students to be “on track.”

A low score as a “ball and chain.” A high score as a “free pass.”

Conditional Incentive PlotsIn a zero-sum model for incentives to teach certain students, these conditional effort plots imply conditional incentive plots, as shown.

The question may be, what is the goal of the policy? This informs conditional incentive plots, and these can inform model selection.

This is a useful alternative to letting prediction drive model selection and then being surprised by the shape of incentives.

• Lower initial scores can inflate trajectories:

• New Model Rewards Low Scores, Encourages “Fail-First Strategy”

• Very intuitive, requires vertical scales, less accurate in terms of future classifications.

• Low scorers require huge gains. High scorers can fall comfortably.

• New Model Labels High and Low Achievers Early, Permanently.

• Counterintuitive, does not require vertical scales, more accurate classifications.

Trajectory, Gain-Score Model Regression/Prediction Model

Stark Contrasts in Incentives

• I argue that school accountability metrics should be designed with less attention to “standards” and “prediction” and closer attention to conditional incentives and their alignment with policy goals.

• What about teacher accountability metrics?• Conditional incentive plots for VAMs are generally

uniform across the distributions of variables included in the model.

• Scaling anomalies may lead to distortions, if equal intervals do not correspond with equal “effort for gains,” although this is difficult to game.

Conditional Incentive Plots for VAMs

• As accountability calculations become increasingly complicated, score reporting and transparency become even more necessary mechanisms for the improvement of schools and teaching.

• Systems will be more successful with clear reporting of actionable (and presumably defensible) results.

• An example: I used to be very suspicious of categorical/value-table models, as they create pseudo-vertical scales and sacrifice information.– I still have reservations, but they are still excellent tools for

reporting and communicating results, even when the underlying models are not themselves categorical.

– An actionable, interpretable categorical framework layered over a continuous model.

Transparency and Score Reporting

• If I had a choice between– A simple VAM that communicated actionable responses

and incentives clearly, vs.– A complicated VAM that did not have any guidance

about how to improve…• This is a false dichotomy.• We can make the complex seem simple.• Conditional incentive plots and similar attention to

differential student contribution to VAM estimates are one approach to this.

• Both to anticipate gaming behavior and encourage desired responses.

Communicating Incentives Clearly

• In his NCME Career Award address, Haertel distinguished between two categories of purposes of large-scale testing: Measurement and Influence.

• A “influencing” purpose is often depends less on the results of the test itself.– Directing student effort– “Shaping public perception”

• Validation arguments for Influencing purposes are rarely well described.

• These plots are a modest first step for visualizing the Influencing mechanisms of proposed models.

Haertel (2012) Measurement vs. Influence

• A medical analogy (thanks to Catherine McClellan) can be helpful in thinking about where VAM and school accountability research should continue to go.

• Doctors must gather data, identify symptoms, reach a diagnosis, and prescribe a treatment.– In school and teacher effectiveness conversations, we often get stuck

at “symptoms.”– Doctors do not average blood pressure results with fMRI results to

get increasingly reliable and accurate measures of “health.” Or at least they don’t stop there.

– We need to continue advancing the science of diagnosis (what’s wrong) and treatment (now what).

• We must continue beyond predictive accuracy and even conditional incentives to deeper understanding of teachers’ and administrators’ learning in response to evaluation systems.

School/Teacher Effectiveness and House, MD

accuracy, transparency, and incentives: contrasting criteria for evaluating growth models

Documents