accuracy, transparency, and incentives: contrasting criteria for evaluating growth models

29
Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models Andrew Ho Harvard Graduate School of Education Maryland Assessment Research Center for Education Success (MARCES) Assessment Conference: Value Added Modeling and Growth Modeling with Particular Application to Teacher and School Effectiveness College Park , Maryland, October 18, 2012

Upload: tausiq

Post on 24-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models. Andrew Ho Harvard Graduate School of Education. Maryland Assessment Research Center for Education Success (MARCES) Assessment Conference: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth

Models

Andrew HoHarvard Graduate School of Education

Maryland Assessment Research Center for Education Success (MARCES) Assessment Conference:Value Added Modeling and Growth Modeling with Particular Application to Teacher and School Effectiveness

College Park , Maryland, October 18, 2012

Page 2: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• How can we advance from passing judgment on schools and teachers to facilitating their improvement?

• By which criteria should we evaluate accountability models?

• Predictive accuracy of individual student “growth” models for school-level accountability.– Projection Model– Trajectory Model– Conditional Status Percentile Rank models (e.g., SGPs)

• Incentives.– Conditional Incentive diagrams and alignment to policy goals.

• Transparency, black boxes, and score reporting.

What makes a good growth model?

Page 3: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• School accountability metrics that count students who are “on track” to proficiency, career and college readiness, or some other future outcome.

• A seemingly straightforward criterion is minimization of the distance between predicted and actual future performance.

Context

Page 4: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Once the discussion is framed in terms of standards, the only rhetorically acceptable choice is high standards.

• Once the discussion is framed in terms of predictive accuracy, the only rhetorically acceptable choice is maximal accuracy.

How “predictive accuracy” is like “standards”

Page 5: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Define as a test score in Grade .• Consider the problem of predicting a “future” Grade 8

test score from a current Grade 7 score and a past score .

1) Take data from a past cohort with complete Grade 6, 7, and 8 data.

2) Estimate a simple prediction equation:

3) Assume the equation holds for the current cohort.4) Plug in and data from the current cohort into the

equation estimated from the past cohort (often looks like this for standardized data):

A simple projection model

Page 6: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

A simple projection model

�̂� 8𝑝𝑟𝑜𝑗=( 0.5 ) 𝑋7+ (0.4 ) 𝑋 6

Page 7: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Define a criterion that describes the average “miss” on the scale (lower the better):

• As you might expect of a regression model given only prior-year variables, the projection model does about as well as possible with prediction, with RMSEs between 0.4 and 0.6 standard deviation units.

• A convenient representation of RMSE assumes equal intergrade correlations (this is unrealistic but tolerable for approximation) for prior years (in this case, ).

Minimize Squared Error

Page 8: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Requires an argument for a vertical scale.• Extends past gains into the future.

• Compare with a typical projection model.

• Under the caricatured conditions of equal intergrade correlations, , and prior years, an “average gain” trajectory model has

• The ratio is 1.4 to 2 over common scenarios.

A simple trajectory model

Page 9: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

A simple trajectory model

Page 10: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Castellano and Ho (2012) describe metrics like SGPs (Betebenner, 2008) and Percentile Ranks of OLS Residuals (PRRs) as Conditional Status Percentile Ranks.

• These can be used to make predictions as follows:1) Regress “current” Grade 7 on “past” Grade 6:

2) Define as the percentile rank of a student’s residual in the distribution of .

3) Obtain following the projection model.4) For each student, add , the th percentile residual

from , where corresponds with their from .

Projections from Conditional Status Percentile Ranks

Page 11: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Contrasting all predictive models

Page 12: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Growth Description Growth Prediction

Gain Scores, Trajectories

Status Beyond Prediction, CSPR

Trajectory, Gain-Score Model

Projection/Regression

Model

Use a regression model to predict

future scores from past scores, statistically, empirically.

Where a student was, where a

student is, and what has been

learned in between.

Where a student is, above and

beyond where we would have

predicted she would be, given

past scores.

Extend past gains in systematic fashion

into the future. Consider whether

future performance is adequate

An important contrast in “growth” use and interpretation

Page 13: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Gain

Prediction from previous score (or scores, or scores and demographics)

Status beyond prediction

Adding two students with equal gains Adding two different students with equal status beyond predictions.

Gain Scores, Trajectories

Status Beyond Prediction

Two Approaches to Growth Description

Page 14: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Extends gains over time in straightforward fashion.

• With more prior years, a best-fit line or curve can be extended similarly.

• Extended trajectories do not have to be linear.

• Estimates a prediction equation for the “future” score.

• Because current students have unknown future scores, estimate the prediction equation from a previous cohort that does have their “future” year’s score.

• Input current cohort data into this prediction equation.

Trajectory Model Projection/Regression Model

Two Approaches to Growth Prediction

Page 15: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Three students with equal projections from a regression model.

The same three students’ predictions with a regression model.

Three students with equal projections from a gain-score model

Stark Contrasts in Projections

Page 16: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Models by RMSE

• As noted, the ratio ranges from 1.4 to 2 under common conditions.

• The ratio (Castellano & Ho, in preparation)• Compared to a projection model (regression)

baseline, CSPRs have an average “miss” that is 40% greater, and trajectory models are worse still.

• Regression does well what regression does well. • Case closed?

Page 17: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• What sorts of decisions do these models support?• How do these decisions set incentives for

teachers and school administrators?• Consider a future standard, , where if , one is

deemed to be “on track.” • Given , , , and any of our three models, we can

graph the “on track” boundary on an vs. plane (Ho, 2011; Ho, et al., 2009; Hoffer, et al., 2011).

• For example, we can plot , assuming a 50th percentile cut score () for simplicity:

Models by RMSE

Page 18: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Decision PlotsFor a Grade 6 score of -2, projection models require a 1.6, CSPRs require a -.5, and trajectory models require a -1.

For a Grade 6 score of 2, projection models require anything above a -1.6, CSPRs require a .5, and trajectory models require a 1.

Page 19: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Torquing Projection LinesThe Projection line (and the CSPR line) is empirically derived. When does the projection line start looking more like the Trajectory line?

As an illustration, the projection line slope is 0 when:

Adjacent grade correlations generally range from 0.6 to 0.8. Distal grade correlations are rarely that much lower, so this is an uncommon occurrence.

Page 20: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Aspirational Models vs. Predictive ModelsThe CSPR and trajectory model lines are, from this perspective, more aspirational than predictive.

They envision a covariance structure where relative position is less fixed over time than it is empirically.

Can this be okay, even if it decreases predictive accuracy?

Page 21: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

From Decision Plots to Effort PlotsThe regression line or “conditional expectation” line gives us a baseline expectation given our Grade 6 scores.

Anything above this line may require “effort.” We can plot this effort on prior-year scores by subtracting out this regression line.

Page 22: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Conditional Effort PlotsRequired gain beyond expectation.

Maximizing predictive accuracy may lead to implausible gains required to get low-achieving students to be “on track.”

A low score as a “ball and chain.” A high score as a “free pass.”

Page 23: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Conditional Incentive PlotsIn a zero-sum model for incentives to teach certain students, these conditional effort plots imply conditional incentive plots, as shown.

The question may be, what is the goal of the policy? This informs conditional incentive plots, and these can inform model selection.

This is a useful alternative to letting prediction drive model selection and then being surprised by the shape of incentives.

Page 24: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• Lower initial scores can inflate trajectories:

• New Model Rewards Low Scores, Encourages “Fail-First Strategy”

• Very intuitive, requires vertical scales, less accurate in terms of future classifications.

• Low scorers require huge gains. High scorers can fall comfortably.

• New Model Labels High and Low Achievers Early, Permanently.

• Counterintuitive, does not require vertical scales, more accurate classifications.

Trajectory, Gain-Score Model Regression/Prediction Model

Stark Contrasts in Incentives

Page 25: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• I argue that school accountability metrics should be designed with less attention to “standards” and “prediction” and closer attention to conditional incentives and their alignment with policy goals.

• What about teacher accountability metrics?• Conditional incentive plots for VAMs are generally

uniform across the distributions of variables included in the model.

• Scaling anomalies may lead to distortions, if equal intervals do not correspond with equal “effort for gains,” although this is difficult to game.

Conditional Incentive Plots for VAMs

Page 26: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• As accountability calculations become increasingly complicated, score reporting and transparency become even more necessary mechanisms for the improvement of schools and teaching.

• Systems will be more successful with clear reporting of actionable (and presumably defensible) results.

• An example: I used to be very suspicious of categorical/value-table models, as they create pseudo-vertical scales and sacrifice information.– I still have reservations, but they are still excellent tools for

reporting and communicating results, even when the underlying models are not themselves categorical.

– An actionable, interpretable categorical framework layered over a continuous model.

Transparency and Score Reporting

Page 27: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• If I had a choice between– A simple VAM that communicated actionable responses

and incentives clearly, vs.– A complicated VAM that did not have any guidance

about how to improve…• This is a false dichotomy.• We can make the complex seem simple.• Conditional incentive plots and similar attention to

differential student contribution to VAM estimates are one approach to this.

• Both to anticipate gaming behavior and encourage desired responses.

Communicating Incentives Clearly

Page 28: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• In his NCME Career Award address, Haertel distinguished between two categories of purposes of large-scale testing: Measurement and Influence.

• A “influencing” purpose is often depends less on the results of the test itself.– Directing student effort– “Shaping public perception”

• Validation arguments for Influencing purposes are rarely well described.

• These plots are a modest first step for visualizing the Influencing mechanisms of proposed models.

Haertel (2012) Measurement vs. Influence

Page 29: Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

• A medical analogy (thanks to Catherine McClellan) can be helpful in thinking about where VAM and school accountability research should continue to go.

• Doctors must gather data, identify symptoms, reach a diagnosis, and prescribe a treatment.– In school and teacher effectiveness conversations, we often get stuck

at “symptoms.”– Doctors do not average blood pressure results with fMRI results to

get increasingly reliable and accurate measures of “health.” Or at least they don’t stop there.

– We need to continue advancing the science of diagnosis (what’s wrong) and treatment (now what).

• We must continue beyond predictive accuracy and even conditional incentives to deeper understanding of teachers’ and administrators’ learning in response to evaluation systems.

School/Teacher Effectiveness and House, MD