nyscoss conference superintendents training on assessment 9 14

Andy Hegedus, Ed.D.

September 2014

From a Superintendent’s

Perspective: Using data wisely

• How many of you think your literacy with assessments is “Good” or better?

• How many of you have a fine tuned assessment program?

• How many of you think your practical knowledge about using data for systemic improvement is “Good” or better?

Trying to gauge my audience and adjust my speed . . .

• An adequate depth of knowledge to ask really good questions and the tenacity to do so

• A person willing to “Speak the Truth, With Love, To Power”

• A big lever – “What gets measured (and attended to),

gets done”

The measurement and reinforcement system is your responsibility

The Bottom Line First:Here’s what you must have

• Increase your understanding about various urgent assessment related topics– Ask better questions– Useful for making all types of decisions with

data

My Purpose

• Assessment basics +• Improving your assessment program• Data culture

Three main topics

• What we’ve known to be true is now being shown to be true– Using data thoughtfully improves student

achievement and growth rates– 12% mathematics, 13% reading

• There are dangers present however– Unintended Consequences

Go forth thoughtfullywith care

Slotnik, W. J. , Smith, M. D., It’s more than money, February 2013, retrieved from http://www.ctacusa.com/PDFs/MoreThanMoney-report.pdf

“What gets measured (and attended to), gets done”

Remember the old adage?

• NCLB– Cast light on inequities– Improved performance of “Bubble Kids”– Narrowed taught curriculum

The same dynamic happens inside your schools

An infamous example

It’s what we do that counts

A patient’s health doesn’t change because we know their blood pressure

It’s our response that makes all the difference

Be considerate of the continuum of stakes involved

Support

Compensate

Terminate

Increasing levels of required rigor

Incr

easi

ng r

isk

Assessment basics(in a teacher evaluation frame)

1. Alignment between the content assessed and the content to be taught

2. Selection of an appropriate assessment• Used for the purpose for which it was designed

(proficiency vs. growth)• Can accurately measure the knowledge of all students• Adequate sensitivity to growth

3. Adjust for context/control for factors outside a teacher’s direct control (value-added)

Three primary conditions

1. Assessment results used wisely as part of a dialogue to help teachers set and meet challenging goals

2. Use of tests as a “yellow light” to identify teachers who may be in need of additional support or are ready for more

Two approaches we like

Is the progress produced by this teacher dramatically different than teaching peers who deliver instruction to comparable students in comparable situations?

What question is being answered in support of

using data in evaluating teachers?

Marcus Normal Growth Needed Growth

Marcus’ growth

College readiness standard

The Test

The Growth Metric

The Evaluation

The Rating

There are four key steps required to answer this question

Top-Down Model

The Test

The Growth Metric

The Evaluation

The Rating

Let’s begin at the beginning

3rd Grade ELA

Standards

3rd Grade ELA

Teacher?

3rd Grade Social

Studies Teacher?

Elem. Art Teacher?

What is measured should be aligned to what is to be taught

1. Answer questions to demonstrate understanding of text….

2. Determine the main idea of a text….

3. Determine the meaning of general academic and domain specific words…

Would you use a general reading assessment in the evaluation of a….

~30% of teachers teach in tested subjects and gradesThe Other 69 Percent: Fairly Rewarding the Performance of Teachers of Nontested Subjects and Grades, http://www.cecr.ed.gov/guides/other69Percent.pdf

• Assessments should align with the teacher’s instructional responsibility– Specific advanced content

• HS teachers teaching discipline specific content – Especially 11th and 12th grade

• MS teachers teaching HS content to advanced students

– Non-tested subjects• School-wide results are more likely “professional

responsibility” rather than reflecting competence

– HS teachers providing remedial services

What is measured should be aligned to what is to be taught

• Many assessments are not designed to measure growth

• Others do not measure growth equally well for all students

The purpose and design of the instrument is significant

http://commons/C19/Marketing/NWEA%20Images/Photos%20for%20Use/boywithpuzzle2.JPG

Let’s ensure we have similar meaning

Beginning

Literacy

Adult Reading

5th Grade x

x

Time 1 Time 2

StatusGrowth

Two assumptions:1. Measurement accuracy,

and2. Vertical interval scale

Accurately measuring growth

depends on accurately measuring

achievement

Questions surrounding the

student’s achievement level

The more questions the

merrier

What does it take to accurately measure achievement?

Teachers encounter a distribution of student performance

Beginning

Literacy

Adult Reading

5th Grad

e

x x xx

xx

xx

x

x

xx

x

xx

Grade Level Performance

Adaptive testing works differently

Item bank can span full range of achievement

How about accurately measuring height?

What if the yardstick stopped in the middle of his back?

Items available need to match student ability

California STAR NWEA MAP

How about accurately measuring height?

What if we could only mark within a pre-defined six inch range?

5th Grade Level Items

These differences impact measurement error

.00

.02

.04

.06

.08

.10

.12

Info

rmati

on

260190 200 210 220 230 240Scale Score

Fully Adaptive Test

Significantly Different Error

250

Constrained Adaptive or

Paper/PencilTest

To determine growth, achievement

measurements must be related through

a scale

If I was measured as:5’ 9”

And a year later I was:1.82m

Did I grow?Yes. ~ 2.5”

How do you know?

Let’s measure height again

Traditional assessment uses items reflecting the grade level standards

Beginning

Literacy

Adult Reading

4th Grade

5th Grade

6th Grade

Grade Level Standards

Traditional Assessment Item Bank

Traditional assessment uses items reflecting the grade level standards

Beginning

Literacy

Adult Reading

4th Grade

5th Grade

6th Grade


Grade Level StandardsOverlap allows linking and scale construction


• Think of a high stakes test – State Summative

– Designed mainly to identify if a student is proficient or not

• Do they do that well?• 93% correct on Proficiency determination

• Does it go off design well?• 75% correct on Performance Levels determination

Error can change your life!

*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May 2004, http://dspace.udel.edu:8080/dspace/handle/19716/244

• Tests specifically designed to inform classroom instruction and school improvement in formative ways

No incentive in the system for inaccurate data

Using tests in high stakes ways creates new dynamic

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71-6.00

-4.00

-2.00

0.00

2.00

4.00

6.00

8.00

10.00

Students taking 10+ minutes longer spring than fall All other students

New phenomenon when used as part of a compensation program

Mean value-added growth by school

When teachers are evaluated on growth using a once per year assessment, one teacher who cheats disadvantages the next teacher

Other consequence

• What were some things you learned?• What practices do you want to reinforce?• What do you need to do differently?

• Think – Pair– 2 min to make some notes– 3 min to share with a neighbor

Lessons?

Testing is complete . . . What is useful to answer our question?

The Test

The Growth Metric

The Evaluation

The Rating

The problem with spring-spring testing

4/14 5/14 6/14 7/14 8/14 9/14 10/14 11/14 12/14 1/15 2/15 3/15 4/15

Teacher 1 Summer Teacher 2

• When possible use a spring – fall – spring approach

• Measure summer loss and incentivize schools and teachers to minimize it

• Measure teacher performance fall to spring, giving as much instructional time as possible between assessments

• Monitor testing conditions to minimize gaming of fall or spring results

A better approach

Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80

10

20

30

40

50

60

70

80

90

100

ReadingMath

The metric matters - Let’s go underneath “Proficiency”

Difficulty of New York Proficient Cut Score

Nat

iona

l Per

cent

ile

College Readiness

New York Linking Study: A Study of the Alignment of the NWEA RIT Scale with the New York State (NYS) Testing Program, November 2013

The metric matters - Let’s go underneath “Proficiency”

Dahlin, M. and Durant, S., The State of Proficiency, Kingsbury Center at NWEA, July 2011

Series1

-40

-20

0

20

40

60

80

10087

55

-31

Estimated Proficiency Rates For Six NY Districts4th Grade Mathematics

With Proficiency Cut Scores Changed

2012 2013 Reported Change

What actually happened?

Series1

-40

-20

0

20

40

60

80

100 87

55

-31

Estimated Proficiency Rates For Six NY Districts

4th Grade MathematicsWith Proficiency Cut Scores

Changed

2012 2013Reported Change

What actually happened?

Series10

10

20

30

40

50

60

46

55

10

Estimated Proficiency Rates For Six NY Districts

4th Grade MathematicsWith 2013 Proficiency Cut

Scores Applied

2012 2013 Actual Change

Mathematics

No ChangeDownUp

Fall RIT

Num

ber o

f Stu

dent

sWhat gets measured and attended to

really does matter

Proficiency College Readiness

One district’s change in 5th grade mathematics performance relative to the KY proficiency cut scores

Mathematics

Below projected growthMet or above pro-jected growth

Student’s score in fall

Nu

mb

er o

f S

tud

ents

Number of 5th grade students meeting projected mathemat-ics growth in the same district

Changing from Proficiency to Growth means all kids matter


• Think – Pair – Share– 2 min to make some notes– 3 min to share with a neighbor

Lessons?

How can we make it fair?

The Test

The Growth Metric

The Evaluation

The Rating

Without context what is “Good”?

Beginning Reading

Adult Literacy

Nati

onal

Pe

rcen

tile

Norms StudyScale

Colle

ge R

eadi

ness

Be

nchm

arks

ACT

Perf

orm

ance

Lev

els

State Test

“Meets”Proficiency

Perf

orm

ance

Lev

els

Common Core

Proficient

Normative data for growth is a bit different

Fall Score

Subject: Reading

Grade: 5th

7 points

FRL vs. non-FRL?

IEP vs. non-IEP?

ESL vs. non-ESL?

Outside of a teacher’s direct control

Starting Achievement

Instructional Weeks

Basic Factors

Typical growth

A Visual Representation of Value Added

Spring 5th Grade Test

Student ASpring Score 209

Score 207(Average Spring Score for Similar

Students)

Value Added(+2 Score)

Student AFall Score 200

Fall 5th Grade Test

• What if I skip this step?– Comparison is likely against normative data

so the comparison is to “typical kids in typical settings”

• How fair is it to disregard context?– Good teacher – bad school– Good teacher – challenging kids

Consider . . .

• Value added models can control for a variety of classroom, school level, and other conditions– Proven statistical methods– All attempt to minimize error– Variables outside controls are assumed as random

Value-added is science

• Control for measurement error– All models attempt to address

this issue• Population size• Multiple data points

– Error is compounded with combining two test events

– Many teachers’ value-added scores will fall within the range of statistical error

A variety of errors means more stability only at the extremes

-12.00-11.00-10.00

-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00

10.0011.0012.00

Mathematics Growth Index Distribution by Teacher - Validity Filtered

Aver

age

Grow

th In

dex

Scor

e an

d Ra

nge

Q5

Q4

Q3

Q2

Q1

Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (blue line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.

Range of teacher value-added estimates

With one teacher, error means a lot

• Value-added models assume that variation is caused by randomness if not controlled for explicitly– Young teachers are assigned disproportionate

numbers of students with poor discipline records– Parent requests for the “best” teachers are

honored• Sound educational reasons for placement are

likely to be defensible

Assumption of randomness can have risk implications

“The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.”

Ballou, D., Mokher, C. and Cavalluzzo, L. (2012) Using Value-Added Assessment for Personnel Decisions: How Omitted Variables and Model Specification Influence Teachers’ Outcomes.

Instability at the tails of the distribution

LA Times Teacher #1LA Times Teacher #2

http://www.aefpweb.org/sites/default/files/webform/AEFP-Using%20VAM%20for%20personnel%20decisions_02-29-12.pdf

http://www.aefpweb.org/sites/default/files/webform/AEFP-Using%20VAM%20for%20personnel%20decisions_02-29-12.pdf

http://projects.latimes.com/value-added/value-added-comparison

http://projects.latimes.com/value-added/value-added-comparison

How tests are used to evaluate teachers

The Test

The Growth Metric

The Evaluation

The Rating

• How would you translate a rank order to a rating?• Data can be provided

• Value judgment ultimately the basis for setting cut scores for points or rating

Translation into ratings can be difficult to inform with data

http://commons/C19/Marketing/NWEA%20Images/Photos%20for%20Use/binder.JPG

• What is far below a district’s expectation is subjective

• What about• Obligation to help

teachers improve?• Quality of replacement

teachers?

Decisions are value based, not empirical

• System for combining elements and producing a rating is also a value based decision– Multiple measures and principal judgment

must be included– Evaluate the extremes to make sure it

makes sense

Even multiple measures need to be used well

Leadership Courage Is A Key

Teacher 1 Teacher 2 Teacher 30

1

2

3

4

5

Ratings can be driven by the assessment

Observation Assessment

Real or Noise?

If evaluators do not differentiate their ratings,

then all differentiation comes from the test

Big Message


• Think – Pair – Share– 2 min to make some notes– 3 min to share with a neighbor– 2 min for two report outs on anything so far

Lessons?

Improving yourassessment program

• Read the sheet and highlight anything interesting to you

Let’s DefineTypes of Assessments

The pursuit of compliance is exhausting because it is always a

moving target. Governors move on, the party in power gets replaced, a

new president is elected, and all want to put their own stamp on

education.

It is saner and less exhausting to define your own course and align compliance requirements to that.

Seven standards that define the purpose driven assessment system

The purposes of all assessments are defined and the assessments are valid and useful for their purposes1

Teachers are educated in the proper administration and application of the assessments used in their classrooms2

Redundant, mis-aligned, or unused assessments are eliminated3

Assessment results are aligned to the needs of their audiences4

Assessment results are delivered in a timely and useful manner5

The metrics and incentives used encourage a focus on all learners6

The assessment program contributes to a climate of transparency and objectivity with a long-term focus7

1. Typical assessment purposes

• Identify student learning needs• Identify groupings of students for instruction• Guide instruction• Course placement• Determine eligibility for programs• Award credits and/or assign grades• Evaluate proficiency• Monitor student progress• Predict proficiency• Project achievement of a goal• Formative and summative evaluation of programs• Formative evaluation to support school and teacher improvement• Report student achievement, growth, and progress to the

community and stakeholders• Summative evaluation of schools and teachers

To increase value…Identify gaps between:1. How critical is this data

to your work?2. How do you actually

use this data?

Take 10 min to fill this out and 5 min to pair and discuss areas of biggest gap

1. Assessment Purpose Survey

Compare assessments and their purposes to find unnecessary overlaps

Take 10 min to fill this out and 5 min to pair and discuss areas of redundancy

3. Eliminate waste


• Think – Pair – Share– 2 min to make some notes– 3 min to share with a neighbor– 2 min for two report outs on this section

Lessons?

Data Culture

Education Organizations Mature

Barber, M., Chijioke, C., & Mourshed, M. (2011). How the world’s most improved school systems keep getting better. McKinsey & Company.

Poor to Fair

Fair to Good

Good to Great

Great to Excellent

Achieving the basics of literacy

and numeracy

Getting the foundations

in place

Shaping the professional

Improving through

peers and innovation

Data use does too

Data Use Continuum

Poor to Fair

Fair to Good

Good to Great

Great to Excellent

One on One Within Teams

Within the Walls

Across the Walls

Requires a shift in the culture

What would you do with this?

• Where are your pockets of most maturity?• Least maturity?• What is causing the differences?

Think - 2 min

Reflection Time

• Education problems are “Wicked”– Problem boundaries are ill-defined– No definitive solutions– Highly resistant to change– Problem and solutions depend on

perspective– Changes are consequential

Data can only take you so far

Research on data use in school improvement

• Use data as a platform for deeper conversations

• Define your problem well– Problem title and description– Magnitude– Location– Duration


• Part of a continuous improvement process–Data conversations

• Collaborative• Embedded in culture• Structured process


• Love, Nancy – Using data to improve learning for all• Lipton & Wellman – Got data? Now what?• NWEA Data Coaching

Final Question

• Think – 2 min• Pair – 3 min• Share – 2 min - Two people

What are your biggest take-aways?

• Materials in your Conference App

Presentation available on Slideshare.net

Last thing

More informationContacting me:

E-mail: [email protected] us:

Exhibit Booth or Liz Kaplan

mailto:[email protected]

nyscoss conference superintendents training on assessment 9 14

Education

performance of teachers

assessment results

hs content

tomeasure growth

growth rates

generalreading assessment

assessment basicsin

appropriate assessment