issues in teacher evaluation and validity: conceptual, methodological, and practical

1/27University of California, Los Angeles

Issues in Teacher Evaluation and Validity:Conceptual, Methodological, and Practical

Jose Felipe Martinez

University of California, Los AngelesGraduate School of Education

New Mexico Teacher Evaluation Advisory Council (NMTEACH)

New Mexico Public Education Department

UCLA Graduate School of Education & Information Studies


Overview

• Teacher Evaluation: The Policy Context

• Teacher Evaluation• Conceptual/Methodological Issues: Why, What, How

• Constructs and methods

• Teacher Evaluation with Multiple Measures• Multiple Measures and Validity

• Models for combining indicators

• Validation Frameworks and Sources of Evidence

• Consequences, additional issues

When pasting text from another document, do the following:1. Highlight the text you want to replace2. Go to the EDIT menu and select PASTE SPECIAL3. Select “Paste as: UNFORMATTED TEXT”


Teacher Evaluation:

The Policy Context


Teacher Evaluation: A New Silver Bullet?

• Teacher evaluation systems undergoing reform

• Tied to perceptions of performance in national or international evaluations,• Reverse Lake Wobegon; all below avg. (Feuer, 2012)

• …assumptions about the role of “good/bad” teachers in explaining/improving the results and

• …about our ability to identify these teachers

• Related to perceptions of teaching profession

• …quality of existing teacher evaluation systems


Many Prominent examples

• United States• Los Angeles, New York, Chicago (2012)

• Denver (2010)

• Tennesee (1992, 2012)

• Toledo, Cincinnati (1990’s)

• Worldwide• Singapore (2006)

• Chile (2003)

• Mexico (1993,2009)

• Australia (2013)


Teacher Evaluation:

Conceptual/Methodological Issues


Why Evaluate?

• Motivations, inferences and uses• Identify struggling teachers to help them improve

• Identify recurrent struggling teachers for sanction

• Provide incentives to the best teachers

• Inform school practice/district policies on Teacher Preparation and Professional Development

• Identify and scale effective teacher practice

• Or typically a combination... (e.g. NMTEACH)


Teacher Evaluation

Conceptual/Methodological Issues:

Why, What, How


What to Evaluate?

• Teacher competence (Reynolds, 1999): • Knowledge: Subject, Pedagogical

• Skill: Ability, applied knowledge

• Disposition: Attitudes, Perceptions, Beliefs

• Practice: Classroom processes (e.g. instruction, assessment, management)

• And..• Seniority, Credentials

• School citizenship, contributions to community…

• “Effectiveness”: Ability to raise student test scores


What to Evaluate? All of the above?

• “We fully understand that standardized tests don't capture all of the subtle qualities of successful teaching. That's why we call for multiple measures in evaluating teachers. In an ideal world, that data should also drive instruction and drive useful professional development.“

Arne Duncan

U.S. Secretary of Education


Teacher Constructs (What?)

Measures (How?)

Knowledge (subject, pedagogical) Skills (ability, applied knowledge)

Multiple Choice TestsPerformance AssessmentsVignettes

Practice, Classroom Performance (instruction, assessment,

management)

Surveys, LogsClassroom Observations, VideoArtifacts, Portfolios

Disposition (beliefs, attitudes) Survey, Interview

Citizenship (contributions to community)

Surveys, Interview, Self Assessment

Effectiveness (contribution to student achievement)

Student Test Score Gains; “Value Added”

How to Evaluate?

(Reynolds, 1999)


Which is Best? Which should we use?

• No method is inherently preferable

• Each illuminates a different aspect of Teacher [insert euphemism here]. • Different kind of information from different sources

• Pros and cons in reliability, validity, credibility…

• Here I will briefly discuss:• Value Added Models

• Observations

• Surveys

• Portfolios


Value Added Models

• Culture changing towards using student achievement to evaluate teachers

• Simple Logic:• Students do better (grow) more in some classrooms

(Weisberg et al. 2009; Kane et.al. 2011)

• Student learning should be a (the?) key criterion to evaluate teacher quality

• Seemingly Simple Method:• With longitudinal data…compare teachers on the

progress of their students, not their achievement.

• Estimate teacher unique contributions to student academic growth, net of factors outside teacher control


Value Added Models

• A family of statistical models• e.g. TVAAS, Growth percentiles, (variable) Persistence

• Correlated; measures used + important (Lockwood et.al 2007)

• A variety of issues:• Partial view of student learning (Baker et. al. 2010)

• Unstable estimates (Schochet & Chiang; 2010)

• Descriptive, not causal (Stuart, Rubin,Zanutto,2004), nor explanatory/diagnostic (Goe, 2011)

• Available only for some teachers (30-40% US)

• “…VAM estimates best used in combination with other indicators” (Braun et al., 2010)


Classroom Observations

• Widely used to assess quality teaching practice• Explanatory + Formative counterpart to VAM

• Identify areas in need of improvement Inform PD

• Expensive if standardized (training, time)

• Error from complex rubrics, human judgment• Bias/Subjectivity in construct definition/emphasis

• Lower reliability than traditional instruments (live or video)

• Weak correlations with other indicators including student achievement (Kane et al. 2010)


Classroom Observation: Constructs


Classroom Observation: Reliability

(Source: Bill and Melinda Gates Foundation, 2011)


Teacher Surveys

• Common method for collecting data on teacher (classroom) practice on a large scale• Good coverage; Low cost; low burden for teachers

• Adequate reliability

• Questionable Validity• Error from inconsistency in interpretation of questions

• …and social desirability

• e.g. Emphasis on higher order thinking

• Weak correlations with other indicators including student achievement (Kane et al. 2010)


Student Surveys

• Increasingly popular for teacher evaluation• Coverage; cost; perceived validity

• Adequate reliability aggregated by classroom• Correlated w/student achievement as much or more

than teacher surveys (Kane etal. 2010)

• Additional information at the student level• Variance reflects differentiated teacher practice with

different students (Martínez, 2012; Muthen , 1995)

• Correlated w/achievement also within classrooms


Student Surveys: Remaining Issues

• Memory errors, inconsistency in interpretation• Particularly with younger children

• Concerns for high stakes teacher evaluation• Social desirability, pressure, other validity issues

• Cost Issues

• Unit of measurement, construct invariance• “My teacher asks me to read books”

• vs. “Our teacher asks us to read books”


Student Surveys: Correlation to Achvmt


Teacher Portfolios

What’s in a Teacher Portfolio?

Classroom Artifacts(lesson plans, assignments, samples of student work, etc.)

Teacher Reflections(on practice reflected in artifacts)

Student/Teacher Survey/Log(classroom practice, attitudes, perceptions)

vs. Surveys + Richer, Better Validity, PD value- Higher cost, Rater/Rubric Error, Burden on teachers

vs. Observations Debate taking form

• Compile evidence of teacher practice over a period of time


Portfolios vs. Observations

• 1. Cost to Collect & Score?• Similar or lower than observations

• 2. Score Reliability?• Similar to observations/video (see MET study)

• May need to re-examine ideas of “acceptable reliability”

• Better coverage, validity x/some aspects of practice• Interesting possibilities with newer technologies

• 3. More burdensome for teachers?• Yes, much more so (20-30+ hour effort)• But, with burden comes Professional Development

• So far used mostly for “National Certification”• Growing interest? : EdTPA, PACT

• May be feasible as integral to an evaluation/PD cycle


Teahcer Evaluation and

Multiple Measures

(Validity)


Validity

• How do we know we are doing a good job of evaluating teachers? • Are our inferences and decisions valid?

“An integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or others modes of assessment.”

Messick (1989)


• “In educational settings, a decision or characterization that will have major impact [on a student] should not be made on the basis of a single score. Other relevant information should be taken into account if it will enhance the overall validity of the decision.”

Standards for Educational and Psychological Testing, Standard 13.7 (AERA, APA, & NCME, 1999)


What to Evaluate? All of the above

• New Mexico’s teacher evaluation system should utilize a matrix in which multiple components of a teacher’s evaluation combine to determine a teacher’s overall effectiveness rating.

• Effectiveness levels should only be assigned after careful consideration of multiple measures, including student achievement data, observations, and other proven measures [emphasis added]

New Mexico Effective Teaching Task Force


Multiple measures: Logic and Assumptions

1.Accuracy

2.Validity

3.Feedback

4.Relevance

-Teachers classified into finer, more stable categories (De Pascale, 2012; Steele et. al. 2010)

- More complete picture of performance (Goe, 2011)-Less incentive for test preparation (Steele et. al. 2010)

- Information to help teachers adjust and improve instruction and classroom strategies. (Duncan, 2011)

- Greater confidence in results of evaluation among the public and stakeholders (Glazerman et. al. 2011)

• General Assumption: • Combining multiple measures leads to better informed

(more valid) decisions about teachers and teaching


Combining Multiple measures: Conceptual Issues

• When/were does these assumptions hold?, in what situations? Depends on several factors

• Assumptions about nature of constructs involved

• Intended inferences and uses

• What is meant exactly by combining (Brookhart, 2009)

• Not self-explanatory. A variety of models is available

• Substantial literature in psychology, personnel evaluation, and student assessment.

• Only starting to be applied to Teacher Evaluation


Models for Combining Multiple Measures

Model Description

Conjunctive Must meet criteria (pass) for all measures

Disjunctive Must meet criteria (pass) for k measures

Compensatory Based on composite measures. High level in one measure compensates for low levels in others

Hybrid e.g. Compensatory-conjunctive, Sequential

(Mehrens, 1989; Chester, 2003)


Combination Model 0: Do not Combine!

• May consider not combining the indicators !• Summary indices not essential to formative or

summative evaluation

• Key measures may be collected, maintained, and reported separately

• All used to illuminate a side of the picture (improve teaching, communication, citizenship, achievmt?)

• And used jointly as needed where summative judgments are sought (Mehrens 1989; Brookhart 2009)

• Making combined use of multiple indicators ≠Combining multiple indicators


Combination Model 1: Conjuntive, Disjunctive

33

PortfolioClassroom Observation

Other Indicators

Student Survey

Teacher Test

Student Achievemt.


Decision Rules and Reliability

• Error in Multiple Measures may cancel out or compound• Assume Teacher A True Scores in T1, T2 are passes

• Because of unreliability the probability of pass Observed Scores is estimated at 0.80 and 0.90, respectively

• Probability of pass scores in both tests (Conjunctive Model): 0.8*0.9=0.72

• Probability of pass scores in either test (Disjunctive Model): 1-[0.2*0.1]=0.98

(see e.g. Cronbach, Linn, Brennan, & Haertel, 1997; Douglas and Mislevy, 2010)


Decision Rules and Reliability

• Simplistic scenario. Complex rules often used in practice according to policy context and goals

• E.g.: Teachers must pass Measure 1 or 2, AND not rank lowest in Measure 3 (eg. New Haven)

• Choice of decision rule more important for accuracy and validity than the reliability of the component measures chosen (Chester, 2003)

• Importantly: Models are not “objective”; each involves judgment

• Why satisfy k criteria, not k-1? Why those criteria?


Hybrid system : e.g. New Haven• Synthesizes three component measures (each

on 5-pt. scale):• Teacher instructional practice

• Teacher professional values

• Student learning outcomes


Combination Model 2 (Compensatory): Principal Components / Factor Analysis

37

Portfolio

Student/Parent Survey

ClassroomObservation

TeacherSurvey

OtherMeasures Student

achievement

GlobalConstruct


Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion)

38

Artifacts/Portfolio


Student Achievement


TeacherSurvey

OtherMeasures

Teacher Construct


Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion)

39

Artifacts/Portfolio


Student Achievement


TeacherSurvey

OtherMeasures

Teacher Construct

β

ββ

β

β


MM Combination Model 4 (Compensatory):PC/FA: Student achievement as Indicator

40

Artifacts/Portfolio



TeacherSurvey

OtherMeasures

Student Achievement

Teacher Construct


MM Combination Model 5 (Compensatory):

SEM/Canonical Correlates41

Artifacts/Portfolio



TeacherSurvey

OtherMeasures

Student Measure #2

Other (e.g. non- cognitive)

Student Measure #1

Teacher Construct

Student Outcomes


MM Combination Model 6 : (Darlington, 1970) Unmeasured Criterion, theoretical weights

42

Artifacts/Portfolio



TeacherSurvey

OtherMeasures

Student Achievement

Unmeasured Teacher Construct


Empirical vs. Theoretical Weighting

• Model 6 is most likely scenario in practice• Policy assumptions/values (consensual) inform the

system, alongside technical considerations

• It really is the only feasible scenario

• Empirical weights cannot be derived• Ultimate criterion measure is NOT available

• Note model 3 assumes such measure is available

• But does not give “correct” weight for criterion

• Exposure to Validity shrinkage (weight change over time)


Multiple Measures and Validity

• Models may lead to different inferences. • Little guidance available; so…

• LOCAL VALIDITY STUDIES NEEDED (lots of them)

• As with single measures, need to set up testable validation hypotheses (Kane, 2006)

• Whatever the construct : Teacher [euphemism]

• 1. Describe intended inferences, uses, AND CONSEQUENCES

• 2. Collect empirical evidence to support

• 2012, 2013 MET reports will be influential. May force field to broaden our lens and revise assumptions and expectations

• No getting around conducting local validation studies


What KINDS of EVIDENCE?

• All of them: Validity is a unitary notion• Theoretical support

• Consistency and accuracy (Reliability)

• Correlations, Internal structure

• Predictive power

• Consequences of use

• Validity becomes a rather empty academic topic if the consequences are not considered

• Or if they differ markedly from expectation


What consequences?

• Intended and Unintended Effects • On teaching practice

• On different student outcomes

• On recruitment and retention

• On Motivation, Competition, Fraud

• On Perceptions of validity, fairness, utility

• On dynamic of relationships with parents and community

• Etc etc


Final Remarks. Teacher Evaluation: Why are we doing this again?

• Some good reasons• Make student achievement priority

• Monitor & assess teacher performance

• Develop a culture of accountability

• and of reflection and improvement

• Inform PD to improve teacher performance

• However • Multiple fallible indicators do not automatically

yield better, less fallible inferences. But they always yield more complex ones

• Using indicators in combination involves technical but also conceptual and policy assumptions



• Because “the stakes are high, and the future of our children is at stake” (insert public official name here, circa 2012) we should proceed carefully and deliberately.

• Good measures take time to develop.

• Solid systems based on these measures take longer to test and implement.

• The consequences of implementing these systems are unknown and will take longer to assess.

• Experience suggests moving too fast to implement may shortchange the system



• Most important goal in my view is not only to avoid unfair decisions, and negative unintended consequences (though the potential for both should give us pause)

• Greatest risk is missing an opportunity to enact sound teacher evaluation policy with great potential to positively impact educational practice and outcomes


Thank you

[email protected]

issues in teacher evaluation and validity: conceptual, methodological, and practical

Documents

teacher preparation

teacher constructs

methodsteacher evaluation

teacher competence reynolds

additional issues

perceptions of performance

new york

pedagogical skills ability