issues in teacher evaluation and validity: conceptual, methodological, and practical
DESCRIPTION
Issues in Teacher Evaluation and Validity: Conceptual, Methodological, and Practical. UCLA Graduate School of Education & Information Studies. Jose Felipe Martinez University of California, Los Angeles Graduate School of Education. New Mexico Teacher Evaluation Advisory Council (NMTEACH) - PowerPoint PPT PresentationTRANSCRIPT
1/27University of California, Los Angeles
Issues in Teacher Evaluation and Validity:Conceptual, Methodological, and Practical
Jose Felipe Martinez
University of California, Los AngelesGraduate School of Education
New Mexico Teacher Evaluation Advisory Council (NMTEACH)
New Mexico Public Education Department
UCLA Graduate School of Education & Information Studies
2/27University of California, Los Angeles
Overview
• Teacher Evaluation: The Policy Context
• Teacher Evaluation• Conceptual/Methodological Issues: Why, What, How
• Constructs and methods
• Teacher Evaluation with Multiple Measures• Multiple Measures and Validity
• Models for combining indicators
• Validation Frameworks and Sources of Evidence
• Consequences, additional issues
When pasting text from another document, do the following:1. Highlight the text you want to replace2. Go to the EDIT menu and select PASTE SPECIAL3. Select “Paste as: UNFORMATTED TEXT”
3/27University of California, Los Angeles
Teacher Evaluation:
The Policy Context
4/27University of California, Los Angeles
Teacher Evaluation: A New Silver Bullet?
• Teacher evaluation systems undergoing reform
• Tied to perceptions of performance in national or international evaluations,• Reverse Lake Wobegon; all below avg. (Feuer, 2012)
• …assumptions about the role of “good/bad” teachers in explaining/improving the results and
• …about our ability to identify these teachers
• Related to perceptions of teaching profession
• …quality of existing teacher evaluation systems
5/27University of California, Los Angeles
Many Prominent examples
• United States• Los Angeles, New York, Chicago (2012)
• Denver (2010)
• Tennesee (1992, 2012)
• Toledo, Cincinnati (1990’s)
• Worldwide• Singapore (2006)
• Chile (2003)
• Mexico (1993,2009)
• Australia (2013)
6/27University of California, Los Angeles
Teacher Evaluation:
Conceptual/Methodological Issues
7/27University of California, Los Angeles
Why Evaluate?
• Motivations, inferences and uses• Identify struggling teachers to help them improve
• Identify recurrent struggling teachers for sanction
• Provide incentives to the best teachers
• Inform school practice/district policies on Teacher Preparation and Professional Development
• Identify and scale effective teacher practice
• Or typically a combination... (e.g. NMTEACH)
8/27University of California, Los Angeles
Teacher Evaluation
Conceptual/Methodological Issues:
Why, What, How
9/27University of California, Los Angeles
What to Evaluate?
• Teacher competence (Reynolds, 1999): • Knowledge: Subject, Pedagogical
• Skill: Ability, applied knowledge
• Disposition: Attitudes, Perceptions, Beliefs
• Practice: Classroom processes (e.g. instruction, assessment, management)
• And..• Seniority, Credentials
• School citizenship, contributions to community…
• “Effectiveness”: Ability to raise student test scores
10/27University of California, Los Angeles
What to Evaluate? All of the above?
• “We fully understand that standardized tests don't capture all of the subtle qualities of successful teaching. That's why we call for multiple measures in evaluating teachers. In an ideal world, that data should also drive instruction and drive useful professional development.“
Arne Duncan
U.S. Secretary of Education
11/27University of California, Los Angeles
Teacher Constructs (What?)
Measures (How?)
Knowledge (subject, pedagogical) Skills (ability, applied knowledge)
Multiple Choice TestsPerformance AssessmentsVignettes
Practice, Classroom Performance (instruction, assessment,
management)
Surveys, LogsClassroom Observations, VideoArtifacts, Portfolios
Disposition (beliefs, attitudes) Survey, Interview
Citizenship (contributions to community)
Surveys, Interview, Self Assessment
Effectiveness (contribution to student achievement)
Student Test Score Gains; “Value Added”
How to Evaluate?
(Reynolds, 1999)
12/27University of California, Los Angeles
Which is Best? Which should we use?
• No method is inherently preferable
• Each illuminates a different aspect of Teacher [insert euphemism here]. • Different kind of information from different sources
• Pros and cons in reliability, validity, credibility…
• Here I will briefly discuss:• Value Added Models
• Observations
• Surveys
• Portfolios
13/27University of California, Los Angeles
Value Added Models
• Culture changing towards using student achievement to evaluate teachers
• Simple Logic:• Students do better (grow) more in some classrooms
(Weisberg et al. 2009; Kane et.al. 2011)
• Student learning should be a (the?) key criterion to evaluate teacher quality
• Seemingly Simple Method:• With longitudinal data…compare teachers on the
progress of their students, not their achievement.
• Estimate teacher unique contributions to student academic growth, net of factors outside teacher control
14/27University of California, Los Angeles
Value Added Models
• A family of statistical models• e.g. TVAAS, Growth percentiles, (variable) Persistence
• Correlated; measures used + important (Lockwood et.al 2007)
• A variety of issues:• Partial view of student learning (Baker et. al. 2010)
• Unstable estimates (Schochet & Chiang; 2010)
• Descriptive, not causal (Stuart, Rubin,Zanutto,2004), nor explanatory/diagnostic (Goe, 2011)
• Available only for some teachers (30-40% US)
• “…VAM estimates best used in combination with other indicators” (Braun et al., 2010)
15/27University of California, Los Angeles
Classroom Observations
• Widely used to assess quality teaching practice• Explanatory + Formative counterpart to VAM
• Identify areas in need of improvement Inform PD
• Expensive if standardized (training, time)
• Error from complex rubrics, human judgment• Bias/Subjectivity in construct definition/emphasis
• Lower reliability than traditional instruments (live or video)
• Weak correlations with other indicators including student achievement (Kane et al. 2010)
16/27University of California, Los Angeles
Classroom Observation: Constructs
17/27University of California, Los Angeles
Classroom Observation: Reliability
(Source: Bill and Melinda Gates Foundation, 2011)
18/27University of California, Los Angeles
Classroom Observation: Reliability
(Source: Bill and Melinda Gates Foundation, 2011)
19/27University of California, Los Angeles
Teacher Surveys
• Common method for collecting data on teacher (classroom) practice on a large scale• Good coverage; Low cost; low burden for teachers
• Adequate reliability
• Questionable Validity• Error from inconsistency in interpretation of questions
• …and social desirability
• e.g. Emphasis on higher order thinking
• Weak correlations with other indicators including student achievement (Kane et al. 2010)
20/27University of California, Los Angeles
Student Surveys
• Increasingly popular for teacher evaluation• Coverage; cost; perceived validity
• Adequate reliability aggregated by classroom• Correlated w/student achievement as much or more
than teacher surveys (Kane etal. 2010)
• Additional information at the student level• Variance reflects differentiated teacher practice with
different students (Martínez, 2012; Muthen , 1995)
• Correlated w/achievement also within classrooms
21/27University of California, Los Angeles
Student Surveys: Remaining Issues
• Memory errors, inconsistency in interpretation• Particularly with younger children
• Concerns for high stakes teacher evaluation• Social desirability, pressure, other validity issues
• Cost Issues
• Unit of measurement, construct invariance• “My teacher asks me to read books”
• vs. “Our teacher asks us to read books”
22/27University of California, Los Angeles
Student Surveys: Correlation to Achvmt
23/27University of California, Los Angeles
Teacher Portfolios
What’s in a Teacher Portfolio?
Classroom Artifacts(lesson plans, assignments, samples of student work, etc.)
Teacher Reflections(on practice reflected in artifacts)
Student/Teacher Survey/Log(classroom practice, attitudes, perceptions)
vs. Surveys + Richer, Better Validity, PD value- Higher cost, Rater/Rubric Error, Burden on teachers
vs. Observations Debate taking form
• Compile evidence of teacher practice over a period of time
24/27University of California, Los Angeles
Portfolios vs. Observations
• 1. Cost to Collect & Score?• Similar or lower than observations
• 2. Score Reliability?• Similar to observations/video (see MET study)
• May need to re-examine ideas of “acceptable reliability”
• Better coverage, validity x/some aspects of practice• Interesting possibilities with newer technologies
• 3. More burdensome for teachers?• Yes, much more so (20-30+ hour effort)• But, with burden comes Professional Development
• So far used mostly for “National Certification”• Growing interest? : EdTPA, PACT
• May be feasible as integral to an evaluation/PD cycle
25/27University of California, Los Angeles
Teahcer Evaluation and
Multiple Measures
(Validity)
26/27University of California, Los Angeles
Validity
• How do we know we are doing a good job of evaluating teachers? • Are our inferences and decisions valid?
“An integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or others modes of assessment.”
Messick (1989)
27/27University of California, Los Angeles
• “In educational settings, a decision or characterization that will have major impact [on a student] should not be made on the basis of a single score. Other relevant information should be taken into account if it will enhance the overall validity of the decision.”
Standards for Educational and Psychological Testing, Standard 13.7 (AERA, APA, & NCME, 1999)
28/27University of California, Los Angeles
What to Evaluate? All of the above
• New Mexico’s teacher evaluation system should utilize a matrix in which multiple components of a teacher’s evaluation combine to determine a teacher’s overall effectiveness rating.
• Effectiveness levels should only be assigned after careful consideration of multiple measures, including student achievement data, observations, and other proven measures [emphasis added]
New Mexico Effective Teaching Task Force
29/27University of California, Los Angeles
Multiple measures: Logic and Assumptions
1.Accuracy
2.Validity
3.Feedback
4.Relevance
-Teachers classified into finer, more stable categories (De Pascale, 2012; Steele et. al. 2010)
- More complete picture of performance (Goe, 2011)-Less incentive for test preparation (Steele et. al. 2010)
- Information to help teachers adjust and improve instruction and classroom strategies. (Duncan, 2011)
- Greater confidence in results of evaluation among the public and stakeholders (Glazerman et. al. 2011)
• General Assumption: • Combining multiple measures leads to better informed
(more valid) decisions about teachers and teaching
30/27University of California, Los Angeles
Combining Multiple measures: Conceptual Issues
• When/were does these assumptions hold?, in what situations? Depends on several factors
• Assumptions about nature of constructs involved
• Intended inferences and uses
• What is meant exactly by combining (Brookhart, 2009)
• Not self-explanatory. A variety of models is available
• Substantial literature in psychology, personnel evaluation, and student assessment.
• Only starting to be applied to Teacher Evaluation
31/27University of California, Los Angeles
Models for Combining Multiple Measures
Model Description
Conjunctive Must meet criteria (pass) for all measures
Disjunctive Must meet criteria (pass) for k measures
Compensatory Based on composite measures. High level in one measure compensates for low levels in others
Hybrid e.g. Compensatory-conjunctive, Sequential
(Mehrens, 1989; Chester, 2003)
32/27University of California, Los Angeles
Combination Model 0: Do not Combine!
• May consider not combining the indicators !• Summary indices not essential to formative or
summative evaluation
• Key measures may be collected, maintained, and reported separately
• All used to illuminate a side of the picture (improve teaching, communication, citizenship, achievmt?)
• And used jointly as needed where summative judgments are sought (Mehrens 1989; Brookhart 2009)
• Making combined use of multiple indicators ≠Combining multiple indicators
33/27University of California, Los Angeles
Combination Model 1: Conjuntive, Disjunctive
33
PortfolioClassroom Observation
Other Indicators
Student Survey
Teacher Test
Student Achievemt.
34/27University of California, Los Angeles
Decision Rules and Reliability
• Error in Multiple Measures may cancel out or compound• Assume Teacher A True Scores in T1, T2 are passes
• Because of unreliability the probability of pass Observed Scores is estimated at 0.80 and 0.90, respectively
• Probability of pass scores in both tests (Conjunctive Model): 0.8*0.9=0.72
• Probability of pass scores in either test (Disjunctive Model): 1-[0.2*0.1]=0.98
(see e.g. Cronbach, Linn, Brennan, & Haertel, 1997; Douglas and Mislevy, 2010)
35/27University of California, Los Angeles
Decision Rules and Reliability
• Simplistic scenario. Complex rules often used in practice according to policy context and goals
• E.g.: Teachers must pass Measure 1 or 2, AND not rank lowest in Measure 3 (eg. New Haven)
• Choice of decision rule more important for accuracy and validity than the reliability of the component measures chosen (Chester, 2003)
• Importantly: Models are not “objective”; each involves judgment
• Why satisfy k criteria, not k-1? Why those criteria?
36/27University of California, Los Angeles
Hybrid system : e.g. New Haven• Synthesizes three component measures (each
on 5-pt. scale):• Teacher instructional practice
• Teacher professional values
• Student learning outcomes
37/27University of California, Los Angeles
Combination Model 2 (Compensatory): Principal Components / Factor Analysis
37
Portfolio
Student/Parent Survey
ClassroomObservation
TeacherSurvey
OtherMeasures Student
achievement
GlobalConstruct
38/27University of California, Los Angeles
Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion)
38
Artifacts/Portfolio
Student/Parent Survey
Student Achievement
ClassroomObservation
TeacherSurvey
OtherMeasures
Teacher Construct
39/27University of California, Los Angeles
Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion)
39
Artifacts/Portfolio
Student/Parent Survey
Student Achievement
ClassroomObservation
TeacherSurvey
OtherMeasures
Teacher Construct
β
ββ
β
β
40/27University of California, Los Angeles
MM Combination Model 4 (Compensatory):PC/FA: Student achievement as Indicator
40
Artifacts/Portfolio
Student/Parent Survey
ClassroomObservation
TeacherSurvey
OtherMeasures
Student Achievement
Teacher Construct
41/27University of California, Los Angeles
MM Combination Model 5 (Compensatory):
SEM/Canonical Correlates41
Artifacts/Portfolio
Student/Parent Survey
ClassroomObservation
TeacherSurvey
OtherMeasures
Student Measure #2
Other (e.g. non- cognitive)
Student Measure #1
Teacher Construct
Student Outcomes
42/27University of California, Los Angeles
MM Combination Model 6 : (Darlington, 1970) Unmeasured Criterion, theoretical weights
42
Artifacts/Portfolio
Student/Parent Survey
ClassroomObservation
TeacherSurvey
OtherMeasures
Student Achievement
Unmeasured Teacher Construct
43/27University of California, Los Angeles
Empirical vs. Theoretical Weighting
• Model 6 is most likely scenario in practice• Policy assumptions/values (consensual) inform the
system, alongside technical considerations
• It really is the only feasible scenario
• Empirical weights cannot be derived• Ultimate criterion measure is NOT available
• Note model 3 assumes such measure is available
• But does not give “correct” weight for criterion
• Exposure to Validity shrinkage (weight change over time)
44/27University of California, Los Angeles
Multiple Measures and Validity
• Models may lead to different inferences. • Little guidance available; so…
• LOCAL VALIDITY STUDIES NEEDED (lots of them)
• As with single measures, need to set up testable validation hypotheses (Kane, 2006)
• Whatever the construct : Teacher [euphemism]
• 1. Describe intended inferences, uses, AND CONSEQUENCES
• 2. Collect empirical evidence to support
• 2012, 2013 MET reports will be influential. May force field to broaden our lens and revise assumptions and expectations
• No getting around conducting local validation studies
45/27University of California, Los Angeles
What KINDS of EVIDENCE?
• All of them: Validity is a unitary notion• Theoretical support
• Consistency and accuracy (Reliability)
• Correlations, Internal structure
• Predictive power
• Consequences of use
• Validity becomes a rather empty academic topic if the consequences are not considered
• Or if they differ markedly from expectation
46/27University of California, Los Angeles
What consequences?
• Intended and Unintended Effects • On teaching practice
• On different student outcomes
• On recruitment and retention
• On Motivation, Competition, Fraud
• On Perceptions of validity, fairness, utility
• On dynamic of relationships with parents and community
• Etc etc
47/27University of California, Los Angeles
Final Remarks. Teacher Evaluation: Why are we doing this again?
• Some good reasons• Make student achievement priority
• Monitor & assess teacher performance
• Develop a culture of accountability
• and of reflection and improvement
• Inform PD to improve teacher performance
• However • Multiple fallible indicators do not automatically
yield better, less fallible inferences. But they always yield more complex ones
• Using indicators in combination involves technical but also conceptual and policy assumptions
48/27University of California, Los Angeles
Final Remarks. Teacher Evaluation: Why are we doing this again?
• Because “the stakes are high, and the future of our children is at stake” (insert public official name here, circa 2012) we should proceed carefully and deliberately.
• Good measures take time to develop.
• Solid systems based on these measures take longer to test and implement.
• The consequences of implementing these systems are unknown and will take longer to assess.
• Experience suggests moving too fast to implement may shortchange the system
49/27University of California, Los Angeles
Final Remarks. Teacher Evaluation: Why are we doing this again?
• Most important goal in my view is not only to avoid unfair decisions, and negative unintended consequences (though the potential for both should give us pause)
• Greatest risk is missing an opportunity to enact sound teacher evaluation policy with great potential to positively impact educational practice and outcomes