evidence in education - lboro.ac.uk

41
Mathematics Education Centre Evidence in Education Matthew Inglis

Upload: others

Post on 07-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

MathematicsEducation Centre

Evidence in EducationMatthew Inglis

Plan

• A story about teaching mathematical proofs.

• How can we evaluate teaching innovations, or teaching more generally?

• I will argue that education research tells us that we can’t evaluate teaching in real world contexts, so we need to adopt a different approach.

742 NOTICES OF THE AMS VOLUME 62, NUMBER 7

Investigating and Improving Undergraduate Proof ComprehensionLara Alcock, Mark Hodds, Somali Roy, and Matthew Inglis

Undergraduate mathematics students see a lot of written proofs. But how much do they learn from them? Perhaps not as much as we would like; every professor knows that students struggle to make sense of the proofs presented in lectures and textbooks. Of course, written proofs are only one resource for learning; students also attend lectures and work independently or with support on problems. But because mathematics majors are expected to learn much of their mathemat-ics by studying proofs, it is important that we understand how to support them in reading and understanding mathematical arguments.

This observation was the starting point for the research reported in this article. Our work uses psychological research methods to generate and analyze empirical evidence on mathematical thinking, in this case via experimental studies of teaching interventions and quantitative analyses of eye-movement data. What follows is a chrono-logical account of three stages in our attempts to better understand students’ mathematical reading processes and to support students in learning to read effectively.

In the first stage, we designed resources we called e-Proofs to support students in under-standing specific written proofs. These e-Proofs conformed to typical guidelines for multimedia learning resources, and students experienced them as useful. But a more rigorous test of their efficacy revealed that students who studied an e-Proof did not learn more than students who had simply studied a printed proof and in fact retained their knowledge less well. This led us to suspect that e-Proofs made learning feel easier, but as a consequence resulted in shallower engagement and therefore poorer learning.

At the second stage we sought insight into pos-sible underlying reasons for this effect by using eye-movement data to study the mechanisms of mathematical reading. We asked undergraduate students and mathematicians to read purported proofs and found that experts paid more atten-tion to the words and made significantly more back-and-forth eye movements of a type consistent with attempts to infer possible justifications for mathematical claims. This result is in line with the idea that mathematical experts make active efforts to identify logical relationships within a proof and that effective guidance might therefore be needed to teach students to do the same thing.

Lara Alcock is a senior lecturer in the Mathematics Educa-tion Centre at Loughborough University in the UK. She was the recipient of the 2012 Annie and John Selden Prize for Research in Undergraduate Mathematics Education, and she is author of the research-based study guides How to Study as a Mathematics Major and How to Think about Analysis (both published by Oxford University Press). Her email address is [email protected].

Mark Hodds is Mathematics Support Centre manager at Coventry University in the UK. He completed his PhD at the Mathematics Education Centre at Loughborough University with a thesis entitled “Improving proof com-prehension in undergraduate mathematics.” His email address is [email protected].

Somali Roy recently completed her PhD at the Mathemat-ics Education Centre at Loughborough University with a thesis entitled “Evaluating a novel pedagogy in higher education: A case study of e-Proofs.” Her email address is [email protected].

DOI: http://dx.doi.org/10.1090/noti1263

Matthew Inglis is a senior lecturer in the Mathematics Edu-cation Centre at Loughborough University, an honorary research fellow in the Learning Sciences Research Institute at the University of Nottingham, and a Royal Society Wor-shipful Company of Actuaries research fellow. He was the recipient of the 2014 Annie and John Selden Prize for Research in Undergraduate Mathematics Education. His email address is [email protected].

The present article draws on work supported by the Royal Society and by the Higher Education Academy’s Maths, Stats & OR Network. Self-Explanation Training for Mathematics Students is available at www.setmath.lboro.ac.uk.

Figures 5, 8, 9, 10, and 11 have been adapted with permission from the Journal for Research in Mathemat-ics Education, ©2012, 2014, by the National Council of Teachers of Mathematics.

Notices of the American Mathematical Society, 62(7), 742-752

Proofs in lectures

Understanding proofs from lectures might be difficult because:

• Following live explanation requires rapid recall of basic knowledge, recognition and validation of logical deductions and recognition of larger scale structure.

• Hand gestures may not direct attention precisely enough.

• Explanation is ephemeral - it is no longer there when a student studies their lecture notes.

Lara designed a resource...

E-Proof Demo

Students liked them…Student comments from feedback forms and focus groups:

• “I found hearing the lecturer explaining each line individually helpful in understanding particular parts and how they relate to the entire proof.”

• “e-Proof good for understanding the situation, rather than copying from the board etc.”

• “Having proofs online does make it easier to go at my own pace whilst still having the lecturer explain each part.”

Students liked them…

We used an online feedback form. Eight statements like:

• “I understand more by studying a proof on paper for myself than I do when using an e-Proof”

• “e-Proofs helped me understand where different parts of a proof come from and how they fit together”

Each scored 0 (negative) to 4 (positive).

Uniformly positive responses:

Mean score 25.5 (out of 32), 95% CI [24.5, 26.6]

Lecturers liked them…

• Many lecturers expressed an interest in designing their own e-proofs.

• Some kind of computer programme was needed to make this possible.

• Jisc used to have a funding scheme for this sort of thing.

Educational technologists liked them…

Based on feedback from students and colleagues, Lara applied for funding.

There was a robust review process.

And an interview.

Finally the University was awarded an £80k grant by JISC to develop further e-proofs.

Evidence for E-Proofs

Evidence that e-Proofs are effective:

• Qualitative feedback from students

• Quantitative feedback from students

• Peer observations of e-proof use

• Expert peer review of a funding proposal

Unfortunately, e-Proofs don’t work.

Somali Roy’s PhD

An example study:

Participants: 49 undergraduates.

Two proof versions: e-Proof; textbook.

Comprehension test: score out of 18.

Time: 15 minutes to study proof; 30 minutes to complete comprehension test.

Delayed post-test: same comprehension test; 20 minutes to complete. Two weeks later.

Identical to e-Proof version,

except no audio or graphics (just text)

Measuring ‘proof comprehension’ is non-

trivial. We used work from an NSF-funded project at

Rutgers University

E-Proofs• Robust evidence that e-Proofs led to worse

retention than reading a textbook.

• Why?

• Both Somali Roy and Mark Hodds worked on answering this question for their PhDs.

• Answer is, roughly, that you benefit from having to construct your own explanations.

• E-Proofs were too easy.

• Students learned less, but felt like they’d learned more.

• Full story in the Notices of the AMS paper.

Evidence in Education• When looking at e-proofs, we collected all the

standard measures of educational effectiveness used in higher education:

• qualitative student feedback

• quantitative student feedback

• expert peer review

• All pointed in the same direction.

• But actually measuring student learning revealed that all these methods were misleading.

Evidence in Education

This is all very well known to education researchers.

What makes great teaching? Review of the underpinning research

Robert Coe, Cesare Aloisi, Steve Higgins and Lee Elliot Major

October 2014

What makes great teaching? Review of the underpinning research

Robert Coe, Cesare Aloisi, Steve Higgins and Lee Elliot Major

October 2014

all rated e-proofs highly

Evidence in Education

Range of methods used to evaluate educational quality in HE:

• Peer/expert classroom observation

• Student ratings

• Analysis of educational materials

• Lecturer portfolios

Evidence in Education

Range of methods used to evaluate educational quality in HE:

• Peer/expert classroom observation

• Student ratings

• Analysis of educational materials

• Lecturer portfolios

I’ll review the research on these

Classroom ObservationsJournal of Teacher Education62(4) 367 –382© 2011 American Association of Colleges for Teacher EducationReprints and permission: http://www. sagepub.com/journalsPermissions.navDOI: 10.1177/0022487110390221http://jte.sagepub.com

Do We Know a Successful Teacher When We See One? Experiments in the Identification of Effective Teachers

Michael Strong1, John Gargani2, and Özge Hacifazlioglu3

AbstractThe authors report on three experiments designed to (a) test under increasingly more favorable conditions whether judges can correctly rate teachers of known ability to raise student achievement, (b) inquire about what criteria judges use when making their evaluations, and (c) determine which criteria are most predictive of a teacher’s effectiveness. All three experiments resulted in high agreement among judges but low ability to identify effective teachers. Certain items on the established measure that are related to instructional behavior did reliably predict teacher effectiveness. The authors conclude that (a) judges, no matter how experienced, are unable to identify successful teachers; (b) certain cognitive operations may be contributing to this outcome; (c) it is desirable and possible to develop a new measure that does produce accurate predictions of a teacher’s ability to raise student achievement test scores.

Keywordsteacher effectiveness, teacher evaluation, classroom observation, value-added

Since the No Child Left Behind Act (2002) became law, the term teacher quality has been close to the surface of many an educator’s consciousness. Now, with President Obama’s Race to the Top, there is a focus on teacher effectiveness. It is fairly well documented that the best school predictor of stu-dent outcomes is high-quality, effective teaching as defined by performance in the classroom (Goldhaber, 2002; Goldhaber & Brewer, 2000; Hanushek, Kain, O’Brien, & Rivkin, 2005; Wright, Horn, & Sanders, 1997). A high-quality teacher may have considerable impact on student learning. For example, Hanushek (1992) found that, all things being equal, a student with a very high-quality teacher will achieve a learning gain of 1.5 grade-level equivalents, whereas a student with a low-quality teacher achieves a gain of only 0.5 grade-level equiv-alents. This translates to one year’s growth being attributable to teacher quality differences. More recently Aaronson, Barrow, and Sander (2007) examined data from the Chicago Public Schools and found that a one-standard-deviation, one-semester improvement in math teacher quality raised student math scores by 0.13 grade equivalents or, over one year, roughly one-fifth of average yearly gains.

William Sanders, who pioneered the Tennessee Value-Added Assessment System, summarizing his own studies, stated that especially in math, the cumulative and residual effects of teachers are still measurable at least four years after students leave a classroom (Sanders, 2000, p. 335). A study by Nye, Konstantopoulos, and Hedges (2004), unusual because it randomly assigned students to classes, estimated

teacher effects on student achievement over four years. Their estimates of teacher effects on achievement gains were simi-lar in magnitude to those of previous studies done by econo-mists, but they found larger effects on mathematics than on reading achievement.

Observational Measurement of Teaching PracticeFindings such as these are convincing as to the importance of having an effective teacher but do nothing to tell us how to identify an effective teacher when we see one. Over the past few decades, researchers have attempted many ways of accom-plishing this task, using a wide variety of first impressionistic and later systematic methods to investigate teaching practices through classroom observations (Brophy & Good, 1986; Nuthall & Alton-Lee, 1990; Stallings & Mohlman, 1988; Waxman, 1995), and findings from their studies have contrib-uted to educators’ notions of what constitutes good teaching. The several hundred observational systems that have been

1University of California, Santa Cruz, CA, USA2Gargani + Company, Inc., Berkeley, CA, USA3Bahçes ehir University, Istanbul, Turkey

Corresponding Author:Michael Strong, University of California Santa Cruz, Center for Research on the Teaching Profession, Mailstop Merrill Faculty Services, 1156 High Street, Santa Cruz, CA 95064Email: [email protected]

Classroom ObservationsJournal of Teacher Education62(4) 367 –382© 2011 American Association of Colleges for Teacher EducationReprints and permission: http://www. sagepub.com/journalsPermissions.navDOI: 10.1177/0022487110390221http://jte.sagepub.com

Do We Know a Successful Teacher When We See One? Experiments in the Identification of Effective Teachers

Michael Strong1, John Gargani2, and Özge Hacifazlioglu3

AbstractThe authors report on three experiments designed to (a) test under increasingly more favorable conditions whether judges can correctly rate teachers of known ability to raise student achievement, (b) inquire about what criteria judges use when making their evaluations, and (c) determine which criteria are most predictive of a teacher’s effectiveness. All three experiments resulted in high agreement among judges but low ability to identify effective teachers. Certain items on the established measure that are related to instructional behavior did reliably predict teacher effectiveness. The authors conclude that (a) judges, no matter how experienced, are unable to identify successful teachers; (b) certain cognitive operations may be contributing to this outcome; (c) it is desirable and possible to develop a new measure that does produce accurate predictions of a teacher’s ability to raise student achievement test scores.

Keywordsteacher effectiveness, teacher evaluation, classroom observation, value-added

Since the No Child Left Behind Act (2002) became law, the term teacher quality has been close to the surface of many an educator’s consciousness. Now, with President Obama’s Race to the Top, there is a focus on teacher effectiveness. It is fairly well documented that the best school predictor of stu-dent outcomes is high-quality, effective teaching as defined by performance in the classroom (Goldhaber, 2002; Goldhaber & Brewer, 2000; Hanushek, Kain, O’Brien, & Rivkin, 2005; Wright, Horn, & Sanders, 1997). A high-quality teacher may have considerable impact on student learning. For example, Hanushek (1992) found that, all things being equal, a student with a very high-quality teacher will achieve a learning gain of 1.5 grade-level equivalents, whereas a student with a low-quality teacher achieves a gain of only 0.5 grade-level equiv-alents. This translates to one year’s growth being attributable to teacher quality differences. More recently Aaronson, Barrow, and Sander (2007) examined data from the Chicago Public Schools and found that a one-standard-deviation, one-semester improvement in math teacher quality raised student math scores by 0.13 grade equivalents or, over one year, roughly one-fifth of average yearly gains.

William Sanders, who pioneered the Tennessee Value-Added Assessment System, summarizing his own studies, stated that especially in math, the cumulative and residual effects of teachers are still measurable at least four years after students leave a classroom (Sanders, 2000, p. 335). A study by Nye, Konstantopoulos, and Hedges (2004), unusual because it randomly assigned students to classes, estimated

teacher effects on student achievement over four years. Their estimates of teacher effects on achievement gains were simi-lar in magnitude to those of previous studies done by econo-mists, but they found larger effects on mathematics than on reading achievement.

Observational Measurement of Teaching PracticeFindings such as these are convincing as to the importance of having an effective teacher but do nothing to tell us how to identify an effective teacher when we see one. Over the past few decades, researchers have attempted many ways of accom-plishing this task, using a wide variety of first impressionistic and later systematic methods to investigate teaching practices through classroom observations (Brophy & Good, 1986; Nuthall & Alton-Lee, 1990; Stallings & Mohlman, 1988; Waxman, 1995), and findings from their studies have contrib-uted to educators’ notions of what constitutes good teaching. The several hundred observational systems that have been

1University of California, Santa Cruz, CA, USA2Gargani + Company, Inc., Berkeley, CA, USA3Bahçes ehir University, Istanbul, Turkey

Corresponding Author:Michael Strong, University of California Santa Cruz, Center for Research on the Teaching Profession, Mailstop Merrill Faculty Services, 1156 High Street, Santa Cruz, CA 95064Email: [email protected]

Used value-added measures to identify a group of ‘effective’ teachers and ‘ineffective’ teachers

(separated by 0.5 SDs in learning gain).

Showed videos of them teaching to experienced teachers and headteachers and asked them to

identify which was in which group.

Teachers cannot do this.Across three experiments they performed worse than 50% (score expected if they had guessed).

Classroom Observations

U . S . D e p a r t m e n t o f E d u c a t i o n

November 2016

Making Connections

The content, predictive power, and potential bias

in five widely used teacher observation instruments

Brian Gill Megan Shoji Thomas Coen

Kate Place Mathematica Policy Research

Key findings

This study seeks to inform decisions about the selection and use of teacher observation instruments using data from the Measures of Effective Teaching project. It compares five widely used observation instruments on the practices they measure, their relationship to student learning, and whether they are affected by the characteristics of students in a teacher’s classroom. The study found that: • Eight of ten dimensions of instructional practice are common across all five

examined teacher observation instruments. • All seven of the dimensions of instructional practice with quantitative data are

modestly but significantly related to teachers’ value-added scores. • The classroom management dimension is most consistently and strongly related to

teachers’ value-added scores across instruments, subjects, and grades. • The characteristics of students in the classroom affect teacher observation results for

some instruments, more often in English language arts classes than in math classes.

At ICF International

Measures of Effective Teaching project, funded by the Gates Foundation.

Large project which aimed to produce and evaluate reliable methods of measuring teacher quality.

Gold standard in lesson observation: much more sophisticated than anything we do in HE.

Classroom Observations

-

Table 4. Summary results for the strength of relationship between teachers’ cross-instrument observation dimension scores and their value added to student learning

Dimension Adjusted correlation to underlying

value added score

Supportive learning environment .18

Classroom management .28

Student intellectual engagement with content .22

Lesson structure and facilitation .18

Content understanding .13

Language and discourse .14

Feedback and assessment .20

Note: Sample includes all teachers who taught a class with at least one valid observation instrument score and who taught at least five students with a valid state assessment outcome score. Grade 4–5 teachers in district 4 were excluded from these analyses because data on eligibility for the federal school lunch program were missing for all students. Results are reported for only 7 of the 10 dimensions identified in the content analysis because the scores for the other 3 dimensions were available for only one instrument. Reported results summa-rize correlations between teachers’ value-added scores and their cross-instrument dimension score, adjusted for measurement error (see appendix A for details). All correlations shown were statistically distinguishable from zero (two-tailed test at the .05 level). The analysis estimated correlations separately by subject (English language arts or math) and primary and secondary grade levels (grades 4–5 or grades 6–9). For grades 4–5 correlations were estimated between value-added scores in one year and observation scores in another year for all available year combinations (year 1 value-added scores with year 2 observation scores, and vice versa). The reported results summarize findings across all grades and subjects as the mean of the adjusted rho estimates for each of the four subject-by-grade combinations: grade 4–5 English language arts, grade 4–5 math, grade 6–9 English language arts, and grade 6–9 math, weighted by the number of teachers included in each subject-by-grade group. See table C5 in appendix C for supplementary results, presented by subject and grade level.

Source: Authors’ analysis of Measures of Effective Teaching data.

Table 5. Summary results for the consistency of the relationship between teachers’ instrument-specific dimension scores and their value added to student learning for all subjects and grade levels

Dimension Total number of

correlations

Number of correlations that are significant

Percentage of correlations that are significant

Supportive learning environment 14 4 29

Classroom management 17 10 59

Student intellectual engagement with content 14 4 29

Lesson structure and facilitation 20 6 30

Content understanding 14 4 29

Language and discourse 17 4 24

Feedback and assessment 17 10 59

Note: Sample includes all teachers who taught a class with at least one valid observation instrument score and who taught at least five students with a valid state assessment outcome score. Grade 4–5 teachers in district 4 were excluded from these analyses because data on eligibility for the federal school lunch program were miss-ing for all students. Significance is based on a two-tailed test at the .05 level.

Source: Authors’ analysis of Measures of Effective Teaching data.

10

For comparison:Correlation between height and intelligence ≈ .20Correlation between swearing frequency and fear of pain ≈ .17Correlation between SNIP and REF article quality ≈ .33

Evidence in Education

Range of methods used to evaluate educational quality in HE:

• Peer/expert classroom observation

• Student ratings

• Analysis of educational materials

• Lecturer portfolios

I’ll review the research on these

Student Feedback/Ratings

Meta-analysis of faculty's teaching effectiveness: Student evaluation ofteaching ratings and student learning are not related

Bob Uttl*, Carmela A. White1, Daniela Wong Gonzalez2

Department of Psychology, Mount Royal University, Canada

A R T I C L E I N F O

Article history:Received 23 February 2016Received in revised form 10 August 2016Accepted 17 August 2016Available online 19 September 2016

Keywords:Meta-analysis of student evaluation ofteachingMultisection studiesValidityTeaching effectivenessEvaluation of facultySET and learning correlations

A B S T R A C T

Student evaluation of teaching (SET) ratings are used to evaluate faculty's teaching effectiveness based ona widespread belief that students learn more from highly rated professors. The key evidence cited insupport of this belief are meta-analyses of multisection studies showing small-to-moderate correlationsbetween SET ratings and student achievement (e.g., Cohen, 1980, 1981; Feldman, 1989). We re-analyzedpreviously published meta-analyses of the multisection studies and found that their findings were anartifact of small sample sized studies and publication bias. Whereas the small sample sized studiesshowed large and moderate correlation, the large sample sized studies showed no or only minimalcorrelation between SET ratings and learning. Our up-to-date meta-analysis of all multisection studiesrevealed no significant correlations between the SET ratings and learning. These findings suggest thatinstitutions focused on student learning and career success may want to abandon SET ratings as ameasure of faculty's teaching effectiveness.

ã 2016 Elsevier Ltd. All rights reserved.

“For every complex problem there is an answer that is clear,simple, and wrong.” H. L. Mencken

Student Evaluation of Teaching (SET) ratings are used toevaluate faculty’s teaching effectiveness based on an assumptionthat students learn more from highly rated professors. AlthoughSET were used as early as 19200s, their use expanded across the USAin the late 19600s and early 19700s (Murray, 2005; Wachtel, 1998).Today, nearly all colleges and universities in Norh America use SETto evaluate their faculty’s teaching effectiveness (Murray, 2005;Wachtel, 1998). Typically, SET are conducted within the last fewweeks of courses, before the final grades are assigned. Students arepresented with rating forms that ask them to rate their perceptionsof instructors and courses, often on a 5-point Likert scale rangingfrom Strongly Disagree to Strongly Agree. The rating forms may askstudents to provide overall ratings of instructor and/or course andthey may also ask students to rate numerous specific character-istics of teachers (e.g., knowledge, clarity of explanation, organi-zation, enthusiasm, friendliness, fairness, availability,approachability, use of humor, contribution to students’ learning)

and courses (e.g., organization, difficulty) (Feldman, 1989;Spooren, Brockx, & Mortelmans, 2013). The ratings for eachcourse/class are summarized, typically by calculating mean ratingsacross all responding students for each rated item and across allrated items, and these mean class SET ratings are then used toevaluate professors’ teaching effectiveness by comparing them, forexample, to department or university average ratings. Althoughuse of SET as a feedback for professors’ own use is not controversial,the use of SET as a measure of professors’ teaching effectiveness formaking high stakes administrative decisions about instructors’hiring, firing, merit pay, and promotions is highly controversial(e.g., Emery, Kramer, & Tian, 2003; Spooren, Brockx, & Mortelmans,2013; Stark & Freishtat, 2014; Wachtel, 1998).

Proponents of SET as a measure of instructor teachingeffectiveness have put forward a number of reasons for theiruse: (1) SET are cheap and convenient means to evaluate faculty’steaching, (2) SET are very useful to demonstrate administrators’concerns with public accountability and public relations, (3) SETallow students to have say in evaluation of faculty’s teaching, and(4) students are uniquely positioned to evaluate their experiencesand perceptions of instructors as they are teaching classes (Murray,2005; Wachtel, 1998). The last reason on this list is the SETproponents’ main rationale for why SET ought to measureinstructor’s teaching effectiveness. The SET proponents assumethat students observe instructors’ behavior, assess how much theylearned from the instructor, rate the instructor according to how

* Corresponding author at: Department of Psychology, Mount Royal University,4825 Mount Royal University Gate, Calgary, AB, T3E 6K6, Canada.

E-mail address: [email protected] (B. Uttl).1 Department of Psychology, University of British Columbia, Okanagan, Canada.2 Department of Psychology, University of Windsor, Canada.

http://dx.doi.org/10.1016/j.stueduc.2016.08.0070191-491X/ã 2016 Elsevier Ltd. All rights reserved.

Studies in Educational Evaluation 54 (2017) 22–42

Contents lists available at ScienceDirect

Studies in Educational Evaluation

journa l homepage: www.e l sev ier .com/stueduc

Student Feedback/Ratingsr = .12 with 95% CI = (0,.24) than for studies without such adjust-ments, r = .30 with 95% CI = (.20, .38), Q(1) = 5.21, p = .022. However,this estimate does not take into an account the presence of thesmall study effects. Using all studies, the linear regression test offunnel plot asymmetry indicated asymmetry, p = .002. Theestimates of SET/learning r adjusted for small study effects were:TF: .12 (with 22 filled in effects); NGT30: .10; Top10: .08; and limitmeta-analysis adjusted r = .12 with 95% CI = (.03, .21) (Test of small-study effects: Q-Q'(1) = 21.24, p < .001; test of residual heteroge-neity Q(95) = 191.49, p < .001).

We re-ran the above analyses but only for studies with priorknowledge/ability adjustments. The random effect model (k = 34)shows r = .16 with 95% CI = (-.02,.32), with moderate heterogeneityas measured by I2 = 72.2%, Q(33) = 118.92, p < .001. The linearregression test of funnel plot asymmetry was not significant, p = .113. The estimates of SET/learning r adjusted for small study effectswere: TF: !.01 (with 8 filled in effects); NGT30: .08; Top10: !.03;and limit meta-analysis adjusted r = !.06 with 95% CI = (!.17, .07)(Test of small-study effects: Q-Q'(1) = 9.10, p = .003; test of residualheterogeneity Q(32) = 109.82, p < .001).

Finally, the two studies ! Capozza (1973) (n = 8) and Rodin andRodin (1972) (n = 12) ! who were identified as univariate outliersin the preliminary analyses were also extreme outliers withstudentized residuals below !3.0. Accordingly, we re-ran theabove analyses with these two studies removed. With the twooutliers removed, the random effect model (k = 95) shows r = .25with 95% CI = (.18, .31), with lower heterogeneity I2 = 48.0%, Q(95) = 182.85, p < .001. Moreover, the mixed effects moderatoranalysis showed that SET/learning correlations were substantiallysmaller for studies with adjustment for prior knowledge/ability,r = .17 with 95% CI = (.05, .27) than for studies without suchadjustments, r = .30 with 95% CI = (.21, .38), Q(1) = 3.34, p = .068.However, as noted above, this estimate does not take into anaccount the presence of the small study effects. Using all studies,the linear regression test of funnel plot asymmetry indicatedasymmetry, p < .001. The estimates of SET/learning r adjusted forsmall study effects were: TF: .13 (with 24 filled in effects), NGT30: .10, Top10: .08, and limit meta-analysis adjusted r = .11 with 95%CI = (.02, .20) (Test of small-study effects: Q-Q'(df = 1) = 29.09,p < .001; test of residual heterogeneity Q(93) = 153.63, p < .001).

We re-ran the above analyses but only for studies with priorknowledge/ability adjustments. The random effect model (k = 32)shows r = .20 with 95% CI = (.06, .34), with moderate heterogeneityas measured by I2 = 66.2%, Q(31) = 94.48, p <.001. The linearregression test of funnel plot asymmetry was significant, p = .006. We recalculated the estimates of SET/learning r adjusted forsmall study effects: TF: .04 (with 10 filled in effects), NGT30: .08,Top10: !.03, and limit meta-analysis adjusted r = !.05 with 95%CI = (-.17, .07) (Test of small-study effects: Q-Q'(1) = 20.41, p < .001;test of residual heterogeneity Q(30) = 72.07, p < .001). Fig. 8, top leftpanel, shows the magnitude of the SET/learning correlations as afunction of the multisection study size revealing the familiar smallstudy size effects. The Fig. 8, right panel, shows the cumulativemeta-analysis, starting with the largest study and adding smallerstudies in each successive step; it indicates that the magnitude ofthe correlation increases as the smaller studies are added intosubsequent meta-analysis. Finally, Fig. 8, bottom left panel, showsthe result of the regression based limit meta-analysis, includingthe adjusted r = !.04.

4.2. Averaged SET/learning correlations

Fig. 9 shows the forest plot and both fixed and random effectsmodel meta-analysis for SET/learning correlations using all SETs.The random effect model (k = 97) shows r = .17 with 95% CI = (.11, .23), with a low heterogeneity as measured by I2 = 34.1%, Q

Study

Fixed effect modelRandom ef fects modelHeterogeneity: I−squared=34.1%, tau−squared=0.0284, p=0.0008

Beleche.2012 (n=82)Benbassat.1981 (n=15)Bendig.1953a (n=5)Bendig.1953b (n=5)Benton.1976 (n=31)Bolton.1979 (n=10)Braskam p.1979.01 (n=19)Braskam p.1979.02 (n=17)Bryson.1974 (n=20)Capozza.1973 (n=8)Centra.1977.01 (n=7)Centra.1977.02 (n=7)Centra.1977.03 (n=22)Centra.1977.04 (n=13)Centra.1977.05 (n=8)Centra.1977.06 (n=7)Centra.1977.07 (n=8)Chase.1979.01 (n=8)Chase.1979.02 (n=6)Cohen.1970 (n=25)Costin.1978.01 (n=25)Costin.1978.02 (n=25)Costin.1978.03 (n=21)Costin.1978.04 (n=25)Doyl e.1974 (n=12)Doyl e.1978 (n=10)Drysdal e.2010.01 (n=11)Drysdal e.2010.02 (n=10)Drysdal e.2010.03 (n=8)Drysdal e.2010.04 (n=11)Drysdal e.2010.05 (n=10)Drysdal e.2010.06 (n=12)Drysdal e.2010.07 (n=11)Drysdal e.2010.08 (n=16)Drysdal e.2010.09 (n=11)Elliot.1950 (n=36)Ellis.1977 (n=19)Endo.1976 (n=5)Fenderson.1997 (n=29)Frey.1973.01 (n=8)Frey.1973.02 (n=5)Frey.1975.01 (n=9)Frey.1975.02 (n=12)Frey.1975.03 (n=5)Frey.1976 (n=7)Galbraith.2012a.01 (n=8)Galbraith.2012a.02 (n=10)Galbraith.2012a.03 (n=12)Galbraith.2012a.04 (n=8)Galbraith.2012a.05 (n=8)Galbraith.2012a.06 (n=9)Galbraith.2012a.07 (n=13)Galbraith.2012b (n=5)Greenwood.1976 (n=36)Grush.1975 (n=18)Hoffman.1978.03 (n=75)Koon.1995 (n=36)Marsh.1975 (n=18)Marsh.1980 (n=31)McKeachi e.1971.01 (n=34)McKeachi e.1971.02 (n=32)McKeachi e.1971.03 (n=6)McKeachi e.1971.04 (n=16)McKeachi e.1971.05 (n=18)McKeachi e.1978 (n=6)Mintzes.1977 (n=25)Morgan.1978 (n=5)Murdock.1969 (n=6)Orpen.1980 (n=10)Palme r.1978 (n=14)Prosser.1991 (n=11)Rankin.1965 (n=21)Remmers.1949 (n=53)Rodin.1972 (n=12)Sheets.1995.01 (n=58)Sheets.1995.02 (n=63)Solomon.1964 (n=24)Soper.1973 (n=14)Sullivan.1974.01 (n=14)Sullivan.1974.04 (n=9)Sullivan.1974.05 (n=9)Sullivan.1974.06 (n=16)Sullivan.1974.07 (n=8)Sullivan.1974.08 (n=6)Sullivan.1974.09 (n=8)Sullivan.1974.10 (n=14)Sullivan.1974.11 (n=6)Sullivan.1974.12 (n=40)Turne r.1974.01 (n=16)Turne r.1974.02 (n=24)Weinberg.2007.01 (n=190)Weinberg.2007.02 (n=119)Weinberg.2007.03 (n=85)Whitely.1979.01 (n=5)Whitely.1979.02 (n=11)Wiviott.1974 (n=6)Yun ker.2003 (n=46)

Total

2021.5

82.0 15.0 5.0 5.0

31.0 10.0 19.0 17.0 20.0 8.0 7.0 7.0

22.0 13.0 8.0 7.0 8.0 8.0 6.0

25.0 25.0 25.0 21.0 25.0 12.0 10.0 11.0 10.0 8.0

11.0 10.0 12.0 11.0 16.0 11.0 36.0 19.0 5.0

29.0 8.0 5.0 9.0

12.0 5.0 7.0 8.0

10.0 12.0 8.0 8.0 9.0

13.0 5.0

36.0 18.0 75.0 36.0 18.0 31.0 34.0 32.0 6.0

16.0 18.0 6.0

25.0 5.0 6.0

10.0 14.0 11.0 21.0 51.5 12.0 58.0 63.0 24.0 14.0 14.0 9.0 9.0

16.0 8.0 6.0 8.0

14.0 6.0

40.0 16.0 24.0 190.0 119.0 85.0 5.0

11.0 6.0

46.0

−1 −0.5 0 0.5 1

CorrelationCOR

0.13 0.17

0.16 0.18 0.85

−0.40 0.12 0.53 0.02 0.23 0.37

−0.94 0.50 0.48 0.34 0.11 0.68 0.11 0.23 0.62 0.19 0.29 0.52 0.56 0.46 0.41 0.19 0.02 0.08

−0.11 0.68 0.08

−0.31−0.08 0.22 0.36 0.19 0.23 0.56 0.10 0.09 0.63 0.60 0.49 0.10 0.46 0.41 0.22 0.34

−0.02 0.30

−0.07−0.13 0.12 0.29

−0.08 0.45 0.25 0.30 0.29 0.36 0.09 0.06 0.01 0.13 0.10 0.20 0.30 0.86 0.77 0.52

−0.17−0.28 0.02 0.27

−0.75 0.18

−0.14 0.19

−0.17 0.51 0.57 0.33 0.34 0.48 0.55 0.08 0.42

−0.28 0.40

−0.52−0.38 0.04

−0.26−0.09 0.80

−0.11 0.00 0.19

95%−CI

[ 0.09; 0.18][ 0.11; 0.23]

[−0.06; 0.36][−0.37; 0.63][−0.13; 0.99][−0.95; 0.75][−0.24; 0.46][−0.14; 0.87][−0.44; 0.47][−0.29; 0.64][−0.09; 0.70]

[−0.99; −0.70][−0.41; 0.91][−0.43; 0.91][−0.10; 0.66][−0.47; 0.62][−0.05; 0.94][−0.70; 0.80][−0.57; 0.80][−0.16; 0.92][−0.74; 0.87][−0.12; 0.62][ 0.16; 0.76][ 0.21; 0.78][ 0.04; 0.74][ 0.02; 0.69]

[−0.43; 0.69][−0.62; 0.64][−0.55; 0.65][−0.69; 0.56][−0.06; 0.93][−0.54; 0.65][−0.79; 0.40][−0.63; 0.51][−0.44; 0.72][−0.16; 0.73][−0.47; 0.71][−0.11; 0.52][ 0.14; 0.81]

[−0.86; 0.90][−0.29; 0.44][−0.13; 0.92][−0.60; 0.97][−0.26; 0.87][−0.51; 0.63][−0.71; 0.95][−0.50; 0.89][−0.57; 0.80][−0.37; 0.80][−0.59; 0.56][−0.52; 0.83][−0.74; 0.67][−0.73; 0.58][−0.46; 0.63][−0.80; 0.93][−0.40; 0.26][−0.02; 0.76][ 0.02; 0.45]

[−0.03; 0.57][−0.21; 0.67][ 0.01; 0.64]

[−0.26; 0.41][−0.30; 0.40][−0.81; 0.82][−0.39; 0.59][−0.39; 0.54][−0.73; 0.87][−0.11; 0.62][−0.08; 0.99][−0.11; 0.97][−0.17; 0.87][−0.64; 0.40][−0.75; 0.38][−0.41; 0.45][−0.01; 0.50]

[−0.93; −0.31][−0.08; 0.42][−0.38; 0.11][−0.23; 0.55][−0.64; 0.40][−0.03; 0.82][−0.15; 0.90][−0.43; 0.82][−0.19; 0.72][−0.34; 0.89][−0.47; 0.94][−0.66; 0.74][−0.14; 0.78][−0.89; 0.69][ 0.10; 0.63]

[−0.81; −0.03][−0.68; 0.03][−0.10; 0.18]

[−0.42; −0.08][−0.30; 0.13][−0.28; 0.99][−0.67; 0.52][−0.81; 0.81][−0.11; 0.45]

W(fixed)

100%−−

4.6% 0.7% 0.1% 0.1% 1.6% 0.4% 0.9% 0.8% 1.0% 0.3% 0.2% 0.2% 1.1% 0.6% 0.3% 0.2% 0.3% 0.3% 0.2% 1.3% 1.3% 1.3% 1.0% 1.3% 0.5% 0.4% 0.5% 0.4% 0.3% 0.5% 0.4% 0.5% 0.5% 0.8% 0.5% 1.9% 0.9% 0.1% 1.5% 0.3% 0.1% 0.3% 0.5% 0.1% 0.2% 0.3% 0.4% 0.5% 0.3% 0.3% 0.3% 0.6% 0.1% 1.9% 0.9% 4.2% 1.9% 0.9% 1.6% 1.8% 1.7% 0.2% 0.8% 0.9% 0.2% 1.3% 0.1% 0.2% 0.4% 0.6% 0.5% 1.0% 2.8% 0.5% 3.2% 3.5% 1.2% 0.6% 0.6% 0.3% 0.3% 0.8% 0.3% 0.2% 0.3% 0.6% 0.2% 2.1% 0.8% 1.2%

10.8% 6.7% 4.7% 0.1% 0.5% 0.2% 2.5%

W(random)

−−100%

2.7%1.0%0.2%0.2%1.8%0.7%1.2%1.1%1.3%0.5%0.4%0.4%1.4%0.9%0.5%0.4%0.5%0.5%0.3%1.5%1.5%1.5%1.3%1.5%0.8%0.7%0.7%0.7%0.5%0.7%0.7%0.8%0.7%1.1%0.7%1.9%1.2%0.2%1.7%0.5%0.2%0.6%0.8%0.2%0.4%0.5%0.7%0.8%0.5%0.5%0.6%0.9%0.2%1.9%1.2%2.7%1.9%1.2%1.8%1.9%1.8%0.3%1.1%1.2%0.3%1.5%0.2%0.3%0.7%0.9%0.7%1.3%2.3%0.8%2.4%2.5%1.5%0.9%0.9%0.6%0.6%1.1%0.5%0.3%0.5%0.9%0.3%2.0%1.1%1.5%3.3%3.0%2.8%0.2%0.7%0.3%2.2%

Fig. 9. Forest plot for Averaged SET/learning correlations, including fixed effect andrandom effects model estimates. The plot includes the study identifier, number ofsections, correlation, 95% C.I., and weights for each study as well as fixed effects andrandom effects estimates.

B. Uttl et al. / Studies in Educational Evaluation 54 (2017) 22–42 37

in nearly all multisection studies students were not randomlyassigned to the sections, we followed the previous meta-analysesand used the SET/learning correlation adjusted for prior knowl-edge/ability and we used zero order correlations only if the priorknowledge/adjusted correlations were not available. Consistentwith our aims, we tested whether the type of the best availableSET/learning correlation ! zero order or adjusted for priorknowledge/ability ! moderates the SET/learning relationship.Finally, we conducted separate meta-analyses using only multi-section studies that provided knowledge/ability adjusted correla-tions because even if the moderator test was not statisticallysignificant, the prior knowledge/ability adjusted correlations arethe better estimate of learning than zero order correlations.

Some multisection studies reported only one SET/learningcorrelation, typically between overall instructor SET rating andlearning/achievement. Other multisection studies reported anumber of SET/learning correlations, for example, one for eachSET item. Accordingly, we analyzed the data two ways. First, in thefirst set of meta-analyses, for each multisection study, we usedonly one SET/learning correlation, that is, the one that bestcaptured the correlation between overall instructor rating andlearning/achievement. This approach follows Cohen (1981) as wellas Clayson (2009). For the second set of meta-analyses, for eachmultisection study, we used averaged SET/learning correlationsaveraged across all SET/learning items. However, we neveraveraged across zero order and ability/prior achievement adjustedcorrelations.

We examined the data for the presence of outliers and smallstudy effects using boxplots, scatterplots, funnel plots, andregression tests. Next, we estimated effect size using the randomeffect model (using restricted maximum-likelihood estimator orREML) but also provided fixed effects estimates for basic analysesand for comparison with prior meta-analyses. A random effectmodel allows for true effect size to vary from study to study, forexample, the effect size may be a little higher for some academicdisciplines than for other academic disciplines or it may be higherfor studies conducted in colleges than for studies conducted inuniversities. In contrast, the fixed effect model assumes that allprimary studies provide estimate of a single true effect size. Givenvariety of disciplines, institutions, SET measures, learning meas-ures, etc. employed by primary studies, the key assumption of fixedeffect model is unlikely to be true and the random effect model ismore appropriate. We supported these analyses with forest plots.Next, we estimated SET/learning correlations adjusted for thesmall study effects using several basic as well as more sophisticat-ed methods, including the trim-and-fill estimate, the cumulativemeta-analysis starting with the largest sample study and addingthe next smaller study on each successive step, the estimate basedon all studies with the sample equal or greater to 30 (NGT30), theestimate based on the top 10% of the most precise studies (TOP10),and the regression based estimates using limit meta-analysismethod. In general, based on a variety of simulation studies, whensmall study effects are present, the TOP10 and the regression basedestimates using the limit meta-analysis method perform the best(Moreno et al., 2009; Stanley & Doucouliagos, 2014 ). Next, we alsoexamined the sensitivity of meta-analyses to outliers. All reportedanalyses were conducted using R, and more specifically, usingpackages meta, metafor, and metasens.

4. Results

The 51 articles yielded 97 multisection studies. Table 2 showsoverall instructor SET/Learning correlations (column labeled “CISr”) as well as averaged SET/Learning correlations (column labeled“CAS r”) across all items/factors for each multisection study. Fig. 6shows the relationship between Instructor SET/Learning

correlations and study size (number of sections), the relationshipbetween Averaged SET/Learning correlations and study size

Study

Fixed ef fect modelRandom ef fects modelHeterogeneity: I−squared=54.9%, tau−squared=0.0694, p<0.0001

Belech e.2012 (n=82)Benbassat.1981 (n=15)Bendig.1953a (n=5)Bendig.1953b (n=5)Benton.1976 (n=31)Bolton.1979 (n=10)Braskamp.1979.01 (n=19)Braskamp.1979.02 (n=17)Bryson.1974 (n=20)Capozza.1973 (n=8)Centra.1977.01 (n=7)Centra.1977.02 (n=7)Centra.1977.03 (n=22)Centra.1977.04 (n=13)Centra.1977.05 (n=8)Centra.1977.06 (n=7)Centra.1977.07 (n=8)Chase.1979.01 (n=8)Chase.1979.02 (n=6)Cohen.1970 (n=25)Costin.1978.01 (n=25)Costin.1978.02 (n=25)Costin.1978.03 (n=21)Costin.1978.04 (n=25)Doyl e.1974 (n=12)Doyl e.1978 (n=10)Drysdale.2010.01 (n=11)Drysdale.2010.02 (n=10)Drysdale.2010.03 (n=8)Drysdale.2010.04 (n=11)Drysdale.2010.05 (n=10)Drysdale.2010.06 (n=12)Drysdale.2010.07 (n=11)Drysdale.2010.08 (n=16)Drysdale.2010.09 (n=11)Elliot.1950 (n=36)Ellis.1977 (n=19)Endo.1976 (n=5)Fenderson.1997 (n=29)Frey.1973.01 (n=8)Frey.1973.02 (n=5)Frey.1975.01 (n=9)Frey.1975.02 (n=12)Frey.1975.03 (n=5)Frey.1976 (n=7)Galbraith.2012a.01 (n=8)Galbraith.2012a.02 (n=10)Galbraith.2012a.03 (n=12)Galbraith.2012a.04 (n=8)Galbraith.2012a.05 (n=8)Galbraith.2012a.06 (n=9)Galbraith.2012a.07 (n=13)Galbraith.2012b (n=5)Greenwood.1976 (n=36)Grush.1975 (n=18)Hoffman.1978.03 (n=75)Koon.1995 (n=36)Marsh.1975 (n=18)Marsh.1980 (n=31)McKeachi e.1971.01 (n=34)McKeachi e.1971.02 (n=32)McKeachi e.1971.03 (n=6)McKeachi e.1971.04 (n=16)McKeachi e.1971.05 (n=18)McKeachi e.1978 (n=6)Mintzes.1977 (n=25)Morgan.1978 (n=5)Murdock.1969 (n=6)Orpen.1980 (n=10)Palme r.1978 (n=14)Prosser.1991 (n=11)Rankin.1965 (n=21)Remmers.1949 (n=53)Rodin.1972 (n=12)Sheets.1995.01 (n=58)Sheets.1995.02 (n=63)Solomon.1964 (n=24)Soper.1973 (n=14)Sullivan.1974.01 (n=14)Sullivan.1974.04 (n=9)Sullivan.1974.05 (n=9)Sullivan.1974.06 (n=16)Sullivan.1974.07 (n=8)Sullivan.1974.08 (n=6)Sullivan.1974.09 (n=8)Sullivan.1974.10 (n=14)Sullivan.1974.11 (n=6)Sullivan.1974.12 (n=40)Turne r.1974.01 (n=16)Turne r.1974.02 (n=24)Weinberg.2007.01 (n=190)Weinberg.2007.02 (n=119)Weinberg.2007.03 (n=85)Whitely.1979.01 (n=5)Whitely.1979.02 (n=11)Wiviott.1974 (n=6)Yun ker.2003 (n=46)

Total

2021.5

82.0 15.0 5.0 5.0

31.0 10.0 19.0 17.0 20.0 8.0 7.0 7.0

22.0 13.0 8.0 7.0 8.0 8.0 6.0

25.0 25.0 25.0 21.0 25.0 12.0 10.0 11.0 10.0 8.0

11.0 10.0 12.0 11.0 16.0 11.0 36.0 19.0 5.0

29.0 8.0 5.0 9.0

12.0 5.0 7.0 8.0

10.0 12.0 8.0 8.0 9.0

13.0 5.0

36.0 18.0 75.0 36.0 18.0 31.0 34.0 32.0 6.0

16.0 18.0 6.0

25.0 5.0 6.0

10.0 14.0 11.0 21.0 51.5 12.0 58.0 63.0 24.0 14.0 14.0 9.0 9.0

16.0 8.0 6.0 8.0

14.0 6.0

40.0 16.0 24.0 190.0 119.0 85.0 5.0

11.0 6.0

46.0

−1 −0.5 0 0.5 1

CorrelationCOR

0.16 0.23

0.21 0.18 0.89

−0.80 0.17 0.68 0.17 0.48 0.55

−0.94 0.60 0.61 0.64 0.23 0.87 0.58 0.41 0.78 0.19 0.42 0.52 0.56 0.46 0.41 0.49

−0.04 0.09

−0.02 0.64 0.03

−0.23−0.10 0.19 0.41 0.23 0.32 0.58

−0.15 0.09 0.91 0.60 0.81 0.18 0.74 0.79 0.23 0.32

−0.07 0.31

−0.13−0.16 0.11 0.29

−0.11 0.45 0.29 0.30 0.42 0.38 0.06

−0.20 0.10 0.25 0.54 0.20 0.38 0.92 0.77 0.61

−0.17−0.42−0.06 0.27

−0.75 0.15

−0.25 0.30

−0.17 0.51 0.57 0.33 0.34 0.48 0.55 0.08 0.42

−0.28 0.40

−0.51−0.41 0.04

−0.26−0.09 0.80

−0.11−0.04 0.19

95%−CI

[ 0.12; 0.21][ 0.16; 0.31]

[−0.01; 0.41][−0.37; 0.63][ 0.04; 0.99]

[−0.99; 0.28][−0.20; 0.49][ 0.09; 0.92]

[−0.31; 0.58][ 0.00; 0.78][ 0.14; 0.80]

[−0.99; −0.70][−0.28; 0.93][−0.26; 0.93][ 0.30; 0.84]

[−0.37; 0.69][ 0.43; 0.98]

[−0.31; 0.93][−0.41; 0.86][ 0.17; 0.96]

[−0.73; 0.87][ 0.03; 0.70][ 0.16; 0.76][ 0.21; 0.78][ 0.04; 0.74][ 0.02; 0.69]

[−0.12; 0.83][−0.65; 0.60][−0.54; 0.65][−0.64; 0.62][−0.12; 0.93][−0.58; 0.62][−0.75; 0.47][−0.64; 0.50][−0.46; 0.71][−0.11; 0.75][−0.43; 0.73][−0.01; 0.59][ 0.17; 0.82]

[−0.91; 0.84][−0.29; 0.44][ 0.57; 0.98]

[−0.60; 0.97][ 0.32; 0.96]

[−0.44; 0.68][−0.41; 0.98][ 0.09; 0.97]

[−0.57; 0.80][−0.39; 0.79][−0.62; 0.52][−0.50; 0.83][−0.76; 0.63][−0.74; 0.56][−0.47; 0.62][−0.80; 0.93][−0.42; 0.23][−0.02; 0.76][ 0.07; 0.49]

[−0.03; 0.57][−0.06; 0.74][ 0.03; 0.65]

[−0.28; 0.39][−0.51; 0.16][−0.77; 0.84][−0.28; 0.67][ 0.10; 0.81]

[−0.73; 0.87][−0.02; 0.67][ 0.20; 0.99]

[−0.11; 0.97][−0.03; 0.90][−0.64; 0.40][−0.81; 0.24][−0.48; 0.38][ 0.00; 0.51]

[−0.93; −0.31][−0.11; 0.39][−0.47; 0.00][−0.12; 0.62][−0.64; 0.40][−0.03; 0.82][−0.15; 0.90][−0.43; 0.82][−0.19; 0.72][−0.34; 0.89][−0.47; 0.94][−0.66; 0.74][−0.14; 0.78][−0.89; 0.69][ 0.10; 0.63]

[−0.80; −0.02][−0.70; −0.01][−0.10; 0.18]

[−0.42; −0.08][−0.30; 0.13][−0.28; 0.99][−0.67; 0.52][−0.83; 0.80][−0.11; 0.45]

W(fixed)

100%−−

4.6% 0.7% 0.1% 0.1% 1.6% 0.4% 0.9% 0.8% 1.0% 0.3% 0.2% 0.2% 1.1% 0.6% 0.3% 0.2% 0.3% 0.3% 0.2% 1.3% 1.3% 1.3% 1.0% 1.3% 0.5% 0.4% 0.5% 0.4% 0.3% 0.5% 0.4% 0.5% 0.5% 0.8% 0.5% 1.9% 0.9% 0.1% 1.5% 0.3% 0.1% 0.3% 0.5% 0.1% 0.2% 0.3% 0.4% 0.5% 0.3% 0.3% 0.3% 0.6% 0.1% 1.9% 0.9% 4.2% 1.9% 0.9% 1.6% 1.8% 1.7% 0.2% 0.8% 0.9% 0.2% 1.3% 0.1% 0.2% 0.4% 0.6% 0.5% 1.0% 2.8% 0.5% 3.2% 3.5% 1.2% 0.6% 0.6% 0.3% 0.3% 0.8% 0.3% 0.2% 0.3% 0.6% 0.2% 2.1% 0.8% 1.2%

10.8% 6.7% 4.7% 0.1% 0.5% 0.2% 2.5%

W(random)

−−100%

2.1%1.1%0.3%0.3%1.6%0.8%1.3%1.2%1.3%0.6%0.5%0.5%1.4%1.0%0.6%0.5%0.6%0.6%0.4%1.5%1.5%1.5%1.4%1.5%0.9%0.8%0.9%0.8%0.6%0.9%0.8%0.9%0.9%1.2%0.9%1.7%1.3%0.3%1.6%0.6%0.3%0.7%0.9%0.3%0.5%0.6%0.8%0.9%0.6%0.6%0.7%1.0%0.3%1.7%1.2%2.0%1.7%1.2%1.6%1.7%1.6%0.4%1.2%1.2%0.4%1.5%0.3%0.4%0.8%1.1%0.9%1.4%1.9%0.9%1.9%2.0%1.4%1.1%1.1%0.7%0.7%1.2%0.6%0.4%0.6%1.1%0.4%1.8%1.2%1.4%2.3%2.2%2.1%0.3%0.9%0.4%1.8%

Fig. 7. Forest plot for Instructor SET/learning correlations. The plot includes thestudy identifier, number of sections, correlation, 95% C.I., and weights for each studyas well as fixed effects and random effects estimates.

B. Uttl et al. / Studies in Educational Evaluation 54 (2017) 22–42 35

• Meta-analysis of 97 correlational studies.

• Found a correlation of r ≈ .2 (varies a bit depending on statistical assumptions).

• Re-analysed Cohen’s 1980s meta-analysis and found it was misleading due to small samples and publication bias.

Student Feedback/Ratings• Correlational studies on student feedback are easy to run.

• Much harder to establish that teachers with good feedback scores cause better student learning.

• Best design:

1. Randomly allocate students to lecturers;

2. Lecturers teach the same material;

3. Students take the same examination;

4. Students move to a subsequent follow-on module, study it and take another examination.

5. Does the lecturer’s student evaluations on the prerequisite module predict scores on the subsequent module?

Student Feedback/Ratings• Correlational studies on student feedback are easy to run.

• Much harder to establish that teachers with good feedback scores cause better student learning.

• Best design:

1. Randomly allocate students to lecturers;

2. Lecturers teach the same material;

3. Students take the same examination;

4. Students move to a subsequent follow-on module, study it and take another examination.

5. Does the lecturer’s student evaluations on the prerequisite module predict scores on the subsequent module?

This study has only been done twice

409

[ Journal of Political Economy, 2010, vol. 118, no. 3]! 2010 by The University of Chicago. All rights reserved. 0022-3808/2010/11803-0001$10.00

Does Professor Quality Matter? Evidence fromRandom Assignment of Students to Professors

Scott E. CarrellUniversity of California, Davis and National Bureau of Economic Research

James E. WestU.S. Air Force Academy

In primary and secondary education, measures of teacher quality areoften based on contemporaneous student performance on standard-ized achievement tests. In the postsecondary environment, scores onstudent evaluations of professors are typically used to measure teach-ing quality. We possess unique data that allow us to measure relativestudent performance in mandatory follow-on classes. We comparemetrics that capture these three different notions of instructional qual-ity and present evidence that professors who excel at promoting con-temporaneous student achievement teach in ways that improve theirstudent evaluations but harm the follow-on achievement of their stu-dents in more advanced classes.

Thanks go to U.S. Air Force Academy personnel, R. Schreiner, W. Bremer, R. Fullerton,J. Putnam, D. Stockburger, K. Carson, and P. Egleston, for assistance in obtaining the datafor this project and to Deb West for many hours entering data from archives. Thanks alsogo to F. Hoffmann, H. Hoynes, C. Hoxby, S. Imberman, C. Knittel, L. Lefgren, M. Lov-enheim, T. Maghakian, D. Miller, P. Oreopoulos, M. Page, J. Rockoff, and D. Staiger andall seminar participants at the American Education Finance Association meetings, ClemsonUniversity, Duke University, NBER Higher Ed Working Group, Stanford University, andUniversity of California, Berkeley and Davis for their helpful comments. The views ex-pressed in this article are those of the authors and do not necessarily reflect the officialpolicy or position of the U.S. Air Force, the Department of Defense, or the U.S.government.

Evaluating students’ evaluations of professors§

Michela Braga a, Marco Paccagnella b, Michele Pellizzari c,*a Bocconi University, Department of Economics, Italyb Bank of Italy, Trento Branch, Italyc University of Geneva, Institute of Economics and Econometrics, Switzerland

1. Introduction

The use of anonymous students’ evaluations of pro-fessors to measure teachers’ performance has becomeextremely popular in many universities (Becker & Watts,1999). They normally include questions about the clarity oflectures, the logistics of the course, and many others. Theyare either administered during a teaching session towardthe end of the term or, more recently, filled on-line.

The university administration uses such evaluations tosolve the agency problems related to the selection andmotivation of teachers, in a context in which neither thetypes of teachers, nor their effort, can be observedprecisely. In fact, students’ evaluations are often used toinform hiring and promotion decisions (Becker & Watts,1999) and, in institutions that put a strong emphasis onresearch, to avoid strategic behavior in the allocation oftime or effort between teaching and research activities(Brown & Saks, 1987; De Philippis, 2013).1

Economics of Education Review 41 (2014) 71–88

A R T I C L E I N F O

Article history:Received 2 August 2013Received in revised form 22 April 2014Accepted 23 April 2014Available online 5 May 2014

JEL classification:I20M55

Keywords:Teacher qualityPostsecondary educationStudents’ evaluations

A B S T R A C T

This paper contrasts measures of teacher effectiveness with the students’ evaluations forthe same teachers using administrative data from Bocconi University. The effectivenessmeasures are estimated by comparing the performance in follow-on coursework ofstudents who are randomly assigned to teachers. We find that teacher quality matterssubstantially and that our measure of effectiveness is negatively correlated with thestudents’ evaluations of professors. A simple theory rationalizes this result under theassumption that students evaluate professors based on their realized utility, anassumption that is supported by additional evidence that the evaluations respond tometeorological conditions.

! 2014 Elsevier Ltd. All rights reserved.

§ We would like to thank Bocconi University for granting access to itsadministrative archives for this project. In particular, the followingpersons provided invaluable and generous help: Giacomo Carrai, MarieleChirulli, Mariapia Chisari, Alessandro Ciarlo, Alessandra Gadioli, RobertoGrassi, Enrica Greggio, Gabriella Maggioni, Erika Palazzo, GiovanniPavese, Cherubino Profeta, Alessandra Startari and Mariangela Vago.We are also indebted to Tito Boeri, Giovanni Bruno, Giacomo De Giorgi,Marco Leonardi, Vincenzo Mariani, Tommaso Monacelli, Tommy Murphyand Tommaso Nannicini for their precious comments. We would also liketo thank seminar participants at the Bank of Italy, Bocconi University,International Workshop on Applied Economics of Education, LondonSchool of Economics, UC Berkeley, Universita Statale di Milano and LUISSUniversity. Davide Malacrino and Alessandro Ferrari provided excellentresearch assistance. Michele Pellizzari is also affiliated to IZA, CREAM,fRDB and NCCR-LIVES. The views expressed in this paper are solely thoseof the authors and do not involve the responsibility of the Bank of Italy.The usual disclaimer applies.

* Corresponding author at: University of Geneva, Department ofEconomics, Uni-Mail, 40 Bd. du Pont d’Arve, CH-1211 Geneva 4,Switzerland. Tel.: þ41 22 37 98 915.

E-mail addresses: [email protected] (M. Pellizzari).

1 Although there is some evidence that a more research orientedfaculty also improve academic and labor market outcomes of graduatestudents (Hogan, 1981).

Contents lists available at ScienceDirect

Economics of Education Review

jo ur n al h o mep ag e: w ww .e lsev ier . co m / loc ate /ec o ned u rev

http://dx.doi.org/10.1016/j.econedurev.2014.04.0020272-7757/! 2014 Elsevier Ltd. All rights reserved.

Student Feedback/Ratings

• Braga et al.: “We find that teacher quality matters substantially and that our measure of effectiveness is negatively correlated with the students’ evaluations of professors.”

• Carrell & West: “We present evidence that professors who excel at promoting contemporaneous student achievement teach in ways that improve their student evaluations but harm the follow-on achievement of their students in more advanced classes.”

These results are broadly consistent with the findings ofother studies (Carrell & West, 2010; Krautmann & Sander,1999; Weinberg et al., 2009). Krautmann and Sander(1999) only look at the correlation of students evaluationswith (expected) contemporaneous grades and find that aone-standard deviation increase in classroom GPA resultsin an increase in the evaluation of between 0.16 and 0.28 ofa standard deviation in the evaluation. Weinberg et al.(2009) estimate the correlation of the students’ evalua-tions of professors and both their current and future gradesand report that ‘‘a one standard deviation change in thecurrent course grade is associated with a large increase inevaluations-more than a quarter of the standard deviationin evaluations’’, a finding that is very much in line with ourresults. When looking at the correlation with future grades,Weinberg et al. (2009) do not find significant results.24 Thecomparison with the findings in Carrell and West (2010) iscomplicated by the fact that in their analysis thedependent variable is the teacher effect whereas the itemsof the evaluation questionnaires are on the right-hand-sideof their model. Despite these difficulties, the positivecorrelation of evaluations and current grades and thereversal to negative correlation with future grades is arather robust finding.

These results clearly challenge the validity of students’evaluations of professors as a measure of teaching quality.Even abstracting from the possibility that professorsstrategically adjust their grades to please the students (a

practice that is made difficult by the timing of theevaluations, that are always collected before the examtakes place), it might still be possible that professors whomake the classroom experience more enjoyable do that atthe expense of true learning or fail to encourage studentsto exert effort. Alternatively, students might rewardteachers who prepare them for the exam, that is teacherswho teach to the test, even if this is done at the expenses oftrue learning. This interpretation is consistent with theresults in Weinberg et al. (2009), who provide evidencethat students are generally unaware of the value of thematerial they have learned in a course.

0.1pt?>Of course, one may also argue that students’satisfaction is important per se and, even, that universitiesshould aim at maximizing satisfaction rather than learning,especially private institutions like Bocconi. We doubt that thisis the most common understanding of higher education policy.

5. Robustness checks

In this section we present some robustness checks forour main results.

First, one might be worried that students might notcomply with the random assignment to the classes. Forvarious reasons they may decide to attend one or morecourses in a different class from the one to which they wereformally allocated.25 Unfortunately, such changes would

Fig. 1. Students’ evaluations and estimated teacher effectiveness.

24 Notice also that Weinberg et al. (2009) consider average evaluationstaken over the entire career of professors.

25 For example, they may wish to stay with their friends, who mighthave been assigned to a different class, or they may like a specific teacher,who is known to present the subject particularly clearly.

M. Braga et al. / Economics of Education Review 41 (2014) 71–8882 Subsequent Module Current Module

Student Feedback/Ratings

Notable that the only two experimental evaluations of student feedback/ratings find the same results:

• student evaluations show weak positive correlations with current achievement;

• student evaluations show weak negative correlations with subsequent achievement.

No reason to suppose these are better

Evidence in Education

Range of methods used to evaluate educational quality in HE:

• Peer/expert classroom observation

• Student ratings

• Analysis of educational materials

• Lecturer portfolios

Summary

• A fundamental problem with teaching is that we have no valid way of assessing teaching quality, beyond assessing learning gains.

• (And assessing learning gains is also extremely tricky, especially in contexts where there are no control groups or standard examinations.)

• The methods we try to use in higher education are known to be invalid.

• What’s the implications of this?

A Serious Problem• What do I do if I want to change my teaching practice?

• If we can’t evaluate teaching quality validly, how can I know if my change has improved my students’ learning?

• Standard advice is to be a “reflective practitioner” and evaluate whether or not the change has helped my students.

• I think this is very bad advice.

• Evaluating student learning is really hard, especially so in higher education where we don’t have obvious control groups or standard examinations.

• Especially true when some effects are only revealed after a delay (e-Proofs, student feedback).

Do not despair!• So if I can’t trust student feedback or advice

from observers, what can I do?

• I can carefully evaluate my proposed change against what we know about the (cognitive) mechanisms which underlie learning.

• If I can’t see how the proposed change will assist these mechanisms then I shouldn’t waste my time doing it.

• What are the cognitive mechanisms which underlie learning? Happily there are now quite a few good introductory texts to the science of learning.

OF LEARNINGSCIENCETHE

w w w . d e a n s f o r i m p a c t . o r g

Perspectives on Psychological Science2016, Vol. 11(5) 652 –660© The Author(s) 2016Reprints and permissions: sagepub.com/journalsPermissions.navDOI: 10.1177/1745691616645770pps.sagepub.com

Imagine a first-year college student (let’s call him Mark) preparing for his first big mid-term examination. Mark studies hard: He stays up late the night before the test, highlighting his textbook and poring over his notes. A week later, Mark is surprised to find a big red C– on his examination. Mark’s case is not unusual, especially in large introductory lecture courses where some students do not know how to prepare and to study for success in a college class.

Consider introductory psychology. Most students arrive with little idea of what psychology is about, but they are suddenly thrust into a rampage through a textbook with 15 chapters on 15 very different topics. The professor may cover perceptual illusions one week, classical condition-ing the next week, and romantic relationships after that. The sheer number of technical concepts introduced each week demands study and concentration. How should stu-dents study and prepare for such a seemingly insurmount-able amount of material? Every introductory college course—biology, economics, history, or physics—poses a similar challenge.

When college students are surveyed on how they study, many report relying on certain strategies such as highlighting (or underlining) as they read—upward of 80% of students report rereading their textbook and

notes as their primary form of studying, often focusing on only the highlighted parts (Hartwig & Dunlosky, 2011; Karpicke, Butler, & Roediger, 2009). Despite the popular-ity of such strategies, research in cognitive and educational psychology suggests that they may consume considerable time without leading to durable learning (e.g., Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013).

The good news is that psychologists now better under-stand which study strategies are effective and which are not. For example, research has shown that students learn more when their studying is spaced apart in time rather than crammed into one long session (Carpenter, Cepeda, Rohrer, Kang, & Pashler, 2012) and that taking practice quizzes may be better than rereading the textbook (Roediger & Karpicke, 2006). Another effective strategy is to answer some ques-tions about a topic before doing the reading rather than afterward (e.g., Richland, Kornell, & Kao, 2009). We discuss these techniques in more detail later.

Many of these strategies may seem counterproductive to both students and teachers. After all, quizzing and

645770 PPSXXX10.1177/1745691616645770Optimizing LearningPutnam et al.research-article2016

Corresponding Author:Adam L. Putnam, Department of Psychology, Carleton College, 1 North College St, Northfield, MN 55057 E-mail: [email protected]

Optimizing Learning in College: Tips From Cognitive Psychology

Adam L. Putnam1, Victor W. Sungkhasettee2, and Henry L. Roediger, III21Department of Psychology, Carleton College, 2Psychological & Brain Sciences Department, Washington University in St. Louis

AbstractEvery fall, thousands of college students begin their first college courses, often in large lecture settings. Many students, even those who work hard, flounder. What should students be doing differently? Drawing on research in cognitive psychology and our experience as educators, we provide suggestions about how students should approach taking a course in college. We discuss time management techniques, identify the ineffective study strategies students often use, and suggest more effective strategies based on research in the lab and the classroom. In particular, we advise students to space their study sessions on a topic and to quiz themselves, as well as using other active learning strategies while reading. Our goal was to provide a framework for students to succeed in college classes.

Keywordslearning techniques, memory, study strategies, metacognition, college success