strengthening the research base that informs … meta...strengthening the research base that informs...

Strengthening the Research Base that

Informs STEM Instructional Improvement Efforts:

A Meta-Analysis

Kathleen Lynch

Brown University

Heather C. Hill, Kathryn Gonzalez, & Cynthia Pollard

Harvard Graduate School of Education

March 2019

Suggested Citation:

Lynch, K., Hill, H.C., Gonzalez, K, & Pollard, C. (2019). Strengthening the Research Base that

Informs STEM Instructional Improvement Efforts: A Meta-Analysis. Brown University Working

Paper.

Correspondence concerning this article can be addressed to Kathleen Lynch at

[email protected].

This material is based upon work supported by the National Science Foundation under Grant

#1348669. Any opinions, findings, and conclusions or recommendations expressed in this

material are those of the author(s) and do not necessarily reflect the views of the National

Science Foundation. The authors would like to thank Patrick Aquino, Wilson Boardman,

Jonathan Hamel Sellman, Andrea Humez, Meghan Kelly, Jiwon Lee, Angie Luo, Manish Parmar,

and Melanie Rucinski for their assistance with data preparation and analysis.

INSTRUCTIONAL IMPROVEMENT IN STEM

1

Abstract

We present results from a meta-analysis of 95 experimental and quasi-experimental preK-12

science, technology, engineering, and mathematics (STEM) professional development and

curriculum programs, seeking to understand what content, activities and formats relate to

stronger student outcomes. Across rigorously conducted studies, we found an average weighted

impact estimate of +0.21 standard deviations. Programs saw stronger outcomes when they helped

teachers learn to use curriculum materials; focused on improving teachers' content knowledge,

pedagogical content knowledge and/or understanding of how students learn; incorporated

summer workshops; and included teacher meetings to troubleshoot and discuss classroom

implementation. We discuss implications for policy and practice.

Keywords: Professional development, curriculum, mathematics, science, STEM


2

Strengthening the Research Base that Informs STEM Instructional Improvement Efforts:

A Meta-Analysis

Instructional improvement efforts constitute a persistent feature of the preK-12 science,

technology, engineering, and mathematics (STEM) educational landscape, with a decades-long

trail of new curriculum materials and professional development programs aimed at changing how

teachers interact with students around content. Such initiatives bring with them significant costs;

scholars estimate that districts spend between 1 and 6% of their budgets on professional

development (see, e.g., Corcoran,1995; Miles, Odden, Fermanich & Archibald, 2004; Miller,

Lord & Dorney, 1994), and the market for instructional materials totaled $11.9 billion in 2013

(Cavanagh, 2015). Prior to 2002, scholars rarely rigorously evaluated such instructional

improvement programs, often using instead cross-sectional and/or self-report data to identify

“best practices” in professional development and curricular design (e.g., Becker & Park, 2011;

Desimone, 2009; Sparks, 2002). However, following calls in the early 2000s for stronger

research into the impact of educational interventions (Confrey & Stohl, 2004; Raudenbush, 2008;

Shavelson & Towne, 2001), federal research portfolios began to prioritize research methods that

allow causal inference, and to use student outcomes as the major indicator of program success.

Dollars’ and scholars’ turn toward using causal methods and student-level impacts has

resulted in a wealth of new studies. These new studies, we argue, permit rigorous empirical

analyses linking program characteristics to outcomes, a topic long of interest to practitioners who

make decisions regarding intervention design and/or adoption. In this paper, we present results

from a meta-analysis of preK-12 STEM curriculum materials and professional development

programs intended to improve instructional quality and student learning, seeking to understand

what content, activities, and formats are linked to stronger student outcomes. Our work differs


3

from other recent efforts in that it is a formal meta-analysis rather than a structured review (e.g.,

Kennedy, 2015; Gersten, 2014), and because the newly available studies allow us to compile a

dataset larger than past reviews, and thus to exclude studies with weaker designs.

We argue that this work is particularly timely. The Every Student Succeeds Act requires

that districts receiving Title I funds must adopt “evidence-based interventions,” including

programs and strategies proven to be effective in raising student achievement. However, recent

null findings from large-scale studies (e.g., Borman, Cotner, Lee, Boydston, & Lanehart, 2009;

Garet et al., 2011; Santagata, Kersting, Givven & Stigler, 2010), as well as an analysis of studies

curated by What Works Clearinghouse (WWC; Malouf & Taymans, 2016), has led many to

doubt the efficacy of instructional improvement programs. By contrast, we find an effect of

+0.21 standard deviations among the programs we study. We describe our study in more detail

below.

Background

STEM Instructional Improvement

Recent calls for reform in STEM education have focused on increasing both student

understanding of core disciplinary ideas and student engagement with key disciplinary practices

such as inquiry, argumentation and proof (NRC, 2011; NGSS 2013; CCSS-M, 2010). Yet

observational studies have shown that instruction in U.S. STEM classrooms tends to be lacking

in disciplinary concepts and low in student cognitive demand (e.g., Authors, in press; Banilower,

Smith, Weiss & Pasley, 2006; Kane, Kerr, & Pianta, 2014; Hiebert et al., 2005).

Perhaps as a result, programs aimed at improving STEM instructional quality abound.

These programs tend to involve two main strategies for improvement: teacher professional

development, which is typically intended to change some aspect of teachers’ instruction, and


4

new curriculum materials, which are typically intended to shape both instruction and the subject

matter content teachers teach. Professional development and curriculum programs can be

independent (e.g., professional development alone) or used in combination (e.g., when

curriculum materials’ implementation is supported by professional development). In contrast to

alternative reforms, such as standards, test-based accountability and market-based reforms,

which out broad principles and signals to which schools and teachers must interpret and react,

curriculum and professional development provide direct instructional guidance, attempting to

shape schools’ and teachers’ day-to-day interactions with students through lesson plans and

instructional strategies (Ball & Cohen, 1996). We discuss the specific theory of action for both

professional development and curriculum below.

Providers of STEM teacher professional development – often districts, but sometimes

non-profits, professional associations, or university-based faculty – typically design a set of

experiences intended to affect change in a variety of teacher- and classroom-level phenomena.

Most programs focus on instruction as a primary target for change, providing teachers new

routines, instructional strategies, and/or ways of teaching content. However, many of these

programs also hope to change teachers’ beliefs, knowledge of content, and knowledge of how

students learn content in order to support instructional improvement. For instance, Garet et al.

(Garet et al., 2010) describe a program in which teachers learn rational number concepts, learn

about persistent student misconceptions with this topic, then receive group and individual

coaching on how to apply the material learned in the content-focused sessions in their own

classrooms. Schneider and Meyer (2012) describe a program in which teachers learn about

assessments and formative assessment practices, then put them into practice. The STeLLA

program (Taylor et al., 2015) focuses mainly on improving teachers’ capacity to conduct


5

analyses of instruction, but also includes sessions on science content and how students learn

content. Such programs reflect the view that teachers’ instructional practice cannot change

without corresponding changes in teachers’ beliefs and knowledge (Smith & O’Day, 1990).

As these examples imply, the experiences designed by professional development

providers can vary substantially. As critiques of the “one-shot workshop” surfaced two decades

ago (Goldenberg & Gallimore, 1991; Kennedy, 1999), providers experimented with new

formats, including program delivery distributed across one or more school years, 1:1 coaching,

collaborative workgroups, and teacher learning in online settings. Recent reform movements

have also changed professional development foci, from programs primarily targeted toward

improving teacher content knowledge (Frechtling, Sharp, Carey & Vaden-Kiernan, 1995) to

more diverse topics, including improving teachers’ pedagogical content knowledge (Shulman,

1986), helping teachers develop and implement content-specific formative assessment, helping

teachers address the needs of English Learners, and helping teachers use technology more

effectively in STEM classrooms. With these new foci, professional development activities also

changed, with newer programs offering more study of student work, more focus on the

curriculum materials to be used in classrooms, and more focus on lesson analysis and planning.

The theory of action around curriculum materials also involves shaping what teachers do

in classrooms, but here the focus is on both introducing specific instructional moves – supported

by the activities and guidance in the text itself – as well as changing the content taught in

substantial ways. For instance, the elementary curriculum materials supported by the National

Science Foundation in the 1990s were designed to engage students in mathematical practices

such as problem-solving, mathematical reasoning, and precise communication, but also to

include topics that had not previously been taught in elementary schools, such as data, statistics,


6

and early algebra (Stein, Remillard & Smith, 2007). In science, numerous projects have built

units and curricula intended to replace conventional textbooks with more inquiry-oriented

instruction and content that is either more relevant to students’ lives or future careers (e.g., Marx

et al., 2004; Schwartz-Bloom & Halpin, 2003).

It is worth noting that many providers of curriculum materials also hold a theory of action

similar to that of professional developers – that teacher beliefs and knowledge must change in

order to support improvements in classroom practice. In some programs, providers intentionally

mix the new curriculum materials with intensive professional development. For instance, Saxe,

Gearhardt & Nasir (2001) paired a new fractions unit with a five-day summer workshop and 13

in-school sessions focused on improving teachers’ knowledge of fractions, teachers’ knowledge

of how students learn fractions, and teachers’ knowledge of student motivation. Roschelle

(Roschelle et al., 2010) and Hand (Hand et al., 2014) followed a similar approach. Many

curriculum providers also embed guidance for teachers in the written materials themselves. This

may include explanations of content intended for teachers – for instance, pages explaining the

conceptual ideas within the unit, how they connect to one another, how to represent those ideas,

alternative solution methods to problems, and even sample dialogue or scripts. Examples of such

curricula include Investigations in Number, Data, and Space in mathematics (Agodini et al.,

2013) and P-SELL (Llosa et al., 2016) in science.

In other cases, curriculum providers offer less professional development, often only a few

days intended to orient teachers to routines in the curriculum, adaptations for students with

specific needs, and technological enhancements to the curriculum (e.g., Pane, Griffin,

McCaffrey, & Karam, 2014); Resendez & Azin, 2006). A small number of curriculum materials


7

providers (e.g., Cervetti, Barber, Dorph, Pearson, & Goldschmidt, 2012) eschew professional

development entirely, often in hopes of replicating real-life conditions in schools.

Syntheses of STEM Instructional Improvement Programs

Several recent syntheses examine STEM teacher professional development and

curriculum improvement efforts. We review the methodology and findings from these syntheses,

then comment on their characteristics.

In the area of professional development (Blank, de las Alas, & Smith, 2008; Kennedy,

1999, 2015; Yoon, 2007; Scher & O’Reilly, 2009; Wilson, 2013), prior syntheses have generally

indicated positive impacts on learning. Meta-analyses suggest that effect sizes range between

+0.21 SD (Blank & de las Alas, 2010) and +0.54 SD (Yoon et al., 2007) on student outcomes,

with some suggestion that content-specific (math) rather than content-general (classroom

management) programs produce greater learning gains (Kennedy, 1999; Scher & O’Reilly,

2009). However, Scher and O’Reilly (2009) noted that the pool of STEM-focused articles and

reports published by 2004 could not support most expert recommendations regarding “best

practices” in professional development (see, e.g., Desimone, 2009). For example, although

experts have frequently expressed the opinion that professional development must be longer in

duration in order to be effective (e.g., Darling-Hammond & McLaughlin, 1995; Desimone,

2011), recent syntheses of the empirical literature have returned inconsistent findings on this

point (e.g., Blank & de las Alas, 2010). Kennedy (2016) found that more time-intensive

programs had weaker effects, whereas Yoon found that the three studies in his review that had

the least amount of professional development (5-14 hours) showed no statistically significant

impacts on student learning. However, Yoon’s review included only nine studies. Scher and


8

O’Reilly (2009) found that multi-year interventions were more effective than single-year

interventions in math, but not in science.

Two recent structured reviews of professional development, each conducted several years

after federal guidelines changed to prioritize causal research, provide another form of evidence

regarding instructional improvement programs. One review specific to mathematics (Gersten,

Taylor, Keys, Rolfhus, & Newman-Gonchar, 2014) used the stringent WWC evidence standards

to screen studies, and perhaps as a result returned too few (five) to discern patterns in program

impacts. Another (Kennedy, 2015) identified 28 studies, split equally between English Language

Arts, science, and mathematics. Kennedy found that the characteristics often cited as best

practices in professional development (content-focused, collective participation of all teachers

within a grade/school, duration) were less predictive of positive outcomes than other features,

including assisting teachers in gaining insight into their practice and helping teachers inject novel

ideas into their practice. The latter categories included several coaching and curriculum-based

programs. Finally, in science, Slavin and colleagues (Slavin, Lake, Hanley, & Thurston, 2014)

found that professional development programs supporting inquiry-based elementary science

raised student outcomes by an average of 0.36 SD. For secondary science, programs that

emphasized inquiry through teacher professional development saw an average effect size of

+0.24 SD.

In the area of curriculum materials, two research syntheses by Slavin and colleagues

(Slavin & Lake, 2008; Slavin, Lake, & Groff, 2009) using studies published between 1971 and

2008 found fairly small effects for mathematics curriculum materials, at +0.03 SD and +0.10 SD

for secondary and elementary curricula respectively. For science (Slavin, et al., 2014),

elementary programs with “kits” and accompanying professional development had a near-zero


9

(+0.03 SD) average impact on student outcomes. The average impact of kits in secondary science

proved similar (+0.04 SD), with the average impact of science textbooks not substantially larger

(+0.10 SD) (Cheung, Slavin, Kim, & Lake, 2017). A review of studies by the National Research

Council (2004) conducted in the early 2000s found stronger results for NSF-funded curricula

than conventional materials, in that 59% of comparisons between materials favored the NSF

materials. However, many regard vote counting as a weak and misleading synthesis procedure

(Hedges & Olkin, 1980), and Confrey (2006) herself noted that the studies included in the NRC

report were of varying quality, and that as study rigor increased, effect sizes diminished. Slavin

and colleagues, perhaps because they used more rigorous study selection criteria, did not find

any relationship between study design and effect size, although they note that the methodological

rigor of most included studies was still relatively weak, and “quality research is particularly

lacking in this area” (Slavin, 2008, p. 480).

Finally, one recent study (Taylor et al., 2018) examined interventions that offered

professional development, curriculum materials, and computer software intended to improve

student science outcomes. Across 96 such studies they found an average impact of 0.489 SD,

with programs evaluated by researcher-designed assessments posting significantly higher

impacts, similar to an earlier report by Hill, Bloom, Black & Lipsey (2008). Other study and

intervention characteristics did not significantly predict program outcomes.

Motivation for the Current Study

These research syntheses share several important characteristics. To start, nearly all noted

that the evidence base for making recommendations was thin at the time of the review. Using the

pool of articles and reports on professional development published by 2003, Yoon and

colleagues (2007) could find only nine ELA, math, and science studies that met WWC evidence


10

standards. Years later, Gersten et al. (2014) located only five mathematics professional

development studies that met the WWC’s exacting standards. Other evaluations (e.g., Scher &

O’Reilly, 2009; Slavin, Lake, Hanley, & Thurston, 2014; Cheung, Slavin, Kim & Lake 2017)

achieved a greater sample size by including quasi-experimental studies with matched comparison

groups (and sometimes retrospective matching, a disputed technique [Slavin, 2008]). Kennedy’s

cross-subject synthetic review (2016) with 28 studies was an exception, as many studies

described more recent strong quasi-experiments or actual randomized trials. In general, however,

the small sample sizes typical in these reviews limited both the generalizability of findings and

also the number of program features, or moderators, that could be coded and examined. We

argue that recent IES- and NSF-funded classroom-level experiments provide a thicker evidence

base, eliminating the need to include studies that conducted post-hoc matching and facilitating

more moderator analyses.

These reviews also share another important characteristic, in that many identified only a

small number of program features for study. This is most apparent in the area of professional

development, where research syntheses almost universally coded studies for duration, content-

specificity, and delivery format. For curriculum materials, most analyses categorized programs

by the degree of alignment with standards-based reform (e.g., Confrey & Stohl, 2004). With the

accumulation of more recent rigorous studies, however, have come opportunities to understand

how features of professional development or curriculum materials moderate effects on student

outcomes. For instance, professional development activities (watching video, designing lessons,

studying student work) may differentially impact program outcomes, as might curriculum design

features, such as length of implementation and degree of support for teachers. In addition, the


11

programs studied recently contain more varied delivery methods and features (e.g., coaching;

online learning components) than those of a decade ago.

Viewed from a contemporary perspective, several other issues emerge. First, existing

research syntheses of professional development studies tend to find that most returned positive

and significant effects. Yoon et al. (2007), for instance, found only one of twenty effects

identified across nine studies was negative, and only one was zero. Yet the more recent wave of

studies tends to find more mixed impacts (e.g., Garet et al., 2008; Santagata et al., 2010). This

suggests that the pool of studies included in earlier syntheses might have suffered from two

issues: the “file drawer” problem, in which studies with null results are not published (Slavin,

2008); or the “boutique” problem, in which only highly promising – and often unusual –

programs are studied (Author, 2004). We believe that both problems may have been ameliorated

in recent years. IES and NSF funding guidelines have explicitly encouraged the evaluation of

programs that have wide reach, meaning those programs may be more typical of the offerings

available to teachers. Further, better reporting practices – e.g., final reports posted on websites or

short reports archived on conference websites – may have rescued studies from file drawers.

Second, the practice of reviewing professional development and curriculum studies

separately creates conceptual and practical difficulties. On the conceptual side, most curriculum

programs also include a professional development component for teachers; likewise, some

professional development programs offer teachers materials to support the implementation of

new practices within classrooms. On the practical side, the combination of professional

development and materials together may be especially effective, as compared to either one alone

or one with a minimal dose of the other (Authors, 2001). Studying both within one review may


12

enhance our understanding of how these instructional improvement efforts can complement one

another.

Methods

Search Procedures

We searched for studies in three phases. We began by scanning the reference lists of prior

reviews of math and science professional development and curriculum improvement programs

(Blank, de las Alas, & Smith, 2008; Cheung, Slavin, Kim, & Lake, 2017; Furtak, Seidel, Iverson,

& Briggs, 2012; Gersten et al., 2014; Kennedy, 1998, 2015; Scher & O’Reilly, 2009; Slavin,

Lake, & Groff, 2009; Slavin & Lake, 2008; Slavin, Lake, Hanley, & Thurston, 2014; Timperley,

Wilson, Barrar, & Fung, 2007; Wilson, 2013; Yoon, 2007; Zaslow et al., 2010) for the years

1989 through 2004. Due to resource constraints, we did not conduct additional searches for

materials dated prior to 2004. We argue that this is reasonable given that prior reviewers have

previously searched the published and grey literature from the pre-2004 period extensively1;

given that rigorous studies of teacher PD and curriculum improvement were relatively rare prior

to the early 2000s (e.g., Kennedy, 2016); we considered the likelihood relatively low that we

would have uncovered previously unknown randomized trials or sufficiently well-designed

quasi-experiments from the pre-2004 period.2 See Appendix Table C13 for a descriptive

comparison of the studies published pre-2004 versus studies published 2004 or later. As

illustrated in the table, descriptively, studies from the earlier period had larger effect sizes on

average; they were also less likely to be RCTs, and less likely to use a state standardized test as

an outcome variable. As we discuss below, we control for these methodological variables in all

of our primary models.


13

In the second search phase, we conducted an electronic search using the databases

Academic Search Premier, ERIC, Ed Abstracts, PsycINFO, EconLit, and ProQuest Dissertations

& Theses, for the period January 2004 - March 2016. Searches were conducted using subject-

related keywords adapted from Yoon (2009), and methodology-related keywords designed to

capture experimental and quasi-experimental methods adapted from Kim & Quinn (2012).3 We

also searched the websites of Regional Education Labs, What Works Clearinghouse, the World

Bank, Inter-American Development Bank, Empirical Education, Mathematica, MDRC, and AIR

for relevant materials. We also searched the abstracts of the Society for Research on Educational

Effectiveness (SREE) conference. We ceased our materials search in March 2016.

In the third search phase, we downloaded, from the NSF Community for Advancing

Discovery Research in Education (CADRE) and IES websites, a list of all STEM award grantees

from the years 2002 to 2012. We then conducted electronic database and Web searches to find

all studies published from each award. In the case of 29 awards, we could find no publicly

available reports or information that included student outcomes, and we attempted to contact

project PIs via email to obtain study results. Of these 29 grant awards, we were sent or later

located reports from 17. Of these, the reports from 16 studies were excluded according to the

criteria above, and one was included in our analyses.

The search procedures described above netted a total of 8,099 records identified through

database screening, and an additional 1,391 records identified through other sources (see Figure

1 for screening flow chart). After removing duplicates, we were left with 7,926 records.

Screening Procedures

Screening proceeded in two phases. First two raters screened each of the studies' titles

and abstracts to identify potentially relevant studies, passing studies into the second phase when


14

they covered grades preK-12, included student outcomes, included quantitative data, and focused

on math and science-specific content and/or instructional strategies. All studies flagged as

potentially relevant by either rater were then reviewed by one of the authors, who made a final

determination about moving the study forward. A total of 656 studies met the initial relevance

criteria and were advanced to full-text screening.

In the second screening phase, two authors examined the full text of each study and

applied more detailed content and methodological criteria. In order to qualify for inclusion in the

final dataset, we required that studies use a randomized or quasi-experimental research design

with a comparison group. Following Slavin and Lake (2008), we included studies that assembled

comparison groups via prospective, but not post-hoc, matching. Slavin and Lake (2008) have

argued that “Prospective studies, in which experimental and control groups were designated in

advance and outcomes are likely to be reported whatever they turn out to be, are always to be

preferred to post hoc studies, other factors being equal.” According to Slavin and Lake (2008),

post-hoc designs have several characteristic problems. First, when researchers attempt to

examine the efficacy of a program by simply comparing the test scores of schools or classrooms

who completed the program versus those that did not, only “survivors” are included in the

treatment group. This can lead to upwardly biased estimates of the treatment impact, if more

schools or classrooms began the treatment but dropped it, potentially because it was not working.

Second, when researchers construct post-hoc comparisons of treated and matched comparison

schools, they generally have many potential ‘comparison’ schools to choose from; this raises the

concern that researchers may inadvertantly (or even intentionally) choose ‘comparison’ schools

that have made less academic progress over the study period than the treatment schools. Third,

post-hoc matching studies are often commissioned by textbook publishers and curriculum


15

developers because they are low-cost and easy to conduct. For these same reasons, study authors

may easily abandon them if they do not show positive results, and thus the retrievable studies

may be especially skewed toward positive findings (Slavin & Lake, 2008). To classify a study as

prospective matching, we required that the study authors explicitly report that they matched

treatment and comparison groups on pretest data prior to the intervention’s commencement.

Studies that explicitly stated that matching was done retrospectively, or that were silent on this

point, were excluded. We also excluded studies in which treatments had been in place prior to

pretesting. We also required that studies present pretest means, and that pretest differences

between groups be less than 50% of a standard deviation. Despite this relatively liberal threshold

for allowable pretest differences, these data requirements resulted in the inclusion of only nine

quasi-experiments.

Studies had to be published in 1989 or later, be written in English, and have participating

students in grades preK-12. The program had to focus on classroom-level STEM instructional

improvement through professional development, curriculum materials, or both. We excluded

programs with no instructional improvement component, such as after-school peer tutoring or

computerized at-home skills practice, as well as studies that examined the impact of variables,

rather than programs.4 Studies had to include sufficient data to calculate an effect size for student

outcomes. The most common reasons for exclusion were for characteristics of the intervention

(e.g., off-topic; not a classroom-level intervention (N = 561), methodology issues (e.g., no

control group; post-hoc design) (N = 310); and sample issues (e.g., did not have at least two

teachers and fifteen students in each condition) (N = 45) (note that some studies had multiple

reasons for exclusion) (see Figure 1).


16

Ninety-five studies met the review inclusion criteria and advanced to the study coding

phase. In cases where we encountered multiple versions of the same study, we used all available

reports to glean information about the study, and used the most recent version (often the peer-

reviewed publication version) for impact estimates. Of the studies that met the inclusion criteria,

44 percent were from peer-reviewed journal articles, 23 percent were conference papers or

presentations, 18 percent were technical reports, 5 percent were district, state, or federal

government reports, 4 percent were doctoral dissertations, and 5 percent were from other

sources. Many of these studies reported multiple effect sizes due to the inclusion of multiple

outcome measures, multiple versions of the same program with a common control group,

multiple samples, or multiple programs.

The final analysis sample includes 258 effect sizes nested within these 95 studies. This

includes a separate effect size for each treatment contrast, each assessment of math or science

achievement, and each sample of students and teachers reported by the study.5 To account for the

nested nature of our data, our analysis used the robust variance estimation (RVE) approach

(Tanner-Smith & Tipton, 2013), as discussed below.

Study Coding

To develop content codes, we began by reviewing prior meta-analyses and reviews of

instructional improvement interventions, naming both broad categories (professional

development; curriculum; technology; subject matter) and specific codes (teachers studied

curriculum materials; teachers solved math problems) that commonly appeared in the literature.

We then used this skeleton structure to code a sample of studies, modifying existing codes to be

more clear and adding codes as needed to cover additional program features. We also added a set

of codes that captured study size, design, and quality. Finally, our technical advisory board,


17

comprised of university faculty and other experts in teacher professional development and

curriculum materials, reviewed our codes, making suggestions for amending and adding as

necessary.6

For programs involving professional development, four primary coding categories

emerged. A first pair of codes captured professional development duration, indexed as the

number of contact hours teachers spent in professional development experiences (both for

professional development and curriculum materials programs) and as the timespan over which

those hours were spread. A second set of codes recorded the main focus (or foci) of the

professional development, including emphases on improving teacher content knowledge,

exploring how students learn, integrating technology into the classroom, and learning how to use

curriculum materials. Third, we coded for specific activities that teachers engaged in during the

professional development, such as observing a demonstration of instruction or working through

student curriculum materials. A final set of codes captured the format of the professional

development, for instance whether it was delivered during a summer workshop, contained

coaching, or involved online learning. Within each of the three substantive coding categories, we

also created an indicator of the number of codes applied, hypothesizing that professional

development programs with multiple foci, activities or formats (e.g., a focus on both how

students learn and how to use curriculum materials) might be more effective compared to more

narrowly-focused programs.

For programs that involved new curriculum materials, we coded for whether the

curriculum provided implementation guidance (e.g., text describing possible student-teacher

dialogues around content) or supplied teachers with kits for implementation. We also recorded


18

the proportion of a lesson the curriculum was intended to replace (i.e., one to 100 percent), and

the total number of minutes the curriculum was intended for use in a school year.

We also designed codes to capture the rigor of the design as well as aspects of its

implementation. Specifically, to record potential selection bias of participants into groups, we

coded whether treatment and control groups were formed via random assignment or by

prospectively matching individuals/classrooms/schools on baseline data. To explore potential

impacts of general and differential attrition bias, we included two measures of potential effect

size-level attrition problems: (1) author-reported attrition of more than 20% at the student or

cluster level, and (2) differential attrition between treatment and control groups of more than

10%. Finally, we coded whether authors simply reported a given outcome as “non-significant” in

order to capture the potential for bias due to selective reporting.

We note we were unable to capture other potential sources of bias such as performance

and detection bias, or biases stemming from participants’ and researchers’ knowledge of whether

participants were in the treatment or control group. Given that it is evident to participants and

researchers alike whether teachers are engaged in new professional development and curriculum

programs, in most cases it is not possible to blind participants and researchers to condition.

We also coded for a variety of other study descriptors, such as publication type (peer-

reviewed vs. not), and whether the study took place in the U.S. or abroad. We examine whether

any of these study descriptors relate to effect sizes. Based on evidence that effect sizes vary by

type of test (Hill et al., 2008; Taylor et al., 2018), we also coded for the nature of the student

assessment, including state or district standardized tests (e.g., the Texas Essential Knowledge

and Skills assessment), standardized tests available through commercial vendors (e.g.,

Woodcock-Johnson), and researcher-designed assessments. We expected the last to be more


19

sensitive to the content of instructional improvement programs, and thus potentially provide

larger effect sizes.

Coding of full-text studies was conducted by the study authors, along with two trained

research assistants. Before beginning operational coding, we went through a process of

establishing inter-rater reliability. Each week, each member of the team coded two studies, then

met to reconcile disagreements and refine the code descriptions. This process continued until

coders reached 80% agreement (four weeks for low-inference codes such as bibliographic

information, five weeks for higher-inference codes surrounding program characteristics). Once

we achieved a stable set of codes and the 80% threshold, each study was coded by two

researchers, including at least one study author, working as a pair. Each researcher coded studies

independently, and met to reconcile discrepant records. All disagreements were resolved through

discussion.

Effect Size Calculation

We calculated standardized mean difference effect sizes using Hedges’ g:

𝑔 = 𝐽 ∗ (𝑌𝐸 − 𝑌𝐶 )/S∗

In this formula, 𝑌𝐸 , represents the average treatment group outcome, 𝑌𝐶 , represents the average

control group outcome, and S∗ represents the pooled within-group standard deviation. J is a

correction factor that adjusts the standardized mean difference to avoid bias in small samples:

𝐽 = 1 −3

4 ∗ (𝑁𝐸 + 𝑁𝐶 − 2) − 1)

In this formula, 𝑁𝐸 represents the number of students in the treatment group and 𝑁𝐶 represents

the number of students in the control group. Effect sizes were calculated using the software

package Comprehensive Meta Analysis for the majority of cases. We used the following decision

rules to calculate effect sizes: If the authors reported a standardized mean difference effect size,


20

such as Cohen’s d or Glass’s delta, we converted author-reported effect sizes to Hedges’s g (52

percent of effect sizes).7 If authors did not report a standardized mean difference effect size but

did report a covariate-adjusted unstandardized mean difference (e.g., a coefficient from a

multilevel model) and raw standard deviations, we calculated a standard mean difference effect

size and converted to Hedges’s g (15 percent). If covariate-adjusted mean differences were not

reported, we calculated effect sizes based on raw posttest means and standard deviations (22

percent). If neither standardized mean differences nor raw means and standard deviations were

reported, effect sizes were calculated based on the coefficients and standard errors of multilevel

or regression models (4 percent).8 In the remaining cases, effect sizes were calculated from other

results (e.g., studies that reported the results of ANOVAs; 7 percent).

We applied corrections to study-reported standard errors when necessary to account for

the clustering of students within classrooms or schools (Littell, Corcoran, & Pillai, 2008).9 For

the five studies that did not report the number of clusters, we followed Scher and O’Reilly’s

(2009) procedure for imputing that number based on available information, typically estimating

the total number of classrooms from the total number of students. When intraclass correlations

could not be inferred from the study report, we also followed Scher and O’Reilly (2009) in

adopting the WWC recommended default intraclass correlations of 0.20.

We had 29 outcomes where the study did not provide enough information to calculate an

effect size.10 All came from studies that did report an effect size for at least one additional

achievement outcome, and so no studies were dropped from the analysis due to missing

outcomes. Furthermore, many of these missing outcomes were subscales, meaning they were a

smaller selection of items already represented in the reported effects. Therefore, we ignore these

missing outcomes in the main analyses. However, we also examine the sensitivity of our results


21

to the inclusion of these outcomes by imputing a range of values for these missing effect sizes

and re-estimating our primary models.11

Model Selection

A common problem in meta-analyses occurs when single studies yield multiple effect

sizes, as authors often measure more than one outcome. These nested effect sizes are likely to be

correlated, violating the assumption of statistical independence. To address this issue, prior meta-

analyses in this area have averaged effect sizes (Kennedy, 2016; Scher & O’Reilly, 2009; Slavin,

Lake, Hanley & Thurston, 2014; Yoon et al., 2007). Here, we argue that a robust variance

estimation (RVE) approach developed by Tanner-Smith and Tipton (2014) more properly

models our data. This approach adjusts standard errors to account for the dependencies between

effect sizes, and is analogous to methods for adjusting standard errors in OLS regression models

for heteroskedasticity (e.g., using Huber-White standard errors) or to account for the nesting of

data within clusters (e.g., clustered standard errors). Importantly, this approach allows for the

inclusion of multiple effect sizes from the same study within the meta analysis, and therefore

avoids the loss of information that would occur by either dropping effect sizes or including a

single average effect size for each study (for a more detailed discussion, see Tanner-Smith &

Tipton, 2014). RVE models can also control for methodological features known to affect

outcomes (e.g., Hill et al. [2008]; Taylor et al., 2018). This approach has been used in multiple

recent meta-analyses where individual studies report multiple effect sizes (e.g., Clark, Tanner-

Smith, & Killingsworth, 2016; Dietrichson, Bøg, Filges, & Jørgensen, 2017; Gardella, Fisher, &

Teurbe-Tolon, 2017).

The RVE approach is designed to account for two types of dependencies among effect

sizes. The first type of dependency is correlated effects, which arises when a single study


22

provides multiple effect sizes estimates for the same underlying construct or of correlated

underlying measures, or when the same control group is used for multiple treatment contrasts.

The second type of dependency is hierarchical effects, which arises when multiple treatment-

control contrasts are nested within a larger cluster of experiments (e.g., a research group

conducts multiple evaluations of the same program). When both of these dependencies are

present, Tanner-Smith and Tipton (2014) recommend selecting a method based on the more

frequent type of dependence. Because correlated effects predominated in our data, we used the

recommended inverse variance weights recommended by Tanner-Smith and Tipton (2014).

The weight for effect size i in study j is calculated by the following:

𝑤𝑖𝑗 = 1

{(𝑣∗𝑗 + 𝜏2)[1 + (𝑘𝑗 − 1)𝜌]}

Where 𝑣∗𝑗 is the mean of within-study sampling variances (𝑆𝐸𝑖𝑗2 ), 𝜏2 is the estimate of the

between-studies variance component, 𝑘𝑗 is the number of effect sizes within each study, and 𝜌 is

the assumed correlation between all pairs of effect sizes within each study. Therefore, effect

sizes from studies with more effect sizes and higher sampling variances are given lower weight.

It is assumed that 𝜌 is constant across studies, and we use the recommended default value of 𝜌 =

0.80 (Tanner-Smith & Tipton, 2014). However, simulation studies suggest that results using the

RVE approach are not sensitive to values of 𝜌 (e.g., Tanner-Smith & Tipton, 2014; Tanner-

Smith et al., 2013; Wilson et al., 2011). We also conduct a series of sensitivity checks to

determine whether our results are sensitive to alternative values of 𝜌. We use the robumeta

package in Stata 15 (developed by Tanner-Smith & Tipton, 2013) to estimate our RVE models,

and incorporate the small-sample correction proposed by the developers of the RVE approach

(Tanner-Smith & Tipton, 2014; Tipton & Pustejovsky, 2015). We also report the results of F-

tests to test the joint significance of the program features included in our RVE models. These F-


23

tests was conducted using the robumeta and clubSandwich packages in R (Fisher & Tipton,

2014; Tipton & Pustejovsky, 2015).

As noted above, we grouped potential moderators of program impact into five categories

indicating the time/duration, focus, activities, and format of professional development, and the

characteristics of new curriculum materials. To examine whether specific features in each

category moderated program impact, we fit four sets of conditional meta-regression models with

robust variance estimation (RVE), including the coded features as moderators and treating these

moderators as fixed. Within each category, we first modeled the effect of each code separately.

To further understand their joint relationships, we then fit a model entering all codes together.

All models controlled for whether the study was a randomized controlled trial, whether the effect

size estimate controlled for covariates, the type of student assessment used, whether the program

focused on mathematics (rather than science), and an indicator variable for whether the study

was conducted at the preschool level.

In some cases, there is within-study variability in program features (e.g., among studies

with multiple treatment arms). Following the recommendation of Tanner-Smith & Tipton (2014),

we include the study-level mean value of each covariate and moderator. For the two covariates

where there is within-study variability in at least 10 percent of studies (state standardized test,

other standardized test), we also include a within-study version of the covariate that is calculated

by subtracting the study-level mean values from the original covariate values. The full dataset is

available from the authors on request.

Finally, we note that the RVE method addresses heterogeneity of effect sizes differently

compared to traditional meta-analyses. As described by the RVE developers, the primary aim of

this method is to estimate fixed effects, such as meta-regression coefficients, rather than to model


24

variation in effect sizes. As a result, tests for heterogeneity used in traditional meta-analysis are

not available with the RVE approach (Tanner-Smith & Tipton, 2014; Tanner-Smith, Tipton &

Polanin, 2016). However, we do report the method of moments estimate of 𝜏2 for each of our

primary models as measures of between-study heterogeneity in effect sizes.

Results

Descriptives and Overall Average Impacts

Table 1 provides descriptive statistics regarding the study designs and programs included

in our dataset. Of the included studies, 22% focused on professional development alone (without

the introduction of new curriculum materials), 9% focused on curriculum materials alone

(without the provision of professional development) and 75% included both new curriculum

materials and professional development. Most included studies had randomized designs; only

nine quasi-experimental studies met the criteria for inclusion. Roughly two-thirds of studies

focused on mathematics rather than science. Table 1 also shows study-level frequencies for the

format, timing and duration, foci, and activities of programs with any professional development

component, and for characteristics of curriculum materials programs.12 Because we had a

relatively small sample size, when two variables correlated at the 0.40 level or above, we

combined them into a single predictor indicating whether the program had either feature. This

occurred in two cases (a focus on content knowledge/pedagogical content knowledge and

knowledge of how students learn; and, under activities, solving problems and working through

student materials).

Across all included studies, we found an average weighted impact estimate of +0.21

standard deviations (see Table 2). To contextualize the magnitude of this effect, a typical

treatment group student would be expected to rank about 8 percentile points higher than a typical


25

control group student (Lipsey et al., 2012). We found a method-of-moments estimate of the

between-study variance in study average effect sizes of 0.037, with a between-study standard

deviation of 0.191, indicating that most studies had small to moderate positive impacts. Of the

258 effect sizes included in the 95 identified studies, 214 effect sizes (83 percent) were positive

in sign and 88 effect sizes (34 percent) were statistically significant. Only 40 effect sizes were

negative in sign (16 percent) and only 4 were negative and statistically significant (2 percent).

Four effect sizes had point estimates of zero (2 percent). The unweighted average effect size for

math outcomes was +0.27 SD and the average effect size for science outcomes was +0.18 SD.

However, the results of conditional meta-regression model estimated using RVE indicates no

statistically significant difference in mean effect sizes based on whether programs focused on

math or science (see Table 3). We therefore include both math and science outcomes in all

analyses.

As shown in Table 3, the type of assessment employed was a strong predictor of effect

size magnitude, with impacts for both standardized (e.g., Woodcock-Johnson) and state

standardized tests lower by about 0.27 SD (p<0.01) relative to impacts for researcher-designed

assessments. Table 3 also shows that there were not statistically significant differences in effect

sizes based on study design (i.e., whether the study was an RCT), whether study authors adjusted

for covariates such as student demographic indicators, whether the study sample included

preschool students, and subject matter.

Table 4 shows that the average effect size for programs that incorporated both

professional development and new curriculum materials was approximately 0.10 SD (p<0.05)

larger than the average ES for programs that included only professional development or only

new curriculum materials.


26

Features that moderate program impacts

Next, we turn to features that may moderate impacts on student outcomes, beginning with

the professional development models. We did not find a significant relationship between

professional development contact hours and student outcomes (see Table 5). We first examined

whether there was a linear association between contact hours and effect size (column 1). Next, to

look for non-linearities in the contact hours data, the model presented in column 2 compares

programs with few hours (i.e., below the 25th percentile; omitted) and those with a larger number

of contact hours (i.e., 25th to 50th percentile; 50th to 75th percentile; and 75th percentile and

above). We did not find evidence of a significant association in either model. To investigate a

threshold effect, the model presented in column 3 examines programs with 16 hours or more and

found no relationship; this finding held at various threshold levels (not shown). In a separate

analysis, we did not find a significant relationship between the timespan (e.g., one month, one

year) over which professional development activities were conducted and effect sizes (models

not shown).

We next examined the associations between the focus of the professional development

and effect sizes via a series of multilevel regression models (Table 6). Average effect sizes were

larger when programs focused on how to use curriculum materials (+0.12 SD, p<0.10) and when

they focused on improving teachers' content and pedagogical content knowledge and/or how

students learned the content (+0.09 SD, p<0.05). Both of these associations remained significant

and similar in size when all predictors were included in the model. Average effect sizes were

also larger when the program focused on formative assessment (+0.13 SD, p<0.10), although this

did not retain its size or significance in the final model. A focus on content-generic instructional

strategies was not a significant predictor of the effect size magnitude. Finally, we also find that


27

programs that included more of the foci listed in Table 6 had larger effect sizes, on average,

compared to programs focused on a more narrow set of topics (+0.08 SD, p<0.01).

Next, we turned to professional development activities (Table 7). No activities for which

we coded – observing demonstrations, reviewing generic student work (problems or

investigations completed by students outside the teachers’ classes), solving math or science

problems/working through student materials, developing curriculum or lesson plans, and

reviewing teachers’ own students’ work – were significant predictors of effect size magnitude.

However, we find that the number of PD activities incorporated in the program, defined as the

total number of the five activities listed in Table 7, significantly predicted effect size magnitude

(+0.05 SD, p<0.10).

Table 8 examines the relationship between effect sizes and professional development

formats. On average, PD programs that had teachers participate alongside other teachers in their

school, which we refer to as same-school collaboration, yielded outcomes 0.12 SD larger

(p<0.10) than programs without such collaboration. This result is significant in the final model

only. In addition, programs that included PD with implementation meetings yielded significantly

larger average effect sizes on average than without these meetings (+0.12 SD in the final model,

p<0.05). These implementation meetings typically allowed teachers to convene briefly with other

activity participants to troubleshoot and discuss obstacles and aids to putting the program into

practice. Meanwhile, programs that included professional development that incorporated an

online component yielded significantly smaller impacts on average relative to programs that did

not involve any online components (-0.15 SD in the final model, p<0.05). Additionally, in the

final model, programs that included a summer workshop had larger effect sizes, on average, as

compared with those that lacked this component (+0.07 SD, p<0.10). The remaining formats


28

examined – whether the program featured coaching from experts or professional development

led by researchers – were not significant predictors of effect size magnitude. The number of PD

features was also not a significant predictor of effect size magnitude.

Table 9 shows the associations between features of new curriculum materials and effect

sizes. None of the features examined were significantly associated with average effect sizes.

It is important to note that the above moderator analyses provide estimates of the

associations between the presence of specific program features and average effect sizes;

however, programs without those specific features that positively predict impacts may still

positively impact student outcomes on average. Thus in Table 10 we present the results of these

moderator tests summarized in terms of regression-adjusted mean effect sizes. First, we present

mean effect sizes based on subgroup analyses without controls for additional program features.

These mean effect sizes are based on unconditional meta-regression models estimated using

RVE to account for the nesting of effect sizes within studies. Next, we present mean effect sizes

based on conditional meta-regression models corresponding to our main moderation analyses

with each predictor included separately and controlling for the program features listed in Table 3.

Finally, we present mean effect sizes corresponding to our final moderation analyses with all

predictors within each category entered simultaneously. For brevity, we focus on only those

features that were statistically significant predictors of program impact in our main models.

The mean effect sizes presented in Table 10 indicate that the overall impacts of programs

with and without the moderators of interest were generally positive. Estimated mean effect sizes

for STEM professional development and curriculum improvement programs are generally

positive even among programs that lacked the intervention features we identified as associated

with larger effect sizes. For example, average effect sizes were positive among programs


29

including either professional development or new curriculum materials but not both components

(𝑔𝑐 = 0.156, 𝑔𝑢𝑐 = 0.136, puc < 0.01), professional development programs that had an online

component (𝑔𝑐+ = 0.096, 𝑔𝑐 = 0.103, 𝑔𝑢𝑐 = 0.116, puc < 0.05), and professional development that

did not contain a focus on improving teachers’ content knowledge, pedagogical content

knowledge or knowledge of how students learn (𝑔𝑐+ = 0.179, 𝑔𝑐 = 0.175, 𝑔𝑢𝑐 = 0.095, puc <

0.01). Although impacts were largest on intervenor-developed assessments, estimated mean

effects sizes are still positive for impacts on state standardized assessments (𝑔𝑐 = 0.0101, 𝑔𝑢𝑐 =

0.060, puc < 0.01) and other standardized assessments (𝑔𝑐 = 0.087, 𝑔𝑢𝑐 = 0.084, puc < 0.01).

Furthermore, the differences in mean effect sizes within each category based on estimating

unconditional models, with the exception of same-school collaboration, are generally comparable

in direction and magnitude to those based on conditional models.

Study Design Moderators

Next, we examined whether a wider range of study design characteristics and

implementation contexts were associated with effect size magnitudes (Table 11). Studies that

included credit incentives for participating teachers showed smaller impacts on average (-0.17

SD in the final model, p<0.05). We found that the unit of assignment to the program (schools vs.

teachers) did not predict effect sizes. We did not find significant relationships between effect size

magnitudes and whether teacher participation was voluntary or mandatory, or whether teachers

received monetary incentives; however, this information was unreported in many studies.

Finally, we found that the level of attrition did not predict the size of impacts.

To further examine the potential role of attrition bias, we additionally examined the

adjusted mean effect sizes for studies that do and do not meet our attrition standards. We see


30

little evidence that studies with attrition problems reported larger impacts (see Appendix Table

A2).

In separate analyses, no other study design features were significantly related to

outcomes, including whether the study involved one district, multiple districts and/or states;

whether the study was conducted in the U.S. or abroad; whether the study was conducted in an

urban vs. non-urban setting; and whether the study sample was majority low-income. We also

found that study size (defined as the average number of treatment clusters across effect sizes

within a study) was not a significant predictor of study outcomes (results not shown).

Publication Bias

Finally, we considered whether effect sizes from peer-reviewed sources differed in

magnitude, on average, relative to effect sizes from other sources. In Table 12, we present results

from a meta-regression model estimated using RVE that includes whether the effect size was

from a peer-reviewed publication as a moderator. We found that effect sizes from peer-reviewed

studies were larger by 0.05 (uncontrolled, Model 1) or 0.07 SD (controlled, Model 2) than those

from other studies when using the RVE approach and after controlling for study design, study

sample, subject area, and outcome measure type. However, these differences were not

statistically significant.

We also conducted two additional tests for publication bias. First, we assessed

publication bias using Egger's regression test (Egger et al., 1997). Given the multiple effect sizes

within each study, we conducted this test at the study level by regressing the study average

standard normal deviation (the average effect size divided by the average standard error) on the

inverse of the study average effect size standard error. This approach tests the null hypothesis

that the intercept is zero; if the null hypothesis is rejected this indicates that smaller (or less


31

precise) studies have systematically larger or smaller reported effect sizes relative to larger (or

more precise) studies which could be due to publication bias. We then use the “trim and fill”

method to examine the magnitude of potential publication bias (Duval & Tweedie, 2000).

Although the models estimated are not precisely analogous to our preferred estimation approach,

results indicate potential publication bias in the full sample as demonstrated by the fact that

studies with larger effect size standard errors (i.e., smaller and less precise studies) have larger

effects (p<0.001). A comparison of the adjusted and unadjusted estimated average effect sizes

from the trim-and-fill method indicates that the magnitude of potential publication bias is

substantial. Second, to account for the nested structure of the data, we also used a modification

of the Egger’s regression test by adding the standard errors of the effect sizes as a moderator to

the unconditional RVE meta-regression model. If effect size standard errors are significant

predictors of effect sizes, this could similarly indicate the presence of publication bias. As above,

results indicate potential publication bias in the full sample (p<0.001).

However, results of conducting both methods separately for peer reviewed and non-peer

reviewed studies detect potential publication bias only among peer reviewed studies. Results of

both Egger’s regression test and the RVE approach indicate that smaller (or less precise) studies

report larger impacts (p<0.001). However, there is less evidence of systematic differences in

reported effects sizes based on study size or precision among studies from other sources

(p=0.116; p=0.050). This suggests that the association between study size or precision and

reported impacts among peer-reviewed studies may be a result of publication bias rather than

other factors (e.g., more effective interventions are more costly and therefore evaluated with

small samples). This highlights the importance of the inclusion of the “grey literature” in


32

systematic reviews and meta-analyses of instructional improvement efforts (Polanin, Tanner-

Smith, & Hennessy, 2016). For full results of these tests, see Appendix Table A3 and Figure A1.

Sensitivity Checks

We conducted additional analyses to examine the robustness of our results. First, we

address the fact that in 20 cases, authors reported some impacts as ‘not statistically significant’

and did not provide enough information to calculate an effect size. A concern is that failing to

provide this information may be correlated with program features. In addition, that effect sizes

that are not statistically significant but negative in sign may be less frequently reported that

effect sizes that are positive in sign. We therefore tested the robustness of our results to the

inclusion of these 20 missing effect sizes by imputing a range of values for missing effect sizes

(g = 0.00, g = -0.10, and g = -0.20) and using the study-level mean of the effect size standard

error to calculate their weights. If our results are driven by differential reporting of information

regarding effect sizes that are not statistically significant based on whether point estimates are

positive or negative in sign, including these impacts should attenuate our results. In general,

including these effect sizes did not substantively change our results. In the majority of cases

where our primary results showed a significant association between program features and

outcomes, results are comparable in magnitude and remain statistically significant. The only

exceptions are the number of professional development activities and use of same-school

collaboration, which are no longer significant predictors of program impact, although the

associations are comparable in magnitude to our main estimates.

We also found that our results were largely robust to excluding studies in settings outside

the U.S., and to excluding studies with weaker designs (i.e., studies that were not RCTs).

Associations are generally comparable in magnitude to our main results, although in some cases


33

less precisely estimated. The only exceptions are that a focus on how to use new curriculum

materials is not a significant predictor of impacts after excluding non-RCT studies, and a focus

on how to use new curriculum materials and the number of professional development activities

are not significant predictors of impacts after excluding studies outside the U.S. These

associations are comparable in magnitude to our main estimates. We also examined the

sensitivity of our results to choosing different values of the within-study correlation between

effect sizes. In our primary models, we specify the correlation to be 0.80, which is the value

recommended by Tanner-Smith and Tipton (2014). Using alternative specifications of (𝜌 = 0.50,

0.70, and 0.90) did not change our results. Full results of these sensitivity checks are available

from the authors on request.

Discussion and Conclusions

To summarize, we found that studies of STEM instructional improvement programs had

on average positive effects on student achievement, with a mean pooled effect size across studies

of 0.21 SD. Compared to those found in prior reviews, these pooled effect sizes are in the

middle range, smaller than those identified in the Yoon et al. review (0.57 SD in math and 0.51

SD in science) and the Taylor et al. review (0.489 SD in science), and somewhat larger than

those identified in the Scher & O'Reilly review (0.14 SD in math and 0.13 SD in science).

Although the effect sizes differ, our results confirm earlier reviewers' findings that studies of

STEM instructional improvement efforts tend to show positive results.

We conducted a series of analyses to examine the relationships between program

characteristics and the size of achievement impacts. The characteristics that were significantly

associated with improved student learning across the current set of studies included the

following:


34

• The use of professional development along with new curriculum materials;

• A focus on improving teachers' content/pedagogical content knowledge, or

understanding of how students learn;

• Specific formats, including:

• meetings to troubleshoot and discuss classroom implementation of the

program;

• the provision of summer workshops to begin the professional development

learning process;

• same-school collaboration.

We also found that on average, programs that provided any component of the PD online had

poorer student outcomes than programs that did not use an online PD component. In general,

there was not a statistically significant difference in the magnitudes of these associations

depending on whether programs focused on mathematics or science.

Components Associated with STEM Program Effectiveness

Taken together, we generally find that providing teachers with opportunities to learn

about the materials they will use with students and/or to participate in programs that seek to

improve their content or pedagogical content knowledge is associated with improved student

outcomes. These findings accord with prior cross-sectional research. For example, Boyd,

Grossman, Lankford, Loeb, and Wyckoff (2009) found that teacher preparation programs that

focused closely on teachers' classroom practice, including reviewing the district curriculum,

produced teachers with better student outcomes in their first year of teaching. Authors (1998)

also found that teachers' participation in curriculum-centered professional development was

related to student achievement. These findings also lend support to the conclusions drawn in


35

prior reviews conducted by Scher and O'Relly (2009) and Kennedy. Scher and O'Reilly (2009)

found that inverventions that focused on both content and pedagogy posted stronger student

outcomes as compared with interventions that focused on pedagogy alone, while Kennedy

(1999) concluded that programs were more effective when they focused on how to teach specific

content and on how students learn the same content.

Most studies of curriculum materials in our dataset included at least some component of

professional development, and vice versa. However, examining studies that included both

elements jointly leads us to see that on average, programs that incorporated both professional

development and new curriculum materials had larger impacts as compared with programs that

included only one of these components. These findings lend support to the argument advanced

by some scholars that curriculum materials alone, even those designed to be educative and

supportive of teacher implementation, may be insufficient to change teaching practice, given the

complexity of classroom instructional interactions, and the often-ingrained nature of traditional

inquiry-response-evaluation teaching practices (Alozie, Moje, & Krajcik, 2010). Meanwhile,

perhaps when professional development is provided without reference to specific curriculum,

teachers may struggle to implement what they have learned while using existing curricular

materials and textbooks.

We also found a positive association between student outcomes and teachers'

participation in implementation meetings, which were defined as meetings in which teachers met

formally or informally with other activity participants to discuss enacting intended practices. We

also found a significant relationship, in our final model, between same-school collaboration

(teachers participating in PD alongside their colleagues) and student outcomes. These findings

align with prior work (e.g., Darling-Hammond & McLaughlin, 1995) emphasizing the


36

importance of providing teachers with opportunities to discuss instructional innovations with

colleagues (e.g., Penuel, Sun, Frank & Gallagher, 2012), and discuss and troubleshoot issues that

arise when implementing new instructional approaches.

Perhaps more surprisingly, the inclusion of a summer workshop was positively related to

student outcomes. Prior syntheses (Scher & O'Reilly, 2009; Yoon et al., 2009) did not examine

this variable specifically. The summer workshop format for professional development has been

critiqued in the past for its typically 'one-shot' nature, but it may be the case that an intensive

summer professional development provides an effective 'springboard' for school-year

implementation. On the other hand, programs in which participants completed a portion of the

professional development online had weaker student outcomes, on average, as compared with

programs that did not include any online PD. Dede, Ketelhut, Whitehouse, Breit, and McCloskey

(2008) point out that although online teacher professional development is burgeoning, relatively

little rigorous research has examined its effectiveness. Although these analyses are correlational

in nature, the positive results associated with both the summer professional development and the

teacher implementation meetings are consistent with the notion that teachers may have benefitted

from both intensive and ongoing opportunities to interact with one another around program

content, which may have been less salient in the online format.

In contrast to two earlier reviews (Scher & O’Reilly, 2009; Yoon et al., 2007), we find no

evidence of a positive association between the duration of professional development and

program impacts. Yoon compared the effectiveness of interventions that included greater than

versus less than 14 hours of professional development, finding that the former had larger impacts

on student achievement. Scher and O'Reilly (2009) compared the effectiveness of interventions

conducted over two or more years versus those conducted over one year, and found that those


37

conducted over two or more years were more effective on average among math-focused, but not

science-focused, interventions. However, both of these reviews note that the small number of

included studies limited their ability to draw firm conclusions. The current findings, using a

continuous measure of contact hours and a separate measure of timespan, suggest that programs

that were limited in duration nonetheless generally had positive impacts on average. For

example, several programs that combined new curriculum materials with a short amount of

professional development documented moderate to large impacts on student achievement (e.g.,

Arnold et al., 2002; Clements & Sarama, 2007; Presser, Vahey, & Dominguez, 2015). In

contrast, some studies of highly-intensive professional development programs showed little or no

impacts on student learning (e.g., Authors, 2016; Devlin-Sherer et al., 1998; Van Egeren et al.,

2014). Our findings echo those of Kennedy (1999, 2016), who did not find a clear benefit of

contact hours or program duration, and concluded that the core condition for program

effectiveness was valuable content; more hours of a given intervention will not help if the

intervention content is not useful.

Similar to Taylor et al., (2018), our analysis did not detect any relationship between

student outcomes and study design, science subject matter, and grade level. However, we found

that average impacts varied considerably depending on the type of student test used. On average,

student outcomes were larger on researcher-designed assessments as compared with standardized

tests. While standardized tests may have benefits such as face validity and broad content

representation, it may be that standardized tests are not especially sensitive to instructional

improvement efforts, due to differences in the skills measured by the tests versus those targeted

in the intervention (see also Hill et al., 2008; Taylor et al., 2018; Sussman & Wilson, 2018). If

this is the case, in order to understand how instructional improvement interventions influence


38

student learning, researchers may need to include both assessments of outcomes that are closely

tied to the student learning goals along with broader standardized tests.

Limitations and Future Directions

We note several limitations to the current review. One limitation is missing data. First,

although we attempted to search both the published and unpublished literatures, studies may

have eluded our grasp. Second, many programs and interventions that are routinely conducted in

schools are never evaluated, and these programs may differ in unknown ways from those that are

formally evaluated. Most of the programs studied are boutique programs, often designed,

operated and evaluated by university or contract researchers. We know little about the efficacy of

professional development that reaches typical teachers. The characteristics we identified as

potentially effective here may not carry over to typical conditions and with the resources

conventionally available in school districts.

Missing data in study reports posed another limitation. We initially hoped to code the

included studies for a number of additional features that have been hypothesized in the literature

to influence the effectiveness of instructional improvement programs, including district and

school leadership support, competing instructional improvement initiatives, teacher recruitment

methods, and the financial resources provided to support the intervention (Wilson, 2013).

However, we found that few study reports contained sufficient detail on these matters, making

tracking their impact impossible. This omission is striking; in nearly all published and

unpublished reports, the district context is simply a black box, or merely a site for teacher

recruitment and service delivery. The inclusion of more contextual information in study reports

is pressing, especially because many district administrators evaluate proposed programs based on

the extent to which studies were carried out in “districts like ours.” Gathering contextual


39

information would also provide insight into the conditions necessary for instructional

improvement programs to thrive.

In addition, it is important to note that, as is generally the case in meta-analyses, the

moderator analyses we have conducted are correlational. The authors of the included studies

generally did not randomly manipulate the variables that we identified as associated with

improved student outcomes in the moderator analyses, such as by randomly assigning teachers to

participate versus not participate in a summer workshop. As a result, moderator analyses are

potentially confounded by unobserved or inadequately measured study and student

characteristics. The characteristics of professional development and curriculum programs that

appear promising in the current study's moderator analyses thus are not definitive, but point

toward potentially productive areas for future experimental work. Future randomized

experiments comparing instructional programs that do versus do not have these components are

warranted, to estimate causally the effects of robust STEM professional development and

curriculum interventions on student learning. Additionally, our use of binary indicators rather

than raw attrition rates constitutes a limitation, as continuous attrition measures would have

allowed us to estimate the impacts of attrition more precisely.

Despite these limitations, however, we were able to distill findings from dozens of recent

experimental and strong quasi-experimental evaluations of instructional improvement programs

in STEM. Based on the extant literature, the types of practices that were associated in our review

with improved student learning, such as providing teachers with opportunities to engage with the

curriculum they teach, develop their content knowledge, pedagogical content knowledge and

understanding of how students learn, and discuss classroom implementation, likely occur

infrequently in the instructional improvement programs typically offered in US schools (e.g.,


40

Author, 2009). Future experimental studies that build on the current findings are warranted, to

advance researchers' and policymakers' understanding of core instructional reform practices that

improve student learning in STEM.

Endnotes

1 For example, in the area of middle and secondary math curriculum materials, Slavin and

colleagues searched the published and grey literature dating back to 1970, conducting "[a] broad

literature search ... in an attempt to locate every study that could possibly meet the inclusion

requirements. This included obtaining all of the middle school studies cited by the What Works

Clearinghouse (2008b) and the middle and high school studies cited by NRC (2004), by Clewell

et al. (2004), and by other reviews of mathematics programs, including technology programs that

teach math (e.g., Chambers, 2003; Kulik, 2003; Murphy et al., 2002). Electronic searches were

made of educational databases (JSTOR, Education Resources Information Center, EBSCO,

PsycINFO, Dissertation Abstracts), Web-based repositories (Google, Yahoo, Google Scholar),

and education publishers’ Web sites. Citations of studies appearing in the first wave of studies

were also followed up. A particular effort was made to find non-U.S. studies." They performed a

similar search for elementary math curriculum materials (Slavin & Lake, 2008), elementary

science curriculum materials (Slavin et al., 2014), and secondary science materials (Cheung,

Slavin, Kim, & Lake, 2016). We refer the reader to those articles for specifics. In the area of

professional development, Yoon et al. (2007) searched for studies dating from 1986 and explain

that "Studies were gathered through an extensive electronic search of published and unpublished

research literature. The review protocol included a list of keywords that guided the literature

search. Seven electronic databases were core data sources: ERIC, PsycINFO, ProQuest,


41

EBSCO’s Professional Development Collection, Dissertation Abstracts, Sociological Collection,

and Campbell Collaboration. These databases were searched separately for each of the three

subjects under review (mathematics, science, and reading and English/language arts). In

consultation with a reference librarian, search parameters were developed using database-

specific keywords ... A deliberately wide net captured literature on professional development and

student achievement, broadly defined ... Fourteen key researchers were also asked to identify

research for the study. Eight researchers responded, recommending additional studies that fit the

study purpose. Finally, existing literature reviews and research syntheses were consulted to

ensure that no key studies were omitted."

2As another point of comparison, the What Works Clearinghouse protocols have generally

limited their scope to studies from the past twenty years, with some categories such as

conference proceedings limited to the past seven years (Brown, Card, Dickersin, Greenhouse,

Kling, & Littell, 2008).

3The specific search strings applied to searches of the titles and abstracts were as follows:

(“professional development” OR “faculty development” OR “Staff development” OR “teacher

improvement” OR “inservice teacher education” OR “peer coaching” OR “teachers’ institute*”

OR “teacher mentoring” OR “Beginning teacher induction” OR “teachers’ Seminar*” OR

“teachers’ workshop*” OR “teacher workshop*” OR “teacher center*” OR “teacher mentoring”

OR curriculum OR instruction*) AND ( “Student achievement” OR “academic achievement”

OR “mathematics achievement” OR “math achievement” OR “science achievement”

OR “Student development” OR “individual development” OR “student learning” OR

“intellectual development” OR “cognitive development” OR “cognitive learning” OR “Student

Outcomes” OR “Outcomes of education” OR “educational assessment” OR “educational


42

measurement” OR “educational tests and measurements” OR “educational indicators” OR

“educational accountability”) AND ("*experiment*" OR "control*" OR "regression

discontinuity” OR “compared” OR “comparison” OR “field trial*” OR “effect size*” OR

“evaluation”) AND (“Math*” OR “*Algebra*” OR “Number concepts” OR “Arithmetic” OR

“Computation” OR “Data analysis” OR “Data processing” OR “Functions" OR “Calculus” OR

“Geometry” OR “Graphing” OR “graphical displays” OR “graphic methods” OR “Science*”

OR “Data Interpretation” OR “Laboratory Experiments” OR “Laboratory Procedures” OR

“Experiment*” OR “Inquiry” OR “Questioning” OR “investigation*” OR “evaluation methods”

OR “laboratories” OR “biology” OR “observation” OR “physics” OR “chemistry” OR

“scientific literacy” OR “scientific knowledge” OR “empirical methods” OR “reasoning” OR

“hypothesis testing”).

4As Slavin (2008) defined this contrast, “A program is defined here as any set of replicable

procedures, materials, professional development, or service configurations that educators could

choose to implement to improve student outcomes. A program is distinct from a variable in

consisting of a specific, well-specified set of procedures and supports. Class size, assigning

homework, or provision of bilingual education are variables, for example, whereas programs

typically are based on particular textbooks, computer software, and/or instructional processes

and usually have a name and a specific provider, such as a company, university, or individual.”

5 For studies with multiple treatment arms, this includes separate effect sizes from each treatment

contrast. For studies that reported impacts for multiple groups of teachers or students (e.g.,

studies that reported impacts separately by grade), this includes separate impacts for each teacher

and student sample. However, we do not include multiple effect sizes for impacts on the same

assessment for the same teacher and student sample (e.g., cases where studies administered the


43

same assessment to examine follow-up impacts over time). We also include a separate effect size

for each assessment of math or science achievement reported by each study. For example, if a

study reported impacts on two separate assessments (e.g., a standardized test and a researcher-

designed assessment), both effect sizes were coded and included in the analysis. We did not

include impacts on assessment subscales or sub-scores if impacts on total scores on the

assessment were also reported. We included impacts on subscales or sub-scores only if impacts

on total scores were not reported.

6 Our technical advisory board consisted of five members – three with expertise in instructional

improvement and program evaluation, and two with expertise in meta-analysis. We met in

person with the first group to gather feedback on our proposed coding system. We consulted with

and sent drafts of our paper to the latter group for feedback on our data and models.

7 Some author-supplied effect sizes that were described as only as being similar to a standardized

mean difference (including standardized coefficients from multilevel models). These were

treated as Cohen’s d effect sizes and converted to Hedges’s g.

8 For these cases a standardized effect size was calculated by the ratio of the unstandardized

regression coefficient and an estimate of the standard deviation of the outcome. The outcome

standard deviation was estimated using the formula provided by Higgins and Deeks (2008):

𝑆𝐷 =𝑆𝐸

√1

𝑁𝑇+

1𝑁𝑐

where 𝑁𝐸 represents the number of students in the treatment group and 𝑁𝐶 represents the number

of students in the control group, and SE represents the coefficient standard error.

9 These include cases where the authors did not take into account the nesting of students within

classrooms and/or schools, cases where effect sizes were reported without the associated


44

standard errors, and cases where the authors reported cluster-adjusted impact estimates, but it

was necessary to calculate a standardized effect sizes based on other information. For example,

this includes cases where the authors reported that cluster-adjusted regression results were

significant below a given value (e.g., p<0.05) but neither standard errors nor p-values nor other

test statistics (e.g., t-statistics) were reported. Of the 258 effect sizes included in this study, we

applied a correction to adjust the standard error for clustering in 114 cases. The majority of effect

sizes that did not require the standard error correction for clustering were based on results of

multilevel models and regression models with clustered standard errors (e.g., using t-statistics,

standard errors and regression coefficients, and p-values). Other effect sizes were reported at the

cluster level (e.g., differences in mean classroom performance); no clustering adjustment was

necessary in these cases.

10 These include 20 cases where the authors referred to “no significant effect” on one or more

outcomes but did not report an effect sizes, and 9 cases where authors reported some outcome

information but did not provide enough information to calculate an effect size (e.g. cases where

the outcome information included only raw means).

11 Specifically, we assumed that the standardized effect size (Hedges’s g) was took on a range of

values (g = 0.00, g =-0.10, and g = -0.20), and used the study-level mean standard error based on

non-missing effect sizes.

12 Some studies contained multiple treatment arms where the characteristic was present in at least

one arm, but not present in at least one other arm. In Table 1, we consider whether the feature

was present in any treatment arm of the study. In all subsequent analyses, we use the study-level

mean.


45

References

Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the

Chicago public high schools. Journal of Labor Economics, 25(1), 95-135.

Agodini, R., Harris, B., Remillard, J., & Thomas, M. (2013). After two years, three elementary

math curricula outperform a fourth. Washington, DC: National Center for Education

Evaluation and Regional Assistance.

Alozie, N. M., Moje, E. B., & Krajcik, J. S. (2010). An analysis of the supports and constraints

for scientific discussion in high school project‐based science. Science Education, 94(3),

395-427.

Arnold, D. H., Fisher, P. H., Doctoroff, G. L., & Dobbs, J. (2002). Accelerating math

development in Head Start classrooms. Journal of Educational Psychology, 94(4), 762-

770.

Ball, D. L., & Cohen, D. K. (1996). Reform by the book: What is: or might be: the role of

curriculum materials in teacher learning and instructional reform? Educational

Researcher, 25(9), 6-14.

Banilower, E., Smith, P. S., Weiss, I. R., & Pasley, J. D. (2006). The status of K-12 science

teaching in the United States: Results from a national observation survey.

In D.Sunal & E.Wright (Eds.), The impact of the state and national standards on K-12

science teaching (pp. 83–122). Greenwich, CT: Information Age Publishing.

Becker, K., & Park, K. (2011). Effects of integrative approaches among science, technology,

engineering, and mathematics (STEM) subjects on students' learning: A preliminary

meta-analysis. Journal of STEM Education: Innovations and Research, 12(5/6), 23-37.

Blank, R. K., & De Las Alas, N. (2010, March). Effects of teacher professional development on


46

gains in student achievement: How meta-analysis provides scientific evidence

useful to education leaders. Paper presented at the Society for

Research on Educational Effectiveness Spring Conference, Washington, DC.

Borman, K. M., Cotner, B. A., Lee, R. S., Boydston, T. L., & Lanehart, R. (2009, March).

Improving Elementary Science Instruction and Student Achievement: The Impact of a

Professional Development Program. Paper presented at the Society for Research on

Educational Effectiveness Spring Conference, Washington, DC.

Boyd, D. J., Grossman, P. L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher preparation

and student achievement. Educational Evaluation and Policy Analysis, 31(4), 416-440.

Cavanagh, S. (2015, August 17). Spending on instructional materials, construction climbing in

K-12. EdWeek. Retrieved from https://marketbrief.edweek.org/marketplace-k-

12/instructional_materials_construction_spending_climbing_in_k-12/

Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact

of an integrated approach to science and literacy in elementary school classrooms.

Journal of Research in Science Teaching, 49(5), 631-658.

Cheung, A., Slavin, R. E., Kim, E., & Lake, C. (2017). Effective secondary science programs: A

best‐evidence synthesis. Journal of Research in Science Teaching, 54(1), 58-81.

Clark, D. B., Tanner-Smith, E. E., & Killingsworth, S. S. (2016). Digital games, design, and

learning: A systematic review and meta-analysis. Review of Educational Research, 86(1),

79-122.

Clements, D. H., & Sarama, J. (2007). Effects of a preschool mathematics curriculum:

Summative research on the Building Blocks project. Journal for Research in

Mathematics Education, 38(2), 136-163.

https://marketbrief.edweek.org/marketplace-k-12/instructional_materials_construction_spending_climbing_in_k-12/

https://marketbrief.edweek.org/marketplace-k-12/instructional_materials_construction_spending_climbing_in_k-12/


47

Clewell, B. C., Campbell, P. B., & Perlman, L. (2004). Review of evaluation studies mathematics

and science curricula and professional development models. Washington, DC: The

Urban Institute.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:

Lawrence Earlbaum Associates.

Confrey, J. (2006). Comparing and contrasting the National Research Council report on

evaluating curricular effectiveness with the What Works Clearinghouse approach.

Educational Evaluation and Policy Analysis, 28(3), 195-213.

Confrey, J., & Stohl, V. (Eds.). (2004). On evaluating curricular effectiveness: Judging the

quality of K-12 mathematics evaluations. Washington, DC: National Academies Press.

Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach. Los

Angeles, CA: Sage.

Corcoran, T. B. (1995). Helping teachers teach well: Transforming professional development.

Philadelphia, PA: CPRE Policy Briefs. Retrieved from

https://files.eric.ed.gov/fulltext/ED388619.pdf

Darling-Hammond, L., & McLaughlin, M. W. (1995). Policies that support professional

development in an era of reform. Phi Delta Kappan, 76(8), 597-604.

Davis, E. A., & Krajcik, J. S. (2005). Designing educative curriculum materials to promote

teacher learning. Educational Researcher, 34(3), 3-14.

Dede, C., Ketelhut, D., Whitehouse, P., Breit, L., & McCloskey, E. M. (2009). A research

agenda for online teacher professional development. Journal of Teacher Education,

60(1), 8-19.



48

Desimone, L. M. (2009). Improving impact studies of teachers' professional development:

Toward better conceptualizations and measures. Educational Researcher, 38(3), 181-199.

doi: 10.3102/0013189X08331140

Desimone, L. M. (2011). A primer on effective professional development. Phi Delta

Kappan, 92(6), 68-71.

Devlin-Scherer, W., Spinelli, A. M., Giammatteo, D., Johnson, C., Mayo-Molina, S., McGinley,

P., ... & Zisk, L. (1998, February). Action research in professional development schools:

Effects on student learning. Paper presented at the Annual Meeting of the American

Association of Colleges for Teacher Education, New Orleans, LA.

Dietrichson, J., Bøg, M., Filges, T., & Klint Jørgensen, A. M. (2017). Academic interventions for

elementary and middle school students with low socioeconomic status: A systematic

review and meta-analysis. Review of Educational Research, 87(2), 243-282.

Duval, S., & Tweedie, R. (2000). Trim and fill: a simple funnel‐plot–based method of testing and

adjusting for publication bias in meta‐analysis. Biometrics, 56(2), 455-463.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by

a simple, graphical test. British Medical Journal, 315(7109), 629-634.

Fisher, Z., & Tipton, E. (2015). robumeta: An R-package for robust variance estimation in meta-

analysis. Retrieved from https://arxiv.org/pdf/1503.02220.pdf

Frechtling, J., Sharp, L., Carey, N., & Vaden-Kiernan, N. (1995). Professional development

programs: A perspective on the last four decades. Washington, DC: National Science

Foundation, Division of Research. Evaluation and Communication.

Furtak, E. M., Seidel, T., Iverson, H., & Briggs, D. C. (2012). Experimental and quasi-


49

experimental studies of inquiry-based science teaching: A meta-analysis. Review of

educational research, 82(3), 300-329.

Gardella, J. H., Fisher, B. W., & Teurbe-Tolon, A. R. (2017). A systematic review and meta-

analysis of cyber-victimization and educational outcomes for adolescents. Review of

Educational Research, 87(2), 283-308.

Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., ... & Sepanik, S.

(2011). Middle school mathematics professional development impact study: Findings

after the second year of implementation. Washington, DC: National Center for Education


Gersten, R., Taylor, M. J., Keys, T. D., Rolfhus, E., & Newman-Gonchar, R. (2014). Summary of

research on the effectiveness of math professional development approaches. (REL 2014–

010). Washington, DC: U.S. Department of Education, Institute of Education Sciences,

National Center for Education Evaluation and Regional Assistance, Regional Educational

Laboratory Southeast. Retrieved from http://ies.ed.gov/ncee/edlabs

Goldenberg, C., & Gallimore, R. (1991). Changing teaching takes more than a one-shot

workshop. Educational Leadership, 49(3), 69-72.

Hand, B., Norton-Meier, L.A., Gunel, M., & Akkus, R. (2016). Aligning teaching to learning: A

3-year study examining the embedding of language and argumentation into elementary

science classrooms. International Journal of Science and Mathematics Education, 14(5),

847-863.

Hedges, L. V. and Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological

Bulletin, 88, 359–369.

http://ies.ed.gov/ncee/edlabs


50

Hiebert, J., Stigler, J. W., Jacobs, J. K., Givvin, K. B., Garnier, H., Smith, M., ... & Gallimore, R.

(2005). Mathematics teaching in the United States today (and tomorrow): Results from

the TIMSS 1999 video study. Educational Evaluation and Policy Analysis, 27(2), 111-

132.

Higgins, J. P. T & Deeks, J. J. (2008). Selecting studies and collecting data. In J.P.T. Higgins &

S. Green (Eds.), Conchrane handbook for systematic reviews of interventions (151-186).

Chichester, U.K.: John Wiley & Sons.

Higgins, J. P. T., Deeks, J. J., & Altman, D.G. (2008). Special topics in statistics. In J.P.T.

Higgins & S. Green (Eds.), Conchrane handbook for systematic reviews of interventions

(481-529). Chichester, U.K.: John Wiley & Sons.

Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for

interpreting effect sizes in research. Child Development Perspectives, 2(3), 172-177.

Kane, T., Kerr, K., & Pianta, R. (2014). Designing teacher evaluation systems: New guidance

from the measures of effective teaching project. John Wiley & Sons.

Kennedy, M. M. (1999). Form and substance in in-service teacher education. (Research

Monograph No. 13). Arlington, VA: National Science Foundation.

Kennedy, M. M. (2016). How does professional development improve teaching? Review of

Educational Research, 86(4), 945-980.

Kim, J. S., & Quinn, D. M. (2012). A meta-analysis of K-8 summer reading interventions: The

role of socioeconomic status in explaining variation in treatment effects. Paper presented

at the annual meeting of the Society for Research on Educational Effectiveness.

Lipsey, M. W., Puzio, K., Yun, C., Hebert, M. A., Steinka-Fry, K., Cole, M. W., ... & Busick, M.

D. (2012). Translating the Statistical Representation of the Effects of Education


51

Interventions into More Readily Interpretable Forms. National Center for Special

Education Research.

Littell, J. H., Corcoran, J., & Pillai, V. (2008). Systematic reviews and meta-analysis. New York,

NY: Oxford University Press.

Llosa, L., Lee, O., Jiang, F., Haas, A., O’Connor, C., Van Booven, C. D., & Kieffer, M. J.

(2016). Impact of a large-scale science intervention focused on English language

learners. American Educational Research Journal, 53(2), 395-424.

Malouf, D. B., & Taymans, J. M. (2016). Anatomy of an Evidence Base. Educational

Researcher, 45(8), 454-459.

Marx, R. W., Blumenfeld, P. C., Krajcik, J. S., Fishman, B., Soloway, E., Geier, R., & Tal, R. T.

(2004). Inquiry‐based science in the middle grades: Assessment of learning in urban

systemic reform. Journal of research in Science Teaching, 41(10), 1063-1080.

Miller, B., Lord, B., & Dorney, J. A. (1994). Staff development for teachers: A study of

configurations and costs in four districts. Boston, MA: Education Development Center.

Miles, K. H., Odden, A., Fermanich, M., & Archibald, S. (2004). Inside the black box of school

district spending on professional development: Lessons from five urban districts. Journal

of Education Finance, 30(1), 1-26.

National Governors Association Center for Best Practices, Council of Chief State School

Officers. (2010). Common Core State Standards. Washington, DC: Author.

National Research Council. (2004). On evaluating curricular effectiveness: Judging the quality

of K-12 mathematics evaluations. Washington, DC: The National Academies Press.


52

National Research Council. (2011). Successful K-12 STEM education: Identifying effective

approaches in science, technology, engineering, and mathematics. National Academies

Press.

National Research Council, Board on Science Education. (2012). A framework for K-12

science education: Practices, crosscutting concepts, and core ideas. Washington, DC:

The National Academies Press. Retrieved from

http://www.nap.edu/catalog.php?record_id=13165

National Research Council, National Committee for Science Education Standards and

Assessment. (1996). National science education standards. Washington, DC: The

National Academies Press. Retrieved from http://www.nap.edu/catalog/4962.html

Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of Cognitive

Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127-144.

Penuel, W. R., Sun, M., Frank, K. A., & Gallagher, H. A. (2012). Using social network analysis

to study how collegial interactions can augment teacher learning from external

professional development. American Journal of Education, 119(1), 103-136.

Presser, A. L., Vahey, P., & Dominguez, X. (2015, March). Improving mathematics learning by

integrating curricular activities with innovative and developmentally appropriate digital

apps: Findings from the Next Generation Preschool Math Evaluation. Paper presented at

the Society for Research on Educational Effectiveness Spring Conference, Washington,

DC.

Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between

published and unpublished effect sizes: A meta-review. Review of Educational

Research, 86(1), 207-236.

http://www.nap.edu/catalog/4962.html


53

Raudenbush, S. W. (2008). Advancing educational policy by advancing research on instruction.

American Educational Research Journal, 45(1), 206-230.

Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes meta-analysis. Journal of

Educational Statistics, 10(2), 75-98.

Resendez, M., and Azin, M., (2006). 2005 Prentice Hall Science Explorer randomized control

trial. Jackson, WY: PRES Associates, Inc.

Roschelle, J., Shechtman, N., Tatar, D., Hegedus, S., Hopkins, B., Empson, S., ... & Gallagher,

L. P. (2010). Integration of technology, curriculum, and professional development for

advancing middle school mathematics: Three large-scale studies. American Educational

Research Journal, 47(4), 833-878.

Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as a

lever for change: An experimental study of the effects of a professional development

program on students’ mathematics learning. Journal of Research on Educational

Effectiveness, 4(1), 1-24.

Saxe, G. B., & Gearhart, M. (2001). Enhancing students' understanding of mathematics: A study

of three contrasting approaches to professional support. Journal of Mathematics Teacher

Education, 4(1), 55-79.

Scher, L., & O'Reilly, F. (2009). Professional development for K–12 math and science teachers:

What do we really know? Journal of Research on Educational Effectiveness, 2(3), 209-

249.

Schneider, M. C., & Meyer, J. P. (2012). Investigating the efficacy of a professional

development program in formative classroom assessment in middle-school English

language arts and mathematics. Journal of Multidisciplinary Evaluation, 8(17), 1-24.


54

Schochet, P. Z. (2011). Do typical RCTs of education interventions have sufficient statistical

power for linking impacts on teacher practice and student achievement outcomes?

Journal of Educational and Behavioral Statistics, 36(4), 441-471.

Schwartz‐Bloom, R. D., & Halpin, M. J. (2003). Integrating pharmacology topics in high school

biology and chemistry classes improves performance. Journal of Research in Science

Teaching, 40(9), 922-938.

Shadish, W. R., & Haddock, C. K. (2009). Combining estimates of effect size. In H. Cooper, L.

V. Hedges, & J. C. Valentine, (Eds.), The handbook of research synthesis and meta-

analysis (Second edition). New York: Russell Sage.

Shavelson, R. J., & Towne, L. (Eds.). (2001). Scientific research in education. Washington, DC:

The National Academies Press.

Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational

researcher, 15(2), 4-14.

Slavin, R. E. (2008). Perspectives on evidence-based research in education—What works? Issues

in synthesizing educational program evaluations. Educational Researcher, 37(1), 5-14.

Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence

synthesis. Review of Educational Research, 78(3), 427-515.

Slavin, R. E., Lake, C., & Groff, C. (2009). Effective programs in middle and high school

mathematics: A best-evidence synthesis. Review of Educational Research, 79(2), 839-

911.

Slavin, R. E., Lake, C., Hanley, P., & Thurston, A. (2014). Experimental evaluations of

elementary science programs: A best-evidence synthesis. Journal of Research in Science

Teaching, 51(7), 870-901.


55

Smith , M. S., & O’Day, J. (1990). Systemic school reform. Politics of Education Association

Yearbook (pp. 233-267). London: Taylor & Francis Ltd.

Sparks, D. (2002). Designing powerful professional development for teachers and principals.

Oxford, OH: National Staff Development Council.

Stein, M. K., Remillard, J., & Smith, M. S. (2007). How curriculum influences student learning.

In F.K. Lester, Jr. (Ed.), Second handbook of research on mathematics teaching and

learning (319-369). New York, NY: Macmillan.

Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation with dependent effect

sizes: Practical considerations and a software tutorial in Stata and SPSS. Research

Synthesis Methods, 5(1), 13–30.

Tanner-Smith, E. E., Tipton, E., & Polanin, J. R. (2016). Handling complex meta-analytic data

structures using robust variance estimates: A tutorial in R. Journal of Developmental and

Life-Course Criminology, 2(1), 85-112.

Taylor, J. A., Getty, S. R., Kowalski, S. M., Wilson, C. D., Carlson, J., & Van Scotter, P. (2015).

An efficacy trial of research-based curriculum materials with curriculum-based

professional development. American Educational Research Journal, 52(5), 984-1017.

Taylor, J. A., Kowalski, S. M., Polanin, J. R., Askinas, K., Stuhlsatz, M. A., Wilson, C. D., ... &

Wilson, S. J. (2018). Investigating Science Education Effect Sizes: Implications for

Power Analyses and Programmatic Decisions. AERA Open, 4(3), 2332858418791991.

Timperley, H., Wilson, A., Barrar, H., & Fung, I. (2008). Teacher professional learning and

development.


56

Tipton, E., & Pustejovsky, J. E. (2015). Small-sample adjustments for tests of moderators and

model fit using robust variance estimation in meta-regression. Journal of Educational

and Behavioral Statistics, 40(6), 604-634.

What Works Clearinghouse. (2010). Procedures and standards handbook (Version 2.1).

Washington, DC: US Department of Education.

Wilson S.J., Tanner-Smith E.E., Lipsey, M.W., Steinka-Fry, K., Morrison, J. (2011) Dropout

prevention and intervention programs: Effects on school completion and dropout among

school aged children and youth. Campbell Systematic Reviews 2011:8 DOI:

10.4073/csr.2011.8

Wilson, S. M. (2013). Professional development for science teachers. Science, 340(6130),

310-313. doi: 10.1126/science.1230725

Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. (2007). Reviewing the

evidence on how teacher professional development affects student achievement (Issues &

Answers Report, REL 2007–No. 033). Washington, DC: U.S. Department of Education,

Institute of Education Sciences, National Center for Education Evaluation and Regional

Assistance, Regional Educational Laboratory Southwest. Retrieved from

http://ies.ed.gov/ncee/edlabs

Zaslow, M., Tout, K., Halle, T., Whittaker, J. V., & Lavelle, B. (2010). Toward the Identification

of Features of Effective Professional Development for Early Childhood Educators.

Literature Review. Office of Planning, Evaluation and Policy Development, US

Department of Education.

http://sfx.hul.harvard.edu/hvd?__char_set=utf8&id=doi:10.1126/science.1230725&sid=libx%3Ahul.harvard&genre=article


57

Tables and Figures

Table 1

Categories and descriptions of codes

Code Code description Code

presenta

Intervention Type Professional development

only Study included only professional development.

22%

Professional development

and curriculum materials

Study included both professional development and new

curriculum materials.

75%

Curriculum materials only Study included only new curriculum materials. 9%

Research Design and Sample

Characteristics

RCT Study is a randomized controlled trial. 91%

Subject matter – Math Subject matter focus of study is math or math/science. 64%

Preschool Study sample included preschool students. 19%

Effect Size Type State standardized test Outcome is a state standardized test. 17%

Other standardized test Outcome is from other standardized test. 30%

Adjusted for covariates Effect size is adjusted for covariates (e.g., pretest score). 76%

PD Formatb

Same-school collaboration Teachers participated in professional development with

other teachers from their own school. 74%

Implementation meetings

Teachers met formally or informally with other activity

participants to discuss classroom implementation (e.g.,

troubleshooting meeting).

35%

Online professional

development

Part or all of the professional development was

conducted online. 18%

Summer workshop The professional development included a summer

workshop. 54%

Expert coaching

The professional development involved coaching or

mentoring from experts who observed instruction and

provided feedback (e.g., a debriefing meeting; via video

or live).

20%

PD lead by

researchers/intervention

developers

The PD was led by the intervention developers and/or

the study authors. 64%


58


presenta

PD Timing and Durationc Contact hours Total number of PD contact hours. 45 hours

Timespan over which

professional development

was conducted:

Less than one week The PD was conducted over less than one week. 15%

One week The PD was conducted over one week. 3%

One month

(8 days to 30 days) The PD was conducted over 8 – 30 days. 3%

One semester

(31 days to 4 months) The PD was conducted over 31 days – 4 months. 13%

One year The PD was conducted over 4 months – 1 year. 49%

More than one year The PD was conducted over more than 1 year. 16%

PD Focusb

Generic instructional

strategies

The professional development was focused on content-

generic instructional strategies (e.g., improving

classroom climate and student motivation).

11%

How to use curriculum

materials The PD focused on how to use curriculum materials. 75%

Integrate technology

The PD focused on how to integrating technology into

the classroom. 11%

Content-specific formative

assessment

The PD focused on formative assessment strategies

specific to mathematics and science teaching (e.g.

strategies to elicit student understanding of fractions or

the scientific method).

17%

Improve content knowledge/

Pedagogical content

knowledge/How students

learn

The PD focused on improving teachers' pedagogical

content knowledge (e.g., how students learn

mathematics or science).

55%

PD Activitiesb

Review sample student work Teachers studied examples of students’ work (including

watching videos of students). 16%

Observed demonstration Teachers observed a video or live demonstration/

modeling of instruction. 35%

Solved problems/Worked

through student materials

Teachers solved problems or exercises during the PD or

worked through student materials during the PD. 42%

Developed curriculum/lesson

plans

Teachers developed curricula or lesson plans during the

PD. 19%

Reviewed own student work Teachers studied examples of their own students' work

during the PD. 11%


59


presenta

Curriculum Materialsd

Implementation guidance

The curriculum materials provided teachers with

implementation guidance (e.g., support for student-

teacher dialogues around the content).

43%

Laboratory/Hands-on

experience or curriculum kits

The curriculum materials included materials/guidance

that supported inquiry-oriented explorations (e.g.,

science laboratory or hands-on mathematics kits).

28%

Curriculum dosage (hours) Total number of hours that the curriculum was intended

to be used.

66.6

hours

Curriculum proportion

replaced (percent)

The proportion of each lesson that the new curriculum

was intended to replace existing curriculum.

91%

Note: N = 95 studies. a Figures in third column include percent of studies which feature the row code for binary

variables, or the sample average calculated at the study level for continuous variables (e.g.,

contact hours and curriculum dosage). For studies that had the feature present in one treatment

arm but not another treatment arm, the code is counted as present if it is present is any treatment

arm (e.g., a study with one treatment arm including only curriculum materials and a second

treatment arm including both professional development and curriculum materials would be

included in both rows). b Codes for PD focus, activities, and format were counted as “Not

Present” for studies that did not involve a PD component. c PD timing/duration includes excludes

studies without a PD component. d Codes for features of interventions involving new curriculum

materials were counted as “Not Present” for studies that did not involve new curriculum

materials.


60

Table 2

Results of estimating an unconditional meta-regression model with robust variance estimation

(RVE)

Dependent variable:

Effect Size (Hedges’s g)

Constant 0.209***

(0.025)

N effect sizes 258

N studies 95

𝜏2a 0.037

95% prediction intervalb (-0.165, 0.583)

Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects

provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b The 95%

prediction interval is calculated as the estimated average effect size +/- 1.96* 𝜏.

*p < .10, **p < .05, ***p < .01.


61

Table 3.

RVE results including controls for study design, study sample, subject area and outcome

measure type

Dependent variable: Effect Size (Hedges’s g)

Between-study effects

RCT -0.022

(0.104)

State standardized test -0.264***

(0.055)

Other standardized test -0.277***

(0.053)

Grade - preschool 0.133

(0.084)

Effect size adjusted for covariates -0.040

(0.055)

Subject matter- math -0.009

(0.044)

Within-study effects

State standardized test -0.208***

(0.052)

Other standardized test -0.203***

(0.059)

Constant 0.395***

(0.111)

N effect sizes 258

N studies 95

𝜏2a 0.025

Results of joint F-testb F = 5.75, df = 26.9, p<0.001

Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects

provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the

joint F test are from a test of the joint significance of all study characteristics included in the

model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher &

Tipton, 2014; Tipton & Pustejovsky, 2015).

*p < .10, **p < .05, ***p < .01.


62

Table 4

RVE results including intervention characteristics (professional development and/or curriculum

materials as moderators



Professional development only -0.084 -0.090*

(0.051) (0.052)

New curriculum materials only -0.129

(0.118)

Professional development only/New

Curriculum materials only

-0.099**

(0.046)

N effect sizes 258 258 258

N studies 95 95 95

𝜏2a 0.025 0.025 0.025

Raw mean ES:

Professional development only 0.254

New curriculum materials only 0.091

Professional development only/ New

curriculum materials only 0.214

Both professional development and new

curriculum materials 0.252

Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.

Models include controls for the following: RCT, state standardized test, other standardized test,

grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects

provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014).

*p < .10, **p < .05, ***p < .01.


63

Table 5

RVE results including professional development contact hours as moderators

Dependent variable:

Effect Size (Hedges’s g)


PD contact hoursa 0.005

(0.007)

PD contact hours between 25th and 50th percentile (16

– 34.5 hours) 0.012

(0.050)

PD contact hours between 50th and 75th percentile (35

- 68 hours) -0.033

(0.065)

PD contact hours above 75th percentile

(> 68 hours) 0.069

(0.067)

PD contact hours at or above 25th percentile

(> 16 hours) 0.017

(0.043)


N studies 85 85 85

𝜏2b 0.022 0.024 0.023

Results of joint F testc F = 0. 647, df = 35.2, p = 0.590


All models include only studies and/or treatment arms with a professional development

component. All models include controls for the following: RCT, state standardized test, other

standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a PD contact hours measured as raw PD contact hours / 10. b 𝜏2 is the method of moments

estimate of the between study variance in the underlying effects provided by the robumeta

package in Stata 15 (Tanner-Smith & Tipton, 2014). c Results of the joint F test are from a test of

the joint significance of the PD contact hours predictors from the second model. The F test was

estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton &

Pustejovsky, 2015).

*p < .10, **p < .05, ***p < .01.


64

Table 6

RVE results including professional development foci as moderators



Generic instructional

strategies -0.014 -0.074

(0.076) (0.072)

How to use curriculum

materials 0.118* 0.119*

(0.060) (0.063)

Integrate technology 0.185 0.139

(0.117) (0.102)

Content-specific

formative assessment 0.132* 0.105

(0.067) (0.062)

Improve pedagogical

content

knowledge/how

students learn 0.094** 0.096**

(0.043) (0.043)

Number of PD

features 0.077***

(0.024)

N effect sizes 237 237 237 237 237 237 237

N studies 89 89 89 89 89 89 89

𝜏2a 0.025 0.025 0.024 0.024 0.026 0.025 0.023

Results of joint

F testb F = 3.89, df = 20.4, p = 0.012




standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects


joint F test are from a test of the joint significance of the predictors for all professional

development foci included in the model. The F test was estimated using the robumeta and

clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015).

*p < .10, **p < .05, ***p < .01.


65

Table 7

RVE results including professional development activities as moderators



Observed demonstration

0.033 0.025

(0.049) (0.046)

Review generic student

work 0.047 -0.014

(0.087) (0.085)

Solved problems/Worked

through

student materials 0.077 0.070

(0.047) (0.047)

Developed curriculum

materials/lesson plans 0.064 0.036

(0.061) (0.063)

Reviewed own student

work 0.033 0.017

(0.064) (0.071)

Number of PD features 0.046*

(0.024)

N effect sizes 237 237 237 237 236 236 237

N studies 89 89 89 89 88 88 89

𝜏2a 0.025 0.021 0.024 0.025 0.018 0.020 0.021

Results of joint F testb F = 0.883, df = 21.2, p = 0.509







development activities included in the model. The F test was estimated using the robumeta and


*p < .10, **p < .05, ***p < .01.


66

Table 8

RVE results including professional development formats as moderators



Same-school

collaboration 0.109 0.123*

(0.076) (0.067)

Implementati

on meetings 0.085* 0.117**

(0.049) (0.045)

Any online

PD -0.161*** -0.153**

(0.053) (0.057)

Summer

workshop 0.093** 0.074*

(0.043) (0.038)

Expert

coaching 0.035 0.053

(0.049) (0.051)

PD leaders –

researchers -0.019 -0.037

(0.043) (0.038)

Number of

PD features 0.023

(0.024)

N effect sizes 237 237 237 236 237 231 231 237

N studies 89 89 89 88 89 86 86 89

𝜏2a 0.020 0.022 0.024 0.023 0.026 0.025 0.015 0.023

Results of

joint F testb F = 2.72, df = 26.2, p = 0.035







development formats included in the model. The F test was estimated using the robumeta and


*p < .10, **p < .05, ***p < .01.


67

Table 9

RVE results including characteristics of interventions involving new curriculum materials as

moderators



Implementation guidance 0.060 0.062

(0.053) (0.056)

Laboratory/hands-on

experience, curriculum kits -0.029 -0.026

(0.055) (0.054)

Curriculum dosage

(Number of hours) 0.000 0.000

(0.001) (0.001)

Curriculum proportion replaced

(0.00-1.00) -0.060

(0.151)

N effect sizes 193 193 193 192 193

N studies 77 77 77 76 77

𝜏2a 0.027 0.031 0.030 0.025 0.031

Results of joint F testb F = 0.415, df = 24.4, p = 0.744


All models include only studies and/or treatment arms that included new curriculum materials.

All models include controls for the following: RCT, state standardized test, other standardized

test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects


joint F test are from a test of the joint significance of the predictors for all characteristics of

interventions involving new curriculum materials included in the model. The F test was

estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton &

Pustejovsky, 2015).

*p < .10, **p < .05, ***p < .01.


68

Table 10

Regression adjusted mean effect sizes based on unconditional and conditional RVE meta-

regression models

Subgroup analysis

(Unconditional

RVE model)

Conditional RVE

model Conditional RVE

model with controls

for other program

features

𝑔𝑢𝑐

, 𝑝𝑢𝑐 𝑔

�� pc 𝑔

𝑐+ pc+

Overall effect size 0.209*** -- -- -- --

Program type Professional development

and curriculum materials 0.235*** 0.254 ** -- --

Professional development

only/Curriculum

materials only+

0.136*** 0.156 -- --

Outcome type State standardized test 0.060*** 0.101 *** -- --

Other standardized test 0.084*** 0.087 *** -- --

Intervenor-developed

test+ 0.371*** 0.365 -- --

Program feature category: Professional development focus Focus on how to use

curriculum materials

Yes 0.228*** 0.258 * 0.260 *

No 0.134*** 0.140 0.141 Focus on content-specific

formative assessment

Yes 0.387*** 0.340 * 0.321 No 0.178*** 0.208 0.216

Focus on improving

content

knowledge/pedagogical

content knowledge/how

students learn

Yes 0.303*** 0.269 ** 0.275 **

No 0.095*** 0.175 0.179


69

Subgroup analysis

(Unconditional

RVE model)

Conditional RVE

model Conditional RVE

model with controls

for other program

features

𝑔𝑢𝑐

a, pcb 𝑔

��c pc

d 𝑔𝑐+

e pc+f

Program feature category: Professional development formats

PD includes same-school

collaboration

Yes 0.198*** 0.248 0.247 *

No 0.263*** 0.139 0.125 PD includes

implementation meetings

Yes 0.283*** 0.284 * 0.297 **

No 0.169*** 0.199 0.180 Any online professional

development

Yes 0.116** 0.103 *** 0.096 **

No 0.235*** 0.264 0.249 PD includes summer

workshop

Yes 0.218*** 0.266 ** 0.253 *

No 0.190*** 0.174 0.179

Note: First column: Regression-adjusted mean effect sizes are based on subgroup analyses.

Unconditional RVE meta-regression models were estimated including only effect sizes with the

row feature. a guc = Estimated mean effect size from unconditional RVE meta-regression model. b puc = p-

value on constant from unconditional meta-regression model.

Second and third column: Regression-adjusted mean effect sizes are based on the results of

estimating conditional RVE meta-regression models. Models included each the row feature

separately as a moderator. All models included controls for the variables listed in Table 3.

Regression-adjusted mean effect sizes were calculated using the overall average of the study-

level values of each included covariate. c 𝑔𝑐 = Estimated regression-adjusted mean effect size from conditional RVE meta-regression

model. d pc = p-value on coefficient for program feature on interest from meta-regression model.

Fourth and fifth column: Regression-adjusted mean effect sizes are based on the results of

estimating conditional RVE meta-regression models. Models included all row features in a given

category simultaneously as moderators. All models included controls for the variables listed in

Table 3. Regression-adjusted mean effect sizes were calculated using the overall average of the

study-level values of each included covariate. e 𝑔𝑐+ = Estimated regression-adjusted mean effect size from conditional RVE meta-regression

model. f pc+ = p-value on coefficient for program feature on interest from meta-regression model.

*p < .10, **p < .05, ***p < .01.


70

Table 11

RVE results including other research design elements as moderators Dependent variable: Effect Size (Hedges’s g)


Unit of assignment

is teacher -0.054 -0.060

(0.043) (0.052)

Business as usual

control group 0.008 0.076

(0.079) (0.085)

Teacher

participation –

Voluntary -0.035 -0.032

(0.045) (0.054)

Teacher

participation –

Missing 0.111 0.114

(0.089) (0.103)

Teacher incentive –

Credit -0.167** -0.177*

(0.074) (0.097)

Teacher credit

incentive – Missing -0.050 -0.137

(0.055) (0.097)

Teacher incentive –

Monetary -0.054 0.016

(0.042) (0.057)

Teacher monetary

incentive – Missing -0.019 0.073

(0.063) (0.110)


High cluster -0.056 0.022


71

attrition

(0.051) (0.057)

Cluster attrition -

Missing -0.070 -0.069

(0.064) (0.078)

Higher student

attrition -0.068 -0.054

(0.046) (0.052)

Student attrition -

Missing -0.031 0.031

(0.054) (0.065)

N effect sizes 258 258 258 258 258 258 258 258

N studies 95 95 95 95 95 95 95 95

𝜏2a 0.024 0.022 0.026 0.024 0.026 0.026 0.027 0.028

Results of joint F

testb F = 1.30, df = 23.2, p = 0.284

Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include controls for the

following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject

matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in

Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all

additional research design elements in the model. The F test was estimated using the robumeta and clubSandwich package in R

(Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015).

*p < .10, **p < .05, ***p < .01.


72

Table 12

RVE results including peer-reviewed publication type as a moderator Dependent variable: Effect Size (Hedges’s g)


Peer-reviewed source 0.046 0.067

(0.050) (0.041)

N effect sizes 258 258

N studies 95 95

𝜏2a 0.038 0.026


Models in column 2 include controls for the following: RCT, state standardized test, other


provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014).

*p < .10, **p < .05, ***p < .01.


73

Figure 1. PRISMA study screening flow chart (PRISMA, 2009).

Records identified through database searching

(n = 8,099)

Scre

enin

g In

clu

ded

El

igib

ility

Id

Ien

tifi

cati

on

Additional records identified through other sources

(n = 1,391)

Records after duplicates removed (n = 7,926)

Records screened (n = 7,926)

Records excluded (n = 7,270)

Full-text articles assessed for eligibility

(n = 656)

Full-text articles excluded (n = 561) (Reasons: Did not meet intervention characteristics [e.g., off-topic, not a classroom-level intervention]: n=248; Methodology issues [e.g., no control group; post-hoc design]: n=310; Sample issues [e.g., did not have at least two teachers and fifteen students in each condition], n=45; Report was subsumed under another study, n=11; Full-text of study could not be located, n=11; Study characteristics issues [published before 1989 or not written in English], n=4)

Studies included in quantitative synthesis

(meta-analysis) (n = 95)

Iden

tifi

cati

on

Sc

reen

ing

Elig

ibili

ty

Incl

ud

ed


74

Online Appendix A:

Tables and Figures


75

Table A1

Included studies and effect sizes

Authors Content Area Study Design Setting Grades

Served

Effect

Size

Effect

Size

SE

Effect

Size

Variance

(SE^2)

Agodini, Harris, Remillard, & Thomas (2013)

Mathematics Random

assignment

Elementary 2

-0.210 0.064 0.004

0.000 0.175 0.031

0.040 0.068 0.005

Argentin, Pennisi, Vidoni, Abbiati, & Caputo

(2014)

Mathematics Random

assignment

Middle

School

6, 7, 8

-0.001 0.044 0.002

Arnold, Fisher, Doctoroff, & Dobbs (2002) Mathematics Random

assignment

Preschool PK

0.436 0.364 0.132

Authors (2016) Mathematics Random

assignment

Elementary 4, 5

-0.060 0.085 0.007

-0.030 0.068 0.005

-0.020 0.044 0.002

0.000 0.161 0.026

0.000 0.155 0.024

0.020 0.086 0.007

0.030 0.085 0.007

0.040 0.062 0.004

0.050 0.147 0.022

0.080 0.297 0.088

0.080 0.052 0.003

0.100 0.073 0.005

Batiza, Luo, Zhang, Gruhl, Nelson, Hoelzer,

… & Marcey (2016)

Science Random

assignment

High School 9

0.549 0.096 0.009

0.699 0.259 0.067


76

Battistich, Alldredge, & Tsuchida (2003) Mathematics Random

assignment

Elementary 2

-0.404 0.601 0.361

-0.363 0.476 0.226

-0.157 0.498 0.248

0.133 0.405 0.164

0.206 0.616 0.379

0.292 0.588 0.346

0.755 0.570 0.324

1.007 0.529 0.280

Berlinski & Busso (2015) Mathematics Random

assignment

Middle

School

7

-0.247 0.081 0.007

-0.171 0.080 0.006

Beuermann, Naslund-Hadley, Ruprah, &

Thompson (2013)

Science Random

assignment

Elementary 3

-0.030 0.060 0.004

0.030 0.070 0.005

0.180 0.080 0.006

Borman, Gamoran & Bowdon (2008) Science Random

assignment

Elementary 4,5

-0.270 0.092 0.008

-0.080 0.104 0.011

0.010 0.085 0.007

Bottge, Ma, Gassaway, Toland, Butler, &

Cho (2014)

Mathematics Random

assignment

Middle

School, High

School

6,7,8, 9,

10, 11

0.040 0.188 0.035

0.390 0.190 0.036

0.440 0.191 0.037

1.000 0.200 0.040

Bottge, Toland, Gassaway, Butler, Cho,

Griffen, & Ma (2015)

Mathematics Random

assignment

Middle

School

6,7,8

-0.220 0.180 0.032

0.100 0.112 0.013

0.150 0.112 0.013

0.260 0.180 0.032


77

0.290 0.178 0.032

0.380 0.110 0.012

0.470 0.175 0.031

0.610 0.114 0.013

Bradshaw (2011) Science Random

assignment

Middle

School

6,7,8

0.251 0.131 0.017

0.310 0.130 0.017

Brendefur, Strother, Thiede, Lane, & Surges-

Prokop (2013)

Mathematics Random

assignment

Preschool PK

0.872 0.288 0.083

Brown, Greenfield, Bell, Juárez, Myers, &

Nayfield (2013)

Science Random

assignment

Preschool PK

0.121 0.066 0.004

Carpenter, Fennema, Peterson, Chiang, &

Loef (1989)

Mathematics Random

assignment

Elementary 1

0.202 0.810 0.657

0.426 0.818 0.669

0.452 0.819 0.671

0.453 0.819 0.671

0.511 0.822 0.675

0.545 0.824 0.678

0.551 0.824 0.679

0.692 0.833 0.694

0.747 0.837 0.701

Cervetti, Barber, Dorph, Pearson, &

Goldschmidt (2012)

Science and

Reading

Random

assignment

Elementary 4

0.220 0.073 0.005

0.400 0.102 0.010

0.650 0.083 0.007

Clark, Arens & Stewart (2015) Mathematics Random

assignment

Middle

School

7

0.040 0.153 0.023

0.056 0.153 0.023

Clarke, Smolkowski, Baker, Fien, Doabler, & Mathematics Random Elementary K 0.189 0.099 0.010


78

Chard (2011) assignment

0.232 0.097 0.009

Clements & Sarama (2007) Mathematics Random

assignment

Preschool PK

0.851 0.517 0.267

1.464 0.558 0.311

Clements & Sarama (2008) Mathematics Random

assignment

Preschool PK

0.637 0.215 0.046

1.066 0.116 0.013

Clements, Sarama, Spitler, Lange, & Wolfe

(2011)

Mathematics Random

assignment

Preschool,

Elementary

PK, K

0.720 0.099 0.010

Dash, De Kramer, O'dwyer, Masters, &

Russell, (2012)

Mathematics Random

assignment

Elementary 5

0.132 0.068 0.005

DeBarger, Penuel, Moorthy, Beauvineau,

Kennedy, & Boscardin (2016)

Science Quasi-

experimental

Middle

School

Not

specifie

d 0.637 0.249 0.062

Devlin-Sherer, Spinelli, Giamatteo, Johnson,

Mayo-Molina, McGinley, … & Zisk (1998)

Mathematics Quasi-

experimental

Elementary 3, 4, 5

-0.343 0.695 0.482

-0.227 0.692 0.478

-0.136 0.682 0.465

0.297 0.683 0.466

Dominguez, Nicholls, & Storandt (2006) Mathematics Random

assignment

Elementary 3, 4, 5

0.082 0.107 0.012

0.100 0.108 0.012

Eddy & Berry (2006) Mathematics Random

assignment

High School 9, 10,

11, 12 0.006 0.053 0.003

Eddy, Ruitman, Sloper, & Hankel (2010) Mathematics Random

assignment

High School 9, 10,

11 0.175 0.189 0.036

Garet, Wayne, Stancavage, Taylor, Walters,

Song, … & Doolittle (2010)

Mathematics Random

assignment

Middle

School

7

0.020 0.044 0.002

0.050 0.066 0.004

Granger, Bevis, Saka, Southerland, Sampson,

& Tate (2012)

Science Random

assignment

Elementary 4, 5

0.067 0.051 0.003


79

0.573 0.089 0.008

Greenleaf, Litman, Hanson, Rosen,

Boscardin, Herman, … & Jones (2011)

Science Random

assignment

High School 9, 10

0.280 0.109 0.012

Gropen, Clark-Chiarelli, Chalufour,

Hoisington, & Eggers-Piérola, (2009)

Science Random

assignment

Preschool PK

0.078 0.112 0.012

0.141 0.112 0.012

0.144 0.112 0.012

0.222 0.112 0.013

Gropen, Clark-Chiarelli, Ehrlich, & Thieu

(2011)

Science Random

assignment

Preschool PK

0.379 0.119 0.014

Hand, Therrien, & Shelley (2013) Science Random

assignment

Elementary,

Middle

School

3, 4, 5,

6

-0.001 0.132 0.018

0.426 0.134 0.018

Harris, Penuel, D'Angelo, Haydel, DeBarger,

Gallagher, Kennedy, … & Krajcik (2015)

Science Random

assignment

Middle

School

6

0.220 0.102 0.010

0.250 0.124 0.015

Heller (2012) Science Random

assignment

Middle

School

8

0.029 0.064 0.004

0.109 0.052 0.003

Heller, Curtis, Rabe-Hesketh, & Verboncoeur

(2007)

Mathematics Random

assignment

Elementary,

Middle

School

2, 4, 6

0.010 0.198 0.039

0.111 0.179 0.032

0.352 0.220 0.048

0.429 0.137 0.019

0.659 0.155 0.024

Heller, Daehler, Wong, Shinohara, & Miratrix

(2012)

Science Random

assignment

Elementary 4

0.010 0.090 0.008


80

0.070 0.095 0.009

0.310 0.090 0.008

0.370 0.084 0.007

0.570 0.086 0.007

0.600 0.092 0.008

Heller, Hanson, & Barnett-Clarke (2010) Mathematics Random

assignment

Elementary 4, 5

0.040 0.094 0.009

0.080 0.079 0.006

0.090 0.083 0.007

0.150 0.088 0.008

0.180 0.088 0.008

0.190 0.087 0.008

Hinerman, Hull, Chen, Booker, & Naslund-

Hadley (2014)

Mathematics Random

assignment

Elementary,

Middle

School

K, 1, 2,

3, 4, 5,

6, 7 0.280 0.135 0.018

Jaciw, Hegseth, Ma, & Lai (2012) Mathematics Random

assignment

Elementary 3, 4, 5

0.050 0.082 0.007

0.120 0.061 0.004

0.140 0.085 0.007

Jacobs, Franke, Carpenter, Levi, & Battey

(2007)

Mathematics Random

assignment

Elementary 1, 2, 3,

4, 5 0.277 0.197 0.039

0.316 0.230 0.053

0.348 0.187 0.035

0.766 0.339 0.115

0.830 0.256 0.065

0.841 0.233 0.054

1.016 0.348 0.121

Jerrim & Vignoles (2015) Mathematics Random

assignment

Elementary,

Middle

School

UK1, 7

0.055 0.046 0.002

0.099 0.054 0.003

Kaldon & Zoblotsky (2014) Science Random Elementary, 3 0.020 0.038 0.001


81

assignment Middle

School

0.040 0.027 0.001

Kim, Van Tassel-Baska, Bracken, Feng,

Stambaugh, & Bland (2012)

Science Random

assignment

Elementary K, 1, 2,

3 0.168 0.184 0.034

Kinzie, Whittaker, Williford, Decoster,

Mcguire, Lee, & Kilday (2014)

Mathematics

and Science

Random

assignment

Preschool PK

-0.076 0.060 0.004

-0.025 0.039 0.002

-0.010 0.056 0.003

-0.002 0.062 0.004

0.016 0.204 0.042

0.062 0.067 0.005

0.070 0.076 0.006

0.073 0.062 0.004

0.081 0.037 0.001

0.131 0.056 0.003

Kisker, Lipka, Adams, Rickard, Andrew-

Ihrke, Yanez, & Millard (2012)

Mathematics Random

assignment

Elementary 2

0.390 0.118 0.014

0.819 0.157 0.025

Klein, Starkey, Clements, Sarama, & Iyer

(2008)

Mathematics Random

assignment

Preschool PK

0.229 0.178 0.032

0.549 0.180 0.033

Klein, Starkey, DeFlorio, & Brown (2012) Mathematics Random

assignment

Preschool,

Elementary

PK, K

0.331 0.118 0.014

0.450 0.114 0.013

0.699 0.127 0.016

0.829 0.118 0.014

Lafferty (1994) Mathematics Quasi-

experimental

Middle

School

0.428 0.345 0.119

Lanehart, Borman, Boydston, Cotner, & Lee

(2010)

Science Random

assignment

Elementary 3,4,5

0.188 0.063 0.004


82

Lang, Schoen, LaVenia, & Oberlin (2014) Mathematics Random

assignment

Elementary K, 1

0.200 0.090 0.008

0.240 0.080 0.006

Lara-Alecio, Tong, Irby, Guerrero, Huerta, &

Fan (2012)

Science Quasi-

experimental

Elementary 5

-0.080 0.306 0.094

0.052 0.306 0.094

0.361 0.311 0.097

0.394 0.312 0.097

0.396 0.312 0.097

Lehrer (2015) Mathematics Random

assignment

Middle

School

6

0.418 0.097 0.009

0.568 0.082 0.007

Lewis & Perry (2015) Mathematics Random

assignment

Elementary 2, 3, 4,

5 0.496 0.136 0.018

Llorente, Pasknik, Moorthy, Hupert,

Rosenfeld, & Gerard (2015)

Mathematics Random

assignment

Preschool PK

0.000 0.078 0.006

0.010 0.039 0.002

0.150 0.081 0.007

0.240 0.048 0.002

Llosa, Lee, Jiang, Haas, O’Connor,

Van Booven, & Kieffer (2016)

Science Random

assignment

Elementary 5

0.150 0.051 0.003

0.250 0.113 0.013

Maerten-Rivera, Ahn, Lanier, Diaz, & Lee

(2014)

Science Random

assignment

Elementary 5

-0.011 0.151 0.023

0.136 0.155 0.024

0.170 0.155 0.024

Martin, Braisel, & Turner, & Wise (2012) Mathematics Random

assignment

Middle

School

6

0.020 0.071 0.005

0.090 0.069 0.005

McCoach, Gubbins, Foreman, Rubenstein, &

Rambo-Hernandez (2014)

Mathematics Random

assignment

Elementary 3

0.030 0.077 0.006


83

Miller, Jaciw, & Ma (2007) Science Random

assignment

Elementary 3,4,5

-0.020 0.040 0.002

Montague, Krawec, Enders, & Dietz (2014) Mathematics Random

assignment

Middle

School

7

0.613 0.250 0.063

Mutch-Jones, Puttick, & Demers (2014) Science Random

assignment

Elementary

and Middle

School

5, 6, 7,

8

0.082 0.039 0.001

Newman, Finney, Bell, Turner, Jaciw,

Zacamy, & Gould (2012)

Mathematics Random

assignment

Elementary

and Middle

School

4,5,6,7,

8

0.048 0.015 0.000

0.050 0.028 0.001

Niess (2005) Mathematics Random

assignment

Elementary

and Middle

School

3, 4, 5,

6, 7, 8

0.129 0.185 0.034

0.362 0.212 0.045

Oh, Lachapelle, Shams, Hertel, &

Cunningham (2016)

Science Random

assignment

Elementary Not

specifie

d -0.070 0.111 0.012

-0.030 0.125 0.016

0.010 0.070 0.005

0.030 0.072 0.005

0.070 0.092 0.008

0.100 0.058 0.003

0.110 0.134 0.018

0.140 0.081 0.006

0.170 0.072 0.005

0.180 0.080 0.006

0.200 0.124 0.015

0.220 0.077 0.006

Pane, Griffin, Mccaffrey, Karam (2014) Mathematics Random

assignment

High School 6, 7, 8,

9, 10, -0.100 0.100 0.010


84

11, 12

-0.030 0.110 0.012

0.190 0.130 0.017

0.220 0.090 0.008

Penuel, Gallagher & Moorthy (2011) Science Random

assignment

Middle

School

6,7,8

0.180 0.183 0.034

0.290 0.114 0.013

0.340 0.124 0.015

Piasta, Logan, Pelatti, Capps, & Petrill (2015) Mathematics

and Science

Random

assignment

Preschool PK

-0.080 0.307 0.094

-0.010 0.332 0.110

0.080 0.301 0.091

0.130 0.266 0.071

Presser, Clements, Ginsburg, & Ertle (2015) Mathematics Random

assignment

Preschool,

Elementary

PK, K

0.318 0.247 0.061

0.319 0.147 0.022

0.491 0.251 0.063

0.555 0.253 0.064

Presser, Vahey, & Dominguez (2015) Mathematics Random

assignment

Preschool PK

0.508 0.228 0.052

Pyke, Lynch, Kuipers, Szesze, & Driver

(2004)

Science Random

assignment

Middle

School

8

0.320 0.166 0.028

0.383 0.165 0.027

Pyke, Lynch, Kuipers, Szesze, & Watson

(2006)

Science Random

assignment

Middle

School

6, 7, 8

0.230 0.318 0.101

Pyke, Lynch, Kuipers, Szesze, & Watson

(2005)

Science Random

assignment

Middle

School

7

-0.216 0.237 0.056

Reid, Chen, & McCray (2014) Mathematics Quasi-

experimental

Preschool,

Elementary

PK, K

0.060 0.079 0.006

Resendez & Azin (2006) Mathematics Random

assignment

Elementary 3,5

-0.070 0.057 0.003


85

0.050 0.086 0.007

Resendez & Azin (2008) Mathematics Random

assignment

Elementary 2, 3, 4,

5 0.200 0.063 0.004

0.210 0.021 0.000

0.240 0.102 0.011

Rimbey (2013) Mathematics Random

assignment

Elementary 2

0.245 0.323 0.104

Roschelle, Shechtman, Tatar, Hegedus,

Hopkins, Empson, … & Gallagher (2010)

Mathematics Random

assignment

Middle

School

7,8

0.379 0.063 0.004

0.606 0.059 0.003

Roth, Wilson, Taylor, Hvidsten, Stennett,

Wickler, … & Bintz (2015)

Science Random

assignment

Elementary 4,5

0.680 0.041 0.002

San Antonio, Morales, & Moral (2011) Mathematics Random

assignment

Middle

School

6

0.268 0.280 0.078

Santagata, Kersting, Givvin, & Stigler (2010) Mathematics Random

assignment

Middle

School

6

-0.022 0.134 0.018

Sarama, Clements, Starkey, Klein, &

Wakeley (2008)

Mathematics Random

assignment

Preschool PK

0.618 0.166 0.028

Saxe, Gearhart, & Nasir (2001) Mathematics Random

assignment

Elementary 4, 5, 6

0.770 1.251 1.565

1.450 1.366 1.867

Schneider (2013) Mathematics Random

assignment

Middle

School, High

School

8, 9

0.298 0.135 0.018

0.400 0.184 0.034

Schneider & Meyer (2012) Mathematics Random

assignment

Middle

School

6

0.030 0.049 0.002

Schwartz‐Bloom & Halpin (2003) Science Random

assignment

High School 9, 10,

11, 12 0.380 0.138 0.019

0.580 0.145 0.021

Shannon & Grant (2012) Science Random

assignment

High School 9, 10,

11, 12 0.060 0.083 0.007


86

0.120 0.097 0.009

Sophian (2004) Mathematics Quasi-

experimental

Preschool PK

0.406 0.431 0.185

0.794 0.442 0.196

Supovitz (2013) Mathematics Random

assignment

Elementary 1, 2, 3,

4, 5 0.001 0.046 0.002

0.035 0.046 0.002

0.059 0.046 0.002

Star, Pollack, Durkin, Rittle-Johnson, Lynch,

Newton, & Gogolen (2015)

Mathematics Random

assignment

Middle

School, High

School

8, 9

-0.058 0.120 0.014

0.047 0.134 0.000

Starkey, Klein, & DeFlorio (2013) Mathematics Random

assignment

Preschool PK

0.589 0.165 0.027

1.078 0.173 0.030

Tatar, Roschelle, Knudsen, Shechtman,

Kaput, & Hopkins (2008)

Mathematics Random

assignment

Middle

School

7, 8

1.249 0.735 0.540

Tauer (2002) Mathematics Quasi-

experimental

High School 10

0.208 0.484 0.234

Taylor, Getty, Kowalski, Wilson, Carlson, &

Van Scotter (2015)

Science Random

assignment

High School 9, 10

0.090 0.040 0.002

Taylor, Roth, Wilson, Stuhlsatz, & Tipton

(2016)

Science Random

assignment

Elementary 4, 5

0.520 0.071 0.005

Thompson, Senk, & Yu (2012) Mathematics Quasi-

experimental

Middle

School

7

-0.125 0.247 0.061

-0.086 0.247 0.061

0.199 0.248 0.061

Vaden-Kiernan, Borman, Caverly, Bell, Ruiz

de Castilla, Sullivan, & Rodriguez (2016)

Mathematics Random

assignment

Elementary K, 1, 3,

4 -0.069 0.036 0.001

0.059 0.035 0.001


87

Van Egeren, Schwarz, Gerde, Morris, Pierce,

Brophy-Herb, … & Stoddard (2014)

Science Random

assignment

Preschool PK

-0.269 0.193 0.037

-0.030 0.192 0.037

0.017 0.192 0.037

0.150 0.192 0.037

0.251 0.193 0.037

Walsh-Cavazos (1994) Mathematics Quasi-

experimental

Elementary 5

0.554 0.439 0.193


88

Table A2

Estimated mean effect sizes based on unconditional RVE meta-regression models: Differences by

presence of attrition problems

Subgroup analysis

(Unconditional RVE

model)

Attrition problemsa

Attrition problem at the cluster level 0.180***

Differential attrition between treatment and control groups 0.131***

No attrition problems 0.243***

Note: Table includes estimated mean effect sizes from unconditional RVE models. Mean effect

sizes are based on subgroup analyses where unconditional RVE meta-regression models were

estimated using only effect sizes with (or without) attrition problems. a In some studies, attrition problems were only present for some outcomes (e.g., in studies with

multiple treatment arms, there may have been differential attrition between the control group and

one treatment arm, but not between the control group and a different treatment arm). In these

cases, only effect sizes from contrasts with (or without) the relevant attrition problems were

included. Studies may be included in more than one row if studies included had both cluster-

level attrition and differential attrition between treatment and control groups.

*p < .10, **p < .05, ***p < .01.


89

Table A3

Results of publication bias tests among full sample, peer-reviewed, and non-peer-reviewed

studies

Full sample Peer-reviewed

studies

Non-peer-reviewed

studies

Egger’s regression test

Precision 0.034 0.006 0.064

(0.031) (0.035) (0.051)

[0.276] [0.866] [0.212]

Intercept 1.488 1.875 1.052

(0.389) (0.434) (0.657)

[<0.001] [<0.001] [0.116]

N studies 95 46 49

Average impact,

based on random

effects meta-analysis

0.207*** 0.227*** 0.189***

Average impact, trim-

and-fill results based

on random effects

meta-analysis

0.085*** 0.122*** 0.189***

Estimating meta-regression model using RVE, including effect size standard error as a

predictor

Standard error 1.078 1.412 0.691

(0.219) (0.258) (0.329)

[<0.001] [<0.001] [0.050]

Intercept 0.085 0.065 0.111

(0.035) (0.043) (0.050)

[0.015] [0.143] [0.037]


N studies 95 46 49

Note: Standard errors in parentheses. p-values in brackets. Models in the first column include all

studies. Models in second column include only peer-reviewed studies. Models in the third

column include only non-peer-reviewed studies.

Top panel: Observations are at the study level. Outcomes are study-level average standard

normal deviations (the study-average effect size divided by the study-average standard error).

Egger’s regression test carried out using the metabias package in Stata 15; trim-and-fill method

carried out using the metatrim package in Stata 15.

Bottom panel: Observations are at the effect size level. Outcomes are effect sizes.

*p < .10, **p < .05, ***p < .01.


90

Figure A1. Funnel plot of included effect sizes.

0.5

11

.5S

tan

da

rd e

rro

r

-4 -2 0 2 4Effect size (Hedges's g)

Peer-reviewed studies

Funnel plot for impacts on math, science achievement

0.5

11

.5S

tan

da

rd e

rro

r

-4 -2 0 2 4Effect size (Hedges's g)

All studies

Funnel plot for impacts on math, science achievement0

.2.4

.6.8

Sta

nd

ard

err

or

-1 0 1 2Effect size (Hedges's g)

Non-peer-reviewed studies

Funnel plot for impacts on math, science achievement


91

Online Appendix B:

References for Included Studies

*Agodini, R., Harris, B., Remillard, J., & Thomas, M. (2013). After two years, three elementary

math curricula outperform a fourth. Washington, DC: National Center for Education


*Argentin, G., Pennisi, A., Vidoni, D., Abbiati, G., & Caputo, A. (2014). Trying to raise (low)

math achievement and to promote (rigorous) policy evaluation in Italy: Evidence from a

large-scale randomized trial. Evaluation Review, 38(2), 99-132.

*Arnold, D. H., Fisher, P. H., Doctoroff, G. L., & Dobbs, J. (2002). Accelerating math

development in Head Start classrooms. Journal of Educational Psychology, 94(4), 762-

770.

**Batiza, A. F., Gruhl, M., Zhang, B., Harrington, T., Roberts, M., LaFlamme, D., ... &

Hagedorn, E. (2013). The effects of the SUN project on teacher knowledge and self-

efficacy regarding biological energy transfer are significant and long-lasting: Results of a

randomized controlled trial. CBE-Life Sciences Education, 12(2), 287-305.

*Batiza, A., Luo, W., Zhang, B., Gruhl, M., Nelson, D., Hoelzer, M., ... & LaFlamme, D. (2016,

March). Regular Biology Students Learn Like AP Students with SUN. Paper presented at


DC.

*Battistich, V., Alldredge, S. and Tsuchida, I. (2003). Number power: An elementary school

program to enhance students' mathematical and social development. In S.L. Senk, & D.R.

Thompson (Eds.) Standards-based school mathematics curricula: What are they? What

do students learn? (133–160). Mahwah, NJ: Erlbaum.


92

**Berlinski, S.G., & Busso, M. (2013). Pedagogical change in mathematics teaching: Evidence

from a randomized control trial. Retrieved from

http://www.webmeets.com/files/papers/lacea-lames/2013/438/BB2013_september.pdf.

*Berlinski, S.G.; Busso, M. (2015). Challenges in Educational Reform: An Experiment on Active

Learning in Mathematics (IDB Working Paper Series, No. IDB-WP-561). Washington,

DC: Inter-American Development Bank. Retrieved from

http://hdl.handle.net/11319/6825

*Beuermann, D. W., Naslund-Hadley, E., Ruprah, I. J., & Thompson, J. (2013). The pedagogy of

science and environment: Experimental evidence from Peru. The Journal of Development

Studies, 49(5), 719-736.

*Borman, K. M., Cotner, B. A., Lee, R. S., Boydston, T. L., & Lanehart, R. (2009, March).

Improving Elementary Science Instruction and Student Achievement: The Impact of a

Professional Development Program. Paper presented at the Society for Research on

Educational Effectiveness Spring Conference, Washington, DC.

*Borman, G. D., Gamoran, A., & Bowdon, J. (2008). A randomized trial of teacher development

in elementary science: First-year achievement effects. Journal of Research on

Educational Effectiveness, 1(4), 237-264.

*Bottge, B. A., Ma, X., Gassaway, L., Toland, M. D., Butler, M., & Cho, S. J. (2014). Effects of

blended instructional models on math performance. Exceptional children, 80(4), 423-437.

*Bottge, B. A., Toland, M. D., Gassaway, L., Butler, M., Choo, S., Griffen, A. K., & Ma, X.

(2015). Impact of enhanced anchored instruction in inclusive math classrooms.

Exceptional Children, 81(2), 158-175.




93

*Bradshaw, T. J. (2012). Impact of inquiry based distance learning and availability of classroom

materials on physical science content knowledge of teachers and students in Central

Appalachia (Doctoral dissertation). Retrieved from ProQuest Dissertations & Theses

Global. (Order no. 1506611403) http://search.proquest.com.ezp-

prod1.hul.harvard.edu/docview/1506611403?accountid=11311

*Brendefur, J., Strother, S., Thiede, K., Lane, C., & Surges-Prokop, M. J. (2013). A professional

development program to improve math skills among preschool children in Head Start.

Early Childhood Education Journal, 41(3), 187-195.

*Brown, J. A., Greenfield, D. B., Bell, E., Juárez, C. L., Myers, T., & Nayfeld, I. (2013, March).

ECHOS: Early Childhood Hands-On Science efficacy study. Paper presented at the

Society for Research on Educational Effectiveness Spring Conference, Washington, DC.

*Carpenter, T. P., Fennema, E., Peterson, P. L., Chiang, C. P., & Loef, M. (1989). Using

knowledge of children’s mathematics thinking in classroom teaching: An experimental

study. American Educational Research Journal, 26(4), 499-531.

*Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact

of an integrated approach to science and literacy in elementary school classrooms.

Journal of Research in Science Teaching, 49(5), 631-658.

*Clark, T. F., Arens, S. A., & Stewart, J. (2015, March). Efficacy study of a pre-algebra

supplemental program in rural Mississippi: Preliminary findings. Paper presented at the


*Clarke, B., Smolkowski, K., Baker, S. K., Fien, H., Doabler, C. T., & Chard, D. J. (2011). The

impact of a comprehensive Tier I core kindergarten program on the achievement of

students at risk in mathematics. The Elementary School Journal, 111(4), 561-584.


94

*Clements, D. H., & Sarama, J. (2007). Effects of a preschool mathematics curriculum:

Summative research on the Building Blocks project. Journal for Research in

Mathematics Education, 38(2), 136-163.

*Clements, D. H., & Sarama, J. (2008). Experimental evaluation of the effects of a research-

based preschool mathematics curriculum. American Educational Research Journal,

45(2), 443-494.

*Clements, D. H., Sarama, J., Spitler, M. E., Lange, A. A., & Wolfe, C. B. (2011). Mathematics

learned by young children in an intervention based on learning trajectories: A large-scale

cluster randomized trial. Journal for Research in Mathematics Education, 42(2), 127–

166.

*Dash, S., De Kramer, R.M., O’Dwyer, L. M., Masters, J., & Russell, M. (2012). Impact of

online professional development or teacher quality and student achievement in fifth grade

mathematics. Journal of research on Technology in Education, 45(1), 1-26.

*Debarger, A. H., Penuel, W. R., Moorthy, S., Beauvineau, Y., Kennedy, C. A., & Boscardin, C.

K. (2017). Investigating purposeful science curriculum adaptation as a strategy to

improve teaching and learning. Science Education, 101(1), 66-98.

*Dominguez, P. S., Nicholls, C., & Storandt, B. (2006). Experimental methods and results in a

study of PBS TeacherLine math courses. Syracuse, NY: Hezel Associates. Retrieved from


*Eddy, R. M., & Berry, T. (2006). A randomized control trial to test the effects of Prentice

Hall’s Miller and Levine (2006) Biology curriculum on student performance: Final

report. Claremont, CA: Claremont Graduate University.


95

*Eddy, R. M., Ruitman, H. T., Sloper, M., & Hankel, N. (2010). The effects of Miller & Levine

Biology (2010) on student performance: Final report. LaVerne, CA: Cobblestone

Applied Research & Evaluation, Inc. Retrieved from

http://assets.pearsonschool.com/asset_mgr/pending/2013-

09/FinalEfficacyReport_ML%20Bio.pdf

*Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., ... & Sepanik, S.

(2011). Middle school mathematics professional development impact study: Findings

after the second year of implementation. Washington, DC: National Center for Education


*Granger, E. M., Bevis, T. H., Saka, Y., Southerland, S. A., Sampson, V., & Tate, R. L. (2012).

The efficacy of student-centered instruction in supporting science learning. Science,

338(6103), 105-108.

*Gropen, J., Clark-Chiarelli, N., Chalufour, I., Hoisington, C., & Eggers-Pierola, C. (2009,

March). Creating a successful professional development program in science for Head

Start teachers and children: Understanding the relationship between development,

intervention, and evaluation. Paper presented at the Society for Research on Educational

Effectiveness Spring Conference Washington, DC.

*Gropen, J., Clark-Chiarelli, N., Ehrlich, S., & Thieu, Y. (2011, March). Examining the efficacy

of Foundations of Science Literacy: Exploring contextual factors. Paper presented at the


**Hand, B., Norton-Meier, L.A., Gunel, M., & Akkus, R. (2016). Aligning teaching to learning:

A 3-year study examining the embedding of language and argumentation into elementary


96

science classrooms. International Journal of Science and Mathematics Education, 14(5),

847-863.

*Hand, B., Therrien, W., & Shelley, M. (2013, March). Examining the impact of using the

science writing heuristic approach in learning science: A cluster randomized study.

Paper presented at the Society for Research on Educational Effectiveness Spring

Conference, Washington, D.C.

*Harris, C. J., Penuel, W. R., D'Angelo, C. M., DeBarger, A. H., Gallagher, L. P., Kennedy, C.

A., ... & Krajcik, J. S. (2015). Impact of project‐based curriculum materials on student

learning in science: Results of a randomized controlled trial. Journal of Research in

Science Teaching, 52(10), 1362-1385.

*Harris, C. J., Penuel, W. R., DeBarger, A., D’Angelo, C., & Gallagher, L. P. (2014).

Curriculum Materials make a difference for next generation science learning: Results

from year 1 of a randomized control trial. Menlo Park, CA: SRI International.

* Heller, J. I. (2012). Effects of Making Sense of SCIENCE™ professional development on the

achievement of middle school students, including English language learners. (NCEE

2012-4002). Washington, DC: National Center for Education Evaluation and Regional

Assistance, Institute of Education Sciences, U.S. Department of Education.

* Heller, J. I., Curtis, D. A., Rabe-Hesketh, S., & Verboncoeur, C. (2007). The effects of Math

Pathways and Pitfalls on students’ mathematics achievement: National Science

Foundation final report. Washington, DC: U.S. Department of Education, National

Center for Education Statistics. http://eric. ed.gov/?id=ED498258

*Heller, J. I., Daehler, K. R., Wong, N., Shinohara, M., & Miratrix, L. W. (2012). Differential

effects of three professional development models on teacher knowledge and student


97

achievement in elementary science. Journal of Research in Science Teaching, 49(3), 333-

362.

*Heller, J., Hanson, T., Barnett-Clarke, C. (2010). The impact of Math Pathways & Pitfalls on

students’ mathematics achievement and mathematical language development: A study

conducted in schools with high concentrations of Latino/a student and English learners.

Oakland, CA: WestEd. Retrieved from https://mpp.wested.org/wp-

content/uploads/2013/05/mpp-ies-report.pdf

*Hinerman, K. M., Hull, D. M., Chen, Q., Booker, D. D., & Naslund-Hadley, E. I. (2014,

March). Teacher-led math inquiry in Belize: A cluster randomized trial. Paper presented

at the Society for Research on Educational Effectiveness Spring Conference,

Washington, DC.

*Jaciw, A. P., Hegseth, W., Ma, B., & Lai, G. (2012). Assessing impacts of Math in Focus, a

‘Singapore Math’ program for American schools: A report of findings from a randomized

control trial. Palo Alto, CA: Empirical Education.

**Jaciw, A. P., Hegseth, W., & Toby, M. (2015, March). Assessing impacts of" Math in Focus,"

a ‘Singapore Math’ program for American schools. Paper presented at the Society for


*Jacobs, V. R., Franke, M. L., Carpenter, T. P., Levi, L., & Battey, D. (2007). Professional

development focused on children's algebraic reasoning in elementary school. Journal for

Research in Mathematics Education, 38(3), 258-288.

*Jerrim, J., & Vignoles, A. (2015). The causal effect of East Asian “mastery” teaching methods

on English children’s mathematics skills. UCL Institute of Education, University College

London.


98

*Kaldon, C. R., & Zoblotsky, T. A. (2014, March). A randomized controlled trial validating the

impact of the LASER model of science education on student achievement and teacher

instruction. Paper presented at the Society for Research on Educational Effectiveness

Spring Conference, Washington, DC.

*Kim, K. H., VanTassel-Baska, J., Bracken, B. A., Feng, A., Stambaugh, T., & Bland, L. (2012).

Project Clarion: Three years of science instruction in Title I schools among K – third-

grade students. Research in Science Education, 42(83), 1-17.

*Kinzie, M. B., Whittaker, J. V., Williford, A. P., DeCoster, J., McGuire, P., Lee, Y., & Kilday,

C. R. (2014). MyTeachingPartner-Math/Science pre-kindergarten curricula and teacher

supports: Associations with children's mathematics and science learning. Early

Childhood Research Quarterly, 29(4), 586-599.

*Kisker, E. E., Lipka, J., Adams, B. L., Rickard, A., Andrew-Ihrke, D., Yanez, E. E., & Millard,

A. (2012). The potential of a culturally based supplemental mathematics curriculum to

improve the mathematics performance of Alaska Native and other students. Journal for

Research in Mathematics Education, 43(1), 75-113.

*Klein, A., Starkey, P., Clements, D., Sarama, J., & Iyer, R. (2008). Effects of a pre-kindergarten

mathematics intervention: A randomized experiment. Journal of Research on

Educational Effectiveness, 1(3), 155-178.

*Klein, A., Starkey, P., Deflorio, L., Brown, E.T., (2012, March). Establishing and sustaining an

effective pre-kindergarten math intervention at scale. Paper presented at the Society for

Research in Educational Effectiveness Spring Conference, Washington, DC.


99

*Lafferty, J. F. (1994). The links among mathematics text, students’ achievement, and students’

mathematics anxiety: A comparison of the incremental development and traditional texts.

Unpublished doctoral dissertation, Widener University, Wilmington, DE.

*Lanehart, R.E., Borman, K.M., Boydston, T. L., Cotner, B. A., & Lee, R. S. (2010, March).

Improving gender, racial, and social equity in elementary science instruction and student

achievement: The impact of a professional development program. Paper presented


*Lang, L. B., Schoen, R. R., LaVenia, M., & Oberlin, M. (2014). Mathematics Formative

Assessment system – Common Core State Standards: A randomized field trial in

kindergarten and first grade. Paper presented at the Society for Research on Educational

Effectiveness Spring Conference, Washington, DC.

*Lara‐Alecio, R., Tong, F., Irby, B. J., Guerrero, C., Huerta, M., & Fan, Y. (2012). The effect of

an instructional intervention on middle school English learners' science and English

reading achievement. Journal of Research in Science Teaching, 49(8), 987-1011.

*Lehrer, R. (2010). Assessing data modeling and statistical reasoning project: IES final report.

Washington, DC: Institute for Educational Sciences.

**Lewis, C.C., & Perry, R. (2014). Lesson study with mathematical resources: A sustainable

model for locally-led teacher professional learning. Mathematics Teacher Education and

Development, 16(1).

**Leuzzi, A. & Pennisi, A. (no date). Learning about the effectiveness of EU structural fund

projects in Italian schools: A large-scale RCT for math teachers. [Presentation slides].

Retrieved from http://www.esparama.lt/c/document_library/get_file?uuid=cbebbfe8-

e2a9-4486-8422-fbded985dfca&groupId=19002


100

*Lewis, C. C., & Perry, R. R. (2015). A randomized trial of lesson study with mathematical

resource kits: Analysis of impact on teachers’ beliefs and learning community. In

Middleton, J.A., Cai, J., & Hwang, S. (Eds). Large-scale studies in mathematics

education (133-158). New York, NY: Springer International Publishing.

*Llorente, C., Pasnik, S., Moorthy, S., Hupert, N., Rosenfeld, D., & Gerard, S. (2015, March).

Preschool teachers can use a PBS KIDS Transmedia Curriculum Supplement to support

young children's mathematics learning: Results of a randomized controlled trial. Paper

presented at the Society for Research on Educational Effectiveness Spring Conference,

Washinton, DC.

*Llosa, L., Lee, O., Jiang, F., Haas, A., O’Connor, C., Van Booven, C. D., & Kieffer, M. J.

(2016). Impact of a large-scale science intervention focused on English language

learners. American Educational Research Journal, 53(2), 395-424.

*Maerten-Rivera, J., Ahn, S., Lanier, K., Diaz, J., & Lee, O. (2016). Effect of a multiyear

intervention on science achievement of all students including English language learners.

The Elementary School Journal, 116(4), 600-624.

*Martin, T., Brasiel, S. J., Turner, H., & Wise, J. C. (2012). Effects of the Connected

Mathematics Project 2 (CMP2) on the mathematics achievement of Grade 6 students in

the Mid-Atlantic region: Final report. Washington, DC: National Center for Education


*McCoach, D. B., Gubbins, E. J., Foreman, J., Rubenstein, L. D., & Rambo-Hernandez, K. E.

(2014). Evaluating the efficacy of using pre-differentiated and enriched mathematics

curricula for grade 3 students: A multisite cluster-randomized trial. Gifted Child

Quarterly, 58(4), 272-286.


101

*Miller, G., Jaciw, A., Ma, B., & Wei, X. (2007). Comparative effectiveness of Scott Foresman

Science: A report of randomized experiments in five school districts. Palo Alto, CA:

Empirical Education.

*Montague, M., Krawec, J., Enders, C., & Dietz, S. (2014). The effects of cognitive strategy

instruction on math problem solving of middle-school students of varying ability. Journal

of Educational Psychology, 106(2), 469.

*Mutch-Jones, Puttick, & Demers (2014, April). Differentiating science instruction: Teacher and

student findings from the Accessing Science Ideas project. Poster presented at the

American Educational Research Association Annual Meeting, Philadelphia, PA.

* Newman,D., Finney, P.B., Bell, S., Turner, H., Jaciw, A.P., Zacamy, J.L., & Feagans Gould, L.

(2012). Evaluation of the Effectiveness of the Alabama Math, Science, and Technology

Initiative (AMSTI). (NCEE 2012–4008). Washington, DC: National Center for Education

Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of

Education. Retrieved from http://dx.doi.org/10.2139/ssrn.2511347.

*Oh, Y., Lachapelle, C. P., Shams, M. F., Hertel, J. D., & Cunningham, C. M. (2016, April).

Evaluating the efficacy of Engineering is Elementary for student learning of engineering

and science concepts. Paper presented at the American Educational Research Association

Annual Meeting, Washington, DC.

*Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of Cognitive

Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127-144.

*Penuel, W. R., Bates, L., Pasnik, S., Townsend, E., Gallagher, L. P., Llorente, C., & Hupert, N.

(2010, June). The impact of a media-rich science curriculum on low-income preschoolers'


102

science talk at home. In Proceedings of the 9th International Conference of the Learning

Sciences, (238-245). International Society of the Learning Sciences, Chicago, IL.

*Penuel, W. R., Gallagher, L. P., & Moorthy, S. (2011). Preparing teachers to design sequences

of instruction in earth systems science: A comparison of three professional development

programs. American Educational Research Journal, 48(4), 996-1025.

*Piasta, S. B., Logan, J. A., Pelatti, C. Y., Capps, J. L., & Petrill, S. A. (2015). Professional

development for early childhood educators: Efforts to improve math and science learning

opportunities in early childhood classrooms. Journal of Educational Psychology, 107(2),

407-422.

*Presser, A. L., Clements, M., Ginsburg, H., & Ertle, B. (2012). Effects of a preschool and

kindergarten mathematics curriculum: Big Math for Little Kids. New York, NY: Center

for Children and Technology, Retrieved from

http://cct.edc.org/sites/cct.edc.org/files/publications/BigMathPaper_Final.pdf

*Presser, A. L., Vahey, P., & Dominguez, X. (2015, March). Improving mathematics learning by

integrating curricular activities with innovative and developmentally appropriate digital

apps: Findings from the Next Generation Preschool Math Evaluation. Paper presented at


DC.

*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Driver, H. (2004). Implementation study of

Chemistry that Applies (2002-2003): SCALE-uP Report No. 2. George Washington

University, Washington, DC.

http://cct/


103

*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Watson, W. (2006). Implementation study of

Exploring Motion and Forces (2005-2006): SCALE-uP Report No. 13. George

Washington University, Washington, DC.

*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Watson, W. (2005). Implementation study of

The Real Reasons for Seasons (2004-2005): SCALE-uP Report No. 7. George

Washington University, Washington, DC.

*Reid, E. E., Chen, J. Q., & McCray, J. (March, 2014). Achieving high standards for pre-K –

Grade 3 mathematics: A whole teacher approach to professional development. Paper

presented at the Society for Research Educational Effectiveness Spring Conference,

Washington, DC.

*Resendez, M., and Azin, M., (2006). 2005 Prentice Hall Science Explorer randomized control

trial. Jackson, WY: PRES Associates, Inc.

*Resendez, M., and Azin, M. (2008). A study of the effects of Pearson’s 2009 enVision Math

Program. Jackson, WY: PRES Associates, Inc.

*Rethinam, V., Pyke, C., Lynch, S., Pyke, C., & Lynch, S. (2008). Using multilevel analyses to

study the effectiveness of science curriculum materials. Evaluation & Research in

Education, 21(1), 18-42.

*Rimbey, K. A. (2013). From the Common Core to the classroom: A professional development

efficacy study for the common core state standards for mathematics (Doctoral

dissertation). Retrieved from Pro Quest. (Order No. 3560379).

**Roschelle, J., Shechtman, N., Tatar, D., Hegedus, S., Hopkins, B., Empson, S., ... & Gallagher,

L. P. (2010). Integration of technology, curriculum, and professional development for


104

advancing middle school mathematics: Three large-scale studies. American Educational

Research Journal, 47(4), 833-878.

*Roth, K., Wilson, C., Taylor, J., Hvidsten, C., Stennett, B., Wickler, N., ... & Bintz, J. (2015,

March). Testing the consensus model of effective PD: Analysis of practice and the PD

research terrain. Paper presented at the International Conference of the National

Association of Science Teacher Researchers, Chicago, IL.

*San Antonio, D. M., Morales, N. S., & Moral, L. S. (2011). Module‐based professional

development for teachers: a cost‐effective Philippine experiment. Teacher development,

15(2), 157-169.

*Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as

a lever for change: An experimental study of the effects of a professional development

program on students’ mathematics learning. Journal of Research on Educational


*Sarama, J., Clements, D. H., Starkey, P., Klein, A., & Wakeley, A. (2008). Scaling up the

implementation of a pre-kindergarten mathematics curriculum: Teaching for

understanding with trajectories and technologies. Journal of Research on Educational


*Sarama, J., Lange, A., Clements, D.H., & Wolfe, C.B. (2012). The impacts of an early

mathematics curriculum on emerging literacy and language. Early Childhood Research

Quarterly, 27, 489-502. doi: 10.1016/j.ecresq.2011.12.002

*Saxe, G. B., & Gearhart, M. (2001). Enhancing students' understanding of mathematics: A

study of three contrasting approaches to professional support. Journal of Mathematics

Teacher Education, 4(1), 55-79.


105

*Schneider, S. (2013). Final Report for IES R305A70105: Algebra Interventions for Measured

Achievement – Full Year. San Francisco, CA: WestEd.

*Schneider, M. C., & Meyer, J. P. (2012). Investigating the efficacy of a professional

development program in formative classroom assessment in middle-school English

language arts and mathematics. Journal of Multidisciplinary Evaluation, 8(17), 1-24.

Schochet, P. Z. (2011). Do typical RCTs of education interventions have sufficient statistical

power for linking impacts on teacher practice and student achievement outcomes?

Journal of Educational and Behavioral Statistics, 36(4), 441-471.

*Schwartz‐Bloom, R. D., & Halpin, M. J. (2003). Integrating pharmacology topics in high

school biology and chemistry classes improves performance. Journal of Research in

Science Teaching, 40(9), 922-938.

*Shannon, L., & Grant, B. (2012). A final evaluation report of Houghton Mifflin Harcourt’s Holt

McDougal Biology. Charlottesville, VA: Magnolia Consulting, LLC.

*Sophian, C. (2004). Mathematics for the future: Developing a Head Start curriculum to support

mathematics learning. Early Childhood Research Quarterly, 19(1), 59-81.

*Star, J. R., Pollack, C., Durkin, K., Rittle-Johnson, B., Lynch, K., Newton, K., & Gogolen, C.

(2015). Learning from comparison in algebra. Contemporary Educational Psychology,

40, 41-54.

*Starkey, P., Klein, A., & , Deflorio, L. (2013). Changing the developmental trajectory in early

math through a two-year preschool math intervention. Paper presented at the Society for



106

Sussman, J., & Wilson, M. R. (2018). The Use and Validity of Standardized Achievement Tests

for Evaluating New Curricular Interventions in Mathematics and Science. American

Journal of Evaluation, 1098214018767313.

*Tatar, D., Roschelle, J., Knudsen, J., Shechtman, N., Kaput, J., & Hopkins, B. (2008). Scaling

up innovative technology-based mathematics. The Journal of the Learning Sciences,

17(2), 248-286.

*Tauer, S. (2002). How does the use of two different mathematics curricula affect student

achievement?: A comparison study in Derby, Kansas. Retrieved from

http://www.cpmponline.org/pdfs/CPMP_Achievement_Derby.pdf

*Taylor, J. A., Getty, S. R., Kowalski, S. M., Wilson, C. D., Carlson, J., & Van Scotter, P.

(2015). An efficacy trial of research-based curriculum materials with curriculum-based

professional development. American Educational Research Journal, 52(5), 984-1017.

*Thompson, D. R., Senk, S. L., & Yu, Y. (2012). An evaluation of the Third Edition of the

University of Chicago School Mathematics Project: Transition Mathematics. Chicago,

IL: University of Chicago School Mathematics Project.

*Vaden-Kiernan, M., Borman, G., Caverly,S., Bell, N., Ruiz de Castilla, V., Sullivan, K. &

Rodriguez, D. (March, 2016). Findings from a multi-year scale-up effectiveness trial of

Everyday Mathematics. Paper presented at the Society for Research in Educational

Effectiveness Spring Conference, Washington, DC.

*Van Egeren, L. A., Schwarz, C., Gerde, H., Morris, B., Pierce, S., Brophy-Herb, .., & Stoddard,

D. (2014, August). Cluster-randomized trial of the efficacy of early childhood science

education with low-income children: Years 1-3. Poster presented at the 2014 Discovery

Research K-12 PI Meeting, Arlington, VA.


107

*Walsh-Cavazos, S. (1994). A study of the effects of a mathematics staff development module on

teachers' and students' achievement. (Doctoral dissertation). Retrieved from ProQuest

(Order no. 9517241)


108

Online Appendix C:

Full Results of Sensitivity Checks

Table C1

Regression adjusted mean effect sizes based on unconditional and conditional RVE meta-

regression models: Differences by subject (math vs. science)

Subgroup analysis

(Unconditional RVE

model)

Conditional RVE model

𝑔𝑢𝑐 a, pucb 𝑔𝑐 c pc

d

Subject Matter: Focus of interventione

Science 0.208*** 0.225 0.848

Math (including Math and

Science)

0.201*** 0.216

Subject Area: Outcome measuref

Science 0.188*** 0.223 0.942

Math 0.201*** 0.226

Note: First column: Mean effect sizes in the second column are based on subgroup analyses

where unconditional RVE meta-regression models were estimated separately for studies that

focused on math/science, and for math/science outcomes. a 𝑔𝑢𝑐 = Estimated mean effect size from unconditional RVE meta-regression model. b puc = p-

value on constant from unconditional meta-regression model.

Second column: Mean effect sizes in the third column are based on the results of estimating RVE

meta-regression models including all variables listed in Table 3, or including an indicator for

math outcome (using the study-level mean) and including other variables listed in Table 3.

Regression-adjusted mean effect sizes were calculated using the overall average of the study-

level values of each included covariate. c 𝑔𝑐 = Estimated regression-adjusted mean effect size from conditional RVE meta-regression

model include covariates listed in Table 3. d pc = p-value on coefficient for program feature on

interest from meta-regression model including relevant set of controls. e Subject Matter: Focus of intervention refers to the focus of the intervention: whether the

intervention focused on math (including 3 studies which focus on both math and science). f Subject Area: Outcome measure refers to the subject area of the outcome measure. This

accounts for the fact that some studies focus on both math and science, and include both math

and science outcomes.

*p < .10, **p < .05, ***p < .01.


109

Table C2

Estimated mean effect sizes based on unconditional RVE meta-regression models: Differences by

grade level

Subgroup analysis

(Unconditional RVE model)

Grade level of participants

Preschool 0.390***

Kindergarten 0.133**

Early Elementary 0.200***

Upper Elementary 0.176***

Middle School 0.159***

High School 0.166**

Note: Estimated mean effect sizes from unconditional RVE meta-regression models. Mean effect

sizes are based on subgroup analyses where unconditional meta-regression analyses were

estimated using RVE separately for studies that included participants in each grade level. Studies

may be included in more than one row if studies included students across multiple grades.

*p < .10, **p < .05, ***p < .01.


110

Table C3

Sensitivity tests: Results of estimating an unconditional meta-regression model with robust

variance estimation (RVE) and imputing missing effect sizes

Imputing missing

effect sizes as zero

(g = 0.00)

Imputing missing

effect sizes as

negative

(g = -0.10)

Imputing missing

effect sizes as

negative

(g = -0.20)

Constant 0.197*** 0.193*** 0.190***

(0.024) (0.024) (0.024)


N studies 95 95 95

𝜌a 0.80 0.80 0.80

Note: a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


111

Table C4

Sensitivity tests: RVE results including professional development foci as moderators

Imputing missing


(g = 0.00)

Imputing missing

effect sizes as

negative

(g = -0.10)

Imputing missing

effect sizes as

negative

(g = -0.20)

Between-study effects: Professional development foci

Generic instructional strategies -0.078 -0.079 -0.080

(0.067) (0.067) (0.067)

How to use curriculum materials 0.118* 0.114* 0.110*

(0.062) (0.063) (0.064)

Integrate technology 0.135 0.134 0.132

(0.104) (0.105) (0.107)

Content-specific formative assessment 0.097 0.095 0.093

(0.062) (0.063) (0.064)

Improve pedagogical content knowledge/how students learn 0.094** 0.095** 0.096**

(0.042) (0.042) (0.043)


N studies 89 89 89

𝜌a 0.80 0.80 0.80

Between-study effects: Number of PD foci

Number of PD features 0.074*** 0.073*** 0.072***

(0.023) (0.024) (0.024)


N studies 89 89 89

𝜌a 0.80 0.80 0.80

Note: All models include only studies and/or treatment arms with a professional development component. All models include controls

for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject

matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


112

Table C5

Sensitivity tests: RVE results including professional development activities as moderators

Imputing missing


(g = 0.00)

Imputing missing

effect sizes as

negative

(g = -0.10)

Imputing missing

effect sizes as

negative

(g = -0.20)


Observed demonstration 0.033 0.033 0.033

(0.048) (0.048) (0.049)

Review generic student work -0.026 -0.029 -0.032

(0.070) (0.070) (0.069)

Solved problems/Worked through

student materials 0.076 0.074 0.072

(0.046) (0.047) (0.048)

Developed curriculum materials/lesson plans 0.046 0.046 0.047

(0.062) (0.062) (0.063)

Reviewed own student work -0.002 -0.007 -0.011

(0.070) (0.073) (0.076)


N studies 88 88 88

𝜌a 0.80 0.80 0.80

Between-study effects: Number of PD activities

Number of PD features 0.039* 0.038 0.036

(0.022) (0.023) (0.024)


N studies 89 89 89

𝜌a 0.80 0.80 0.80d

Note: All models include only studies and/or treatment arms with a professional development component and include controls for:

RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


113

Table C6

Sensitivity tests: RVE results including professional development formats as moderators

Imputing missing


(g = 0.00)

Imputing missing effect

sizes as negative

(g = -0.10)

Imputing missing

effect sizes as

negative

(g = -0.20)

Between-study effects: Professional development formats

Same-school collaboration 0.106 0.104 0.101

(0.066) (0.065) (0.064)

Implementation meetings 0.104** 0.102** 0.099**

(0.047) (0.047) (0.047)

Any online PD -0.137** -0.136** -0.135**

(0.054) (0.053) (0.053)

Summer workshop 0.072** 0.071* 0.071*

(0.038) (0.038) (0.038)

Expert coaching 0.054 0.057 0.060

(0.051) (0.051) (0.051)

PD leaders – researchers -0.039 -0.043 -0.048

(0.037) (0.037) (0.037)


N studies 86 86 86

𝜌a 0.80 0.80 0.80

Between-study effects: Number of PD formats

Number of PD features 0.020 0.019 0.018

(0.023) (0.023) (0.023)


N studies 89 89 89

𝜌a 0.80 0.80 0.80

Note: All models include only studies and/or treatment arms with a professional development component and include controls for:

RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01


114

Table C7

Differences in regression-adjusted mean effect sizes by subject: Regression-adjusted mean effect

sizes

Imputing

missing

effect sizes as

zero

(g = 0.00)

Imputing

missing effect

sizes as

negative

(g = -0.10)

Imputing

missing

effect sizes as

negative

(g = -0.20)


Implementation guidance 0.075 0.080 0.086

(0.058) (0.058) (0.059)

Laboratory/hands-on experience, curriculum

kits -0.026 -0.027 -0.028

(0.053) (0.054) (0.054)

Curriculum dosage

(Number of hours) 0.000 0.000 0.000

(0.001) (0.001) (0.001)


N studies 77 77 77

𝜌a 0.80 0.80 0.80

Note: All models include only studies and/or treatment arms with a professional development

component. All models include controls for: RCT, state standardized test, other standardized test,

grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


115

Table C8

Sensitivity tests: Results of estimating an unconditional meta-regression model with robust

variance estimation (RVE)

Preferred

model

Excluding

non-US

studies

Excluding

non-RCT

studies

Alternative values of correlation

between pairs of observed effect sizes

within studies

Constant 0.209*** 0.223*** 0.208*** 0.209*** 0.209*** 0.209***

(0.025) (0.025) (0.025) (0.025) (0.025) (0.025)

N effect

sizes

258 248 239 258 258 258

N studies 95 89 86 95 95 95

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90

Note: a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


116

Table C9

Sensitivity tests: RVE results including professional development foci as moderators

Preferred

model

Excluding

non-US

studies

Excluding

non-RCT

studies

Alternative values of correlation between

effect sizes within studies


Generic instructional strategies -0.074 0.071 -0.079 -0.074 -0.074 -0.074

(0.072) (0.074) (0.077) (0.072) (0.072) (0.072)

How to use curriculum materials 0.119* 0.077 0.110 0.119* 0.119* 0.119*

(0.063) (0.057) (0.066) (0.063) (0.063) (0.063)

Integrate technology 0.139 0.203* 0.136 0.139 0.139 0.139

(0.102) (0.095) (0.104) (0.102) (0.102) (0.102)

Content-specific formative assessment 0.105 0.109 0.104 0.105 0.105 0.105

(0.062) (0.069) (0.067) (0.062) (0.062) (0.062)

Improve pedagogical content knowledge/how

students learn 0.096** 0.078* 0.101** 0.096** 0.096** 0.096**

(0.043) (0.039) (0.044) (0.043) (0.043) (0.043)

N effect sizes 237 227 222 237 237 237

N studies 89 83 82 89 89 89

𝜌a 237 0.80 0.80 0.50 0.50 0.50

Between-study effects: Number of PD foci

Number of PD features 0.077*** 0.069*** 0.075*** 0.077*** 0.077*** 0.077***

(0.024) (0.024) (0.025) (0.024) (0.024) (0.024)

N effect sizes 237 227 222 237 237 237

N studies 89 83 82 89 89 89

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90


for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


117

Table C10

Sensitivity tests: RVE results including professional development activities as moderators

Preferred

model

Excluding

non-US

studies

Excludin

g non-

RCT

studies


between effect sizes within

studies

Between-study effects: Professional development activities

Observed demonstration 0.025 0.004 0.017 0.025 0.025 0.025

(0.046) (0.048) (0.047) (0.046) (0.046) (0.046)

Review generic student work -0.014 -0.057 0.015 -0.014 -0.014 -0.014

(0.085) (0.081) (0.093) (0.085) (0.085) (0.085)

Solved problems/Worked through

student materials 0.070 0.098** 0.062 0.070 0.070 0.070

(0.047) (0.045) (0.048) (0.047) (0.047) (0.047)

Developed curriculum materials/lesson plans 0.036 -0.001 0.044 0.036 0.036 0.036

(0.063) (0.073) (0.065) (0.063) (0.063) (0.063)

Reviewed own student work 0.017 0.046 0.009 0.017 0.017 0.017

(0.071) (0.085) (0.073) (0.071) (0.071) (0.071)

N effect sizes 236 226 221 236 236 236

N studies 88 82 81 88 88 88

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90

Between-study effects: Number of PD activities

Number of PD features 0.046* 0.041 0.048* 0.046* 0.046* 0.046*

(0.024) (0.024) (0.024) (0.024) (0.024) (0.024)

N effect sizes 237 227 222 237 237 237

N studies 89 83 82 89 89 89

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90



*p < .10, **p < .05, ***p < .01.


118

Table C11

Sensitivity tests: RVE results including professional development formats as moderators

Preferred

model

Excluding

non-US

studies

Excluding

non-RCT

studies

Alternative values of correlation between

effect sizes within studies

Between-study effects: Professional development formats

Same-school collaboration 0.123* 0.152** 0.127* 0.123* 0.123* 0.122*

(0.067) (0.067) (0.068) (0.067) (0.067) (0.067)

Implementation meetings 0.117** 0.091** 0.109** 0.117** 0.117** 0.117**

(0.045) (0.0435) (0.046) (0.045) (0.045) (0.045)

Any online PD -0.153** -0.169** -0.160** -0.153** -0.153** -0.153**

(0.057) (0.063) (0.059) (0.057) (0.057) (0.057)

Summer workshop 0.074* 0.069* 0.077* 0.074* 0.074* 0.074*

(0.038) (0.038) (0.039) (0.038) (0.038) (0.038)

Expert coaching 0.053 0.016 0.051 0.053 0.053 0.053

(0.051) (0.049) (0.053) (0.051) (0.051) (0.051)

PD leaders – researchers -0.037 -0.025 -0.040 -0.037 -0.037 -0.037

(0.038) (0.034) (0.038) (0.038) (0.038) (0.038)

N effect sizes 231 221 216 231 231 231

N studies 86 80 79 86 86 86

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90

Between-study effects: Number of PD formats

Number of PD features 0.023 0.029 0.022 0.023 0.023 0.023

(0.024) (0.026) (0.024) (0.024) (0.024) (0.024)

N effect sizes 237 227 222 237 237 237

N studies 89 83 82 89 89 89

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90



*p < .10, **p < .05, ***p < .01.


119

Table C12

Differences in regression-adjusted mean effect sizes by subject: Regression-adjusted mean effect sizes

Preferred model Excluding

non-US

studies

Excluding non-

RCT studies


between effect sizes within studies


Implementation guidance 0.062 0.028 0.057 0.062 0.062 0.062

(0.056) (0.051) (0.060) (0.056) (0.056) (0.056)

Laboratory/hands-on experience, curriculum kits -0.026 -0.028 -0.026 -0.026 -0.026 -0.026

(0.054) (0.059) (0.056) (0.054) (0.054) (0.054)

Curriculum dosage

(Number of hours) 0.000 0.000 0.000 0.000 0.000 0.000

(0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

N effect sizes 193 185 175 197 197 197

N studies 77 73 69 78 78 78

𝜌a 0.80 0.80 0.80 0.50 0.70 0.90

Note: All models include only studies and/or treatment arms with new curriculum materials. All models include controls for: RCT,

state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.

*p < .10, **p < .05, ***p < .01.


120

Table C13

Descriptions of studies published before and after 2004

Code Code presenta

Studies

published

pre-2004

(N = 9)

Studies

published

2004 or later

(N = 86)

p-value of

differencee

Effect Size (Hedges’ g) 0.360 0.224 0.023

Intervention Type

Professional development only 22% 22% 0.993

Professional development and curriculum materials 67% 76% 0.563

Curriculum materials only 22% 8% 0.173

Research Design and Sample Characteristics

RCT 56% 94% 0.001

Subject matter – Math 89% 62% 0.107

Preschool 11% 20% 0.533

Effect Size Type

Intervenor developed test 69% 51% 0.063

State standardized test 3% 19% 0.039

Other standardized test 28% 31% 0.743

Adjusted for covariates 10% 84% <0.001

PD Formatb

Same-school collaboration 67% 74% 0.620

Implementation meetings 33% 35% 0.927

Online professional development 0% 20% 0.144

Summer workshop 67% 53% 0.437

Expert coaching 0% 22% 0.117

PD lead by researchers/intervention developers 78% 65% 0.449

PD Timing and Durationc

Contact hours 56 hours 44 hours 0.371

Timespan over which professional development was

conducted:

Less than one week 25% 14% 0.419

One week 13% 3% 0.148

One month (8 days to 30 days) 0% 4% 0.578

One semester (31 days to 4 months) 13% 13% 0.980

One year 13% 53% 0.031

More than one year 38% 14% 0.090


121

Code presenta

Studies

published

pre-2004

(N = 9)

Studies

published

2004 or later

(N = 86)

p-value of

differencee

PD Focusb

Generic instructional strategies 22% 9% 0.234

How to use curriculum materials 78% 74% 0.828

Integrate technology 0% 12% 0.284

Content-specific formative assessment 22% 16% 0.654

Improve content knowledge/Pedagogical content

knowledge/How students learn 56% 55% 0.959

PD Activitiesb

Review sample student work 22% 15% 0.583

Observed demonstration 0% 38% 0.021

Solved problems/Worked through student materials 44% 42% 0.883

Developed curriculum/lesson plans 22% 19% 0.795

Reviewed own student work 0% 12% 0.281

Curriculum Materialsd

Laboratory/Hands-on experience or curriculum kits 22% 29% 0.669

Curriculum dosage (hours) 71.2 66.2 0.850

Curriculum proportion replaced (percent) 87% 91% 0.651

Implementation guidance 22% 45% 0.186

Note: N = 95 studies. a Figures in third column include percent of studies which feature the row code for binary

variables, or the sample average calculated at the study level for continuous variables (e.g.,

contact hours and curriculum dosage). For studies that had the feature present in one treatment

arm but not another treatment arm, the code is counted as present if it is present is any treatment

arm (e.g., a study with one treatment arm including only curriculum materials and a second

treatment arm including both professional development and curriculum materials would be

included in both rows). b Codes for PD focus, activities, and format were counted as “Not

Present” for studies that did not involve a PD component. c PD timing/duration excludes studies

without a PD component. d Codes for features of interventions involving new curriculum

materials were counted as “Not Present” for studies that did not involve new curriculum

materials. e p-value from t-tests. All t-tests were conducted at the study level, except for t-test for

variables in the Effect Size Type category which were conducted at the effect size level.


122

strengthening the research base that informs … meta...strengthening the research base that informs...

Documents