strengthening the research base that informs … meta...strengthening the research base that informs...
TRANSCRIPT
Strengthening the Research Base that
Informs STEM Instructional Improvement Efforts:
A Meta-Analysis
Kathleen Lynch
Brown University
Heather C. Hill, Kathryn Gonzalez, & Cynthia Pollard
Harvard Graduate School of Education
March 2019
Suggested Citation:
Lynch, K., Hill, H.C., Gonzalez, K, & Pollard, C. (2019). Strengthening the Research Base that
Informs STEM Instructional Improvement Efforts: A Meta-Analysis. Brown University Working
Paper.
Correspondence concerning this article can be addressed to Kathleen Lynch at
This material is based upon work supported by the National Science Foundation under Grant
#1348669. Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the National
Science Foundation. The authors would like to thank Patrick Aquino, Wilson Boardman,
Jonathan Hamel Sellman, Andrea Humez, Meghan Kelly, Jiwon Lee, Angie Luo, Manish Parmar,
and Melanie Rucinski for their assistance with data preparation and analysis.
INSTRUCTIONAL IMPROVEMENT IN STEM
1
Abstract
We present results from a meta-analysis of 95 experimental and quasi-experimental preK-12
science, technology, engineering, and mathematics (STEM) professional development and
curriculum programs, seeking to understand what content, activities and formats relate to
stronger student outcomes. Across rigorously conducted studies, we found an average weighted
impact estimate of +0.21 standard deviations. Programs saw stronger outcomes when they helped
teachers learn to use curriculum materials; focused on improving teachers' content knowledge,
pedagogical content knowledge and/or understanding of how students learn; incorporated
summer workshops; and included teacher meetings to troubleshoot and discuss classroom
implementation. We discuss implications for policy and practice.
Keywords: Professional development, curriculum, mathematics, science, STEM
INSTRUCTIONAL IMPROVEMENT IN STEM
2
Strengthening the Research Base that Informs STEM Instructional Improvement Efforts:
A Meta-Analysis
Instructional improvement efforts constitute a persistent feature of the preK-12 science,
technology, engineering, and mathematics (STEM) educational landscape, with a decades-long
trail of new curriculum materials and professional development programs aimed at changing how
teachers interact with students around content. Such initiatives bring with them significant costs;
scholars estimate that districts spend between 1 and 6% of their budgets on professional
development (see, e.g., Corcoran,1995; Miles, Odden, Fermanich & Archibald, 2004; Miller,
Lord & Dorney, 1994), and the market for instructional materials totaled $11.9 billion in 2013
(Cavanagh, 2015). Prior to 2002, scholars rarely rigorously evaluated such instructional
improvement programs, often using instead cross-sectional and/or self-report data to identify
“best practices” in professional development and curricular design (e.g., Becker & Park, 2011;
Desimone, 2009; Sparks, 2002). However, following calls in the early 2000s for stronger
research into the impact of educational interventions (Confrey & Stohl, 2004; Raudenbush, 2008;
Shavelson & Towne, 2001), federal research portfolios began to prioritize research methods that
allow causal inference, and to use student outcomes as the major indicator of program success.
Dollars’ and scholars’ turn toward using causal methods and student-level impacts has
resulted in a wealth of new studies. These new studies, we argue, permit rigorous empirical
analyses linking program characteristics to outcomes, a topic long of interest to practitioners who
make decisions regarding intervention design and/or adoption. In this paper, we present results
from a meta-analysis of preK-12 STEM curriculum materials and professional development
programs intended to improve instructional quality and student learning, seeking to understand
what content, activities, and formats are linked to stronger student outcomes. Our work differs
INSTRUCTIONAL IMPROVEMENT IN STEM
3
from other recent efforts in that it is a formal meta-analysis rather than a structured review (e.g.,
Kennedy, 2015; Gersten, 2014), and because the newly available studies allow us to compile a
dataset larger than past reviews, and thus to exclude studies with weaker designs.
We argue that this work is particularly timely. The Every Student Succeeds Act requires
that districts receiving Title I funds must adopt “evidence-based interventions,” including
programs and strategies proven to be effective in raising student achievement. However, recent
null findings from large-scale studies (e.g., Borman, Cotner, Lee, Boydston, & Lanehart, 2009;
Garet et al., 2011; Santagata, Kersting, Givven & Stigler, 2010), as well as an analysis of studies
curated by What Works Clearinghouse (WWC; Malouf & Taymans, 2016), has led many to
doubt the efficacy of instructional improvement programs. By contrast, we find an effect of
+0.21 standard deviations among the programs we study. We describe our study in more detail
below.
Background
STEM Instructional Improvement
Recent calls for reform in STEM education have focused on increasing both student
understanding of core disciplinary ideas and student engagement with key disciplinary practices
such as inquiry, argumentation and proof (NRC, 2011; NGSS 2013; CCSS-M, 2010). Yet
observational studies have shown that instruction in U.S. STEM classrooms tends to be lacking
in disciplinary concepts and low in student cognitive demand (e.g., Authors, in press; Banilower,
Smith, Weiss & Pasley, 2006; Kane, Kerr, & Pianta, 2014; Hiebert et al., 2005).
Perhaps as a result, programs aimed at improving STEM instructional quality abound.
These programs tend to involve two main strategies for improvement: teacher professional
development, which is typically intended to change some aspect of teachers’ instruction, and
INSTRUCTIONAL IMPROVEMENT IN STEM
4
new curriculum materials, which are typically intended to shape both instruction and the subject
matter content teachers teach. Professional development and curriculum programs can be
independent (e.g., professional development alone) or used in combination (e.g., when
curriculum materials’ implementation is supported by professional development). In contrast to
alternative reforms, such as standards, test-based accountability and market-based reforms,
which out broad principles and signals to which schools and teachers must interpret and react,
curriculum and professional development provide direct instructional guidance, attempting to
shape schools’ and teachers’ day-to-day interactions with students through lesson plans and
instructional strategies (Ball & Cohen, 1996). We discuss the specific theory of action for both
professional development and curriculum below.
Providers of STEM teacher professional development – often districts, but sometimes
non-profits, professional associations, or university-based faculty – typically design a set of
experiences intended to affect change in a variety of teacher- and classroom-level phenomena.
Most programs focus on instruction as a primary target for change, providing teachers new
routines, instructional strategies, and/or ways of teaching content. However, many of these
programs also hope to change teachers’ beliefs, knowledge of content, and knowledge of how
students learn content in order to support instructional improvement. For instance, Garet et al.
(Garet et al., 2010) describe a program in which teachers learn rational number concepts, learn
about persistent student misconceptions with this topic, then receive group and individual
coaching on how to apply the material learned in the content-focused sessions in their own
classrooms. Schneider and Meyer (2012) describe a program in which teachers learn about
assessments and formative assessment practices, then put them into practice. The STeLLA
program (Taylor et al., 2015) focuses mainly on improving teachers’ capacity to conduct
INSTRUCTIONAL IMPROVEMENT IN STEM
5
analyses of instruction, but also includes sessions on science content and how students learn
content. Such programs reflect the view that teachers’ instructional practice cannot change
without corresponding changes in teachers’ beliefs and knowledge (Smith & O’Day, 1990).
As these examples imply, the experiences designed by professional development
providers can vary substantially. As critiques of the “one-shot workshop” surfaced two decades
ago (Goldenberg & Gallimore, 1991; Kennedy, 1999), providers experimented with new
formats, including program delivery distributed across one or more school years, 1:1 coaching,
collaborative workgroups, and teacher learning in online settings. Recent reform movements
have also changed professional development foci, from programs primarily targeted toward
improving teacher content knowledge (Frechtling, Sharp, Carey & Vaden-Kiernan, 1995) to
more diverse topics, including improving teachers’ pedagogical content knowledge (Shulman,
1986), helping teachers develop and implement content-specific formative assessment, helping
teachers address the needs of English Learners, and helping teachers use technology more
effectively in STEM classrooms. With these new foci, professional development activities also
changed, with newer programs offering more study of student work, more focus on the
curriculum materials to be used in classrooms, and more focus on lesson analysis and planning.
The theory of action around curriculum materials also involves shaping what teachers do
in classrooms, but here the focus is on both introducing specific instructional moves – supported
by the activities and guidance in the text itself – as well as changing the content taught in
substantial ways. For instance, the elementary curriculum materials supported by the National
Science Foundation in the 1990s were designed to engage students in mathematical practices
such as problem-solving, mathematical reasoning, and precise communication, but also to
include topics that had not previously been taught in elementary schools, such as data, statistics,
INSTRUCTIONAL IMPROVEMENT IN STEM
6
and early algebra (Stein, Remillard & Smith, 2007). In science, numerous projects have built
units and curricula intended to replace conventional textbooks with more inquiry-oriented
instruction and content that is either more relevant to students’ lives or future careers (e.g., Marx
et al., 2004; Schwartz-Bloom & Halpin, 2003).
It is worth noting that many providers of curriculum materials also hold a theory of action
similar to that of professional developers – that teacher beliefs and knowledge must change in
order to support improvements in classroom practice. In some programs, providers intentionally
mix the new curriculum materials with intensive professional development. For instance, Saxe,
Gearhardt & Nasir (2001) paired a new fractions unit with a five-day summer workshop and 13
in-school sessions focused on improving teachers’ knowledge of fractions, teachers’ knowledge
of how students learn fractions, and teachers’ knowledge of student motivation. Roschelle
(Roschelle et al., 2010) and Hand (Hand et al., 2014) followed a similar approach. Many
curriculum providers also embed guidance for teachers in the written materials themselves. This
may include explanations of content intended for teachers – for instance, pages explaining the
conceptual ideas within the unit, how they connect to one another, how to represent those ideas,
alternative solution methods to problems, and even sample dialogue or scripts. Examples of such
curricula include Investigations in Number, Data, and Space in mathematics (Agodini et al.,
2013) and P-SELL (Llosa et al., 2016) in science.
In other cases, curriculum providers offer less professional development, often only a few
days intended to orient teachers to routines in the curriculum, adaptations for students with
specific needs, and technological enhancements to the curriculum (e.g., Pane, Griffin,
McCaffrey, & Karam, 2014); Resendez & Azin, 2006). A small number of curriculum materials
INSTRUCTIONAL IMPROVEMENT IN STEM
7
providers (e.g., Cervetti, Barber, Dorph, Pearson, & Goldschmidt, 2012) eschew professional
development entirely, often in hopes of replicating real-life conditions in schools.
Syntheses of STEM Instructional Improvement Programs
Several recent syntheses examine STEM teacher professional development and
curriculum improvement efforts. We review the methodology and findings from these syntheses,
then comment on their characteristics.
In the area of professional development (Blank, de las Alas, & Smith, 2008; Kennedy,
1999, 2015; Yoon, 2007; Scher & O’Reilly, 2009; Wilson, 2013), prior syntheses have generally
indicated positive impacts on learning. Meta-analyses suggest that effect sizes range between
+0.21 SD (Blank & de las Alas, 2010) and +0.54 SD (Yoon et al., 2007) on student outcomes,
with some suggestion that content-specific (math) rather than content-general (classroom
management) programs produce greater learning gains (Kennedy, 1999; Scher & O’Reilly,
2009). However, Scher and O’Reilly (2009) noted that the pool of STEM-focused articles and
reports published by 2004 could not support most expert recommendations regarding “best
practices” in professional development (see, e.g., Desimone, 2009). For example, although
experts have frequently expressed the opinion that professional development must be longer in
duration in order to be effective (e.g., Darling-Hammond & McLaughlin, 1995; Desimone,
2011), recent syntheses of the empirical literature have returned inconsistent findings on this
point (e.g., Blank & de las Alas, 2010). Kennedy (2016) found that more time-intensive
programs had weaker effects, whereas Yoon found that the three studies in his review that had
the least amount of professional development (5-14 hours) showed no statistically significant
impacts on student learning. However, Yoon’s review included only nine studies. Scher and
INSTRUCTIONAL IMPROVEMENT IN STEM
8
O’Reilly (2009) found that multi-year interventions were more effective than single-year
interventions in math, but not in science.
Two recent structured reviews of professional development, each conducted several years
after federal guidelines changed to prioritize causal research, provide another form of evidence
regarding instructional improvement programs. One review specific to mathematics (Gersten,
Taylor, Keys, Rolfhus, & Newman-Gonchar, 2014) used the stringent WWC evidence standards
to screen studies, and perhaps as a result returned too few (five) to discern patterns in program
impacts. Another (Kennedy, 2015) identified 28 studies, split equally between English Language
Arts, science, and mathematics. Kennedy found that the characteristics often cited as best
practices in professional development (content-focused, collective participation of all teachers
within a grade/school, duration) were less predictive of positive outcomes than other features,
including assisting teachers in gaining insight into their practice and helping teachers inject novel
ideas into their practice. The latter categories included several coaching and curriculum-based
programs. Finally, in science, Slavin and colleagues (Slavin, Lake, Hanley, & Thurston, 2014)
found that professional development programs supporting inquiry-based elementary science
raised student outcomes by an average of 0.36 SD. For secondary science, programs that
emphasized inquiry through teacher professional development saw an average effect size of
+0.24 SD.
In the area of curriculum materials, two research syntheses by Slavin and colleagues
(Slavin & Lake, 2008; Slavin, Lake, & Groff, 2009) using studies published between 1971 and
2008 found fairly small effects for mathematics curriculum materials, at +0.03 SD and +0.10 SD
for secondary and elementary curricula respectively. For science (Slavin, et al., 2014),
elementary programs with “kits” and accompanying professional development had a near-zero
INSTRUCTIONAL IMPROVEMENT IN STEM
9
(+0.03 SD) average impact on student outcomes. The average impact of kits in secondary science
proved similar (+0.04 SD), with the average impact of science textbooks not substantially larger
(+0.10 SD) (Cheung, Slavin, Kim, & Lake, 2017). A review of studies by the National Research
Council (2004) conducted in the early 2000s found stronger results for NSF-funded curricula
than conventional materials, in that 59% of comparisons between materials favored the NSF
materials. However, many regard vote counting as a weak and misleading synthesis procedure
(Hedges & Olkin, 1980), and Confrey (2006) herself noted that the studies included in the NRC
report were of varying quality, and that as study rigor increased, effect sizes diminished. Slavin
and colleagues, perhaps because they used more rigorous study selection criteria, did not find
any relationship between study design and effect size, although they note that the methodological
rigor of most included studies was still relatively weak, and “quality research is particularly
lacking in this area” (Slavin, 2008, p. 480).
Finally, one recent study (Taylor et al., 2018) examined interventions that offered
professional development, curriculum materials, and computer software intended to improve
student science outcomes. Across 96 such studies they found an average impact of 0.489 SD,
with programs evaluated by researcher-designed assessments posting significantly higher
impacts, similar to an earlier report by Hill, Bloom, Black & Lipsey (2008). Other study and
intervention characteristics did not significantly predict program outcomes.
Motivation for the Current Study
These research syntheses share several important characteristics. To start, nearly all noted
that the evidence base for making recommendations was thin at the time of the review. Using the
pool of articles and reports on professional development published by 2003, Yoon and
colleagues (2007) could find only nine ELA, math, and science studies that met WWC evidence
INSTRUCTIONAL IMPROVEMENT IN STEM
10
standards. Years later, Gersten et al. (2014) located only five mathematics professional
development studies that met the WWC’s exacting standards. Other evaluations (e.g., Scher &
O’Reilly, 2009; Slavin, Lake, Hanley, & Thurston, 2014; Cheung, Slavin, Kim & Lake 2017)
achieved a greater sample size by including quasi-experimental studies with matched comparison
groups (and sometimes retrospective matching, a disputed technique [Slavin, 2008]). Kennedy’s
cross-subject synthetic review (2016) with 28 studies was an exception, as many studies
described more recent strong quasi-experiments or actual randomized trials. In general, however,
the small sample sizes typical in these reviews limited both the generalizability of findings and
also the number of program features, or moderators, that could be coded and examined. We
argue that recent IES- and NSF-funded classroom-level experiments provide a thicker evidence
base, eliminating the need to include studies that conducted post-hoc matching and facilitating
more moderator analyses.
These reviews also share another important characteristic, in that many identified only a
small number of program features for study. This is most apparent in the area of professional
development, where research syntheses almost universally coded studies for duration, content-
specificity, and delivery format. For curriculum materials, most analyses categorized programs
by the degree of alignment with standards-based reform (e.g., Confrey & Stohl, 2004). With the
accumulation of more recent rigorous studies, however, have come opportunities to understand
how features of professional development or curriculum materials moderate effects on student
outcomes. For instance, professional development activities (watching video, designing lessons,
studying student work) may differentially impact program outcomes, as might curriculum design
features, such as length of implementation and degree of support for teachers. In addition, the
INSTRUCTIONAL IMPROVEMENT IN STEM
11
programs studied recently contain more varied delivery methods and features (e.g., coaching;
online learning components) than those of a decade ago.
Viewed from a contemporary perspective, several other issues emerge. First, existing
research syntheses of professional development studies tend to find that most returned positive
and significant effects. Yoon et al. (2007), for instance, found only one of twenty effects
identified across nine studies was negative, and only one was zero. Yet the more recent wave of
studies tends to find more mixed impacts (e.g., Garet et al., 2008; Santagata et al., 2010). This
suggests that the pool of studies included in earlier syntheses might have suffered from two
issues: the “file drawer” problem, in which studies with null results are not published (Slavin,
2008); or the “boutique” problem, in which only highly promising – and often unusual –
programs are studied (Author, 2004). We believe that both problems may have been ameliorated
in recent years. IES and NSF funding guidelines have explicitly encouraged the evaluation of
programs that have wide reach, meaning those programs may be more typical of the offerings
available to teachers. Further, better reporting practices – e.g., final reports posted on websites or
short reports archived on conference websites – may have rescued studies from file drawers.
Second, the practice of reviewing professional development and curriculum studies
separately creates conceptual and practical difficulties. On the conceptual side, most curriculum
programs also include a professional development component for teachers; likewise, some
professional development programs offer teachers materials to support the implementation of
new practices within classrooms. On the practical side, the combination of professional
development and materials together may be especially effective, as compared to either one alone
or one with a minimal dose of the other (Authors, 2001). Studying both within one review may
INSTRUCTIONAL IMPROVEMENT IN STEM
12
enhance our understanding of how these instructional improvement efforts can complement one
another.
Methods
Search Procedures
We searched for studies in three phases. We began by scanning the reference lists of prior
reviews of math and science professional development and curriculum improvement programs
(Blank, de las Alas, & Smith, 2008; Cheung, Slavin, Kim, & Lake, 2017; Furtak, Seidel, Iverson,
& Briggs, 2012; Gersten et al., 2014; Kennedy, 1998, 2015; Scher & O’Reilly, 2009; Slavin,
Lake, & Groff, 2009; Slavin & Lake, 2008; Slavin, Lake, Hanley, & Thurston, 2014; Timperley,
Wilson, Barrar, & Fung, 2007; Wilson, 2013; Yoon, 2007; Zaslow et al., 2010) for the years
1989 through 2004. Due to resource constraints, we did not conduct additional searches for
materials dated prior to 2004. We argue that this is reasonable given that prior reviewers have
previously searched the published and grey literature from the pre-2004 period extensively1;
given that rigorous studies of teacher PD and curriculum improvement were relatively rare prior
to the early 2000s (e.g., Kennedy, 2016); we considered the likelihood relatively low that we
would have uncovered previously unknown randomized trials or sufficiently well-designed
quasi-experiments from the pre-2004 period.2 See Appendix Table C13 for a descriptive
comparison of the studies published pre-2004 versus studies published 2004 or later. As
illustrated in the table, descriptively, studies from the earlier period had larger effect sizes on
average; they were also less likely to be RCTs, and less likely to use a state standardized test as
an outcome variable. As we discuss below, we control for these methodological variables in all
of our primary models.
INSTRUCTIONAL IMPROVEMENT IN STEM
13
In the second search phase, we conducted an electronic search using the databases
Academic Search Premier, ERIC, Ed Abstracts, PsycINFO, EconLit, and ProQuest Dissertations
& Theses, for the period January 2004 - March 2016. Searches were conducted using subject-
related keywords adapted from Yoon (2009), and methodology-related keywords designed to
capture experimental and quasi-experimental methods adapted from Kim & Quinn (2012).3 We
also searched the websites of Regional Education Labs, What Works Clearinghouse, the World
Bank, Inter-American Development Bank, Empirical Education, Mathematica, MDRC, and AIR
for relevant materials. We also searched the abstracts of the Society for Research on Educational
Effectiveness (SREE) conference. We ceased our materials search in March 2016.
In the third search phase, we downloaded, from the NSF Community for Advancing
Discovery Research in Education (CADRE) and IES websites, a list of all STEM award grantees
from the years 2002 to 2012. We then conducted electronic database and Web searches to find
all studies published from each award. In the case of 29 awards, we could find no publicly
available reports or information that included student outcomes, and we attempted to contact
project PIs via email to obtain study results. Of these 29 grant awards, we were sent or later
located reports from 17. Of these, the reports from 16 studies were excluded according to the
criteria above, and one was included in our analyses.
The search procedures described above netted a total of 8,099 records identified through
database screening, and an additional 1,391 records identified through other sources (see Figure
1 for screening flow chart). After removing duplicates, we were left with 7,926 records.
Screening Procedures
Screening proceeded in two phases. First two raters screened each of the studies' titles
and abstracts to identify potentially relevant studies, passing studies into the second phase when
INSTRUCTIONAL IMPROVEMENT IN STEM
14
they covered grades preK-12, included student outcomes, included quantitative data, and focused
on math and science-specific content and/or instructional strategies. All studies flagged as
potentially relevant by either rater were then reviewed by one of the authors, who made a final
determination about moving the study forward. A total of 656 studies met the initial relevance
criteria and were advanced to full-text screening.
In the second screening phase, two authors examined the full text of each study and
applied more detailed content and methodological criteria. In order to qualify for inclusion in the
final dataset, we required that studies use a randomized or quasi-experimental research design
with a comparison group. Following Slavin and Lake (2008), we included studies that assembled
comparison groups via prospective, but not post-hoc, matching. Slavin and Lake (2008) have
argued that “Prospective studies, in which experimental and control groups were designated in
advance and outcomes are likely to be reported whatever they turn out to be, are always to be
preferred to post hoc studies, other factors being equal.” According to Slavin and Lake (2008),
post-hoc designs have several characteristic problems. First, when researchers attempt to
examine the efficacy of a program by simply comparing the test scores of schools or classrooms
who completed the program versus those that did not, only “survivors” are included in the
treatment group. This can lead to upwardly biased estimates of the treatment impact, if more
schools or classrooms began the treatment but dropped it, potentially because it was not working.
Second, when researchers construct post-hoc comparisons of treated and matched comparison
schools, they generally have many potential ‘comparison’ schools to choose from; this raises the
concern that researchers may inadvertantly (or even intentionally) choose ‘comparison’ schools
that have made less academic progress over the study period than the treatment schools. Third,
post-hoc matching studies are often commissioned by textbook publishers and curriculum
INSTRUCTIONAL IMPROVEMENT IN STEM
15
developers because they are low-cost and easy to conduct. For these same reasons, study authors
may easily abandon them if they do not show positive results, and thus the retrievable studies
may be especially skewed toward positive findings (Slavin & Lake, 2008). To classify a study as
prospective matching, we required that the study authors explicitly report that they matched
treatment and comparison groups on pretest data prior to the intervention’s commencement.
Studies that explicitly stated that matching was done retrospectively, or that were silent on this
point, were excluded. We also excluded studies in which treatments had been in place prior to
pretesting. We also required that studies present pretest means, and that pretest differences
between groups be less than 50% of a standard deviation. Despite this relatively liberal threshold
for allowable pretest differences, these data requirements resulted in the inclusion of only nine
quasi-experiments.
Studies had to be published in 1989 or later, be written in English, and have participating
students in grades preK-12. The program had to focus on classroom-level STEM instructional
improvement through professional development, curriculum materials, or both. We excluded
programs with no instructional improvement component, such as after-school peer tutoring or
computerized at-home skills practice, as well as studies that examined the impact of variables,
rather than programs.4 Studies had to include sufficient data to calculate an effect size for student
outcomes. The most common reasons for exclusion were for characteristics of the intervention
(e.g., off-topic; not a classroom-level intervention (N = 561), methodology issues (e.g., no
control group; post-hoc design) (N = 310); and sample issues (e.g., did not have at least two
teachers and fifteen students in each condition) (N = 45) (note that some studies had multiple
reasons for exclusion) (see Figure 1).
INSTRUCTIONAL IMPROVEMENT IN STEM
16
Ninety-five studies met the review inclusion criteria and advanced to the study coding
phase. In cases where we encountered multiple versions of the same study, we used all available
reports to glean information about the study, and used the most recent version (often the peer-
reviewed publication version) for impact estimates. Of the studies that met the inclusion criteria,
44 percent were from peer-reviewed journal articles, 23 percent were conference papers or
presentations, 18 percent were technical reports, 5 percent were district, state, or federal
government reports, 4 percent were doctoral dissertations, and 5 percent were from other
sources. Many of these studies reported multiple effect sizes due to the inclusion of multiple
outcome measures, multiple versions of the same program with a common control group,
multiple samples, or multiple programs.
The final analysis sample includes 258 effect sizes nested within these 95 studies. This
includes a separate effect size for each treatment contrast, each assessment of math or science
achievement, and each sample of students and teachers reported by the study.5 To account for the
nested nature of our data, our analysis used the robust variance estimation (RVE) approach
(Tanner-Smith & Tipton, 2013), as discussed below.
Study Coding
To develop content codes, we began by reviewing prior meta-analyses and reviews of
instructional improvement interventions, naming both broad categories (professional
development; curriculum; technology; subject matter) and specific codes (teachers studied
curriculum materials; teachers solved math problems) that commonly appeared in the literature.
We then used this skeleton structure to code a sample of studies, modifying existing codes to be
more clear and adding codes as needed to cover additional program features. We also added a set
of codes that captured study size, design, and quality. Finally, our technical advisory board,
INSTRUCTIONAL IMPROVEMENT IN STEM
17
comprised of university faculty and other experts in teacher professional development and
curriculum materials, reviewed our codes, making suggestions for amending and adding as
necessary.6
For programs involving professional development, four primary coding categories
emerged. A first pair of codes captured professional development duration, indexed as the
number of contact hours teachers spent in professional development experiences (both for
professional development and curriculum materials programs) and as the timespan over which
those hours were spread. A second set of codes recorded the main focus (or foci) of the
professional development, including emphases on improving teacher content knowledge,
exploring how students learn, integrating technology into the classroom, and learning how to use
curriculum materials. Third, we coded for specific activities that teachers engaged in during the
professional development, such as observing a demonstration of instruction or working through
student curriculum materials. A final set of codes captured the format of the professional
development, for instance whether it was delivered during a summer workshop, contained
coaching, or involved online learning. Within each of the three substantive coding categories, we
also created an indicator of the number of codes applied, hypothesizing that professional
development programs with multiple foci, activities or formats (e.g., a focus on both how
students learn and how to use curriculum materials) might be more effective compared to more
narrowly-focused programs.
For programs that involved new curriculum materials, we coded for whether the
curriculum provided implementation guidance (e.g., text describing possible student-teacher
dialogues around content) or supplied teachers with kits for implementation. We also recorded
INSTRUCTIONAL IMPROVEMENT IN STEM
18
the proportion of a lesson the curriculum was intended to replace (i.e., one to 100 percent), and
the total number of minutes the curriculum was intended for use in a school year.
We also designed codes to capture the rigor of the design as well as aspects of its
implementation. Specifically, to record potential selection bias of participants into groups, we
coded whether treatment and control groups were formed via random assignment or by
prospectively matching individuals/classrooms/schools on baseline data. To explore potential
impacts of general and differential attrition bias, we included two measures of potential effect
size-level attrition problems: (1) author-reported attrition of more than 20% at the student or
cluster level, and (2) differential attrition between treatment and control groups of more than
10%. Finally, we coded whether authors simply reported a given outcome as “non-significant” in
order to capture the potential for bias due to selective reporting.
We note we were unable to capture other potential sources of bias such as performance
and detection bias, or biases stemming from participants’ and researchers’ knowledge of whether
participants were in the treatment or control group. Given that it is evident to participants and
researchers alike whether teachers are engaged in new professional development and curriculum
programs, in most cases it is not possible to blind participants and researchers to condition.
We also coded for a variety of other study descriptors, such as publication type (peer-
reviewed vs. not), and whether the study took place in the U.S. or abroad. We examine whether
any of these study descriptors relate to effect sizes. Based on evidence that effect sizes vary by
type of test (Hill et al., 2008; Taylor et al., 2018), we also coded for the nature of the student
assessment, including state or district standardized tests (e.g., the Texas Essential Knowledge
and Skills assessment), standardized tests available through commercial vendors (e.g.,
Woodcock-Johnson), and researcher-designed assessments. We expected the last to be more
INSTRUCTIONAL IMPROVEMENT IN STEM
19
sensitive to the content of instructional improvement programs, and thus potentially provide
larger effect sizes.
Coding of full-text studies was conducted by the study authors, along with two trained
research assistants. Before beginning operational coding, we went through a process of
establishing inter-rater reliability. Each week, each member of the team coded two studies, then
met to reconcile disagreements and refine the code descriptions. This process continued until
coders reached 80% agreement (four weeks for low-inference codes such as bibliographic
information, five weeks for higher-inference codes surrounding program characteristics). Once
we achieved a stable set of codes and the 80% threshold, each study was coded by two
researchers, including at least one study author, working as a pair. Each researcher coded studies
independently, and met to reconcile discrepant records. All disagreements were resolved through
discussion.
Effect Size Calculation
We calculated standardized mean difference effect sizes using Hedges’ g:
𝑔 = 𝐽 ∗ (𝑌𝐸 − 𝑌𝐶 )/S∗
In this formula, 𝑌𝐸 , represents the average treatment group outcome, 𝑌𝐶 , represents the average
control group outcome, and S∗ represents the pooled within-group standard deviation. J is a
correction factor that adjusts the standardized mean difference to avoid bias in small samples:
𝐽 = 1 −3
4 ∗ (𝑁𝐸 + 𝑁𝐶 − 2) − 1)
In this formula, 𝑁𝐸 represents the number of students in the treatment group and 𝑁𝐶 represents
the number of students in the control group. Effect sizes were calculated using the software
package Comprehensive Meta Analysis for the majority of cases. We used the following decision
rules to calculate effect sizes: If the authors reported a standardized mean difference effect size,
INSTRUCTIONAL IMPROVEMENT IN STEM
20
such as Cohen’s d or Glass’s delta, we converted author-reported effect sizes to Hedges’s g (52
percent of effect sizes).7 If authors did not report a standardized mean difference effect size but
did report a covariate-adjusted unstandardized mean difference (e.g., a coefficient from a
multilevel model) and raw standard deviations, we calculated a standard mean difference effect
size and converted to Hedges’s g (15 percent). If covariate-adjusted mean differences were not
reported, we calculated effect sizes based on raw posttest means and standard deviations (22
percent). If neither standardized mean differences nor raw means and standard deviations were
reported, effect sizes were calculated based on the coefficients and standard errors of multilevel
or regression models (4 percent).8 In the remaining cases, effect sizes were calculated from other
results (e.g., studies that reported the results of ANOVAs; 7 percent).
We applied corrections to study-reported standard errors when necessary to account for
the clustering of students within classrooms or schools (Littell, Corcoran, & Pillai, 2008).9 For
the five studies that did not report the number of clusters, we followed Scher and O’Reilly’s
(2009) procedure for imputing that number based on available information, typically estimating
the total number of classrooms from the total number of students. When intraclass correlations
could not be inferred from the study report, we also followed Scher and O’Reilly (2009) in
adopting the WWC recommended default intraclass correlations of 0.20.
We had 29 outcomes where the study did not provide enough information to calculate an
effect size.10 All came from studies that did report an effect size for at least one additional
achievement outcome, and so no studies were dropped from the analysis due to missing
outcomes. Furthermore, many of these missing outcomes were subscales, meaning they were a
smaller selection of items already represented in the reported effects. Therefore, we ignore these
missing outcomes in the main analyses. However, we also examine the sensitivity of our results
INSTRUCTIONAL IMPROVEMENT IN STEM
21
to the inclusion of these outcomes by imputing a range of values for these missing effect sizes
and re-estimating our primary models.11
Model Selection
A common problem in meta-analyses occurs when single studies yield multiple effect
sizes, as authors often measure more than one outcome. These nested effect sizes are likely to be
correlated, violating the assumption of statistical independence. To address this issue, prior meta-
analyses in this area have averaged effect sizes (Kennedy, 2016; Scher & O’Reilly, 2009; Slavin,
Lake, Hanley & Thurston, 2014; Yoon et al., 2007). Here, we argue that a robust variance
estimation (RVE) approach developed by Tanner-Smith and Tipton (2014) more properly
models our data. This approach adjusts standard errors to account for the dependencies between
effect sizes, and is analogous to methods for adjusting standard errors in OLS regression models
for heteroskedasticity (e.g., using Huber-White standard errors) or to account for the nesting of
data within clusters (e.g., clustered standard errors). Importantly, this approach allows for the
inclusion of multiple effect sizes from the same study within the meta analysis, and therefore
avoids the loss of information that would occur by either dropping effect sizes or including a
single average effect size for each study (for a more detailed discussion, see Tanner-Smith &
Tipton, 2014). RVE models can also control for methodological features known to affect
outcomes (e.g., Hill et al. [2008]; Taylor et al., 2018). This approach has been used in multiple
recent meta-analyses where individual studies report multiple effect sizes (e.g., Clark, Tanner-
Smith, & Killingsworth, 2016; Dietrichson, Bøg, Filges, & Jørgensen, 2017; Gardella, Fisher, &
Teurbe-Tolon, 2017).
The RVE approach is designed to account for two types of dependencies among effect
sizes. The first type of dependency is correlated effects, which arises when a single study
INSTRUCTIONAL IMPROVEMENT IN STEM
22
provides multiple effect sizes estimates for the same underlying construct or of correlated
underlying measures, or when the same control group is used for multiple treatment contrasts.
The second type of dependency is hierarchical effects, which arises when multiple treatment-
control contrasts are nested within a larger cluster of experiments (e.g., a research group
conducts multiple evaluations of the same program). When both of these dependencies are
present, Tanner-Smith and Tipton (2014) recommend selecting a method based on the more
frequent type of dependence. Because correlated effects predominated in our data, we used the
recommended inverse variance weights recommended by Tanner-Smith and Tipton (2014).
The weight for effect size i in study j is calculated by the following:
𝑤𝑖𝑗 = 1
{(𝑣∗𝑗 + 𝜏2)[1 + (𝑘𝑗 − 1)𝜌]}
Where 𝑣∗𝑗 is the mean of within-study sampling variances (𝑆𝐸𝑖𝑗2 ), 𝜏2 is the estimate of the
between-studies variance component, 𝑘𝑗 is the number of effect sizes within each study, and 𝜌 is
the assumed correlation between all pairs of effect sizes within each study. Therefore, effect
sizes from studies with more effect sizes and higher sampling variances are given lower weight.
It is assumed that 𝜌 is constant across studies, and we use the recommended default value of 𝜌 =
0.80 (Tanner-Smith & Tipton, 2014). However, simulation studies suggest that results using the
RVE approach are not sensitive to values of 𝜌 (e.g., Tanner-Smith & Tipton, 2014; Tanner-
Smith et al., 2013; Wilson et al., 2011). We also conduct a series of sensitivity checks to
determine whether our results are sensitive to alternative values of 𝜌. We use the robumeta
package in Stata 15 (developed by Tanner-Smith & Tipton, 2013) to estimate our RVE models,
and incorporate the small-sample correction proposed by the developers of the RVE approach
(Tanner-Smith & Tipton, 2014; Tipton & Pustejovsky, 2015). We also report the results of F-
tests to test the joint significance of the program features included in our RVE models. These F-
INSTRUCTIONAL IMPROVEMENT IN STEM
23
tests was conducted using the robumeta and clubSandwich packages in R (Fisher & Tipton,
2014; Tipton & Pustejovsky, 2015).
As noted above, we grouped potential moderators of program impact into five categories
indicating the time/duration, focus, activities, and format of professional development, and the
characteristics of new curriculum materials. To examine whether specific features in each
category moderated program impact, we fit four sets of conditional meta-regression models with
robust variance estimation (RVE), including the coded features as moderators and treating these
moderators as fixed. Within each category, we first modeled the effect of each code separately.
To further understand their joint relationships, we then fit a model entering all codes together.
All models controlled for whether the study was a randomized controlled trial, whether the effect
size estimate controlled for covariates, the type of student assessment used, whether the program
focused on mathematics (rather than science), and an indicator variable for whether the study
was conducted at the preschool level.
In some cases, there is within-study variability in program features (e.g., among studies
with multiple treatment arms). Following the recommendation of Tanner-Smith & Tipton (2014),
we include the study-level mean value of each covariate and moderator. For the two covariates
where there is within-study variability in at least 10 percent of studies (state standardized test,
other standardized test), we also include a within-study version of the covariate that is calculated
by subtracting the study-level mean values from the original covariate values. The full dataset is
available from the authors on request.
Finally, we note that the RVE method addresses heterogeneity of effect sizes differently
compared to traditional meta-analyses. As described by the RVE developers, the primary aim of
this method is to estimate fixed effects, such as meta-regression coefficients, rather than to model
INSTRUCTIONAL IMPROVEMENT IN STEM
24
variation in effect sizes. As a result, tests for heterogeneity used in traditional meta-analysis are
not available with the RVE approach (Tanner-Smith & Tipton, 2014; Tanner-Smith, Tipton &
Polanin, 2016). However, we do report the method of moments estimate of 𝜏2 for each of our
primary models as measures of between-study heterogeneity in effect sizes.
Results
Descriptives and Overall Average Impacts
Table 1 provides descriptive statistics regarding the study designs and programs included
in our dataset. Of the included studies, 22% focused on professional development alone (without
the introduction of new curriculum materials), 9% focused on curriculum materials alone
(without the provision of professional development) and 75% included both new curriculum
materials and professional development. Most included studies had randomized designs; only
nine quasi-experimental studies met the criteria for inclusion. Roughly two-thirds of studies
focused on mathematics rather than science. Table 1 also shows study-level frequencies for the
format, timing and duration, foci, and activities of programs with any professional development
component, and for characteristics of curriculum materials programs.12 Because we had a
relatively small sample size, when two variables correlated at the 0.40 level or above, we
combined them into a single predictor indicating whether the program had either feature. This
occurred in two cases (a focus on content knowledge/pedagogical content knowledge and
knowledge of how students learn; and, under activities, solving problems and working through
student materials).
Across all included studies, we found an average weighted impact estimate of +0.21
standard deviations (see Table 2). To contextualize the magnitude of this effect, a typical
treatment group student would be expected to rank about 8 percentile points higher than a typical
INSTRUCTIONAL IMPROVEMENT IN STEM
25
control group student (Lipsey et al., 2012). We found a method-of-moments estimate of the
between-study variance in study average effect sizes of 0.037, with a between-study standard
deviation of 0.191, indicating that most studies had small to moderate positive impacts. Of the
258 effect sizes included in the 95 identified studies, 214 effect sizes (83 percent) were positive
in sign and 88 effect sizes (34 percent) were statistically significant. Only 40 effect sizes were
negative in sign (16 percent) and only 4 were negative and statistically significant (2 percent).
Four effect sizes had point estimates of zero (2 percent). The unweighted average effect size for
math outcomes was +0.27 SD and the average effect size for science outcomes was +0.18 SD.
However, the results of conditional meta-regression model estimated using RVE indicates no
statistically significant difference in mean effect sizes based on whether programs focused on
math or science (see Table 3). We therefore include both math and science outcomes in all
analyses.
As shown in Table 3, the type of assessment employed was a strong predictor of effect
size magnitude, with impacts for both standardized (e.g., Woodcock-Johnson) and state
standardized tests lower by about 0.27 SD (p<0.01) relative to impacts for researcher-designed
assessments. Table 3 also shows that there were not statistically significant differences in effect
sizes based on study design (i.e., whether the study was an RCT), whether study authors adjusted
for covariates such as student demographic indicators, whether the study sample included
preschool students, and subject matter.
Table 4 shows that the average effect size for programs that incorporated both
professional development and new curriculum materials was approximately 0.10 SD (p<0.05)
larger than the average ES for programs that included only professional development or only
new curriculum materials.
INSTRUCTIONAL IMPROVEMENT IN STEM
26
Features that moderate program impacts
Next, we turn to features that may moderate impacts on student outcomes, beginning with
the professional development models. We did not find a significant relationship between
professional development contact hours and student outcomes (see Table 5). We first examined
whether there was a linear association between contact hours and effect size (column 1). Next, to
look for non-linearities in the contact hours data, the model presented in column 2 compares
programs with few hours (i.e., below the 25th percentile; omitted) and those with a larger number
of contact hours (i.e., 25th to 50th percentile; 50th to 75th percentile; and 75th percentile and
above). We did not find evidence of a significant association in either model. To investigate a
threshold effect, the model presented in column 3 examines programs with 16 hours or more and
found no relationship; this finding held at various threshold levels (not shown). In a separate
analysis, we did not find a significant relationship between the timespan (e.g., one month, one
year) over which professional development activities were conducted and effect sizes (models
not shown).
We next examined the associations between the focus of the professional development
and effect sizes via a series of multilevel regression models (Table 6). Average effect sizes were
larger when programs focused on how to use curriculum materials (+0.12 SD, p<0.10) and when
they focused on improving teachers' content and pedagogical content knowledge and/or how
students learned the content (+0.09 SD, p<0.05). Both of these associations remained significant
and similar in size when all predictors were included in the model. Average effect sizes were
also larger when the program focused on formative assessment (+0.13 SD, p<0.10), although this
did not retain its size or significance in the final model. A focus on content-generic instructional
strategies was not a significant predictor of the effect size magnitude. Finally, we also find that
INSTRUCTIONAL IMPROVEMENT IN STEM
27
programs that included more of the foci listed in Table 6 had larger effect sizes, on average,
compared to programs focused on a more narrow set of topics (+0.08 SD, p<0.01).
Next, we turned to professional development activities (Table 7). No activities for which
we coded – observing demonstrations, reviewing generic student work (problems or
investigations completed by students outside the teachers’ classes), solving math or science
problems/working through student materials, developing curriculum or lesson plans, and
reviewing teachers’ own students’ work – were significant predictors of effect size magnitude.
However, we find that the number of PD activities incorporated in the program, defined as the
total number of the five activities listed in Table 7, significantly predicted effect size magnitude
(+0.05 SD, p<0.10).
Table 8 examines the relationship between effect sizes and professional development
formats. On average, PD programs that had teachers participate alongside other teachers in their
school, which we refer to as same-school collaboration, yielded outcomes 0.12 SD larger
(p<0.10) than programs without such collaboration. This result is significant in the final model
only. In addition, programs that included PD with implementation meetings yielded significantly
larger average effect sizes on average than without these meetings (+0.12 SD in the final model,
p<0.05). These implementation meetings typically allowed teachers to convene briefly with other
activity participants to troubleshoot and discuss obstacles and aids to putting the program into
practice. Meanwhile, programs that included professional development that incorporated an
online component yielded significantly smaller impacts on average relative to programs that did
not involve any online components (-0.15 SD in the final model, p<0.05). Additionally, in the
final model, programs that included a summer workshop had larger effect sizes, on average, as
compared with those that lacked this component (+0.07 SD, p<0.10). The remaining formats
INSTRUCTIONAL IMPROVEMENT IN STEM
28
examined – whether the program featured coaching from experts or professional development
led by researchers – were not significant predictors of effect size magnitude. The number of PD
features was also not a significant predictor of effect size magnitude.
Table 9 shows the associations between features of new curriculum materials and effect
sizes. None of the features examined were significantly associated with average effect sizes.
It is important to note that the above moderator analyses provide estimates of the
associations between the presence of specific program features and average effect sizes;
however, programs without those specific features that positively predict impacts may still
positively impact student outcomes on average. Thus in Table 10 we present the results of these
moderator tests summarized in terms of regression-adjusted mean effect sizes. First, we present
mean effect sizes based on subgroup analyses without controls for additional program features.
These mean effect sizes are based on unconditional meta-regression models estimated using
RVE to account for the nesting of effect sizes within studies. Next, we present mean effect sizes
based on conditional meta-regression models corresponding to our main moderation analyses
with each predictor included separately and controlling for the program features listed in Table 3.
Finally, we present mean effect sizes corresponding to our final moderation analyses with all
predictors within each category entered simultaneously. For brevity, we focus on only those
features that were statistically significant predictors of program impact in our main models.
The mean effect sizes presented in Table 10 indicate that the overall impacts of programs
with and without the moderators of interest were generally positive. Estimated mean effect sizes
for STEM professional development and curriculum improvement programs are generally
positive even among programs that lacked the intervention features we identified as associated
with larger effect sizes. For example, average effect sizes were positive among programs
INSTRUCTIONAL IMPROVEMENT IN STEM
29
including either professional development or new curriculum materials but not both components
(𝑔𝑐 = 0.156, 𝑔𝑢𝑐 = 0.136, puc < 0.01), professional development programs that had an online
component (𝑔𝑐+ = 0.096, 𝑔𝑐 = 0.103, 𝑔𝑢𝑐 = 0.116, puc < 0.05), and professional development that
did not contain a focus on improving teachers’ content knowledge, pedagogical content
knowledge or knowledge of how students learn (𝑔𝑐+ = 0.179, 𝑔𝑐 = 0.175, 𝑔𝑢𝑐 = 0.095, puc <
0.01). Although impacts were largest on intervenor-developed assessments, estimated mean
effects sizes are still positive for impacts on state standardized assessments (𝑔𝑐 = 0.0101, 𝑔𝑢𝑐 =
0.060, puc < 0.01) and other standardized assessments (𝑔𝑐 = 0.087, 𝑔𝑢𝑐 = 0.084, puc < 0.01).
Furthermore, the differences in mean effect sizes within each category based on estimating
unconditional models, with the exception of same-school collaboration, are generally comparable
in direction and magnitude to those based on conditional models.
Study Design Moderators
Next, we examined whether a wider range of study design characteristics and
implementation contexts were associated with effect size magnitudes (Table 11). Studies that
included credit incentives for participating teachers showed smaller impacts on average (-0.17
SD in the final model, p<0.05). We found that the unit of assignment to the program (schools vs.
teachers) did not predict effect sizes. We did not find significant relationships between effect size
magnitudes and whether teacher participation was voluntary or mandatory, or whether teachers
received monetary incentives; however, this information was unreported in many studies.
Finally, we found that the level of attrition did not predict the size of impacts.
To further examine the potential role of attrition bias, we additionally examined the
adjusted mean effect sizes for studies that do and do not meet our attrition standards. We see
INSTRUCTIONAL IMPROVEMENT IN STEM
30
little evidence that studies with attrition problems reported larger impacts (see Appendix Table
A2).
In separate analyses, no other study design features were significantly related to
outcomes, including whether the study involved one district, multiple districts and/or states;
whether the study was conducted in the U.S. or abroad; whether the study was conducted in an
urban vs. non-urban setting; and whether the study sample was majority low-income. We also
found that study size (defined as the average number of treatment clusters across effect sizes
within a study) was not a significant predictor of study outcomes (results not shown).
Publication Bias
Finally, we considered whether effect sizes from peer-reviewed sources differed in
magnitude, on average, relative to effect sizes from other sources. In Table 12, we present results
from a meta-regression model estimated using RVE that includes whether the effect size was
from a peer-reviewed publication as a moderator. We found that effect sizes from peer-reviewed
studies were larger by 0.05 (uncontrolled, Model 1) or 0.07 SD (controlled, Model 2) than those
from other studies when using the RVE approach and after controlling for study design, study
sample, subject area, and outcome measure type. However, these differences were not
statistically significant.
We also conducted two additional tests for publication bias. First, we assessed
publication bias using Egger's regression test (Egger et al., 1997). Given the multiple effect sizes
within each study, we conducted this test at the study level by regressing the study average
standard normal deviation (the average effect size divided by the average standard error) on the
inverse of the study average effect size standard error. This approach tests the null hypothesis
that the intercept is zero; if the null hypothesis is rejected this indicates that smaller (or less
INSTRUCTIONAL IMPROVEMENT IN STEM
31
precise) studies have systematically larger or smaller reported effect sizes relative to larger (or
more precise) studies which could be due to publication bias. We then use the “trim and fill”
method to examine the magnitude of potential publication bias (Duval & Tweedie, 2000).
Although the models estimated are not precisely analogous to our preferred estimation approach,
results indicate potential publication bias in the full sample as demonstrated by the fact that
studies with larger effect size standard errors (i.e., smaller and less precise studies) have larger
effects (p<0.001). A comparison of the adjusted and unadjusted estimated average effect sizes
from the trim-and-fill method indicates that the magnitude of potential publication bias is
substantial. Second, to account for the nested structure of the data, we also used a modification
of the Egger’s regression test by adding the standard errors of the effect sizes as a moderator to
the unconditional RVE meta-regression model. If effect size standard errors are significant
predictors of effect sizes, this could similarly indicate the presence of publication bias. As above,
results indicate potential publication bias in the full sample (p<0.001).
However, results of conducting both methods separately for peer reviewed and non-peer
reviewed studies detect potential publication bias only among peer reviewed studies. Results of
both Egger’s regression test and the RVE approach indicate that smaller (or less precise) studies
report larger impacts (p<0.001). However, there is less evidence of systematic differences in
reported effects sizes based on study size or precision among studies from other sources
(p=0.116; p=0.050). This suggests that the association between study size or precision and
reported impacts among peer-reviewed studies may be a result of publication bias rather than
other factors (e.g., more effective interventions are more costly and therefore evaluated with
small samples). This highlights the importance of the inclusion of the “grey literature” in
INSTRUCTIONAL IMPROVEMENT IN STEM
32
systematic reviews and meta-analyses of instructional improvement efforts (Polanin, Tanner-
Smith, & Hennessy, 2016). For full results of these tests, see Appendix Table A3 and Figure A1.
Sensitivity Checks
We conducted additional analyses to examine the robustness of our results. First, we
address the fact that in 20 cases, authors reported some impacts as ‘not statistically significant’
and did not provide enough information to calculate an effect size. A concern is that failing to
provide this information may be correlated with program features. In addition, that effect sizes
that are not statistically significant but negative in sign may be less frequently reported that
effect sizes that are positive in sign. We therefore tested the robustness of our results to the
inclusion of these 20 missing effect sizes by imputing a range of values for missing effect sizes
(g = 0.00, g = -0.10, and g = -0.20) and using the study-level mean of the effect size standard
error to calculate their weights. If our results are driven by differential reporting of information
regarding effect sizes that are not statistically significant based on whether point estimates are
positive or negative in sign, including these impacts should attenuate our results. In general,
including these effect sizes did not substantively change our results. In the majority of cases
where our primary results showed a significant association between program features and
outcomes, results are comparable in magnitude and remain statistically significant. The only
exceptions are the number of professional development activities and use of same-school
collaboration, which are no longer significant predictors of program impact, although the
associations are comparable in magnitude to our main estimates.
We also found that our results were largely robust to excluding studies in settings outside
the U.S., and to excluding studies with weaker designs (i.e., studies that were not RCTs).
Associations are generally comparable in magnitude to our main results, although in some cases
INSTRUCTIONAL IMPROVEMENT IN STEM
33
less precisely estimated. The only exceptions are that a focus on how to use new curriculum
materials is not a significant predictor of impacts after excluding non-RCT studies, and a focus
on how to use new curriculum materials and the number of professional development activities
are not significant predictors of impacts after excluding studies outside the U.S. These
associations are comparable in magnitude to our main estimates. We also examined the
sensitivity of our results to choosing different values of the within-study correlation between
effect sizes. In our primary models, we specify the correlation to be 0.80, which is the value
recommended by Tanner-Smith and Tipton (2014). Using alternative specifications of (𝜌 = 0.50,
0.70, and 0.90) did not change our results. Full results of these sensitivity checks are available
from the authors on request.
Discussion and Conclusions
To summarize, we found that studies of STEM instructional improvement programs had
on average positive effects on student achievement, with a mean pooled effect size across studies
of 0.21 SD. Compared to those found in prior reviews, these pooled effect sizes are in the
middle range, smaller than those identified in the Yoon et al. review (0.57 SD in math and 0.51
SD in science) and the Taylor et al. review (0.489 SD in science), and somewhat larger than
those identified in the Scher & O'Reilly review (0.14 SD in math and 0.13 SD in science).
Although the effect sizes differ, our results confirm earlier reviewers' findings that studies of
STEM instructional improvement efforts tend to show positive results.
We conducted a series of analyses to examine the relationships between program
characteristics and the size of achievement impacts. The characteristics that were significantly
associated with improved student learning across the current set of studies included the
following:
INSTRUCTIONAL IMPROVEMENT IN STEM
34
• The use of professional development along with new curriculum materials;
• A focus on improving teachers' content/pedagogical content knowledge, or
understanding of how students learn;
• Specific formats, including:
• meetings to troubleshoot and discuss classroom implementation of the
program;
• the provision of summer workshops to begin the professional development
learning process;
• same-school collaboration.
We also found that on average, programs that provided any component of the PD online had
poorer student outcomes than programs that did not use an online PD component. In general,
there was not a statistically significant difference in the magnitudes of these associations
depending on whether programs focused on mathematics or science.
Components Associated with STEM Program Effectiveness
Taken together, we generally find that providing teachers with opportunities to learn
about the materials they will use with students and/or to participate in programs that seek to
improve their content or pedagogical content knowledge is associated with improved student
outcomes. These findings accord with prior cross-sectional research. For example, Boyd,
Grossman, Lankford, Loeb, and Wyckoff (2009) found that teacher preparation programs that
focused closely on teachers' classroom practice, including reviewing the district curriculum,
produced teachers with better student outcomes in their first year of teaching. Authors (1998)
also found that teachers' participation in curriculum-centered professional development was
related to student achievement. These findings also lend support to the conclusions drawn in
INSTRUCTIONAL IMPROVEMENT IN STEM
35
prior reviews conducted by Scher and O'Relly (2009) and Kennedy. Scher and O'Reilly (2009)
found that inverventions that focused on both content and pedagogy posted stronger student
outcomes as compared with interventions that focused on pedagogy alone, while Kennedy
(1999) concluded that programs were more effective when they focused on how to teach specific
content and on how students learn the same content.
Most studies of curriculum materials in our dataset included at least some component of
professional development, and vice versa. However, examining studies that included both
elements jointly leads us to see that on average, programs that incorporated both professional
development and new curriculum materials had larger impacts as compared with programs that
included only one of these components. These findings lend support to the argument advanced
by some scholars that curriculum materials alone, even those designed to be educative and
supportive of teacher implementation, may be insufficient to change teaching practice, given the
complexity of classroom instructional interactions, and the often-ingrained nature of traditional
inquiry-response-evaluation teaching practices (Alozie, Moje, & Krajcik, 2010). Meanwhile,
perhaps when professional development is provided without reference to specific curriculum,
teachers may struggle to implement what they have learned while using existing curricular
materials and textbooks.
We also found a positive association between student outcomes and teachers'
participation in implementation meetings, which were defined as meetings in which teachers met
formally or informally with other activity participants to discuss enacting intended practices. We
also found a significant relationship, in our final model, between same-school collaboration
(teachers participating in PD alongside their colleagues) and student outcomes. These findings
align with prior work (e.g., Darling-Hammond & McLaughlin, 1995) emphasizing the
INSTRUCTIONAL IMPROVEMENT IN STEM
36
importance of providing teachers with opportunities to discuss instructional innovations with
colleagues (e.g., Penuel, Sun, Frank & Gallagher, 2012), and discuss and troubleshoot issues that
arise when implementing new instructional approaches.
Perhaps more surprisingly, the inclusion of a summer workshop was positively related to
student outcomes. Prior syntheses (Scher & O'Reilly, 2009; Yoon et al., 2009) did not examine
this variable specifically. The summer workshop format for professional development has been
critiqued in the past for its typically 'one-shot' nature, but it may be the case that an intensive
summer professional development provides an effective 'springboard' for school-year
implementation. On the other hand, programs in which participants completed a portion of the
professional development online had weaker student outcomes, on average, as compared with
programs that did not include any online PD. Dede, Ketelhut, Whitehouse, Breit, and McCloskey
(2008) point out that although online teacher professional development is burgeoning, relatively
little rigorous research has examined its effectiveness. Although these analyses are correlational
in nature, the positive results associated with both the summer professional development and the
teacher implementation meetings are consistent with the notion that teachers may have benefitted
from both intensive and ongoing opportunities to interact with one another around program
content, which may have been less salient in the online format.
In contrast to two earlier reviews (Scher & O’Reilly, 2009; Yoon et al., 2007), we find no
evidence of a positive association between the duration of professional development and
program impacts. Yoon compared the effectiveness of interventions that included greater than
versus less than 14 hours of professional development, finding that the former had larger impacts
on student achievement. Scher and O'Reilly (2009) compared the effectiveness of interventions
conducted over two or more years versus those conducted over one year, and found that those
INSTRUCTIONAL IMPROVEMENT IN STEM
37
conducted over two or more years were more effective on average among math-focused, but not
science-focused, interventions. However, both of these reviews note that the small number of
included studies limited their ability to draw firm conclusions. The current findings, using a
continuous measure of contact hours and a separate measure of timespan, suggest that programs
that were limited in duration nonetheless generally had positive impacts on average. For
example, several programs that combined new curriculum materials with a short amount of
professional development documented moderate to large impacts on student achievement (e.g.,
Arnold et al., 2002; Clements & Sarama, 2007; Presser, Vahey, & Dominguez, 2015). In
contrast, some studies of highly-intensive professional development programs showed little or no
impacts on student learning (e.g., Authors, 2016; Devlin-Sherer et al., 1998; Van Egeren et al.,
2014). Our findings echo those of Kennedy (1999, 2016), who did not find a clear benefit of
contact hours or program duration, and concluded that the core condition for program
effectiveness was valuable content; more hours of a given intervention will not help if the
intervention content is not useful.
Similar to Taylor et al., (2018), our analysis did not detect any relationship between
student outcomes and study design, science subject matter, and grade level. However, we found
that average impacts varied considerably depending on the type of student test used. On average,
student outcomes were larger on researcher-designed assessments as compared with standardized
tests. While standardized tests may have benefits such as face validity and broad content
representation, it may be that standardized tests are not especially sensitive to instructional
improvement efforts, due to differences in the skills measured by the tests versus those targeted
in the intervention (see also Hill et al., 2008; Taylor et al., 2018; Sussman & Wilson, 2018). If
this is the case, in order to understand how instructional improvement interventions influence
INSTRUCTIONAL IMPROVEMENT IN STEM
38
student learning, researchers may need to include both assessments of outcomes that are closely
tied to the student learning goals along with broader standardized tests.
Limitations and Future Directions
We note several limitations to the current review. One limitation is missing data. First,
although we attempted to search both the published and unpublished literatures, studies may
have eluded our grasp. Second, many programs and interventions that are routinely conducted in
schools are never evaluated, and these programs may differ in unknown ways from those that are
formally evaluated. Most of the programs studied are boutique programs, often designed,
operated and evaluated by university or contract researchers. We know little about the efficacy of
professional development that reaches typical teachers. The characteristics we identified as
potentially effective here may not carry over to typical conditions and with the resources
conventionally available in school districts.
Missing data in study reports posed another limitation. We initially hoped to code the
included studies for a number of additional features that have been hypothesized in the literature
to influence the effectiveness of instructional improvement programs, including district and
school leadership support, competing instructional improvement initiatives, teacher recruitment
methods, and the financial resources provided to support the intervention (Wilson, 2013).
However, we found that few study reports contained sufficient detail on these matters, making
tracking their impact impossible. This omission is striking; in nearly all published and
unpublished reports, the district context is simply a black box, or merely a site for teacher
recruitment and service delivery. The inclusion of more contextual information in study reports
is pressing, especially because many district administrators evaluate proposed programs based on
the extent to which studies were carried out in “districts like ours.” Gathering contextual
INSTRUCTIONAL IMPROVEMENT IN STEM
39
information would also provide insight into the conditions necessary for instructional
improvement programs to thrive.
In addition, it is important to note that, as is generally the case in meta-analyses, the
moderator analyses we have conducted are correlational. The authors of the included studies
generally did not randomly manipulate the variables that we identified as associated with
improved student outcomes in the moderator analyses, such as by randomly assigning teachers to
participate versus not participate in a summer workshop. As a result, moderator analyses are
potentially confounded by unobserved or inadequately measured study and student
characteristics. The characteristics of professional development and curriculum programs that
appear promising in the current study's moderator analyses thus are not definitive, but point
toward potentially productive areas for future experimental work. Future randomized
experiments comparing instructional programs that do versus do not have these components are
warranted, to estimate causally the effects of robust STEM professional development and
curriculum interventions on student learning. Additionally, our use of binary indicators rather
than raw attrition rates constitutes a limitation, as continuous attrition measures would have
allowed us to estimate the impacts of attrition more precisely.
Despite these limitations, however, we were able to distill findings from dozens of recent
experimental and strong quasi-experimental evaluations of instructional improvement programs
in STEM. Based on the extant literature, the types of practices that were associated in our review
with improved student learning, such as providing teachers with opportunities to engage with the
curriculum they teach, develop their content knowledge, pedagogical content knowledge and
understanding of how students learn, and discuss classroom implementation, likely occur
infrequently in the instructional improvement programs typically offered in US schools (e.g.,
INSTRUCTIONAL IMPROVEMENT IN STEM
40
Author, 2009). Future experimental studies that build on the current findings are warranted, to
advance researchers' and policymakers' understanding of core instructional reform practices that
improve student learning in STEM.
Endnotes
1 For example, in the area of middle and secondary math curriculum materials, Slavin and
colleagues searched the published and grey literature dating back to 1970, conducting "[a] broad
literature search ... in an attempt to locate every study that could possibly meet the inclusion
requirements. This included obtaining all of the middle school studies cited by the What Works
Clearinghouse (2008b) and the middle and high school studies cited by NRC (2004), by Clewell
et al. (2004), and by other reviews of mathematics programs, including technology programs that
teach math (e.g., Chambers, 2003; Kulik, 2003; Murphy et al., 2002). Electronic searches were
made of educational databases (JSTOR, Education Resources Information Center, EBSCO,
PsycINFO, Dissertation Abstracts), Web-based repositories (Google, Yahoo, Google Scholar),
and education publishers’ Web sites. Citations of studies appearing in the first wave of studies
were also followed up. A particular effort was made to find non-U.S. studies." They performed a
similar search for elementary math curriculum materials (Slavin & Lake, 2008), elementary
science curriculum materials (Slavin et al., 2014), and secondary science materials (Cheung,
Slavin, Kim, & Lake, 2016). We refer the reader to those articles for specifics. In the area of
professional development, Yoon et al. (2007) searched for studies dating from 1986 and explain
that "Studies were gathered through an extensive electronic search of published and unpublished
research literature. The review protocol included a list of keywords that guided the literature
search. Seven electronic databases were core data sources: ERIC, PsycINFO, ProQuest,
INSTRUCTIONAL IMPROVEMENT IN STEM
41
EBSCO’s Professional Development Collection, Dissertation Abstracts, Sociological Collection,
and Campbell Collaboration. These databases were searched separately for each of the three
subjects under review (mathematics, science, and reading and English/language arts). In
consultation with a reference librarian, search parameters were developed using database-
specific keywords ... A deliberately wide net captured literature on professional development and
student achievement, broadly defined ... Fourteen key researchers were also asked to identify
research for the study. Eight researchers responded, recommending additional studies that fit the
study purpose. Finally, existing literature reviews and research syntheses were consulted to
ensure that no key studies were omitted."
2As another point of comparison, the What Works Clearinghouse protocols have generally
limited their scope to studies from the past twenty years, with some categories such as
conference proceedings limited to the past seven years (Brown, Card, Dickersin, Greenhouse,
Kling, & Littell, 2008).
3The specific search strings applied to searches of the titles and abstracts were as follows:
(“professional development” OR “faculty development” OR “Staff development” OR “teacher
improvement” OR “inservice teacher education” OR “peer coaching” OR “teachers’ institute*”
OR “teacher mentoring” OR “Beginning teacher induction” OR “teachers’ Seminar*” OR
“teachers’ workshop*” OR “teacher workshop*” OR “teacher center*” OR “teacher mentoring”
OR curriculum OR instruction*) AND ( “Student achievement” OR “academic achievement”
OR “mathematics achievement” OR “math achievement” OR “science achievement”
OR “Student development” OR “individual development” OR “student learning” OR
“intellectual development” OR “cognitive development” OR “cognitive learning” OR “Student
Outcomes” OR “Outcomes of education” OR “educational assessment” OR “educational
INSTRUCTIONAL IMPROVEMENT IN STEM
42
measurement” OR “educational tests and measurements” OR “educational indicators” OR
“educational accountability”) AND ("*experiment*" OR "control*" OR "regression
discontinuity” OR “compared” OR “comparison” OR “field trial*” OR “effect size*” OR
“evaluation”) AND (“Math*” OR “*Algebra*” OR “Number concepts” OR “Arithmetic” OR
“Computation” OR “Data analysis” OR “Data processing” OR “Functions" OR “Calculus” OR
“Geometry” OR “Graphing” OR “graphical displays” OR “graphic methods” OR “Science*”
OR “Data Interpretation” OR “Laboratory Experiments” OR “Laboratory Procedures” OR
“Experiment*” OR “Inquiry” OR “Questioning” OR “investigation*” OR “evaluation methods”
OR “laboratories” OR “biology” OR “observation” OR “physics” OR “chemistry” OR
“scientific literacy” OR “scientific knowledge” OR “empirical methods” OR “reasoning” OR
“hypothesis testing”).
4As Slavin (2008) defined this contrast, “A program is defined here as any set of replicable
procedures, materials, professional development, or service configurations that educators could
choose to implement to improve student outcomes. A program is distinct from a variable in
consisting of a specific, well-specified set of procedures and supports. Class size, assigning
homework, or provision of bilingual education are variables, for example, whereas programs
typically are based on particular textbooks, computer software, and/or instructional processes
and usually have a name and a specific provider, such as a company, university, or individual.”
5 For studies with multiple treatment arms, this includes separate effect sizes from each treatment
contrast. For studies that reported impacts for multiple groups of teachers or students (e.g.,
studies that reported impacts separately by grade), this includes separate impacts for each teacher
and student sample. However, we do not include multiple effect sizes for impacts on the same
assessment for the same teacher and student sample (e.g., cases where studies administered the
INSTRUCTIONAL IMPROVEMENT IN STEM
43
same assessment to examine follow-up impacts over time). We also include a separate effect size
for each assessment of math or science achievement reported by each study. For example, if a
study reported impacts on two separate assessments (e.g., a standardized test and a researcher-
designed assessment), both effect sizes were coded and included in the analysis. We did not
include impacts on assessment subscales or sub-scores if impacts on total scores on the
assessment were also reported. We included impacts on subscales or sub-scores only if impacts
on total scores were not reported.
6 Our technical advisory board consisted of five members – three with expertise in instructional
improvement and program evaluation, and two with expertise in meta-analysis. We met in
person with the first group to gather feedback on our proposed coding system. We consulted with
and sent drafts of our paper to the latter group for feedback on our data and models.
7 Some author-supplied effect sizes that were described as only as being similar to a standardized
mean difference (including standardized coefficients from multilevel models). These were
treated as Cohen’s d effect sizes and converted to Hedges’s g.
8 For these cases a standardized effect size was calculated by the ratio of the unstandardized
regression coefficient and an estimate of the standard deviation of the outcome. The outcome
standard deviation was estimated using the formula provided by Higgins and Deeks (2008):
𝑆𝐷 =𝑆𝐸
√1
𝑁𝑇+
1𝑁𝑐
where 𝑁𝐸 represents the number of students in the treatment group and 𝑁𝐶 represents the number
of students in the control group, and SE represents the coefficient standard error.
9 These include cases where the authors did not take into account the nesting of students within
classrooms and/or schools, cases where effect sizes were reported without the associated
INSTRUCTIONAL IMPROVEMENT IN STEM
44
standard errors, and cases where the authors reported cluster-adjusted impact estimates, but it
was necessary to calculate a standardized effect sizes based on other information. For example,
this includes cases where the authors reported that cluster-adjusted regression results were
significant below a given value (e.g., p<0.05) but neither standard errors nor p-values nor other
test statistics (e.g., t-statistics) were reported. Of the 258 effect sizes included in this study, we
applied a correction to adjust the standard error for clustering in 114 cases. The majority of effect
sizes that did not require the standard error correction for clustering were based on results of
multilevel models and regression models with clustered standard errors (e.g., using t-statistics,
standard errors and regression coefficients, and p-values). Other effect sizes were reported at the
cluster level (e.g., differences in mean classroom performance); no clustering adjustment was
necessary in these cases.
10 These include 20 cases where the authors referred to “no significant effect” on one or more
outcomes but did not report an effect sizes, and 9 cases where authors reported some outcome
information but did not provide enough information to calculate an effect size (e.g. cases where
the outcome information included only raw means).
11 Specifically, we assumed that the standardized effect size (Hedges’s g) was took on a range of
values (g = 0.00, g =-0.10, and g = -0.20), and used the study-level mean standard error based on
non-missing effect sizes.
12 Some studies contained multiple treatment arms where the characteristic was present in at least
one arm, but not present in at least one other arm. In Table 1, we consider whether the feature
was present in any treatment arm of the study. In all subsequent analyses, we use the study-level
mean.
INSTRUCTIONAL IMPROVEMENT IN STEM
45
References
Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the
Chicago public high schools. Journal of Labor Economics, 25(1), 95-135.
Agodini, R., Harris, B., Remillard, J., & Thomas, M. (2013). After two years, three elementary
math curricula outperform a fourth. Washington, DC: National Center for Education
Evaluation and Regional Assistance.
Alozie, N. M., Moje, E. B., & Krajcik, J. S. (2010). An analysis of the supports and constraints
for scientific discussion in high school project‐based science. Science Education, 94(3),
395-427.
Arnold, D. H., Fisher, P. H., Doctoroff, G. L., & Dobbs, J. (2002). Accelerating math
development in Head Start classrooms. Journal of Educational Psychology, 94(4), 762-
770.
Ball, D. L., & Cohen, D. K. (1996). Reform by the book: What is: or might be: the role of
curriculum materials in teacher learning and instructional reform? Educational
Researcher, 25(9), 6-14.
Banilower, E., Smith, P. S., Weiss, I. R., & Pasley, J. D. (2006). The status of K-12 science
teaching in the United States: Results from a national observation survey.
In D.Sunal & E.Wright (Eds.), The impact of the state and national standards on K-12
science teaching (pp. 83–122). Greenwich, CT: Information Age Publishing.
Becker, K., & Park, K. (2011). Effects of integrative approaches among science, technology,
engineering, and mathematics (STEM) subjects on students' learning: A preliminary
meta-analysis. Journal of STEM Education: Innovations and Research, 12(5/6), 23-37.
Blank, R. K., & De Las Alas, N. (2010, March). Effects of teacher professional development on
INSTRUCTIONAL IMPROVEMENT IN STEM
46
gains in student achievement: How meta-analysis provides scientific evidence
useful to education leaders. Paper presented at the Society for
Research on Educational Effectiveness Spring Conference, Washington, DC.
Borman, K. M., Cotner, B. A., Lee, R. S., Boydston, T. L., & Lanehart, R. (2009, March).
Improving Elementary Science Instruction and Student Achievement: The Impact of a
Professional Development Program. Paper presented at the Society for Research on
Educational Effectiveness Spring Conference, Washington, DC.
Boyd, D. J., Grossman, P. L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher preparation
and student achievement. Educational Evaluation and Policy Analysis, 31(4), 416-440.
Cavanagh, S. (2015, August 17). Spending on instructional materials, construction climbing in
K-12. EdWeek. Retrieved from https://marketbrief.edweek.org/marketplace-k-
12/instructional_materials_construction_spending_climbing_in_k-12/
Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact
of an integrated approach to science and literacy in elementary school classrooms.
Journal of Research in Science Teaching, 49(5), 631-658.
Cheung, A., Slavin, R. E., Kim, E., & Lake, C. (2017). Effective secondary science programs: A
best‐evidence synthesis. Journal of Research in Science Teaching, 54(1), 58-81.
Clark, D. B., Tanner-Smith, E. E., & Killingsworth, S. S. (2016). Digital games, design, and
learning: A systematic review and meta-analysis. Review of Educational Research, 86(1),
79-122.
Clements, D. H., & Sarama, J. (2007). Effects of a preschool mathematics curriculum:
Summative research on the Building Blocks project. Journal for Research in
Mathematics Education, 38(2), 136-163.
INSTRUCTIONAL IMPROVEMENT IN STEM
47
Clewell, B. C., Campbell, P. B., & Perlman, L. (2004). Review of evaluation studies mathematics
and science curricula and professional development models. Washington, DC: The
Urban Institute.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Lawrence Earlbaum Associates.
Confrey, J. (2006). Comparing and contrasting the National Research Council report on
evaluating curricular effectiveness with the What Works Clearinghouse approach.
Educational Evaluation and Policy Analysis, 28(3), 195-213.
Confrey, J., & Stohl, V. (Eds.). (2004). On evaluating curricular effectiveness: Judging the
quality of K-12 mathematics evaluations. Washington, DC: National Academies Press.
Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach. Los
Angeles, CA: Sage.
Corcoran, T. B. (1995). Helping teachers teach well: Transforming professional development.
Philadelphia, PA: CPRE Policy Briefs. Retrieved from
https://files.eric.ed.gov/fulltext/ED388619.pdf
Darling-Hammond, L., & McLaughlin, M. W. (1995). Policies that support professional
development in an era of reform. Phi Delta Kappan, 76(8), 597-604.
Davis, E. A., & Krajcik, J. S. (2005). Designing educative curriculum materials to promote
teacher learning. Educational Researcher, 34(3), 3-14.
Dede, C., Ketelhut, D., Whitehouse, P., Breit, L., & McCloskey, E. M. (2009). A research
agenda for online teacher professional development. Journal of Teacher Education,
60(1), 8-19.
INSTRUCTIONAL IMPROVEMENT IN STEM
48
Desimone, L. M. (2009). Improving impact studies of teachers' professional development:
Toward better conceptualizations and measures. Educational Researcher, 38(3), 181-199.
doi: 10.3102/0013189X08331140
Desimone, L. M. (2011). A primer on effective professional development. Phi Delta
Kappan, 92(6), 68-71.
Devlin-Scherer, W., Spinelli, A. M., Giammatteo, D., Johnson, C., Mayo-Molina, S., McGinley,
P., ... & Zisk, L. (1998, February). Action research in professional development schools:
Effects on student learning. Paper presented at the Annual Meeting of the American
Association of Colleges for Teacher Education, New Orleans, LA.
Dietrichson, J., Bøg, M., Filges, T., & Klint Jørgensen, A. M. (2017). Academic interventions for
elementary and middle school students with low socioeconomic status: A systematic
review and meta-analysis. Review of Educational Research, 87(2), 243-282.
Duval, S., & Tweedie, R. (2000). Trim and fill: a simple funnel‐plot–based method of testing and
adjusting for publication bias in meta‐analysis. Biometrics, 56(2), 455-463.
Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by
a simple, graphical test. British Medical Journal, 315(7109), 629-634.
Fisher, Z., & Tipton, E. (2015). robumeta: An R-package for robust variance estimation in meta-
analysis. Retrieved from https://arxiv.org/pdf/1503.02220.pdf
Frechtling, J., Sharp, L., Carey, N., & Vaden-Kiernan, N. (1995). Professional development
programs: A perspective on the last four decades. Washington, DC: National Science
Foundation, Division of Research. Evaluation and Communication.
Furtak, E. M., Seidel, T., Iverson, H., & Briggs, D. C. (2012). Experimental and quasi-
INSTRUCTIONAL IMPROVEMENT IN STEM
49
experimental studies of inquiry-based science teaching: A meta-analysis. Review of
educational research, 82(3), 300-329.
Gardella, J. H., Fisher, B. W., & Teurbe-Tolon, A. R. (2017). A systematic review and meta-
analysis of cyber-victimization and educational outcomes for adolescents. Review of
Educational Research, 87(2), 283-308.
Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., ... & Sepanik, S.
(2011). Middle school mathematics professional development impact study: Findings
after the second year of implementation. Washington, DC: National Center for Education
Evaluation and Regional Assistance.
Gersten, R., Taylor, M. J., Keys, T. D., Rolfhus, E., & Newman-Gonchar, R. (2014). Summary of
research on the effectiveness of math professional development approaches. (REL 2014–
010). Washington, DC: U.S. Department of Education, Institute of Education Sciences,
National Center for Education Evaluation and Regional Assistance, Regional Educational
Laboratory Southeast. Retrieved from http://ies.ed.gov/ncee/edlabs
Goldenberg, C., & Gallimore, R. (1991). Changing teaching takes more than a one-shot
workshop. Educational Leadership, 49(3), 69-72.
Hand, B., Norton-Meier, L.A., Gunel, M., & Akkus, R. (2016). Aligning teaching to learning: A
3-year study examining the embedding of language and argumentation into elementary
science classrooms. International Journal of Science and Mathematics Education, 14(5),
847-863.
Hedges, L. V. and Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological
Bulletin, 88, 359–369.
INSTRUCTIONAL IMPROVEMENT IN STEM
50
Hiebert, J., Stigler, J. W., Jacobs, J. K., Givvin, K. B., Garnier, H., Smith, M., ... & Gallimore, R.
(2005). Mathematics teaching in the United States today (and tomorrow): Results from
the TIMSS 1999 video study. Educational Evaluation and Policy Analysis, 27(2), 111-
132.
Higgins, J. P. T & Deeks, J. J. (2008). Selecting studies and collecting data. In J.P.T. Higgins &
S. Green (Eds.), Conchrane handbook for systematic reviews of interventions (151-186).
Chichester, U.K.: John Wiley & Sons.
Higgins, J. P. T., Deeks, J. J., & Altman, D.G. (2008). Special topics in statistics. In J.P.T.
Higgins & S. Green (Eds.), Conchrane handbook for systematic reviews of interventions
(481-529). Chichester, U.K.: John Wiley & Sons.
Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for
interpreting effect sizes in research. Child Development Perspectives, 2(3), 172-177.
Kane, T., Kerr, K., & Pianta, R. (2014). Designing teacher evaluation systems: New guidance
from the measures of effective teaching project. John Wiley & Sons.
Kennedy, M. M. (1999). Form and substance in in-service teacher education. (Research
Monograph No. 13). Arlington, VA: National Science Foundation.
Kennedy, M. M. (2016). How does professional development improve teaching? Review of
Educational Research, 86(4), 945-980.
Kim, J. S., & Quinn, D. M. (2012). A meta-analysis of K-8 summer reading interventions: The
role of socioeconomic status in explaining variation in treatment effects. Paper presented
at the annual meeting of the Society for Research on Educational Effectiveness.
Lipsey, M. W., Puzio, K., Yun, C., Hebert, M. A., Steinka-Fry, K., Cole, M. W., ... & Busick, M.
D. (2012). Translating the Statistical Representation of the Effects of Education
INSTRUCTIONAL IMPROVEMENT IN STEM
51
Interventions into More Readily Interpretable Forms. National Center for Special
Education Research.
Littell, J. H., Corcoran, J., & Pillai, V. (2008). Systematic reviews and meta-analysis. New York,
NY: Oxford University Press.
Llosa, L., Lee, O., Jiang, F., Haas, A., O’Connor, C., Van Booven, C. D., & Kieffer, M. J.
(2016). Impact of a large-scale science intervention focused on English language
learners. American Educational Research Journal, 53(2), 395-424.
Malouf, D. B., & Taymans, J. M. (2016). Anatomy of an Evidence Base. Educational
Researcher, 45(8), 454-459.
Marx, R. W., Blumenfeld, P. C., Krajcik, J. S., Fishman, B., Soloway, E., Geier, R., & Tal, R. T.
(2004). Inquiry‐based science in the middle grades: Assessment of learning in urban
systemic reform. Journal of research in Science Teaching, 41(10), 1063-1080.
Miller, B., Lord, B., & Dorney, J. A. (1994). Staff development for teachers: A study of
configurations and costs in four districts. Boston, MA: Education Development Center.
Miles, K. H., Odden, A., Fermanich, M., & Archibald, S. (2004). Inside the black box of school
district spending on professional development: Lessons from five urban districts. Journal
of Education Finance, 30(1), 1-26.
National Governors Association Center for Best Practices, Council of Chief State School
Officers. (2010). Common Core State Standards. Washington, DC: Author.
National Research Council. (2004). On evaluating curricular effectiveness: Judging the quality
of K-12 mathematics evaluations. Washington, DC: The National Academies Press.
INSTRUCTIONAL IMPROVEMENT IN STEM
52
National Research Council. (2011). Successful K-12 STEM education: Identifying effective
approaches in science, technology, engineering, and mathematics. National Academies
Press.
National Research Council, Board on Science Education. (2012). A framework for K-12
science education: Practices, crosscutting concepts, and core ideas. Washington, DC:
The National Academies Press. Retrieved from
http://www.nap.edu/catalog.php?record_id=13165
National Research Council, National Committee for Science Education Standards and
Assessment. (1996). National science education standards. Washington, DC: The
National Academies Press. Retrieved from http://www.nap.edu/catalog/4962.html
Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of Cognitive
Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127-144.
Penuel, W. R., Sun, M., Frank, K. A., & Gallagher, H. A. (2012). Using social network analysis
to study how collegial interactions can augment teacher learning from external
professional development. American Journal of Education, 119(1), 103-136.
Presser, A. L., Vahey, P., & Dominguez, X. (2015, March). Improving mathematics learning by
integrating curricular activities with innovative and developmentally appropriate digital
apps: Findings from the Next Generation Preschool Math Evaluation. Paper presented at
the Society for Research on Educational Effectiveness Spring Conference, Washington,
DC.
Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between
published and unpublished effect sizes: A meta-review. Review of Educational
Research, 86(1), 207-236.
INSTRUCTIONAL IMPROVEMENT IN STEM
53
Raudenbush, S. W. (2008). Advancing educational policy by advancing research on instruction.
American Educational Research Journal, 45(1), 206-230.
Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes meta-analysis. Journal of
Educational Statistics, 10(2), 75-98.
Resendez, M., and Azin, M., (2006). 2005 Prentice Hall Science Explorer randomized control
trial. Jackson, WY: PRES Associates, Inc.
Roschelle, J., Shechtman, N., Tatar, D., Hegedus, S., Hopkins, B., Empson, S., ... & Gallagher,
L. P. (2010). Integration of technology, curriculum, and professional development for
advancing middle school mathematics: Three large-scale studies. American Educational
Research Journal, 47(4), 833-878.
Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as a
lever for change: An experimental study of the effects of a professional development
program on students’ mathematics learning. Journal of Research on Educational
Effectiveness, 4(1), 1-24.
Saxe, G. B., & Gearhart, M. (2001). Enhancing students' understanding of mathematics: A study
of three contrasting approaches to professional support. Journal of Mathematics Teacher
Education, 4(1), 55-79.
Scher, L., & O'Reilly, F. (2009). Professional development for K–12 math and science teachers:
What do we really know? Journal of Research on Educational Effectiveness, 2(3), 209-
249.
Schneider, M. C., & Meyer, J. P. (2012). Investigating the efficacy of a professional
development program in formative classroom assessment in middle-school English
language arts and mathematics. Journal of Multidisciplinary Evaluation, 8(17), 1-24.
INSTRUCTIONAL IMPROVEMENT IN STEM
54
Schochet, P. Z. (2011). Do typical RCTs of education interventions have sufficient statistical
power for linking impacts on teacher practice and student achievement outcomes?
Journal of Educational and Behavioral Statistics, 36(4), 441-471.
Schwartz‐Bloom, R. D., & Halpin, M. J. (2003). Integrating pharmacology topics in high school
biology and chemistry classes improves performance. Journal of Research in Science
Teaching, 40(9), 922-938.
Shadish, W. R., & Haddock, C. K. (2009). Combining estimates of effect size. In H. Cooper, L.
V. Hedges, & J. C. Valentine, (Eds.), The handbook of research synthesis and meta-
analysis (Second edition). New York: Russell Sage.
Shavelson, R. J., & Towne, L. (Eds.). (2001). Scientific research in education. Washington, DC:
The National Academies Press.
Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational
researcher, 15(2), 4-14.
Slavin, R. E. (2008). Perspectives on evidence-based research in education—What works? Issues
in synthesizing educational program evaluations. Educational Researcher, 37(1), 5-14.
Slavin, R., & Lake, C. (2008). Effective programs in elementary mathematics: A best-evidence
synthesis. Review of Educational Research, 78(3), 427-515.
Slavin, R. E., Lake, C., & Groff, C. (2009). Effective programs in middle and high school
mathematics: A best-evidence synthesis. Review of Educational Research, 79(2), 839-
911.
Slavin, R. E., Lake, C., Hanley, P., & Thurston, A. (2014). Experimental evaluations of
elementary science programs: A best-evidence synthesis. Journal of Research in Science
Teaching, 51(7), 870-901.
INSTRUCTIONAL IMPROVEMENT IN STEM
55
Smith , M. S., & O’Day, J. (1990). Systemic school reform. Politics of Education Association
Yearbook (pp. 233-267). London: Taylor & Francis Ltd.
Sparks, D. (2002). Designing powerful professional development for teachers and principals.
Oxford, OH: National Staff Development Council.
Stein, M. K., Remillard, J., & Smith, M. S. (2007). How curriculum influences student learning.
In F.K. Lester, Jr. (Ed.), Second handbook of research on mathematics teaching and
learning (319-369). New York, NY: Macmillan.
Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation with dependent effect
sizes: Practical considerations and a software tutorial in Stata and SPSS. Research
Synthesis Methods, 5(1), 13–30.
Tanner-Smith, E. E., Tipton, E., & Polanin, J. R. (2016). Handling complex meta-analytic data
structures using robust variance estimates: A tutorial in R. Journal of Developmental and
Life-Course Criminology, 2(1), 85-112.
Taylor, J. A., Getty, S. R., Kowalski, S. M., Wilson, C. D., Carlson, J., & Van Scotter, P. (2015).
An efficacy trial of research-based curriculum materials with curriculum-based
professional development. American Educational Research Journal, 52(5), 984-1017.
Taylor, J. A., Kowalski, S. M., Polanin, J. R., Askinas, K., Stuhlsatz, M. A., Wilson, C. D., ... &
Wilson, S. J. (2018). Investigating Science Education Effect Sizes: Implications for
Power Analyses and Programmatic Decisions. AERA Open, 4(3), 2332858418791991.
Timperley, H., Wilson, A., Barrar, H., & Fung, I. (2008). Teacher professional learning and
development.
INSTRUCTIONAL IMPROVEMENT IN STEM
56
Tipton, E., & Pustejovsky, J. E. (2015). Small-sample adjustments for tests of moderators and
model fit using robust variance estimation in meta-regression. Journal of Educational
and Behavioral Statistics, 40(6), 604-634.
What Works Clearinghouse. (2010). Procedures and standards handbook (Version 2.1).
Washington, DC: US Department of Education.
Wilson S.J., Tanner-Smith E.E., Lipsey, M.W., Steinka-Fry, K., Morrison, J. (2011) Dropout
prevention and intervention programs: Effects on school completion and dropout among
school aged children and youth. Campbell Systematic Reviews 2011:8 DOI:
10.4073/csr.2011.8
Wilson, S. M. (2013). Professional development for science teachers. Science, 340(6130),
310-313. doi: 10.1126/science.1230725
Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. (2007). Reviewing the
evidence on how teacher professional development affects student achievement (Issues &
Answers Report, REL 2007–No. 033). Washington, DC: U.S. Department of Education,
Institute of Education Sciences, National Center for Education Evaluation and Regional
Assistance, Regional Educational Laboratory Southwest. Retrieved from
http://ies.ed.gov/ncee/edlabs
Zaslow, M., Tout, K., Halle, T., Whittaker, J. V., & Lavelle, B. (2010). Toward the Identification
of Features of Effective Professional Development for Early Childhood Educators.
Literature Review. Office of Planning, Evaluation and Policy Development, US
Department of Education.
INSTRUCTIONAL IMPROVEMENT IN STEM
57
Tables and Figures
Table 1
Categories and descriptions of codes
Code Code description Code
presenta
Intervention Type Professional development
only Study included only professional development.
22%
Professional development
and curriculum materials
Study included both professional development and new
curriculum materials.
75%
Curriculum materials only Study included only new curriculum materials. 9%
Research Design and Sample
Characteristics
RCT Study is a randomized controlled trial. 91%
Subject matter – Math Subject matter focus of study is math or math/science. 64%
Preschool Study sample included preschool students. 19%
Effect Size Type State standardized test Outcome is a state standardized test. 17%
Other standardized test Outcome is from other standardized test. 30%
Adjusted for covariates Effect size is adjusted for covariates (e.g., pretest score). 76%
PD Formatb
Same-school collaboration Teachers participated in professional development with
other teachers from their own school. 74%
Implementation meetings
Teachers met formally or informally with other activity
participants to discuss classroom implementation (e.g.,
troubleshooting meeting).
35%
Online professional
development
Part or all of the professional development was
conducted online. 18%
Summer workshop The professional development included a summer
workshop. 54%
Expert coaching
The professional development involved coaching or
mentoring from experts who observed instruction and
provided feedback (e.g., a debriefing meeting; via video
or live).
20%
PD lead by
researchers/intervention
developers
The PD was led by the intervention developers and/or
the study authors. 64%
INSTRUCTIONAL IMPROVEMENT IN STEM
58
Code Code description Code
presenta
PD Timing and Durationc Contact hours Total number of PD contact hours. 45 hours
Timespan over which
professional development
was conducted:
Less than one week The PD was conducted over less than one week. 15%
One week The PD was conducted over one week. 3%
One month
(8 days to 30 days) The PD was conducted over 8 – 30 days. 3%
One semester
(31 days to 4 months) The PD was conducted over 31 days – 4 months. 13%
One year The PD was conducted over 4 months – 1 year. 49%
More than one year The PD was conducted over more than 1 year. 16%
PD Focusb
Generic instructional
strategies
The professional development was focused on content-
generic instructional strategies (e.g., improving
classroom climate and student motivation).
11%
How to use curriculum
materials The PD focused on how to use curriculum materials. 75%
Integrate technology
The PD focused on how to integrating technology into
the classroom. 11%
Content-specific formative
assessment
The PD focused on formative assessment strategies
specific to mathematics and science teaching (e.g.
strategies to elicit student understanding of fractions or
the scientific method).
17%
Improve content knowledge/
Pedagogical content
knowledge/How students
learn
The PD focused on improving teachers' pedagogical
content knowledge (e.g., how students learn
mathematics or science).
55%
PD Activitiesb
Review sample student work Teachers studied examples of students’ work (including
watching videos of students). 16%
Observed demonstration Teachers observed a video or live demonstration/
modeling of instruction. 35%
Solved problems/Worked
through student materials
Teachers solved problems or exercises during the PD or
worked through student materials during the PD. 42%
Developed curriculum/lesson
plans
Teachers developed curricula or lesson plans during the
PD. 19%
Reviewed own student work Teachers studied examples of their own students' work
during the PD. 11%
INSTRUCTIONAL IMPROVEMENT IN STEM
59
Code Code description Code
presenta
Curriculum Materialsd
Implementation guidance
The curriculum materials provided teachers with
implementation guidance (e.g., support for student-
teacher dialogues around the content).
43%
Laboratory/Hands-on
experience or curriculum kits
The curriculum materials included materials/guidance
that supported inquiry-oriented explorations (e.g.,
science laboratory or hands-on mathematics kits).
28%
Curriculum dosage (hours) Total number of hours that the curriculum was intended
to be used.
66.6
hours
Curriculum proportion
replaced (percent)
The proportion of each lesson that the new curriculum
was intended to replace existing curriculum.
91%
Note: N = 95 studies. a Figures in third column include percent of studies which feature the row code for binary
variables, or the sample average calculated at the study level for continuous variables (e.g.,
contact hours and curriculum dosage). For studies that had the feature present in one treatment
arm but not another treatment arm, the code is counted as present if it is present is any treatment
arm (e.g., a study with one treatment arm including only curriculum materials and a second
treatment arm including both professional development and curriculum materials would be
included in both rows). b Codes for PD focus, activities, and format were counted as “Not
Present” for studies that did not involve a PD component. c PD timing/duration includes excludes
studies without a PD component. d Codes for features of interventions involving new curriculum
materials were counted as “Not Present” for studies that did not involve new curriculum
materials.
INSTRUCTIONAL IMPROVEMENT IN STEM
60
Table 2
Results of estimating an unconditional meta-regression model with robust variance estimation
(RVE)
Dependent variable:
Effect Size (Hedges’s g)
Constant 0.209***
(0.025)
N effect sizes 258
N studies 95
𝜏2a 0.037
95% prediction intervalb (-0.165, 0.583)
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b The 95%
prediction interval is calculated as the estimated average effect size +/- 1.96* 𝜏.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
61
Table 3.
RVE results including controls for study design, study sample, subject area and outcome
measure type
Dependent variable: Effect Size (Hedges’s g)
Between-study effects
RCT -0.022
(0.104)
State standardized test -0.264***
(0.055)
Other standardized test -0.277***
(0.053)
Grade - preschool 0.133
(0.084)
Effect size adjusted for covariates -0.040
(0.055)
Subject matter- math -0.009
(0.044)
Within-study effects
State standardized test -0.208***
(0.052)
Other standardized test -0.203***
(0.059)
Constant 0.395***
(0.111)
N effect sizes 258
N studies 95
𝜏2a 0.025
Results of joint F-testb F = 5.75, df = 26.9, p<0.001
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the
joint F test are from a test of the joint significance of all study characteristics included in the
model. The F test was estimated using the robumeta and clubSandwich package in R (Fisher &
Tipton, 2014; Tipton & Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
62
Table 4
RVE results including intervention characteristics (professional development and/or curriculum
materials as moderators
Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Professional development only -0.084 -0.090*
(0.051) (0.052)
New curriculum materials only -0.129
(0.118)
Professional development only/New
Curriculum materials only
-0.099**
(0.046)
N effect sizes 258 258 258
N studies 95 95 95
𝜏2a 0.025 0.025 0.025
Raw mean ES:
Professional development only 0.254
New curriculum materials only 0.091
Professional development only/ New
curriculum materials only 0.214
Both professional development and new
curriculum materials 0.252
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
Models include controls for the following: RCT, state standardized test, other standardized test,
grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
63
Table 5
RVE results including professional development contact hours as moderators
Dependent variable:
Effect Size (Hedges’s g)
Between-study effects
PD contact hoursa 0.005
(0.007)
PD contact hours between 25th and 50th percentile (16
– 34.5 hours) 0.012
(0.050)
PD contact hours between 50th and 75th percentile (35
- 68 hours) -0.033
(0.065)
PD contact hours above 75th percentile
(> 68 hours) 0.069
(0.067)
PD contact hours at or above 25th percentile
(> 16 hours) 0.017
(0.043)
N effect sizes 231 231 231
N studies 85 85 85
𝜏2b 0.022 0.024 0.023
Results of joint F testc F = 0. 647, df = 35.2, p = 0.590
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
All models include only studies and/or treatment arms with a professional development
component. All models include controls for the following: RCT, state standardized test, other
standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a PD contact hours measured as raw PD contact hours / 10. b 𝜏2 is the method of moments
estimate of the between study variance in the underlying effects provided by the robumeta
package in Stata 15 (Tanner-Smith & Tipton, 2014). c Results of the joint F test are from a test of
the joint significance of the PD contact hours predictors from the second model. The F test was
estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton &
Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
64
Table 6
RVE results including professional development foci as moderators
Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Generic instructional
strategies -0.014 -0.074
(0.076) (0.072)
How to use curriculum
materials 0.118* 0.119*
(0.060) (0.063)
Integrate technology 0.185 0.139
(0.117) (0.102)
Content-specific
formative assessment 0.132* 0.105
(0.067) (0.062)
Improve pedagogical
content
knowledge/how
students learn 0.094** 0.096**
(0.043) (0.043)
Number of PD
features 0.077***
(0.024)
N effect sizes 237 237 237 237 237 237 237
N studies 89 89 89 89 89 89 89
𝜏2a 0.025 0.025 0.024 0.024 0.026 0.025 0.023
Results of joint
F testb F = 3.89, df = 20.4, p = 0.012
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
All models include only studies and/or treatment arms with a professional development
component. All models include controls for the following: RCT, state standardized test, other
standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the
joint F test are from a test of the joint significance of the predictors for all professional
development foci included in the model. The F test was estimated using the robumeta and
clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
65
Table 7
RVE results including professional development activities as moderators
Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Observed demonstration
0.033 0.025
(0.049) (0.046)
Review generic student
work 0.047 -0.014
(0.087) (0.085)
Solved problems/Worked
through
student materials 0.077 0.070
(0.047) (0.047)
Developed curriculum
materials/lesson plans 0.064 0.036
(0.061) (0.063)
Reviewed own student
work 0.033 0.017
(0.064) (0.071)
Number of PD features 0.046*
(0.024)
N effect sizes 237 237 237 237 236 236 237
N studies 89 89 89 89 88 88 89
𝜏2a 0.025 0.021 0.024 0.025 0.018 0.020 0.021
Results of joint F testb F = 0.883, df = 21.2, p = 0.509
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
All models include only studies and/or treatment arms with a professional development
component. All models include controls for the following: RCT, state standardized test, other
standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the
joint F test are from a test of the joint significance of the predictors for all professional
development activities included in the model. The F test was estimated using the robumeta and
clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
66
Table 8
RVE results including professional development formats as moderators
Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Same-school
collaboration 0.109 0.123*
(0.076) (0.067)
Implementati
on meetings 0.085* 0.117**
(0.049) (0.045)
Any online
PD -0.161*** -0.153**
(0.053) (0.057)
Summer
workshop 0.093** 0.074*
(0.043) (0.038)
Expert
coaching 0.035 0.053
(0.049) (0.051)
PD leaders –
researchers -0.019 -0.037
(0.043) (0.038)
Number of
PD features 0.023
(0.024)
N effect sizes 237 237 237 236 237 231 231 237
N studies 89 89 89 88 89 86 86 89
𝜏2a 0.020 0.022 0.024 0.023 0.026 0.025 0.015 0.023
Results of
joint F testb F = 2.72, df = 26.2, p = 0.035
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
All models include only studies and/or treatment arms with a professional development
component. All models include controls for the following: RCT, state standardized test, other
standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the
joint F test are from a test of the joint significance of the predictors for all professional
development formats included in the model. The F test was estimated using the robumeta and
clubSandwich package in R (Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
67
Table 9
RVE results including characteristics of interventions involving new curriculum materials as
moderators
Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Implementation guidance 0.060 0.062
(0.053) (0.056)
Laboratory/hands-on
experience, curriculum kits -0.029 -0.026
(0.055) (0.054)
Curriculum dosage
(Number of hours) 0.000 0.000
(0.001) (0.001)
Curriculum proportion replaced
(0.00-1.00) -0.060
(0.151)
N effect sizes 193 193 193 192 193
N studies 77 77 77 76 77
𝜏2a 0.027 0.031 0.030 0.025 0.031
Results of joint F testb F = 0.415, df = 24.4, p = 0.744
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
All models include only studies and/or treatment arms that included new curriculum materials.
All models include controls for the following: RCT, state standardized test, other standardized
test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the
joint F test are from a test of the joint significance of the predictors for all characteristics of
interventions involving new curriculum materials included in the model. The F test was
estimated using the robumeta and clubSandwich package in R (Fisher & Tipton, 2014; Tipton &
Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
68
Table 10
Regression adjusted mean effect sizes based on unconditional and conditional RVE meta-
regression models
Subgroup analysis
(Unconditional
RVE model)
Conditional RVE
model Conditional RVE
model with controls
for other program
features
𝑔𝑢𝑐
, 𝑝𝑢𝑐 𝑔
�� pc 𝑔
𝑐+ pc+
Overall effect size 0.209*** -- -- -- --
Program type Professional development
and curriculum materials 0.235*** 0.254 ** -- --
Professional development
only/Curriculum
materials only+
0.136*** 0.156 -- --
Outcome type State standardized test 0.060*** 0.101 *** -- --
Other standardized test 0.084*** 0.087 *** -- --
Intervenor-developed
test+ 0.371*** 0.365 -- --
Program feature category: Professional development focus Focus on how to use
curriculum materials
Yes 0.228*** 0.258 * 0.260 *
No 0.134*** 0.140 0.141 Focus on content-specific
formative assessment
Yes 0.387*** 0.340 * 0.321 No 0.178*** 0.208 0.216
Focus on improving
content
knowledge/pedagogical
content knowledge/how
students learn
Yes 0.303*** 0.269 ** 0.275 **
No 0.095*** 0.175 0.179
INSTRUCTIONAL IMPROVEMENT IN STEM
69
Subgroup analysis
(Unconditional
RVE model)
Conditional RVE
model Conditional RVE
model with controls
for other program
features
𝑔𝑢𝑐
a, pcb 𝑔
��c pc
d 𝑔𝑐+
e pc+f
Program feature category: Professional development formats
PD includes same-school
collaboration
Yes 0.198*** 0.248 0.247 *
No 0.263*** 0.139 0.125 PD includes
implementation meetings
Yes 0.283*** 0.284 * 0.297 **
No 0.169*** 0.199 0.180 Any online professional
development
Yes 0.116** 0.103 *** 0.096 **
No 0.235*** 0.264 0.249 PD includes summer
workshop
Yes 0.218*** 0.266 ** 0.253 *
No 0.190*** 0.174 0.179
Note: First column: Regression-adjusted mean effect sizes are based on subgroup analyses.
Unconditional RVE meta-regression models were estimated including only effect sizes with the
row feature. a guc = Estimated mean effect size from unconditional RVE meta-regression model. b puc = p-
value on constant from unconditional meta-regression model.
Second and third column: Regression-adjusted mean effect sizes are based on the results of
estimating conditional RVE meta-regression models. Models included each the row feature
separately as a moderator. All models included controls for the variables listed in Table 3.
Regression-adjusted mean effect sizes were calculated using the overall average of the study-
level values of each included covariate. c 𝑔𝑐 = Estimated regression-adjusted mean effect size from conditional RVE meta-regression
model. d pc = p-value on coefficient for program feature on interest from meta-regression model.
Fourth and fifth column: Regression-adjusted mean effect sizes are based on the results of
estimating conditional RVE meta-regression models. Models included all row features in a given
category simultaneously as moderators. All models included controls for the variables listed in
Table 3. Regression-adjusted mean effect sizes were calculated using the overall average of the
study-level values of each included covariate. e 𝑔𝑐+ = Estimated regression-adjusted mean effect size from conditional RVE meta-regression
model. f pc+ = p-value on coefficient for program feature on interest from meta-regression model.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
70
Table 11
RVE results including other research design elements as moderators Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Unit of assignment
is teacher -0.054 -0.060
(0.043) (0.052)
Business as usual
control group 0.008 0.076
(0.079) (0.085)
Teacher
participation –
Voluntary -0.035 -0.032
(0.045) (0.054)
Teacher
participation –
Missing 0.111 0.114
(0.089) (0.103)
Teacher incentive –
Credit -0.167** -0.177*
(0.074) (0.097)
Teacher credit
incentive – Missing -0.050 -0.137
(0.055) (0.097)
Teacher incentive –
Monetary -0.054 0.016
(0.042) (0.057)
Teacher monetary
incentive – Missing -0.019 0.073
(0.063) (0.110)
Dependent variable: Effect Size (Hedges’s g)
High cluster -0.056 0.022
INSTRUCTIONAL IMPROVEMENT IN STEM
71
attrition
(0.051) (0.057)
Cluster attrition -
Missing -0.070 -0.069
(0.064) (0.078)
Higher student
attrition -0.068 -0.054
(0.046) (0.052)
Student attrition -
Missing -0.031 0.031
(0.054) (0.065)
N effect sizes 258 258 258 258 258 258 258 258
N studies 95 95 95 95 95 95 95 95
𝜏2a 0.024 0.022 0.026 0.024 0.026 0.026 0.027 0.028
Results of joint F
testb F = 1.30, df = 23.2, p = 0.284
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80. All models include controls for the
following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject
matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects provided by the robumeta package in
Stata 15 (Tanner-Smith & Tipton, 2014). b Results of the joint F test are from a test of the joint significance of the predictors for all
additional research design elements in the model. The F test was estimated using the robumeta and clubSandwich package in R
(Fisher & Tipton, 2014; Tipton & Pustejovsky, 2015).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
72
Table 12
RVE results including peer-reviewed publication type as a moderator Dependent variable: Effect Size (Hedges’s g)
Between-study effects
Peer-reviewed source 0.046 0.067
(0.050) (0.041)
N effect sizes 258 258
N studies 95 95
𝜏2a 0.038 0.026
Note: We assume the average correlation between all pairs of effect sizes within studies is 0.80.
Models in column 2 include controls for the following: RCT, state standardized test, other
standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜏2 is the method of moments estimate of the between study variance in the underlying effects
provided by the robumeta package in Stata 15 (Tanner-Smith & Tipton, 2014).
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
73
Figure 1. PRISMA study screening flow chart (PRISMA, 2009).
Records identified through database searching
(n = 8,099)
Scre
enin
g In
clu
ded
El
igib
ility
Id
Ien
tifi
cati
on
Additional records identified through other sources
(n = 1,391)
Records after duplicates removed (n = 7,926)
Records screened (n = 7,926)
Records excluded (n = 7,270)
Full-text articles assessed for eligibility
(n = 656)
Full-text articles excluded (n = 561) (Reasons: Did not meet intervention characteristics [e.g., off-topic, not a classroom-level intervention]: n=248; Methodology issues [e.g., no control group; post-hoc design]: n=310; Sample issues [e.g., did not have at least two teachers and fifteen students in each condition], n=45; Report was subsumed under another study, n=11; Full-text of study could not be located, n=11; Study characteristics issues [published before 1989 or not written in English], n=4)
Studies included in quantitative synthesis
(meta-analysis) (n = 95)
Iden
tifi
cati
on
Sc
reen
ing
Elig
ibili
ty
Incl
ud
ed
INSTRUCTIONAL IMPROVEMENT IN STEM
74
Online Appendix A:
Tables and Figures
INSTRUCTIONAL IMPROVEMENT IN STEM
75
Table A1
Included studies and effect sizes
Authors Content Area Study Design Setting Grades
Served
Effect
Size
Effect
Size
SE
Effect
Size
Variance
(SE^2)
Agodini, Harris, Remillard, & Thomas (2013)
Mathematics Random
assignment
Elementary 2
-0.210 0.064 0.004
0.000 0.175 0.031
0.040 0.068 0.005
Argentin, Pennisi, Vidoni, Abbiati, & Caputo
(2014)
Mathematics Random
assignment
Middle
School
6, 7, 8
-0.001 0.044 0.002
Arnold, Fisher, Doctoroff, & Dobbs (2002) Mathematics Random
assignment
Preschool PK
0.436 0.364 0.132
Authors (2016) Mathematics Random
assignment
Elementary 4, 5
-0.060 0.085 0.007
-0.030 0.068 0.005
-0.020 0.044 0.002
0.000 0.161 0.026
0.000 0.155 0.024
0.020 0.086 0.007
0.030 0.085 0.007
0.040 0.062 0.004
0.050 0.147 0.022
0.080 0.297 0.088
0.080 0.052 0.003
0.100 0.073 0.005
Batiza, Luo, Zhang, Gruhl, Nelson, Hoelzer,
… & Marcey (2016)
Science Random
assignment
High School 9
0.549 0.096 0.009
0.699 0.259 0.067
INSTRUCTIONAL IMPROVEMENT IN STEM
76
Battistich, Alldredge, & Tsuchida (2003) Mathematics Random
assignment
Elementary 2
-0.404 0.601 0.361
-0.363 0.476 0.226
-0.157 0.498 0.248
0.133 0.405 0.164
0.206 0.616 0.379
0.292 0.588 0.346
0.755 0.570 0.324
1.007 0.529 0.280
Berlinski & Busso (2015) Mathematics Random
assignment
Middle
School
7
-0.247 0.081 0.007
-0.171 0.080 0.006
Beuermann, Naslund-Hadley, Ruprah, &
Thompson (2013)
Science Random
assignment
Elementary 3
-0.030 0.060 0.004
0.030 0.070 0.005
0.180 0.080 0.006
Borman, Gamoran & Bowdon (2008) Science Random
assignment
Elementary 4,5
-0.270 0.092 0.008
-0.080 0.104 0.011
0.010 0.085 0.007
Bottge, Ma, Gassaway, Toland, Butler, &
Cho (2014)
Mathematics Random
assignment
Middle
School, High
School
6,7,8, 9,
10, 11
0.040 0.188 0.035
0.390 0.190 0.036
0.440 0.191 0.037
1.000 0.200 0.040
Bottge, Toland, Gassaway, Butler, Cho,
Griffen, & Ma (2015)
Mathematics Random
assignment
Middle
School
6,7,8
-0.220 0.180 0.032
0.100 0.112 0.013
0.150 0.112 0.013
0.260 0.180 0.032
INSTRUCTIONAL IMPROVEMENT IN STEM
77
0.290 0.178 0.032
0.380 0.110 0.012
0.470 0.175 0.031
0.610 0.114 0.013
Bradshaw (2011) Science Random
assignment
Middle
School
6,7,8
0.251 0.131 0.017
0.310 0.130 0.017
Brendefur, Strother, Thiede, Lane, & Surges-
Prokop (2013)
Mathematics Random
assignment
Preschool PK
0.872 0.288 0.083
Brown, Greenfield, Bell, Juárez, Myers, &
Nayfield (2013)
Science Random
assignment
Preschool PK
0.121 0.066 0.004
Carpenter, Fennema, Peterson, Chiang, &
Loef (1989)
Mathematics Random
assignment
Elementary 1
0.202 0.810 0.657
0.426 0.818 0.669
0.452 0.819 0.671
0.453 0.819 0.671
0.511 0.822 0.675
0.545 0.824 0.678
0.551 0.824 0.679
0.692 0.833 0.694
0.747 0.837 0.701
Cervetti, Barber, Dorph, Pearson, &
Goldschmidt (2012)
Science and
Reading
Random
assignment
Elementary 4
0.220 0.073 0.005
0.400 0.102 0.010
0.650 0.083 0.007
Clark, Arens & Stewart (2015) Mathematics Random
assignment
Middle
School
7
0.040 0.153 0.023
0.056 0.153 0.023
Clarke, Smolkowski, Baker, Fien, Doabler, & Mathematics Random Elementary K 0.189 0.099 0.010
INSTRUCTIONAL IMPROVEMENT IN STEM
78
Chard (2011) assignment
0.232 0.097 0.009
Clements & Sarama (2007) Mathematics Random
assignment
Preschool PK
0.851 0.517 0.267
1.464 0.558 0.311
Clements & Sarama (2008) Mathematics Random
assignment
Preschool PK
0.637 0.215 0.046
1.066 0.116 0.013
Clements, Sarama, Spitler, Lange, & Wolfe
(2011)
Mathematics Random
assignment
Preschool,
Elementary
PK, K
0.720 0.099 0.010
Dash, De Kramer, O'dwyer, Masters, &
Russell, (2012)
Mathematics Random
assignment
Elementary 5
0.132 0.068 0.005
DeBarger, Penuel, Moorthy, Beauvineau,
Kennedy, & Boscardin (2016)
Science Quasi-
experimental
Middle
School
Not
specifie
d 0.637 0.249 0.062
Devlin-Sherer, Spinelli, Giamatteo, Johnson,
Mayo-Molina, McGinley, … & Zisk (1998)
Mathematics Quasi-
experimental
Elementary 3, 4, 5
-0.343 0.695 0.482
-0.227 0.692 0.478
-0.136 0.682 0.465
0.297 0.683 0.466
Dominguez, Nicholls, & Storandt (2006) Mathematics Random
assignment
Elementary 3, 4, 5
0.082 0.107 0.012
0.100 0.108 0.012
Eddy & Berry (2006) Mathematics Random
assignment
High School 9, 10,
11, 12 0.006 0.053 0.003
Eddy, Ruitman, Sloper, & Hankel (2010) Mathematics Random
assignment
High School 9, 10,
11 0.175 0.189 0.036
Garet, Wayne, Stancavage, Taylor, Walters,
Song, … & Doolittle (2010)
Mathematics Random
assignment
Middle
School
7
0.020 0.044 0.002
0.050 0.066 0.004
Granger, Bevis, Saka, Southerland, Sampson,
& Tate (2012)
Science Random
assignment
Elementary 4, 5
0.067 0.051 0.003
INSTRUCTIONAL IMPROVEMENT IN STEM
79
0.573 0.089 0.008
Greenleaf, Litman, Hanson, Rosen,
Boscardin, Herman, … & Jones (2011)
Science Random
assignment
High School 9, 10
0.280 0.109 0.012
Gropen, Clark-Chiarelli, Chalufour,
Hoisington, & Eggers-Piérola, (2009)
Science Random
assignment
Preschool PK
0.078 0.112 0.012
0.141 0.112 0.012
0.144 0.112 0.012
0.222 0.112 0.013
Gropen, Clark-Chiarelli, Ehrlich, & Thieu
(2011)
Science Random
assignment
Preschool PK
0.379 0.119 0.014
Hand, Therrien, & Shelley (2013) Science Random
assignment
Elementary,
Middle
School
3, 4, 5,
6
-0.001 0.132 0.018
0.426 0.134 0.018
Harris, Penuel, D'Angelo, Haydel, DeBarger,
Gallagher, Kennedy, … & Krajcik (2015)
Science Random
assignment
Middle
School
6
0.220 0.102 0.010
0.250 0.124 0.015
Heller (2012) Science Random
assignment
Middle
School
8
0.029 0.064 0.004
0.109 0.052 0.003
Heller, Curtis, Rabe-Hesketh, & Verboncoeur
(2007)
Mathematics Random
assignment
Elementary,
Middle
School
2, 4, 6
0.010 0.198 0.039
0.111 0.179 0.032
0.352 0.220 0.048
0.429 0.137 0.019
0.659 0.155 0.024
Heller, Daehler, Wong, Shinohara, & Miratrix
(2012)
Science Random
assignment
Elementary 4
0.010 0.090 0.008
INSTRUCTIONAL IMPROVEMENT IN STEM
80
0.070 0.095 0.009
0.310 0.090 0.008
0.370 0.084 0.007
0.570 0.086 0.007
0.600 0.092 0.008
Heller, Hanson, & Barnett-Clarke (2010) Mathematics Random
assignment
Elementary 4, 5
0.040 0.094 0.009
0.080 0.079 0.006
0.090 0.083 0.007
0.150 0.088 0.008
0.180 0.088 0.008
0.190 0.087 0.008
Hinerman, Hull, Chen, Booker, & Naslund-
Hadley (2014)
Mathematics Random
assignment
Elementary,
Middle
School
K, 1, 2,
3, 4, 5,
6, 7 0.280 0.135 0.018
Jaciw, Hegseth, Ma, & Lai (2012) Mathematics Random
assignment
Elementary 3, 4, 5
0.050 0.082 0.007
0.120 0.061 0.004
0.140 0.085 0.007
Jacobs, Franke, Carpenter, Levi, & Battey
(2007)
Mathematics Random
assignment
Elementary 1, 2, 3,
4, 5 0.277 0.197 0.039
0.316 0.230 0.053
0.348 0.187 0.035
0.766 0.339 0.115
0.830 0.256 0.065
0.841 0.233 0.054
1.016 0.348 0.121
Jerrim & Vignoles (2015) Mathematics Random
assignment
Elementary,
Middle
School
UK1, 7
0.055 0.046 0.002
0.099 0.054 0.003
Kaldon & Zoblotsky (2014) Science Random Elementary, 3 0.020 0.038 0.001
INSTRUCTIONAL IMPROVEMENT IN STEM
81
assignment Middle
School
0.040 0.027 0.001
Kim, Van Tassel-Baska, Bracken, Feng,
Stambaugh, & Bland (2012)
Science Random
assignment
Elementary K, 1, 2,
3 0.168 0.184 0.034
Kinzie, Whittaker, Williford, Decoster,
Mcguire, Lee, & Kilday (2014)
Mathematics
and Science
Random
assignment
Preschool PK
-0.076 0.060 0.004
-0.025 0.039 0.002
-0.010 0.056 0.003
-0.002 0.062 0.004
0.016 0.204 0.042
0.062 0.067 0.005
0.070 0.076 0.006
0.073 0.062 0.004
0.081 0.037 0.001
0.131 0.056 0.003
Kisker, Lipka, Adams, Rickard, Andrew-
Ihrke, Yanez, & Millard (2012)
Mathematics Random
assignment
Elementary 2
0.390 0.118 0.014
0.819 0.157 0.025
Klein, Starkey, Clements, Sarama, & Iyer
(2008)
Mathematics Random
assignment
Preschool PK
0.229 0.178 0.032
0.549 0.180 0.033
Klein, Starkey, DeFlorio, & Brown (2012) Mathematics Random
assignment
Preschool,
Elementary
PK, K
0.331 0.118 0.014
0.450 0.114 0.013
0.699 0.127 0.016
0.829 0.118 0.014
Lafferty (1994) Mathematics Quasi-
experimental
Middle
School
0.428 0.345 0.119
Lanehart, Borman, Boydston, Cotner, & Lee
(2010)
Science Random
assignment
Elementary 3,4,5
0.188 0.063 0.004
INSTRUCTIONAL IMPROVEMENT IN STEM
82
Lang, Schoen, LaVenia, & Oberlin (2014) Mathematics Random
assignment
Elementary K, 1
0.200 0.090 0.008
0.240 0.080 0.006
Lara-Alecio, Tong, Irby, Guerrero, Huerta, &
Fan (2012)
Science Quasi-
experimental
Elementary 5
-0.080 0.306 0.094
0.052 0.306 0.094
0.361 0.311 0.097
0.394 0.312 0.097
0.396 0.312 0.097
Lehrer (2015) Mathematics Random
assignment
Middle
School
6
0.418 0.097 0.009
0.568 0.082 0.007
Lewis & Perry (2015) Mathematics Random
assignment
Elementary 2, 3, 4,
5 0.496 0.136 0.018
Llorente, Pasknik, Moorthy, Hupert,
Rosenfeld, & Gerard (2015)
Mathematics Random
assignment
Preschool PK
0.000 0.078 0.006
0.010 0.039 0.002
0.150 0.081 0.007
0.240 0.048 0.002
Llosa, Lee, Jiang, Haas, O’Connor,
Van Booven, & Kieffer (2016)
Science Random
assignment
Elementary 5
0.150 0.051 0.003
0.250 0.113 0.013
Maerten-Rivera, Ahn, Lanier, Diaz, & Lee
(2014)
Science Random
assignment
Elementary 5
-0.011 0.151 0.023
0.136 0.155 0.024
0.170 0.155 0.024
Martin, Braisel, & Turner, & Wise (2012) Mathematics Random
assignment
Middle
School
6
0.020 0.071 0.005
0.090 0.069 0.005
McCoach, Gubbins, Foreman, Rubenstein, &
Rambo-Hernandez (2014)
Mathematics Random
assignment
Elementary 3
0.030 0.077 0.006
INSTRUCTIONAL IMPROVEMENT IN STEM
83
Miller, Jaciw, & Ma (2007) Science Random
assignment
Elementary 3,4,5
-0.020 0.040 0.002
Montague, Krawec, Enders, & Dietz (2014) Mathematics Random
assignment
Middle
School
7
0.613 0.250 0.063
Mutch-Jones, Puttick, & Demers (2014) Science Random
assignment
Elementary
and Middle
School
5, 6, 7,
8
0.082 0.039 0.001
Newman, Finney, Bell, Turner, Jaciw,
Zacamy, & Gould (2012)
Mathematics Random
assignment
Elementary
and Middle
School
4,5,6,7,
8
0.048 0.015 0.000
0.050 0.028 0.001
Niess (2005) Mathematics Random
assignment
Elementary
and Middle
School
3, 4, 5,
6, 7, 8
0.129 0.185 0.034
0.362 0.212 0.045
Oh, Lachapelle, Shams, Hertel, &
Cunningham (2016)
Science Random
assignment
Elementary Not
specifie
d -0.070 0.111 0.012
-0.030 0.125 0.016
0.010 0.070 0.005
0.030 0.072 0.005
0.070 0.092 0.008
0.100 0.058 0.003
0.110 0.134 0.018
0.140 0.081 0.006
0.170 0.072 0.005
0.180 0.080 0.006
0.200 0.124 0.015
0.220 0.077 0.006
Pane, Griffin, Mccaffrey, Karam (2014) Mathematics Random
assignment
High School 6, 7, 8,
9, 10, -0.100 0.100 0.010
INSTRUCTIONAL IMPROVEMENT IN STEM
84
11, 12
-0.030 0.110 0.012
0.190 0.130 0.017
0.220 0.090 0.008
Penuel, Gallagher & Moorthy (2011) Science Random
assignment
Middle
School
6,7,8
0.180 0.183 0.034
0.290 0.114 0.013
0.340 0.124 0.015
Piasta, Logan, Pelatti, Capps, & Petrill (2015) Mathematics
and Science
Random
assignment
Preschool PK
-0.080 0.307 0.094
-0.010 0.332 0.110
0.080 0.301 0.091
0.130 0.266 0.071
Presser, Clements, Ginsburg, & Ertle (2015) Mathematics Random
assignment
Preschool,
Elementary
PK, K
0.318 0.247 0.061
0.319 0.147 0.022
0.491 0.251 0.063
0.555 0.253 0.064
Presser, Vahey, & Dominguez (2015) Mathematics Random
assignment
Preschool PK
0.508 0.228 0.052
Pyke, Lynch, Kuipers, Szesze, & Driver
(2004)
Science Random
assignment
Middle
School
8
0.320 0.166 0.028
0.383 0.165 0.027
Pyke, Lynch, Kuipers, Szesze, & Watson
(2006)
Science Random
assignment
Middle
School
6, 7, 8
0.230 0.318 0.101
Pyke, Lynch, Kuipers, Szesze, & Watson
(2005)
Science Random
assignment
Middle
School
7
-0.216 0.237 0.056
Reid, Chen, & McCray (2014) Mathematics Quasi-
experimental
Preschool,
Elementary
PK, K
0.060 0.079 0.006
Resendez & Azin (2006) Mathematics Random
assignment
Elementary 3,5
-0.070 0.057 0.003
INSTRUCTIONAL IMPROVEMENT IN STEM
85
0.050 0.086 0.007
Resendez & Azin (2008) Mathematics Random
assignment
Elementary 2, 3, 4,
5 0.200 0.063 0.004
0.210 0.021 0.000
0.240 0.102 0.011
Rimbey (2013) Mathematics Random
assignment
Elementary 2
0.245 0.323 0.104
Roschelle, Shechtman, Tatar, Hegedus,
Hopkins, Empson, … & Gallagher (2010)
Mathematics Random
assignment
Middle
School
7,8
0.379 0.063 0.004
0.606 0.059 0.003
Roth, Wilson, Taylor, Hvidsten, Stennett,
Wickler, … & Bintz (2015)
Science Random
assignment
Elementary 4,5
0.680 0.041 0.002
San Antonio, Morales, & Moral (2011) Mathematics Random
assignment
Middle
School
6
0.268 0.280 0.078
Santagata, Kersting, Givvin, & Stigler (2010) Mathematics Random
assignment
Middle
School
6
-0.022 0.134 0.018
Sarama, Clements, Starkey, Klein, &
Wakeley (2008)
Mathematics Random
assignment
Preschool PK
0.618 0.166 0.028
Saxe, Gearhart, & Nasir (2001) Mathematics Random
assignment
Elementary 4, 5, 6
0.770 1.251 1.565
1.450 1.366 1.867
Schneider (2013) Mathematics Random
assignment
Middle
School, High
School
8, 9
0.298 0.135 0.018
0.400 0.184 0.034
Schneider & Meyer (2012) Mathematics Random
assignment
Middle
School
6
0.030 0.049 0.002
Schwartz‐Bloom & Halpin (2003) Science Random
assignment
High School 9, 10,
11, 12 0.380 0.138 0.019
0.580 0.145 0.021
Shannon & Grant (2012) Science Random
assignment
High School 9, 10,
11, 12 0.060 0.083 0.007
INSTRUCTIONAL IMPROVEMENT IN STEM
86
0.120 0.097 0.009
Sophian (2004) Mathematics Quasi-
experimental
Preschool PK
0.406 0.431 0.185
0.794 0.442 0.196
Supovitz (2013) Mathematics Random
assignment
Elementary 1, 2, 3,
4, 5 0.001 0.046 0.002
0.035 0.046 0.002
0.059 0.046 0.002
Star, Pollack, Durkin, Rittle-Johnson, Lynch,
Newton, & Gogolen (2015)
Mathematics Random
assignment
Middle
School, High
School
8, 9
-0.058 0.120 0.014
0.047 0.134 0.000
Starkey, Klein, & DeFlorio (2013) Mathematics Random
assignment
Preschool PK
0.589 0.165 0.027
1.078 0.173 0.030
Tatar, Roschelle, Knudsen, Shechtman,
Kaput, & Hopkins (2008)
Mathematics Random
assignment
Middle
School
7, 8
1.249 0.735 0.540
Tauer (2002) Mathematics Quasi-
experimental
High School 10
0.208 0.484 0.234
Taylor, Getty, Kowalski, Wilson, Carlson, &
Van Scotter (2015)
Science Random
assignment
High School 9, 10
0.090 0.040 0.002
Taylor, Roth, Wilson, Stuhlsatz, & Tipton
(2016)
Science Random
assignment
Elementary 4, 5
0.520 0.071 0.005
Thompson, Senk, & Yu (2012) Mathematics Quasi-
experimental
Middle
School
7
-0.125 0.247 0.061
-0.086 0.247 0.061
0.199 0.248 0.061
Vaden-Kiernan, Borman, Caverly, Bell, Ruiz
de Castilla, Sullivan, & Rodriguez (2016)
Mathematics Random
assignment
Elementary K, 1, 3,
4 -0.069 0.036 0.001
0.059 0.035 0.001
INSTRUCTIONAL IMPROVEMENT IN STEM
87
Van Egeren, Schwarz, Gerde, Morris, Pierce,
Brophy-Herb, … & Stoddard (2014)
Science Random
assignment
Preschool PK
-0.269 0.193 0.037
-0.030 0.192 0.037
0.017 0.192 0.037
0.150 0.192 0.037
0.251 0.193 0.037
Walsh-Cavazos (1994) Mathematics Quasi-
experimental
Elementary 5
0.554 0.439 0.193
INSTRUCTIONAL IMPROVEMENT IN STEM
88
Table A2
Estimated mean effect sizes based on unconditional RVE meta-regression models: Differences by
presence of attrition problems
Subgroup analysis
(Unconditional RVE
model)
Attrition problemsa
Attrition problem at the cluster level 0.180***
Differential attrition between treatment and control groups 0.131***
No attrition problems 0.243***
Note: Table includes estimated mean effect sizes from unconditional RVE models. Mean effect
sizes are based on subgroup analyses where unconditional RVE meta-regression models were
estimated using only effect sizes with (or without) attrition problems. a In some studies, attrition problems were only present for some outcomes (e.g., in studies with
multiple treatment arms, there may have been differential attrition between the control group and
one treatment arm, but not between the control group and a different treatment arm). In these
cases, only effect sizes from contrasts with (or without) the relevant attrition problems were
included. Studies may be included in more than one row if studies included had both cluster-
level attrition and differential attrition between treatment and control groups.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
89
Table A3
Results of publication bias tests among full sample, peer-reviewed, and non-peer-reviewed
studies
Full sample Peer-reviewed
studies
Non-peer-reviewed
studies
Egger’s regression test
Precision 0.034 0.006 0.064
(0.031) (0.035) (0.051)
[0.276] [0.866] [0.212]
Intercept 1.488 1.875 1.052
(0.389) (0.434) (0.657)
[<0.001] [<0.001] [0.116]
N studies 95 46 49
Average impact,
based on random
effects meta-analysis
0.207*** 0.227*** 0.189***
Average impact, trim-
and-fill results based
on random effects
meta-analysis
0.085*** 0.122*** 0.189***
Estimating meta-regression model using RVE, including effect size standard error as a
predictor
Standard error 1.078 1.412 0.691
(0.219) (0.258) (0.329)
[<0.001] [<0.001] [0.050]
Intercept 0.085 0.065 0.111
(0.035) (0.043) (0.050)
[0.015] [0.143] [0.037]
N effect sizes 258 129 129
N studies 95 46 49
Note: Standard errors in parentheses. p-values in brackets. Models in the first column include all
studies. Models in second column include only peer-reviewed studies. Models in the third
column include only non-peer-reviewed studies.
Top panel: Observations are at the study level. Outcomes are study-level average standard
normal deviations (the study-average effect size divided by the study-average standard error).
Egger’s regression test carried out using the metabias package in Stata 15; trim-and-fill method
carried out using the metatrim package in Stata 15.
Bottom panel: Observations are at the effect size level. Outcomes are effect sizes.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
90
Figure A1. Funnel plot of included effect sizes.
0.5
11
.5S
tan
da
rd e
rro
r
-4 -2 0 2 4Effect size (Hedges's g)
Peer-reviewed studies
Funnel plot for impacts on math, science achievement
0.5
11
.5S
tan
da
rd e
rro
r
-4 -2 0 2 4Effect size (Hedges's g)
All studies
Funnel plot for impacts on math, science achievement0
.2.4
.6.8
Sta
nd
ard
err
or
-1 0 1 2Effect size (Hedges's g)
Non-peer-reviewed studies
Funnel plot for impacts on math, science achievement
INSTRUCTIONAL IMPROVEMENT IN STEM
91
Online Appendix B:
References for Included Studies
*Agodini, R., Harris, B., Remillard, J., & Thomas, M. (2013). After two years, three elementary
math curricula outperform a fourth. Washington, DC: National Center for Education
Evaluation and Regional Assistance.
*Argentin, G., Pennisi, A., Vidoni, D., Abbiati, G., & Caputo, A. (2014). Trying to raise (low)
math achievement and to promote (rigorous) policy evaluation in Italy: Evidence from a
large-scale randomized trial. Evaluation Review, 38(2), 99-132.
*Arnold, D. H., Fisher, P. H., Doctoroff, G. L., & Dobbs, J. (2002). Accelerating math
development in Head Start classrooms. Journal of Educational Psychology, 94(4), 762-
770.
**Batiza, A. F., Gruhl, M., Zhang, B., Harrington, T., Roberts, M., LaFlamme, D., ... &
Hagedorn, E. (2013). The effects of the SUN project on teacher knowledge and self-
efficacy regarding biological energy transfer are significant and long-lasting: Results of a
randomized controlled trial. CBE-Life Sciences Education, 12(2), 287-305.
*Batiza, A., Luo, W., Zhang, B., Gruhl, M., Nelson, D., Hoelzer, M., ... & LaFlamme, D. (2016,
March). Regular Biology Students Learn Like AP Students with SUN. Paper presented at
the Society for Research on Educational Effectiveness Spring Conference, Washington,
DC.
*Battistich, V., Alldredge, S. and Tsuchida, I. (2003). Number power: An elementary school
program to enhance students' mathematical and social development. In S.L. Senk, & D.R.
Thompson (Eds.) Standards-based school mathematics curricula: What are they? What
do students learn? (133–160). Mahwah, NJ: Erlbaum.
INSTRUCTIONAL IMPROVEMENT IN STEM
92
**Berlinski, S.G., & Busso, M. (2013). Pedagogical change in mathematics teaching: Evidence
from a randomized control trial. Retrieved from
http://www.webmeets.com/files/papers/lacea-lames/2013/438/BB2013_september.pdf.
*Berlinski, S.G.; Busso, M. (2015). Challenges in Educational Reform: An Experiment on Active
Learning in Mathematics (IDB Working Paper Series, No. IDB-WP-561). Washington,
DC: Inter-American Development Bank. Retrieved from
http://hdl.handle.net/11319/6825
*Beuermann, D. W., Naslund-Hadley, E., Ruprah, I. J., & Thompson, J. (2013). The pedagogy of
science and environment: Experimental evidence from Peru. The Journal of Development
Studies, 49(5), 719-736.
*Borman, K. M., Cotner, B. A., Lee, R. S., Boydston, T. L., & Lanehart, R. (2009, March).
Improving Elementary Science Instruction and Student Achievement: The Impact of a
Professional Development Program. Paper presented at the Society for Research on
Educational Effectiveness Spring Conference, Washington, DC.
*Borman, G. D., Gamoran, A., & Bowdon, J. (2008). A randomized trial of teacher development
in elementary science: First-year achievement effects. Journal of Research on
Educational Effectiveness, 1(4), 237-264.
*Bottge, B. A., Ma, X., Gassaway, L., Toland, M. D., Butler, M., & Cho, S. J. (2014). Effects of
blended instructional models on math performance. Exceptional children, 80(4), 423-437.
*Bottge, B. A., Toland, M. D., Gassaway, L., Butler, M., Choo, S., Griffen, A. K., & Ma, X.
(2015). Impact of enhanced anchored instruction in inclusive math classrooms.
Exceptional Children, 81(2), 158-175.
INSTRUCTIONAL IMPROVEMENT IN STEM
93
*Bradshaw, T. J. (2012). Impact of inquiry based distance learning and availability of classroom
materials on physical science content knowledge of teachers and students in Central
Appalachia (Doctoral dissertation). Retrieved from ProQuest Dissertations & Theses
Global. (Order no. 1506611403) http://search.proquest.com.ezp-
prod1.hul.harvard.edu/docview/1506611403?accountid=11311
*Brendefur, J., Strother, S., Thiede, K., Lane, C., & Surges-Prokop, M. J. (2013). A professional
development program to improve math skills among preschool children in Head Start.
Early Childhood Education Journal, 41(3), 187-195.
*Brown, J. A., Greenfield, D. B., Bell, E., Juárez, C. L., Myers, T., & Nayfeld, I. (2013, March).
ECHOS: Early Childhood Hands-On Science efficacy study. Paper presented at the
Society for Research on Educational Effectiveness Spring Conference, Washington, DC.
*Carpenter, T. P., Fennema, E., Peterson, P. L., Chiang, C. P., & Loef, M. (1989). Using
knowledge of children’s mathematics thinking in classroom teaching: An experimental
study. American Educational Research Journal, 26(4), 499-531.
*Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact
of an integrated approach to science and literacy in elementary school classrooms.
Journal of Research in Science Teaching, 49(5), 631-658.
*Clark, T. F., Arens, S. A., & Stewart, J. (2015, March). Efficacy study of a pre-algebra
supplemental program in rural Mississippi: Preliminary findings. Paper presented at the
Society for Research on Educational Effectiveness Spring Conference, Washington, DC.
*Clarke, B., Smolkowski, K., Baker, S. K., Fien, H., Doabler, C. T., & Chard, D. J. (2011). The
impact of a comprehensive Tier I core kindergarten program on the achievement of
students at risk in mathematics. The Elementary School Journal, 111(4), 561-584.
INSTRUCTIONAL IMPROVEMENT IN STEM
94
*Clements, D. H., & Sarama, J. (2007). Effects of a preschool mathematics curriculum:
Summative research on the Building Blocks project. Journal for Research in
Mathematics Education, 38(2), 136-163.
*Clements, D. H., & Sarama, J. (2008). Experimental evaluation of the effects of a research-
based preschool mathematics curriculum. American Educational Research Journal,
45(2), 443-494.
*Clements, D. H., Sarama, J., Spitler, M. E., Lange, A. A., & Wolfe, C. B. (2011). Mathematics
learned by young children in an intervention based on learning trajectories: A large-scale
cluster randomized trial. Journal for Research in Mathematics Education, 42(2), 127–
166.
*Dash, S., De Kramer, R.M., O’Dwyer, L. M., Masters, J., & Russell, M. (2012). Impact of
online professional development or teacher quality and student achievement in fifth grade
mathematics. Journal of research on Technology in Education, 45(1), 1-26.
*Debarger, A. H., Penuel, W. R., Moorthy, S., Beauvineau, Y., Kennedy, C. A., & Boscardin, C.
K. (2017). Investigating purposeful science curriculum adaptation as a strategy to
improve teaching and learning. Science Education, 101(1), 66-98.
*Dominguez, P. S., Nicholls, C., & Storandt, B. (2006). Experimental methods and results in a
study of PBS TeacherLine math courses. Syracuse, NY: Hezel Associates. Retrieved from
https://files.eric.ed.gov/fulltext/ED510045.pdf
*Eddy, R. M., & Berry, T. (2006). A randomized control trial to test the effects of Prentice
Hall’s Miller and Levine (2006) Biology curriculum on student performance: Final
report. Claremont, CA: Claremont Graduate University.
INSTRUCTIONAL IMPROVEMENT IN STEM
95
*Eddy, R. M., Ruitman, H. T., Sloper, M., & Hankel, N. (2010). The effects of Miller & Levine
Biology (2010) on student performance: Final report. LaVerne, CA: Cobblestone
Applied Research & Evaluation, Inc. Retrieved from
http://assets.pearsonschool.com/asset_mgr/pending/2013-
09/FinalEfficacyReport_ML%20Bio.pdf
*Garet, M. S., Wayne, A. J., Stancavage, F., Taylor, J., Eaton, M., Walters, K., ... & Sepanik, S.
(2011). Middle school mathematics professional development impact study: Findings
after the second year of implementation. Washington, DC: National Center for Education
Evaluation and Regional Assistance.
*Granger, E. M., Bevis, T. H., Saka, Y., Southerland, S. A., Sampson, V., & Tate, R. L. (2012).
The efficacy of student-centered instruction in supporting science learning. Science,
338(6103), 105-108.
*Gropen, J., Clark-Chiarelli, N., Chalufour, I., Hoisington, C., & Eggers-Pierola, C. (2009,
March). Creating a successful professional development program in science for Head
Start teachers and children: Understanding the relationship between development,
intervention, and evaluation. Paper presented at the Society for Research on Educational
Effectiveness Spring Conference Washington, DC.
*Gropen, J., Clark-Chiarelli, N., Ehrlich, S., & Thieu, Y. (2011, March). Examining the efficacy
of Foundations of Science Literacy: Exploring contextual factors. Paper presented at the
Society for Research on Educational Effectiveness Spring Conference, Washington, DC.
**Hand, B., Norton-Meier, L.A., Gunel, M., & Akkus, R. (2016). Aligning teaching to learning:
A 3-year study examining the embedding of language and argumentation into elementary
INSTRUCTIONAL IMPROVEMENT IN STEM
96
science classrooms. International Journal of Science and Mathematics Education, 14(5),
847-863.
*Hand, B., Therrien, W., & Shelley, M. (2013, March). Examining the impact of using the
science writing heuristic approach in learning science: A cluster randomized study.
Paper presented at the Society for Research on Educational Effectiveness Spring
Conference, Washington, D.C.
*Harris, C. J., Penuel, W. R., D'Angelo, C. M., DeBarger, A. H., Gallagher, L. P., Kennedy, C.
A., ... & Krajcik, J. S. (2015). Impact of project‐based curriculum materials on student
learning in science: Results of a randomized controlled trial. Journal of Research in
Science Teaching, 52(10), 1362-1385.
*Harris, C. J., Penuel, W. R., DeBarger, A., D’Angelo, C., & Gallagher, L. P. (2014).
Curriculum Materials make a difference for next generation science learning: Results
from year 1 of a randomized control trial. Menlo Park, CA: SRI International.
* Heller, J. I. (2012). Effects of Making Sense of SCIENCE™ professional development on the
achievement of middle school students, including English language learners. (NCEE
2012-4002). Washington, DC: National Center for Education Evaluation and Regional
Assistance, Institute of Education Sciences, U.S. Department of Education.
* Heller, J. I., Curtis, D. A., Rabe-Hesketh, S., & Verboncoeur, C. (2007). The effects of Math
Pathways and Pitfalls on students’ mathematics achievement: National Science
Foundation final report. Washington, DC: U.S. Department of Education, National
Center for Education Statistics. http://eric. ed.gov/?id=ED498258
*Heller, J. I., Daehler, K. R., Wong, N., Shinohara, M., & Miratrix, L. W. (2012). Differential
effects of three professional development models on teacher knowledge and student
INSTRUCTIONAL IMPROVEMENT IN STEM
97
achievement in elementary science. Journal of Research in Science Teaching, 49(3), 333-
362.
*Heller, J., Hanson, T., Barnett-Clarke, C. (2010). The impact of Math Pathways & Pitfalls on
students’ mathematics achievement and mathematical language development: A study
conducted in schools with high concentrations of Latino/a student and English learners.
Oakland, CA: WestEd. Retrieved from https://mpp.wested.org/wp-
content/uploads/2013/05/mpp-ies-report.pdf
*Hinerman, K. M., Hull, D. M., Chen, Q., Booker, D. D., & Naslund-Hadley, E. I. (2014,
March). Teacher-led math inquiry in Belize: A cluster randomized trial. Paper presented
at the Society for Research on Educational Effectiveness Spring Conference,
Washington, DC.
*Jaciw, A. P., Hegseth, W., Ma, B., & Lai, G. (2012). Assessing impacts of Math in Focus, a
‘Singapore Math’ program for American schools: A report of findings from a randomized
control trial. Palo Alto, CA: Empirical Education.
**Jaciw, A. P., Hegseth, W., & Toby, M. (2015, March). Assessing impacts of" Math in Focus,"
a ‘Singapore Math’ program for American schools. Paper presented at the Society for
Research on Educational Effectiveness Spring Conference, Washington, DC.
*Jacobs, V. R., Franke, M. L., Carpenter, T. P., Levi, L., & Battey, D. (2007). Professional
development focused on children's algebraic reasoning in elementary school. Journal for
Research in Mathematics Education, 38(3), 258-288.
*Jerrim, J., & Vignoles, A. (2015). The causal effect of East Asian “mastery” teaching methods
on English children’s mathematics skills. UCL Institute of Education, University College
London.
INSTRUCTIONAL IMPROVEMENT IN STEM
98
*Kaldon, C. R., & Zoblotsky, T. A. (2014, March). A randomized controlled trial validating the
impact of the LASER model of science education on student achievement and teacher
instruction. Paper presented at the Society for Research on Educational Effectiveness
Spring Conference, Washington, DC.
*Kim, K. H., VanTassel-Baska, J., Bracken, B. A., Feng, A., Stambaugh, T., & Bland, L. (2012).
Project Clarion: Three years of science instruction in Title I schools among K – third-
grade students. Research in Science Education, 42(83), 1-17.
*Kinzie, M. B., Whittaker, J. V., Williford, A. P., DeCoster, J., McGuire, P., Lee, Y., & Kilday,
C. R. (2014). MyTeachingPartner-Math/Science pre-kindergarten curricula and teacher
supports: Associations with children's mathematics and science learning. Early
Childhood Research Quarterly, 29(4), 586-599.
*Kisker, E. E., Lipka, J., Adams, B. L., Rickard, A., Andrew-Ihrke, D., Yanez, E. E., & Millard,
A. (2012). The potential of a culturally based supplemental mathematics curriculum to
improve the mathematics performance of Alaska Native and other students. Journal for
Research in Mathematics Education, 43(1), 75-113.
*Klein, A., Starkey, P., Clements, D., Sarama, J., & Iyer, R. (2008). Effects of a pre-kindergarten
mathematics intervention: A randomized experiment. Journal of Research on
Educational Effectiveness, 1(3), 155-178.
*Klein, A., Starkey, P., Deflorio, L., Brown, E.T., (2012, March). Establishing and sustaining an
effective pre-kindergarten math intervention at scale. Paper presented at the Society for
Research in Educational Effectiveness Spring Conference, Washington, DC.
INSTRUCTIONAL IMPROVEMENT IN STEM
99
*Lafferty, J. F. (1994). The links among mathematics text, students’ achievement, and students’
mathematics anxiety: A comparison of the incremental development and traditional texts.
Unpublished doctoral dissertation, Widener University, Wilmington, DE.
*Lanehart, R.E., Borman, K.M., Boydston, T. L., Cotner, B. A., & Lee, R. S. (2010, March).
Improving gender, racial, and social equity in elementary science instruction and student
achievement: The impact of a professional development program. Paper presented
Society for Research on Educational Effectiveness Spring Conference, Washington, DC.
*Lang, L. B., Schoen, R. R., LaVenia, M., & Oberlin, M. (2014). Mathematics Formative
Assessment system – Common Core State Standards: A randomized field trial in
kindergarten and first grade. Paper presented at the Society for Research on Educational
Effectiveness Spring Conference, Washington, DC.
*Lara‐Alecio, R., Tong, F., Irby, B. J., Guerrero, C., Huerta, M., & Fan, Y. (2012). The effect of
an instructional intervention on middle school English learners' science and English
reading achievement. Journal of Research in Science Teaching, 49(8), 987-1011.
*Lehrer, R. (2010). Assessing data modeling and statistical reasoning project: IES final report.
Washington, DC: Institute for Educational Sciences.
**Lewis, C.C., & Perry, R. (2014). Lesson study with mathematical resources: A sustainable
model for locally-led teacher professional learning. Mathematics Teacher Education and
Development, 16(1).
**Leuzzi, A. & Pennisi, A. (no date). Learning about the effectiveness of EU structural fund
projects in Italian schools: A large-scale RCT for math teachers. [Presentation slides].
Retrieved from http://www.esparama.lt/c/document_library/get_file?uuid=cbebbfe8-
e2a9-4486-8422-fbded985dfca&groupId=19002
INSTRUCTIONAL IMPROVEMENT IN STEM
100
*Lewis, C. C., & Perry, R. R. (2015). A randomized trial of lesson study with mathematical
resource kits: Analysis of impact on teachers’ beliefs and learning community. In
Middleton, J.A., Cai, J., & Hwang, S. (Eds). Large-scale studies in mathematics
education (133-158). New York, NY: Springer International Publishing.
*Llorente, C., Pasnik, S., Moorthy, S., Hupert, N., Rosenfeld, D., & Gerard, S. (2015, March).
Preschool teachers can use a PBS KIDS Transmedia Curriculum Supplement to support
young children's mathematics learning: Results of a randomized controlled trial. Paper
presented at the Society for Research on Educational Effectiveness Spring Conference,
Washinton, DC.
*Llosa, L., Lee, O., Jiang, F., Haas, A., O’Connor, C., Van Booven, C. D., & Kieffer, M. J.
(2016). Impact of a large-scale science intervention focused on English language
learners. American Educational Research Journal, 53(2), 395-424.
*Maerten-Rivera, J., Ahn, S., Lanier, K., Diaz, J., & Lee, O. (2016). Effect of a multiyear
intervention on science achievement of all students including English language learners.
The Elementary School Journal, 116(4), 600-624.
*Martin, T., Brasiel, S. J., Turner, H., & Wise, J. C. (2012). Effects of the Connected
Mathematics Project 2 (CMP2) on the mathematics achievement of Grade 6 students in
the Mid-Atlantic region: Final report. Washington, DC: National Center for Education
Evaluation and Regional Assistance.
*McCoach, D. B., Gubbins, E. J., Foreman, J., Rubenstein, L. D., & Rambo-Hernandez, K. E.
(2014). Evaluating the efficacy of using pre-differentiated and enriched mathematics
curricula for grade 3 students: A multisite cluster-randomized trial. Gifted Child
Quarterly, 58(4), 272-286.
INSTRUCTIONAL IMPROVEMENT IN STEM
101
*Miller, G., Jaciw, A., Ma, B., & Wei, X. (2007). Comparative effectiveness of Scott Foresman
Science: A report of randomized experiments in five school districts. Palo Alto, CA:
Empirical Education.
*Montague, M., Krawec, J., Enders, C., & Dietz, S. (2014). The effects of cognitive strategy
instruction on math problem solving of middle-school students of varying ability. Journal
of Educational Psychology, 106(2), 469.
*Mutch-Jones, Puttick, & Demers (2014, April). Differentiating science instruction: Teacher and
student findings from the Accessing Science Ideas project. Poster presented at the
American Educational Research Association Annual Meeting, Philadelphia, PA.
* Newman,D., Finney, P.B., Bell, S., Turner, H., Jaciw, A.P., Zacamy, J.L., & Feagans Gould, L.
(2012). Evaluation of the Effectiveness of the Alabama Math, Science, and Technology
Initiative (AMSTI). (NCEE 2012–4008). Washington, DC: National Center for Education
Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of
Education. Retrieved from http://dx.doi.org/10.2139/ssrn.2511347.
*Oh, Y., Lachapelle, C. P., Shams, M. F., Hertel, J. D., & Cunningham, C. M. (2016, April).
Evaluating the efficacy of Engineering is Elementary for student learning of engineering
and science concepts. Paper presented at the American Educational Research Association
Annual Meeting, Washington, DC.
*Pane, J. F., Griffin, B. A., McCaffrey, D. F., & Karam, R. (2014). Effectiveness of Cognitive
Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127-144.
*Penuel, W. R., Bates, L., Pasnik, S., Townsend, E., Gallagher, L. P., Llorente, C., & Hupert, N.
(2010, June). The impact of a media-rich science curriculum on low-income preschoolers'
INSTRUCTIONAL IMPROVEMENT IN STEM
102
science talk at home. In Proceedings of the 9th International Conference of the Learning
Sciences, (238-245). International Society of the Learning Sciences, Chicago, IL.
*Penuel, W. R., Gallagher, L. P., & Moorthy, S. (2011). Preparing teachers to design sequences
of instruction in earth systems science: A comparison of three professional development
programs. American Educational Research Journal, 48(4), 996-1025.
*Piasta, S. B., Logan, J. A., Pelatti, C. Y., Capps, J. L., & Petrill, S. A. (2015). Professional
development for early childhood educators: Efforts to improve math and science learning
opportunities in early childhood classrooms. Journal of Educational Psychology, 107(2),
407-422.
*Presser, A. L., Clements, M., Ginsburg, H., & Ertle, B. (2012). Effects of a preschool and
kindergarten mathematics curriculum: Big Math for Little Kids. New York, NY: Center
for Children and Technology, Retrieved from
http://cct.edc.org/sites/cct.edc.org/files/publications/BigMathPaper_Final.pdf
*Presser, A. L., Vahey, P., & Dominguez, X. (2015, March). Improving mathematics learning by
integrating curricular activities with innovative and developmentally appropriate digital
apps: Findings from the Next Generation Preschool Math Evaluation. Paper presented at
the Society for Research on Educational Effectiveness Spring Conference, Washington,
DC.
*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Driver, H. (2004). Implementation study of
Chemistry that Applies (2002-2003): SCALE-uP Report No. 2. George Washington
University, Washington, DC.
INSTRUCTIONAL IMPROVEMENT IN STEM
103
*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Watson, W. (2006). Implementation study of
Exploring Motion and Forces (2005-2006): SCALE-uP Report No. 13. George
Washington University, Washington, DC.
*Pyke, C., Lynch, S., Kuipers, J., Szesze, M., & Watson, W. (2005). Implementation study of
The Real Reasons for Seasons (2004-2005): SCALE-uP Report No. 7. George
Washington University, Washington, DC.
*Reid, E. E., Chen, J. Q., & McCray, J. (March, 2014). Achieving high standards for pre-K –
Grade 3 mathematics: A whole teacher approach to professional development. Paper
presented at the Society for Research Educational Effectiveness Spring Conference,
Washington, DC.
*Resendez, M., and Azin, M., (2006). 2005 Prentice Hall Science Explorer randomized control
trial. Jackson, WY: PRES Associates, Inc.
*Resendez, M., and Azin, M. (2008). A study of the effects of Pearson’s 2009 enVision Math
Program. Jackson, WY: PRES Associates, Inc.
*Rethinam, V., Pyke, C., Lynch, S., Pyke, C., & Lynch, S. (2008). Using multilevel analyses to
study the effectiveness of science curriculum materials. Evaluation & Research in
Education, 21(1), 18-42.
*Rimbey, K. A. (2013). From the Common Core to the classroom: A professional development
efficacy study for the common core state standards for mathematics (Doctoral
dissertation). Retrieved from Pro Quest. (Order No. 3560379).
**Roschelle, J., Shechtman, N., Tatar, D., Hegedus, S., Hopkins, B., Empson, S., ... & Gallagher,
L. P. (2010). Integration of technology, curriculum, and professional development for
INSTRUCTIONAL IMPROVEMENT IN STEM
104
advancing middle school mathematics: Three large-scale studies. American Educational
Research Journal, 47(4), 833-878.
*Roth, K., Wilson, C., Taylor, J., Hvidsten, C., Stennett, B., Wickler, N., ... & Bintz, J. (2015,
March). Testing the consensus model of effective PD: Analysis of practice and the PD
research terrain. Paper presented at the International Conference of the National
Association of Science Teacher Researchers, Chicago, IL.
*San Antonio, D. M., Morales, N. S., & Moral, L. S. (2011). Module‐based professional
development for teachers: a cost‐effective Philippine experiment. Teacher development,
15(2), 157-169.
*Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as
a lever for change: An experimental study of the effects of a professional development
program on students’ mathematics learning. Journal of Research on Educational
Effectiveness, 4(1), 1-24.
*Sarama, J., Clements, D. H., Starkey, P., Klein, A., & Wakeley, A. (2008). Scaling up the
implementation of a pre-kindergarten mathematics curriculum: Teaching for
understanding with trajectories and technologies. Journal of Research on Educational
Effectiveness, 1(2), 89-119.
*Sarama, J., Lange, A., Clements, D.H., & Wolfe, C.B. (2012). The impacts of an early
mathematics curriculum on emerging literacy and language. Early Childhood Research
Quarterly, 27, 489-502. doi: 10.1016/j.ecresq.2011.12.002
*Saxe, G. B., & Gearhart, M. (2001). Enhancing students' understanding of mathematics: A
study of three contrasting approaches to professional support. Journal of Mathematics
Teacher Education, 4(1), 55-79.
INSTRUCTIONAL IMPROVEMENT IN STEM
105
*Schneider, S. (2013). Final Report for IES R305A70105: Algebra Interventions for Measured
Achievement – Full Year. San Francisco, CA: WestEd.
*Schneider, M. C., & Meyer, J. P. (2012). Investigating the efficacy of a professional
development program in formative classroom assessment in middle-school English
language arts and mathematics. Journal of Multidisciplinary Evaluation, 8(17), 1-24.
Schochet, P. Z. (2011). Do typical RCTs of education interventions have sufficient statistical
power for linking impacts on teacher practice and student achievement outcomes?
Journal of Educational and Behavioral Statistics, 36(4), 441-471.
*Schwartz‐Bloom, R. D., & Halpin, M. J. (2003). Integrating pharmacology topics in high
school biology and chemistry classes improves performance. Journal of Research in
Science Teaching, 40(9), 922-938.
*Shannon, L., & Grant, B. (2012). A final evaluation report of Houghton Mifflin Harcourt’s Holt
McDougal Biology. Charlottesville, VA: Magnolia Consulting, LLC.
*Sophian, C. (2004). Mathematics for the future: Developing a Head Start curriculum to support
mathematics learning. Early Childhood Research Quarterly, 19(1), 59-81.
*Star, J. R., Pollack, C., Durkin, K., Rittle-Johnson, B., Lynch, K., Newton, K., & Gogolen, C.
(2015). Learning from comparison in algebra. Contemporary Educational Psychology,
40, 41-54.
*Starkey, P., Klein, A., & , Deflorio, L. (2013). Changing the developmental trajectory in early
math through a two-year preschool math intervention. Paper presented at the Society for
Research on Educational Effectiveness Spring Conference, Washington, DC.
INSTRUCTIONAL IMPROVEMENT IN STEM
106
Sussman, J., & Wilson, M. R. (2018). The Use and Validity of Standardized Achievement Tests
for Evaluating New Curricular Interventions in Mathematics and Science. American
Journal of Evaluation, 1098214018767313.
*Tatar, D., Roschelle, J., Knudsen, J., Shechtman, N., Kaput, J., & Hopkins, B. (2008). Scaling
up innovative technology-based mathematics. The Journal of the Learning Sciences,
17(2), 248-286.
*Tauer, S. (2002). How does the use of two different mathematics curricula affect student
achievement?: A comparison study in Derby, Kansas. Retrieved from
http://www.cpmponline.org/pdfs/CPMP_Achievement_Derby.pdf
*Taylor, J. A., Getty, S. R., Kowalski, S. M., Wilson, C. D., Carlson, J., & Van Scotter, P.
(2015). An efficacy trial of research-based curriculum materials with curriculum-based
professional development. American Educational Research Journal, 52(5), 984-1017.
*Thompson, D. R., Senk, S. L., & Yu, Y. (2012). An evaluation of the Third Edition of the
University of Chicago School Mathematics Project: Transition Mathematics. Chicago,
IL: University of Chicago School Mathematics Project.
*Vaden-Kiernan, M., Borman, G., Caverly,S., Bell, N., Ruiz de Castilla, V., Sullivan, K. &
Rodriguez, D. (March, 2016). Findings from a multi-year scale-up effectiveness trial of
Everyday Mathematics. Paper presented at the Society for Research in Educational
Effectiveness Spring Conference, Washington, DC.
*Van Egeren, L. A., Schwarz, C., Gerde, H., Morris, B., Pierce, S., Brophy-Herb, .., & Stoddard,
D. (2014, August). Cluster-randomized trial of the efficacy of early childhood science
education with low-income children: Years 1-3. Poster presented at the 2014 Discovery
Research K-12 PI Meeting, Arlington, VA.
INSTRUCTIONAL IMPROVEMENT IN STEM
107
*Walsh-Cavazos, S. (1994). A study of the effects of a mathematics staff development module on
teachers' and students' achievement. (Doctoral dissertation). Retrieved from ProQuest
(Order no. 9517241)
INSTRUCTIONAL IMPROVEMENT IN STEM
108
Online Appendix C:
Full Results of Sensitivity Checks
Table C1
Regression adjusted mean effect sizes based on unconditional and conditional RVE meta-
regression models: Differences by subject (math vs. science)
Subgroup analysis
(Unconditional RVE
model)
Conditional RVE model
𝑔𝑢𝑐 a, pucb 𝑔𝑐 c pc
d
Subject Matter: Focus of interventione
Science 0.208*** 0.225 0.848
Math (including Math and
Science)
0.201*** 0.216
Subject Area: Outcome measuref
Science 0.188*** 0.223 0.942
Math 0.201*** 0.226
Note: First column: Mean effect sizes in the second column are based on subgroup analyses
where unconditional RVE meta-regression models were estimated separately for studies that
focused on math/science, and for math/science outcomes. a 𝑔𝑢𝑐 = Estimated mean effect size from unconditional RVE meta-regression model. b puc = p-
value on constant from unconditional meta-regression model.
Second column: Mean effect sizes in the third column are based on the results of estimating RVE
meta-regression models including all variables listed in Table 3, or including an indicator for
math outcome (using the study-level mean) and including other variables listed in Table 3.
Regression-adjusted mean effect sizes were calculated using the overall average of the study-
level values of each included covariate. c 𝑔𝑐 = Estimated regression-adjusted mean effect size from conditional RVE meta-regression
model include covariates listed in Table 3. d pc = p-value on coefficient for program feature on
interest from meta-regression model including relevant set of controls. e Subject Matter: Focus of intervention refers to the focus of the intervention: whether the
intervention focused on math (including 3 studies which focus on both math and science). f Subject Area: Outcome measure refers to the subject area of the outcome measure. This
accounts for the fact that some studies focus on both math and science, and include both math
and science outcomes.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
109
Table C2
Estimated mean effect sizes based on unconditional RVE meta-regression models: Differences by
grade level
Subgroup analysis
(Unconditional RVE model)
Grade level of participants
Preschool 0.390***
Kindergarten 0.133**
Early Elementary 0.200***
Upper Elementary 0.176***
Middle School 0.159***
High School 0.166**
Note: Estimated mean effect sizes from unconditional RVE meta-regression models. Mean effect
sizes are based on subgroup analyses where unconditional meta-regression analyses were
estimated using RVE separately for studies that included participants in each grade level. Studies
may be included in more than one row if studies included students across multiple grades.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
110
Table C3
Sensitivity tests: Results of estimating an unconditional meta-regression model with robust
variance estimation (RVE) and imputing missing effect sizes
Imputing missing
effect sizes as zero
(g = 0.00)
Imputing missing
effect sizes as
negative
(g = -0.10)
Imputing missing
effect sizes as
negative
(g = -0.20)
Constant 0.197*** 0.193*** 0.190***
(0.024) (0.024) (0.024)
N effect sizes 278 278 278
N studies 95 95 95
𝜌a 0.80 0.80 0.80
Note: a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
111
Table C4
Sensitivity tests: RVE results including professional development foci as moderators
Imputing missing
effect sizes as zero
(g = 0.00)
Imputing missing
effect sizes as
negative
(g = -0.10)
Imputing missing
effect sizes as
negative
(g = -0.20)
Between-study effects: Professional development foci
Generic instructional strategies -0.078 -0.079 -0.080
(0.067) (0.067) (0.067)
How to use curriculum materials 0.118* 0.114* 0.110*
(0.062) (0.063) (0.064)
Integrate technology 0.135 0.134 0.132
(0.104) (0.105) (0.107)
Content-specific formative assessment 0.097 0.095 0.093
(0.062) (0.063) (0.064)
Improve pedagogical content knowledge/how students learn 0.094** 0.095** 0.096**
(0.042) (0.042) (0.043)
N effect sizes 257 257 257
N studies 89 89 89
𝜌a 0.80 0.80 0.80
Between-study effects: Number of PD foci
Number of PD features 0.074*** 0.073*** 0.072***
(0.023) (0.024) (0.024)
N effect sizes 257 257 257
N studies 89 89 89
𝜌a 0.80 0.80 0.80
Note: All models include only studies and/or treatment arms with a professional development component. All models include controls
for the following: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject
matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
112
Table C5
Sensitivity tests: RVE results including professional development activities as moderators
Imputing missing
effect sizes as zero
(g = 0.00)
Imputing missing
effect sizes as
negative
(g = -0.10)
Imputing missing
effect sizes as
negative
(g = -0.20)
Between-study effects: Professional development foci
Observed demonstration 0.033 0.033 0.033
(0.048) (0.048) (0.049)
Review generic student work -0.026 -0.029 -0.032
(0.070) (0.070) (0.069)
Solved problems/Worked through
student materials 0.076 0.074 0.072
(0.046) (0.047) (0.048)
Developed curriculum materials/lesson plans 0.046 0.046 0.047
(0.062) (0.062) (0.063)
Reviewed own student work -0.002 -0.007 -0.011
(0.070) (0.073) (0.076)
N effect sizes 256 256 256
N studies 88 88 88
𝜌a 0.80 0.80 0.80
Between-study effects: Number of PD activities
Number of PD features 0.039* 0.038 0.036
(0.022) (0.023) (0.024)
N effect sizes 257 257 257
N studies 89 89 89
𝜌a 0.80 0.80 0.80d
Note: All models include only studies and/or treatment arms with a professional development component and include controls for:
RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
113
Table C6
Sensitivity tests: RVE results including professional development formats as moderators
Imputing missing
effect sizes as zero
(g = 0.00)
Imputing missing effect
sizes as negative
(g = -0.10)
Imputing missing
effect sizes as
negative
(g = -0.20)
Between-study effects: Professional development formats
Same-school collaboration 0.106 0.104 0.101
(0.066) (0.065) (0.064)
Implementation meetings 0.104** 0.102** 0.099**
(0.047) (0.047) (0.047)
Any online PD -0.137** -0.136** -0.135**
(0.054) (0.053) (0.053)
Summer workshop 0.072** 0.071* 0.071*
(0.038) (0.038) (0.038)
Expert coaching 0.054 0.057 0.060
(0.051) (0.051) (0.051)
PD leaders – researchers -0.039 -0.043 -0.048
(0.037) (0.037) (0.037)
N effect sizes 251 251 251
N studies 86 86 86
𝜌a 0.80 0.80 0.80
Between-study effects: Number of PD formats
Number of PD features 0.020 0.019 0.018
(0.023) (0.023) (0.023)
N effect sizes 257 257 257
N studies 89 89 89
𝜌a 0.80 0.80 0.80
Note: All models include only studies and/or treatment arms with a professional development component and include controls for:
RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies. *p < .10, **p < .05, ***p < .01
INSTRUCTIONAL IMPROVEMENT IN STEM
114
Table C7
Differences in regression-adjusted mean effect sizes by subject: Regression-adjusted mean effect
sizes
Imputing
missing
effect sizes as
zero
(g = 0.00)
Imputing
missing effect
sizes as
negative
(g = -0.10)
Imputing
missing
effect sizes as
negative
(g = -0.20)
Between-study effects
Implementation guidance 0.075 0.080 0.086
(0.058) (0.058) (0.059)
Laboratory/hands-on experience, curriculum
kits -0.026 -0.027 -0.028
(0.053) (0.054) (0.054)
Curriculum dosage
(Number of hours) 0.000 0.000 0.000
(0.001) (0.001) (0.001)
N effect sizes 205 205 205
N studies 77 77 77
𝜌a 0.80 0.80 0.80
Note: All models include only studies and/or treatment arms with a professional development
component. All models include controls for: RCT, state standardized test, other standardized test,
grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
115
Table C8
Sensitivity tests: Results of estimating an unconditional meta-regression model with robust
variance estimation (RVE)
Preferred
model
Excluding
non-US
studies
Excluding
non-RCT
studies
Alternative values of correlation
between pairs of observed effect sizes
within studies
Constant 0.209*** 0.223*** 0.208*** 0.209*** 0.209*** 0.209***
(0.025) (0.025) (0.025) (0.025) (0.025) (0.025)
N effect
sizes
258 248 239 258 258 258
N studies 95 89 86 95 95 95
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Note: a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
116
Table C9
Sensitivity tests: RVE results including professional development foci as moderators
Preferred
model
Excluding
non-US
studies
Excluding
non-RCT
studies
Alternative values of correlation between
effect sizes within studies
Between-study effects: Professional development foci
Generic instructional strategies -0.074 0.071 -0.079 -0.074 -0.074 -0.074
(0.072) (0.074) (0.077) (0.072) (0.072) (0.072)
How to use curriculum materials 0.119* 0.077 0.110 0.119* 0.119* 0.119*
(0.063) (0.057) (0.066) (0.063) (0.063) (0.063)
Integrate technology 0.139 0.203* 0.136 0.139 0.139 0.139
(0.102) (0.095) (0.104) (0.102) (0.102) (0.102)
Content-specific formative assessment 0.105 0.109 0.104 0.105 0.105 0.105
(0.062) (0.069) (0.067) (0.062) (0.062) (0.062)
Improve pedagogical content knowledge/how
students learn 0.096** 0.078* 0.101** 0.096** 0.096** 0.096**
(0.043) (0.039) (0.044) (0.043) (0.043) (0.043)
N effect sizes 237 227 222 237 237 237
N studies 89 83 82 89 89 89
𝜌a 237 0.80 0.80 0.50 0.50 0.50
Between-study effects: Number of PD foci
Number of PD features 0.077*** 0.069*** 0.075*** 0.077*** 0.077*** 0.077***
(0.024) (0.024) (0.025) (0.024) (0.024) (0.024)
N effect sizes 237 227 222 237 237 237
N studies 89 83 82 89 89 89
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Note: All models include only studies and/or treatment arms with a professional development component. All models include controls
for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
117
Table C10
Sensitivity tests: RVE results including professional development activities as moderators
Preferred
model
Excluding
non-US
studies
Excludin
g non-
RCT
studies
Alternative values of correlation
between effect sizes within
studies
Between-study effects: Professional development activities
Observed demonstration 0.025 0.004 0.017 0.025 0.025 0.025
(0.046) (0.048) (0.047) (0.046) (0.046) (0.046)
Review generic student work -0.014 -0.057 0.015 -0.014 -0.014 -0.014
(0.085) (0.081) (0.093) (0.085) (0.085) (0.085)
Solved problems/Worked through
student materials 0.070 0.098** 0.062 0.070 0.070 0.070
(0.047) (0.045) (0.048) (0.047) (0.047) (0.047)
Developed curriculum materials/lesson plans 0.036 -0.001 0.044 0.036 0.036 0.036
(0.063) (0.073) (0.065) (0.063) (0.063) (0.063)
Reviewed own student work 0.017 0.046 0.009 0.017 0.017 0.017
(0.071) (0.085) (0.073) (0.071) (0.071) (0.071)
N effect sizes 236 226 221 236 236 236
N studies 88 82 81 88 88 88
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Between-study effects: Number of PD activities
Number of PD features 0.046* 0.041 0.048* 0.046* 0.046* 0.046*
(0.024) (0.024) (0.024) (0.024) (0.024) (0.024)
N effect sizes 237 227 222 237 237 237
N studies 89 83 82 89 89 89
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Note: All models include only studies and/or treatment arms with a professional development component. All models include controls
for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
118
Table C11
Sensitivity tests: RVE results including professional development formats as moderators
Preferred
model
Excluding
non-US
studies
Excluding
non-RCT
studies
Alternative values of correlation between
effect sizes within studies
Between-study effects: Professional development formats
Same-school collaboration 0.123* 0.152** 0.127* 0.123* 0.123* 0.122*
(0.067) (0.067) (0.068) (0.067) (0.067) (0.067)
Implementation meetings 0.117** 0.091** 0.109** 0.117** 0.117** 0.117**
(0.045) (0.0435) (0.046) (0.045) (0.045) (0.045)
Any online PD -0.153** -0.169** -0.160** -0.153** -0.153** -0.153**
(0.057) (0.063) (0.059) (0.057) (0.057) (0.057)
Summer workshop 0.074* 0.069* 0.077* 0.074* 0.074* 0.074*
(0.038) (0.038) (0.039) (0.038) (0.038) (0.038)
Expert coaching 0.053 0.016 0.051 0.053 0.053 0.053
(0.051) (0.049) (0.053) (0.051) (0.051) (0.051)
PD leaders – researchers -0.037 -0.025 -0.040 -0.037 -0.037 -0.037
(0.038) (0.034) (0.038) (0.038) (0.038) (0.038)
N effect sizes 231 221 216 231 231 231
N studies 86 80 79 86 86 86
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Between-study effects: Number of PD formats
Number of PD features 0.023 0.029 0.022 0.023 0.023 0.023
(0.024) (0.026) (0.024) (0.024) (0.024) (0.024)
N effect sizes 237 227 222 237 237 237
N studies 89 83 82 89 89 89
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Note: All models include only studies and/or treatment arms with a professional development component. All models include controls
for: RCT, state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
119
Table C12
Differences in regression-adjusted mean effect sizes by subject: Regression-adjusted mean effect sizes
Preferred model Excluding
non-US
studies
Excluding non-
RCT studies
Alternative values of correlation
between effect sizes within studies
Between-study effects
Implementation guidance 0.062 0.028 0.057 0.062 0.062 0.062
(0.056) (0.051) (0.060) (0.056) (0.056) (0.056)
Laboratory/hands-on experience, curriculum kits -0.026 -0.028 -0.026 -0.026 -0.026 -0.026
(0.054) (0.059) (0.056) (0.054) (0.054) (0.054)
Curriculum dosage
(Number of hours) 0.000 0.000 0.000 0.000 0.000 0.000
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
N effect sizes 193 185 175 197 197 197
N studies 77 73 69 78 78 78
𝜌a 0.80 0.80 0.80 0.50 0.70 0.90
Note: All models include only studies and/or treatment arms with new curriculum materials. All models include controls for: RCT,
state standardized test, other standardized test, grade-preschool, effect size adjusted for covariates, and subject matter. a 𝜌 represents the average correlation between all pairs of effect sizes within studies.
*p < .10, **p < .05, ***p < .01.
INSTRUCTIONAL IMPROVEMENT IN STEM
120
Table C13
Descriptions of studies published before and after 2004
Code Code presenta
Studies
published
pre-2004
(N = 9)
Studies
published
2004 or later
(N = 86)
p-value of
differencee
Effect Size (Hedges’ g) 0.360 0.224 0.023
Intervention Type
Professional development only 22% 22% 0.993
Professional development and curriculum materials 67% 76% 0.563
Curriculum materials only 22% 8% 0.173
Research Design and Sample Characteristics
RCT 56% 94% 0.001
Subject matter – Math 89% 62% 0.107
Preschool 11% 20% 0.533
Effect Size Type
Intervenor developed test 69% 51% 0.063
State standardized test 3% 19% 0.039
Other standardized test 28% 31% 0.743
Adjusted for covariates 10% 84% <0.001
PD Formatb
Same-school collaboration 67% 74% 0.620
Implementation meetings 33% 35% 0.927
Online professional development 0% 20% 0.144
Summer workshop 67% 53% 0.437
Expert coaching 0% 22% 0.117
PD lead by researchers/intervention developers 78% 65% 0.449
PD Timing and Durationc
Contact hours 56 hours 44 hours 0.371
Timespan over which professional development was
conducted:
Less than one week 25% 14% 0.419
One week 13% 3% 0.148
One month (8 days to 30 days) 0% 4% 0.578
One semester (31 days to 4 months) 13% 13% 0.980
One year 13% 53% 0.031
More than one year 38% 14% 0.090
INSTRUCTIONAL IMPROVEMENT IN STEM
121
Code presenta
Studies
published
pre-2004
(N = 9)
Studies
published
2004 or later
(N = 86)
p-value of
differencee
PD Focusb
Generic instructional strategies 22% 9% 0.234
How to use curriculum materials 78% 74% 0.828
Integrate technology 0% 12% 0.284
Content-specific formative assessment 22% 16% 0.654
Improve content knowledge/Pedagogical content
knowledge/How students learn 56% 55% 0.959
PD Activitiesb
Review sample student work 22% 15% 0.583
Observed demonstration 0% 38% 0.021
Solved problems/Worked through student materials 44% 42% 0.883
Developed curriculum/lesson plans 22% 19% 0.795
Reviewed own student work 0% 12% 0.281
Curriculum Materialsd
Laboratory/Hands-on experience or curriculum kits 22% 29% 0.669
Curriculum dosage (hours) 71.2 66.2 0.850
Curriculum proportion replaced (percent) 87% 91% 0.651
Implementation guidance 22% 45% 0.186
Note: N = 95 studies. a Figures in third column include percent of studies which feature the row code for binary
variables, or the sample average calculated at the study level for continuous variables (e.g.,
contact hours and curriculum dosage). For studies that had the feature present in one treatment
arm but not another treatment arm, the code is counted as present if it is present is any treatment
arm (e.g., a study with one treatment arm including only curriculum materials and a second
treatment arm including both professional development and curriculum materials would be
included in both rows). b Codes for PD focus, activities, and format were counted as “Not
Present” for studies that did not involve a PD component. c PD timing/duration excludes studies
without a PD component. d Codes for features of interventions involving new curriculum
materials were counted as “Not Present” for studies that did not involve new curriculum
materials. e p-value from t-tests. All t-tests were conducted at the study level, except for t-test for
variables in the Effect Size Type category which were conducted at the effect size level.
INSTRUCTIONAL IMPROVEMENT IN STEM
122