revealing german primary school students’ achievement in ... · humboldt university, unter den...
TRANSCRIPT
ORIGINAL ARTICLE
Revealing German primary school students’ achievementin measurement
Jasmin Hannighofer • Marja Van den Heuvel-Panhuizen •
Sebastian Weirich • Alexander Robitzsch
Accepted: 24 July 2011 / Published online: 28 August 2011
� FIZ Karlsruhe 2011
Abstract The focus of this study was to investigate pri-
mary school students’ achievement in the domain of
measurement. We analyzed a large-scale data set
(N = 6,638) from German third and fourth graders (8- to
10-year-olds). These data were collected in 2007 within the
framework of the ESMaG (Evaluation of the Standards in
Mathematics in Primary School) project carried out by the
Institute for Educational Quality Improvement (IQB) at
Humboldt University, Berlin, Germany. The data were
interpreted using a classification scheme based on a con-
ceptual–procedural distinction in measurement compe-
tence. The analyses with this classification revealed that
grade, gender, and in particular figural reasoning ability are
significantly related to overall measurement competence as
well as on the sub-competencies of Instrumental knowl-
edge and Measurement sense. The paper concludes with a
discussion of the implications of the findings of this study
for teaching and assessing measurement.
Keywords Mathematical competence � Measurement �Gender � Grade � Figural reasoning ability
1 Introduction
In many countries, assessments are carried out to measure
the effects of education on students’ achievement. Exam-
ples of such national assessments are the NAEP (National
Assessment of Educational Progress) in the USA, the
PPON (National Assessment of Educational Achievement)
in the Netherlands, the NAPLAN (National Assessment
Program—Literacy and Numeracy) in Australia, and the
PSLE (Primary School Leaving Examination) in Singa-
pore. These assessments of educational output are mostly
based on national achievement standards, which—begin-
ning with the standards published by the American
National Council of Teachers of Mathematics (NCTM
1989)—have been formulated since the late 1980s.
In Germany, standards were developed for primary
school mathematics in 2004 by the KMK (Standing Con-
ference of the Ministers of Education and Cultural Affairs
of the States in the Federal Republic of Germany) (KMK
2005). The standards describe what students are expected
to have achieved by the end of grade 4, which in Germany
is the end of primary school. At that time, the students are
about 10 years old.
The KMK standards for primary school mathematics
distinguish five general competencies (Problem Solving,
Communicating, Reasoning, Modeling, and Representing)
and five content-related mathematical competencies
(Numbers and Operations, Space and Shape, Patters and
Structure, Measurement, and Probability). The latter set of
competencies relate to the structure of mathematical con-
tent as described, for example, in the NCTM (2000) stan-
dards and also reflected in the PISA framework for
assessing mathematics (OECD 2003).
Starting in 2004, the KMK standards were used to
evaluate primary school students’ achievement. The
J. Hannighofer (&) � S. Weirich
Institute for Educational Quality Improvement (IQB),
Humboldt University, Unter den Linden 6,
10099 Berlin, Germany
e-mail: [email protected]
M. Van den Heuvel-Panhuizen
Freudenthal Institute for Science and Mathematics Education,
Utrecht University, Utrecht, The Netherlands
A. Robitzsch
Federal Institute for Education Research,
Innovation and Development of the Austrian School System,
Salzburg, Austria
123
ZDM Mathematics Education (2011) 43:651–665
DOI 10.1007/s11858-011-0357-y
Institute for Educational Quality Improvement (IQB) at the
Humboldt University is responsible for this evaluation and
carries out the assessment. For mathematics in primary
school, this was done in the ESMaG (Evaluation of the
Standards in Mathematics in Primary School) project. In
our study, we explored primary school students’ achieve-
ment in the mathematical domain of measurement.
Measurement competence is generally described as the
ability to assign a numerical value to an attribute of an
object or event NCTM (2000). The mathematical domain
of measurement is considered as the most widely used
application of mathematics in everyday life and is regarded
as a foundation for many sciences (Lehrer 2003; Vasilyeva,
Casey, Dearing and Ganley 2009). Moreover, measurement
is regarded as one of the most challenging areas of math-
ematics in elementary school (Vasilyeva et al. 2009). Being
competent in measurement means that children have the
ability to grasp the physical world around them by
expressing its properties in numbers and reasoning about
them mathematically (van den Heuvel-Panhuizen and Buys
2008).
The KMK measurement standard is subdivided into two
competencies and further divided into a number of sub-
competencies (Table 1).
2 What is already known about German primary
school students’ achievement in measurement?
Knowledge about German primary school students’
achievement in the mathematical domain of measurement
is scarce. There are only two recent studies that give some
information about this. The latest findings come from
TIMSS 2007 (Mullis, Martin and Foy 2008; Bos et al.
2008). Of the 37 countries that participated in TIMSS
2007, the German fourth graders were ranked 12th between
Australia and Denmark below, and the USA and Lithuania
above them. However, compared to these countries, the
variance of performance scores was rather low in Germany
and the range from the best-performing to the lowest-per-
forming students was rather small. But these TIMSS results
offer hardly any specific information about measurement
achievement, because this domain forms one content
domain together with geometry and is named Geometric
shapes and measures.
Additional information about German students’
achievement in measurement is provided by Lobemeier
(2005), who carried out a secondary analysis of the data
collected in 2001 in the IGLU (Bos et al. 2003) and the
IGLU-E study (Lankes et al. 2003), in which 16 mea-
surement items were used from TIMSS 1995 (Mullis et al.
1997). Lobemeier (ibid.) classified these items into four
measurement-related categories named ordering, estimat-
ing, partitioning, and operating. Ordering includes tasks
that require, among other things, comparing measures such
as temperatures and time spans and arranging them in a
systematic order (e.g., ordering minute, hour, day, week,
and month from the shortest to the longest). Estimating
includes, for example, roughly determining the weight or
length of an object without using a measuring tool (e.g.,
estimating whether a pencil is 5-, 10-, 20-, or 30-cm long).
Partitioning areas, volumes, and weights is Lobemeier’s
third category. Here students, for example, had to identify
how a particular weight was composed and then had to use
this knowledge to determine the number of weights that
would balance a scale. The focus of tasks in the operating
category is on carrying out multi-step calculations with
measures.
Lobemeier’s (ibid.) results showed that German fourth
graders performed best on the ordering items. The success
rate for these items was between 95 and 66%. Lower scores
were found for the items on estimating. Here the percent-
age of correct answers ranged from 79 to 55%. For the
three items on partitioning, a similar range was found. A
really large range in success rate was found in the items
within the category operating. The easiest item was solved
Table 1 Measurement competencies as formulated in the KMK (Konferenz der Kultusminister der Lander in der Bundesrepublik Deutschland)
(2005) standards
I Having conceptions of measures a. Knowing standard units that belong to monetary values, lengths, durations, weights, and volumes
b. Comparing, measuring, and estimating measures
c. Knowing objects or events that are important in everyday life and that represent a particular
standard unit
d. Converting measures
e. Knowing and understanding simple fractions in the context of measures from everyday life
II Dealing with measures in context
situations
a. Measuring with appropriate measuring units and instruments
b. Using representatives (of standard units) from everyday life to solve context problems
c. Calculating in context problems with appropriate estimates of measures
d. Solving context problems that require dealing with measures
652 J. Hannighofer et al.
123
by 78% of the students. In this item, the students had to
identify what date it was 3 weeks after a particular date.
The most difficult item was one in which the circumference
of a rectangle and its width was given and students had to
calculate its length. Only 19% of the students could solve
this item. These results correspond with the TIMSS 2007
findings that this topic is not included in the curriculum
until grade 5 and that only 55% of German students have
been taught this topic in grade 4.
Another area that has been studied is how students’
measurement competence develops during the primary
school years. In agreement with the findings of Winkel-
mann and van den Heuvel-Panhuizen (2009), we assume
that there will be progress in achievement from grade 3 to
4. Other information about students’ mathematical devel-
opment between grades 3 and 4 can be found in TIMSS
1995 (Mullis et al. 1997). The international results from
this study, in which Germany did not participate, showed
that for mathematics, in general, the international average
of the fourth-grade students (529) was approximately 60
points higher than the average of third-grade students
(470). For items with low difficulty, for example, esti-
mating a pencils’ length, the average percentage correct for
all countries was 77% for the fourth graders, and 69% for
third graders. However, such an increase could not be
found for difficult measurement items. For example, stu-
dents in both grades performed very similarly with 21 and
23% correct answers, respectively, on an item involving a
multi-step problem requiring students to apply their
knowledge of the perimeters of rectangles. These results
for measurement deviated from those for the other math-
ematical content domains where the differences between
grade 3 and 4 in the overall achievement in the domains
were often larger and ranged from at least 6 up to 21
percentage points.
A further issue is whether boys and girls differ in their
measurement competence. The grade 4 results from TIMSS
2007 (Mullis et al. 2008; Bos et al. 2008) did not show
significant gender differences in the domain Geometric
shapes and measures, whereas in the domains Number and
Data display German boys did significantly better than the
girls. Lobemeier’s (2005) findings were in agreement with
these TIMSS results. For all of the four categories of
measurement items that she distinguished, she found that
boys and girls scored equally well. Ratzka (2003), who also
used TIMSS items to investigate mathematical achieve-
ment of German fourth graders, also found no significant
gender differences on the scale for measurement. In con-
trast to these three studies, earlier analyses based on the
students’ responses to the items used in the ESMaG project
(Winkelmann, van den Heuvel-Panhuizen and Robitzsch
2008; Winkelmann and van den Heuvel-Panhuizen 2009)
showed that, out of the five mathematical content domains
as defined in the KMK Standards, the strongest differences
between boys and girls were found (d = -.36) in the
measurement items. Moreover, Kaiser and Steisel (2000)
found higher measurement scores for boys than for girls in
grade 8 in TIMSS 1997.
A further point of interest is the relationship between
students’ measurement ability and their figural reasoning
ability. Many studies have shown a strong connection
between spatial ability and students’ mathematics
achievement in general (see, e.g., Sherman 1980; Fennema
1979). This correlation increases with the complexity of
mathematical tasks (Kaufmann 1990). However, it is
unclear whether this connection also applies to the sub-
domain of measurement. Also, to our knowledge no
research exists that explores the relationship between stu-
dents’ measurement achievement and their figural reason-
ing ability. In the case of measurement, this latter
relationship is even more relevant than with spatial ability
in general, because in solving measurement tasks students
often have to reason with figures which are not necessarily
presented in a spatial context.
Knowing more about how the students’ measurement
ability is related to their figural reasoning ability might also
give a further insight into possible differences in mea-
surement achievement between girls and boys.
3 Research questions
The goal of this study was to know more about the German
primary school students’ achievement in the mathematical
domain of measurement. The following research questions
were investigated:
Q1. What can be said about the dimensionality structure
in the measurement competencies based on the
ESMaG data?
Q2. To what degree do fourth-grade students outperform
third-grade students in measurement achievement?
Q3. Does measurement achievement differ by gender?
Q4. Does figural reasoning ability correlate with
students’ achievement in measurement?
Our research questions are exploratory, because we
found hardly any previous research that gave indications
for formulating hypotheses. The only predictions that could
be made based on prior studies focused on gender differ-
ences. However, the difficulty here is that the findings from
these studies contradict each other. For example, Lobe-
meier (2005) and Mullis et al. (2008) did not detect
significant gender differences in the domain of measure-
ment, whereas Kaiser and Steisel (2000) found that eighth-
grade boys outperformed girls in their achievement in
measurement.
Primary school students’ achievement in measurement 653
123
4 Method
4.1 Sample and data collection
The data for this study were collected in 2007 from a
national representative sample of 6,638 German primary
school students (3,280 third-grade students and 3,358
fourth-grade students). In a multistage sampling procedure,
schools were randomly selected within each of the 16
German states and within each school one grade 4 and one
grade 3 class were randomly selected. Additionally, data
were collected about the figural reasoning ability of the
students by having them take the subscales for figural
analogy and figural classification of the Kognitiver
Fahigkeits-Test (KFT) (Heller and Perleth 2000).
4.2 Item development and classification for evaluating
the KMK standards
One of the requirements of the KMK for the evaluation of
the standards was that teachers should be heavily involved
in the evaluation process (Granzer 2009). This stipulation
meant that the development of items attended to what
teachers thought was important to be assessed when stu-
dents’ achievement was evaluated. Consequently, in the
ESMaG project psychometricians, didacticians and teach-
ers worked together in developing and classifying the
items. Firstly, teachers who specialized in mathematics
education and trained in item development were asked to
compile and design a collection of items for assessing
students’ competence in measurement, and to classify the
items in terms of the two measurement competencies as
formulated in the KMK standards. In a second step, the
items and their classifications were examined by a group of
mathematics didacticians.
As shown in Table 2, more than half of the developed
items were attributed to the competence Having concep-
tions of measures (Competence I), while almost one-third
were attributed to the competence Dealing with measures
in context situations (Competence II). Moreover, about
one-tenth of the items were ascribed to Competence I as
well as to Competence II.
Figures 1, 2, 3, and 4 show examples of the developed
items and the KMK competence categories to which they
were allocated. For test security reasons, we can only show
examples which have been released for publication. Nev-
ertheless, this restricted selection of items made clear that
although the developed items and their classifications had
support from the community of mathematics teachers and
didacticians, they did not produce a clear and focused
domain structure of measurement and the corresponding
competencies.
The item in Fig. 1 refers without doubt to conversion of
measures and is as such assigned to Competence I Having
conceptions of measures. Yet, one may wonder whether such
a technical sub-competence belongs to this competence.
The item in Fig. 2, about working with the map of a zoo,
is evidently about dealing with measurement context situ-
ations and consequently corresponds to Competence II.
However, for the item in Fig. 3, the assignment to Com-
petence II is questionable. This item does not assess stu-
dents’ measurement competence; it only requires reading
the text to identify the time sequence that fits best to the
described story. Similarly, questions can be raised with
respect to the item in Fig. 4. Although it is apparent that
this item compares measures and as such is a sub-compe-
tence of Competence I, it is not so obvious that this item
can be considered as an operationalization of Compe-
tence II, i.e., measuring with appropriate measurement
units and instruments.
4.3 Revisiting the KMK standards for measurement
A detailed consideration of the items and their classifica-
tions used in the ESMaG project revealed that the
Table 2 Measurement competencies to which the measurement
items were attributed
Type of measurement competence based
on the KMK standards
Number
of items
I Having conceptions of measures 58
II Dealing with measures in context situations 27
I?II Having conceptions of measures ? dealing with
measures in context situations
12
Total 97
Convert the time measurements as is shownin the example and fill in the gaps.
Example: 87 min = 1 h 27 min
a) 144 min = _____ h _____ min
b) __________ min = 3 h 54 min
c) __________ min = 6 h 40 min
KMK category:I. Havingconceptionsofmeasures
d. Converting measures
© Cornelsen Verlag 2008 (Bildungsstandards:Kompetenzen überprüfen, Mathematik Grundschule, Klasse 3/4)
Fig. 1 Measurement item concerning conversion of time
measurements
654 J. Hannighofer et al.
123
measurement sub-competencies as formulated in the KMK
standards are partly ambiguous. For example, the sub-
competence Knowing standard units that belong to mone-
tary values, lengths, durations, weights and volumes
(Competence Ia) overlaps with the sub-competence
Knowing object or events that are important in everyday
life and that represent a particular standard unit (Com-
petence Ic). Although Competence Ia means to know and
understand terms like ‘‘grams’’ and ‘‘kilograms’’ and
Competence Ic means for example to know things that are
about 10 kg, competencies are not easy to distinguish
because tasks often require both. Students, for example,
often have to compare different things and therefore have
to know how big, tall, heavy, etc., things are. Moreover,
this latter sub-competence and the sub-competence
Knowing and understanding simple fractions in the context
of measures from everyday life (Competence Ie) do not
differ from the sub-competencies in Competence II that
encompass all kinds of context situations in which students
have to deal with measures.
Having a framework of standards with partially over-
lapping and ambiguous definitions does not clearly distin-
guish the competencies and, in turn, resulted in items that
may not be precise enough to assess distinctive aspects of
the measurement competence.
4.4 Alternative classification based
on conceptual–procedural distinction
As Resnick and Ford (1981) pointed out, the distinction
between computational skills and conceptual understanding
is one of the oldest concerns in mathematics education.
Many researchers have addressed this splitting up into two
kinds of understanding, although they do not always use
the same terms for it and also differ in their interpretation.
Already in the 1970s, Skemp (1976) made mathematics
teachers aware that students should develop relational
understanding, meaning that students should understand
both what to do and why and that this understanding
deviates from what Skemp referred to as instrumental
understanding, which means learning rules without know-
ing why. Although using different terms, a similar dis-
tinction is made by Hiebert (1986) when discerning
conceptual and procedural knowledge. Comparably with
Skemp’s relational understanding, Hiebert’s conceptual
knowledge includes relationships between mathematical
objects, and Hiebert’s procedural knowledge, which refers
to knowledge of standard learned procedures, corresponds
with Skemp’s instrumental understanding. However,
Rittle-Johnson and Wagner Alibali (1999), who also made
the distinction between conceptual and procedure knowl-
edge, emphasize that both types of knowledge lie on a
continuum and it is not always possible to separate them.
In fact, the distinction between conceptual and proce-
dural knowledge is applicable in all domains of mathe-
matics, but it fits particularly well to the domain of
measurement in which instrumental as well as conceptual
knowledge plays unique roles. The KMK framework
reflects to some degree a division into an understanding of
basic facts and procedures related to measurement, on the
one hand, and the understanding of measurement concepts
and how they are related and the ability to apply this
knowledge in everyday contexts, on the other hand.
However, these two perspectives (the procedural or
instrumental understanding versus the conceptual or rela-
tional understanding) are not clearly distinguished in all the
sub-competencies. Therefore, we developed a framework
that could better distinguish the measurement items in such
a way that there is a division between items that refer to
using Instrumental knowledge (IK) and items that imply
Measurement sense (MS). The IK competence includes
Fig. 2 Measurement item concerning measurements in a map of a
zoo
Primary school students’ achievement in measurement 655
123
having available straightforward and isolated measurement
knowledge and procedures, while the MS competence
involves having available knowledge about measures and
units of measurement in everyday life and being able to
apply all kinds of measurement knowledge in context sit-
uations. In Table 3, we have described sub-categories of
these competencies by referring to the types of items that
belong to IK and MS.
We classified all the items according to the new
framework. Contrary to the KMK classification, the alter-
native classification did not result in items that were
attributed to both IK and MS.
A closer examination of all the 97 items revealed that
there were quite a number of them that, although a part of
the collection of measurement items, should—according to
our opinion—not have been attributed to measurement. In
total, we found 28 of these items, which we labeled as ‘‘No
measurement’’ items (see Table 3). They range from
comparison problems in which the measurement context is
not relevant to an item that merely requires reading
(Fig. 3).
A less detailed classification was found when we clus-
tered the items belonging to the different sub-categories
within the IK and MS competencies (see Table 4). The
collection of items that belong to the IK competence can be
grouped into items that are about times and items that
concern other measurement attributes (e.g., length, weight).
Similarly, the items that refer to the MS competence can be
divided into items that assess whether students have a
particular knowledge and items that are about problem
solving.
4.5 Statistical analyses
4.5.1 Estimation of students’ achievement
To estimate students’ achievement, the items were scaled
within the framework of Item Response Theory (IRT). As
the data were collected in a multi-matrix sampling design,
every student completed only a subsample of all available
items, resulting in many items with randomly missing
values (Rubin 1987). Neither the KMK competence cate-
gories nor the alternative competence categories were
evenly distributed over the booklets. This is caused by the
design of the ESMaG study, in which only main categories
(e.g., measurement) were balanced over the booklets.
Fig. 3 Measurement item
concerning identifying correct
time sequence
656 J. Hannighofer et al.
123
For the estimation of population parameters (means,
standard deviations) and for students’ achievement, we
used the plausible value technique (von Davier 2009;
Mislevy et al. 1992). Plausible values (PVs) are drawn
from the distribution of students’ latent abilities. Latent
regression models are specified for drawing PVs to reflect
all statistical relationships of ability variables with covar-
iates of interest that will be used in later carried out sta-
tistical analyses. The variation between different PVs
reflects the uncertainty due to missing data (missing by
design) and measurement error modeled within the IRT.
Using this technique, data with a considerable amount of
missing data and measurement error can be analyzed as if
there were no missing values and no measurement error.
The analyses are repeatedly conducted for each set of PVs,
and the results are pooled and tested for significance (Little
and Rubin 2002).
4.5.2 Analyses of dimensionality
To explore the structure of the measurement competence,
we conducted confirmatory factor analyses (CFA) for the
KMK classification as well as for the alternative classifi-
cation. This CFA allowed us to verify the latent factor
structure of a set of observed test responses and to test the
homogeneity of the domain to determine whether the dif-
ferentiation in the two competencies was reasonable from
an empirical point of view.
For both classifications, we specified a two-factor
model. The KMK classification includes items with within-
item dimensionality. Such items belong to both factors.
The alternative classification allows only between-item
dimensionality, i.e., each item belongs to only one factor.
If the proposed two-dimensionality fits better to the
empirical data than a one-dimensional structure, the latent
correlation between the two dimensions is expected to
differ significantly from one. Therefore, a model with a
fixed correlation of one should fit the data worse than a
model where the correlation is estimated freely. To test
this, the chi-square statistic provided by the Wald test was
used (Bollen 1989; Muthen and Muthen 1998–2007). In an
additional model constraint, the correlation was forced to
equal one, obtaining one additional degree of freedom. The
Wald test provides a chi-square statistic quantifying the
loss of model fit by this constraint. If the Chi-square
statistic is significant at an alpha-level of .05, the loss of
model fit is substantial.
We tested the equality of latent factor correlations in
subgroups, i.e., for boys and girls and for third graders and
fourth graders. The analysis was done for the four dis-
junctive groups of third-grade girls, third-grade boys,
fourth-grade girls, and fourth-grade boys. The model was a
two-dimensional CFA, a multi-group analysis to differen-
tiate between the groups. Within each group, the correla-
tion between the two dimensions was estimated and tested
for being equal in all subgroups.
4.5.3 Effect of grade, gender, and figural reasoning ability
on measurement achievement
We investigated grade and gender differences with a two-
way analysis of variance (ANOVA). The dependent vari-
able is the PV score on the measurement items. This means
that the results for all analysis concerning PVs were con-
ducted five times (for each set of PVs) and the results
pooled according to Rubin’s rule (Little and Rubin 2002).
As stated earlier, all the factors used in each ANOVA
model also occurred in the latent regression model of the
PV imputation.
The same analysis was carried out separately for the
students with high and low figural reasoning ability scores.
The high ability group consisted of all students with a KFT
value of at least one standard deviation above the mean and
the low ability group consisted of all students with a KFT
value at least one standard deviation below the mean.
KMK categories:I. Having conceptions of measures
b. Comparing, measuring and estimating measures
II. Dealing with measures in context situationsa. Measuring with appropriate measuring units and
instruments
Tim und Jana are weighing four bags. Which bag is the lightest one?
Bag A
Bag B
Bag C
Bag D
© Cornelsen Verlag 2008 (Bildungsstandards:Kompetenzen überprüfen, Mathematik Grundschule, Klasse 3/4)
Fig. 4 Measurement item concerning comparing measures
Primary school students’ achievement in measurement 657
123
To investigate whether gender differences in measure-
ment achievement were at least partially mediated by
gender differences in figural reasoning ability, we carried
out a mediation analysis. To test the indirect effect of the
figural reasoning ability, we used asymmetric confidence
intervals by applying a bias-corrected bootstrap method
(MacKinnon 2008) instead of the Sobel test, because
conventional tests of significance might lead to biased
results due to non-normality (MacKinnon et al. 2004;
MacKinnon 2008). Indirect effect parameters are com-
pound from the product of two regression coefficients and
therefore they are often not normally distributed.
5 Results
5.1 Measurement achievement
The easiest items were, for example, those in which stu-
dents had to say which was more: 6,000 m or 5 km? Other
easy items were those in which students had to decide
which unit (km, s, min, kg, t) was the right one, when they
had to complete the sentence: a duck weighs about 4___.
An example of an item of medium difficulty is shown in
Fig. 2. To answer this item, students had to use a map that
contained measures for determining a particular distance.
The most difficult items were those in which students had
to solve a problem like 2 � h = ___hours and ___minutes.
5.2 Inspection of the two classifications
Before answering the research questions about students’
achievement in measurement, we first investigated the
dimensionality of the two classifications. This was done to
assess which classification provided a better differentiation
within the structure of the measurement competence and
thus gave a better insight into achievement in measurement.
Our analyses showed that the correlation of the two
KMK competencies was r = .78. Comparing the underlying
Table 3 Alternative classification of measurement items
Type of measurement
competence
Sub-category Number of
items
Examples
Subtotal Total
Instrumental knowledge
(IK)
Conversion of measures 9 43
Comparison of measures 9 Fig. 4
Conversion of times 17 Fig. 1
Comparison of times 1
Reading clocks or timetables 7
Measurement sense
(MS)
Knowledge of daily life sizes 10 26
Knowing which unit of measurement belongs to an attribute 1
Context problems about calculations with multiple attributes 5
Context problems about additive calculation with one attribute and multiple units of
measurement
4 Fig. 2
Context problems about multiplicative calculation with one attribute and multiple
units of measurement
2
Context problems with one attribute and one unit of measurement 4
No measurement Comparison problems in which the measurement context is not relevant 3 28
Context problems with years 1
Context problems about calculation with money 20
Context problems about fractions 1
Bare number problems 2
Just requires reading 1 Fig. 3
97
Table 4 Alternative classification of measurement items with clus-
tered sub-categories
Type of measurement
competence
Clustered
sub-category
Number of items
Subtotal Total
Instrumental knowledge
(IK)
Time items 25 43
Items about other attributes 18
Measurement sense
(MS)
Knowledge 11 26
Problem solving 15
69
658 J. Hannighofer et al.
123
two-dimensionality to a one-dimensional structure revealed
that Competence I Having conceptions of measures and
Competence II Dealing with measures in context situation
were distinguishable constructs (v2 = 4.74; df = 1;
p \ .05). The standard deviation of students’ achievement
was 1.29 for Competence I and .96 for Competence II.
The correlation of the competencies Instrumental
knowledge (IK) and Measurement sense (MS), which are
based on the alternative classification, was r = .70. Com-
paring the underlying two-dimensionality to a one-dimen-
sional structure indicated that IK and MS were
distinguishable constructs (v2 = 4.92; df = 1; p \ .05).
Standard deviation of students’ achievement was 1.67 for
IK and 1.18 for MS.
As both latent correlation coefficients were estimated in a
structural equation model, standard errors of the coefficients
were available. Yet, the two correlation coefficients cannot
be tested for equality, because they stem from different
subsets of items. However, the 95% confidence inter-
vals (CI) show only a minimal overlap (CIKMK classification =
[.73, .82]; CIalternative classification = [.66, .74]). Therefore, we
can assume that the correlation coefficients for both classi-
fications differ. However, when comparing the correlation
coefficients obtained from the subset of 69 items (see
Table 4), once classified according to the two KMK com-
petencies and according to the alternative competencies, we
did not find a significant difference between these correla-
tion coefficients (KMK classification r = .71; alternative
classification r = .70, t = .41, p = .69). We found that in
terms of log-likelihood value (logL) and information criteria
(AIC, BIC), the alternative classification (logL = -24,068,
AIC = 48,413, BIC = 49,315) fitted the data better than the
KMK classification (logL = -24,129, AIC = 48,554,
BIC = 49,514), as indicated by the larger log-likelihood
value and the smaller values of information criteria AIC and
BIC.
To ensure that the two-dimensional structure was not due
to characteristics of the items, we constructed a two-
dimensional structure that was based on a random allocation
of items and tested that structure for dimensionality. Two
dimensions consisting of randomly allocated items are
expected to correlate to one. Hence, the two-dimensional
model, as well as a one-dimensional model, is expected to
fit. The correlation of both arbitrary main domains was
r = .99. The model does not fit better to the data than the
one-dimensional model (v2 = .01, p = .97). In summary,
we have three possible models, each consisting of two
dimensions: the model based on the KMK classification, the
model based on the alternative classification, and the model
based on an arbitrary classification. On comparing the fit of
each model in relation to the one-dimensional model, we
found that the arbitrary model fitted worst to the data (i.e., it
did not fit better than the one-dimensional model). The
model based on the KMK classification fits significantly
better than the one-dimensional model (i.e., it separates the
items in two distinguishable constructs), and the model
based on alternative classification also fits better than the
one-dimensional model. More precisely, the model based
on alternative classification separates the items into two,
more distinguishable constructs, as the correlation between
both dimensions in this model (r = .70) is less than in the
model based on the KMK classification (r = .78).
The test of the invariance of dimensionality shows that
the dimensionality in the alternative classification also
holds in subgroups. Correlation coefficients for the four
groups were .69 for third-grade boys, .64 for fourth-grade
boys, .68 for third-grade girls, and .65 for fourth-grade
girls. The correlation coefficients were tested for equality.
No significant differences in the coefficients were found
(v2 = .15, df = 1, p = .70).
Because the alternative classification results in a clearer
description of the measurement competence from a
domain-specific didactical perspective and is a more dis-
tinctive classification from a psychometric perspective, the
reported findings in the remainder of the article refer only
to the alternative classification results.
5.3 Students’ competence structure in measurement
In Fig. 5, the average mean scores are shown for the total
items belonging to the IK competence (mean p value .45)
and for the MS competence (mean p value .45). On aver-
age, the students scored on both competencies equally well.
However, on comparing the average mean scores of the
items belonging to the sub-categories as described in
Table 4, we found remarkable differences between the
average mean scores. For the sub-categories within IK as
well as for those within MS, we hit upon large differences
in difficulty level. With respect to IK, the mean p value for
the time items was .36 and for the items about other
attributes .57. A similar difference was found for the MS
mean p-value
MeasurementSense
Instrumentalknowledge
total IK items
items about other attributes
time items
total MS items
knowledge items
problem solving items
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 5 Students’ competence structure in measurement based on an
alternative classification
Primary school students’ achievement in measurement 659
123
items. The mean p value for the knowledge items was .60
and for the problem solving items .34.
5.4 Measurement competence and other student
characteristics
In Table 5, mean differences of the estimated students’
achievement (PVs) are shown for the total sample of stu-
dents as well as for subgroups specified by grade and
gender for each item subset, i.e., overall measurement
competence, MS, and IK.
As the subgroup means are arbitrary in the IRT frame-
work, we report mean differences relative to the standard
deviation of the respective item subsets. The reference
group was the fourth grade for each item subset.
The standardized mean difference in the overall mea-
surement competence between grades 3 and 4 was .64. This
indicates that third-grade students’ mean achievement was
.64 standard deviations lower than fourth-grade students’
mean achievement. This difference was significant with
F(1, 2,098) = 459.5, p \ .001, and gp2 = .09. The stan-
dardized mean difference for MS between grades 3 and 4
was .71. The difference was significant with F(1,
26.5) = 348.2 and p \ .001, resulting in a partial eta
square of gp2 = .11. The standardized mean difference for
IK between grades 3 and 4 was .61. The difference was
significant with F(1, 42.6) = 279.4, p \ .001, and
gp2 = .08.
With respect to gender, we found that in grade 3 as well
as in grade 4, boys significantly outperformed girls in
overall measurement competence. For grade 3, the mean
difference was .43 and for grade 4 .39. The difference
between the two grades was significant with F(1,
32.1) = 127.5, p \ .001, and gp2 = .04.
For IK and MS, similar results were found as for the
overall measurement competence. For MS, we found a
mean difference of .37 in grade 3 and of .36 in grade 4; the
difference between the grades was significant with F(1,
74.3) = 119.7, p \ .001, and gp2 = .03. The interaction
between grade and gender was not significant with F(1,
205.8) = .17, p = .68, and gp2 = .00. With respect to IK,
we found that in grade 3 the mean difference between boys
and girls was .44, and in grade 4 the mean difference was
.40. The difference between the two grades was significant
with F(1, 93.5) = 154.4; p \ .001, and gp2 = .04. The
interaction between grade and gender was not significant
with F(1, 171.1) = .5, p = .47, and gp2 = .00.
Table 6 lists the regression coefficients of grade, gender,
and figural reasoning ability, and their two-way interac-
tions for the total sample. We found that the interaction
between grade and gender was not significant for the
overall measurement competence, as well as for the com-
petencies IK and MS. Of the three predictors, grade, gen-
der, and figural reasoning ability, the latter turned out to
have the largest effect on the overall measurement
achievement as well as on the achievement in IK and MS.
Table 5 Mean differences of estimated students’ achievement in
measurement for grade and gender (positive values attributed to grade
4 and to boys, respectively)
Grade Gender Total
sample
Overall measurement competence 3 .43 .64
4 .39
Measurement sense (MS) 3 .37 .71
4 .36
Instrumental knowledge (IK) 3 .44 .61
4 .40
Mean differences were standardized. Reference for standardization is
the standard deviation of boys and girls of fourth grade corresponding
to overall measurement competence, MS, or IK, respectively
Table 6 Regression coefficients for overall measurement competence, IK, and MS
Overall measurement competence
(69 items)
Instrumental knowledge
(43 items)
Measurement sense
(26 items)
B SE B SE B SE
Grade .67*** .05 .78*** .08 .67*** .06
Gender -.62*** .05 -.76*** .07 -.47*** .04
Figural reasoning ability .36*** .02 .42*** .02 .29*** .02
Grade 9 gender .01 .08 .02 .11 -.04 .07
Grade 9 figural reasoning ability -.04 .03 -.07** .03 .00 .03
Gender 9 figural reasoning ability .02 .02 .04 .03 .04 .03
Overall measurement competence: N = 4,850; IK: N = 3,031: MS: N = 3,031
Overall measurement competence: R2 = .35; IK: R2 = .31; MS: R2 = .37
B regression coefficient, SE standard error
* p \ .05; ** p \ .01; *** p \ .001
660 J. Hannighofer et al.
123
The significant negative interaction of figural reasoning
ability and gender for IK indicates that the regression effect
of IK on figural reasoning ability is higher in grade 3 than
in grade 4.
Table 7 gives an overview of the regression coefficients
of the three predictors, grade, gender, and figural reasoning
ability, and their two-way interactions for students with
high figural reasoning ability.
Similar to the results for the whole sample in the group
of students with a high figural reasoning ability, a signifi-
cant grade effect was found for IK and MS. However, this
was not the case for the overall measurement competence.
Furthermore, we found no gender effects for students with
a high figural reasoning ability: in the overall measurement
competence or in MS and IK. A significant effect of the
figural reasoning ability was found only for IK.
In Table 8, the regression coefficients of the three pre-
dictors and their two-way interactions for students with a
low figural reasoning ability are shown. In contrast to the
results for the total sample, no significant grade effects
were found for the overall measurement competence as
well as for IK and MS for students with a low figural
reasoning ability. Furthermore, for these students we only
found a gender effect for the overall measurement com-
petence and for IK.
Table 9 includes the results of the three mediation
analyses. In the first analysis, the dependent variable was
the students’ achievement in the overall measurement
domain. In the second analysis, it was the students’
achievement in IK. Finally, in the third analysis, the
dependent variable was students’ achievement in MS. The
independent variable for each analysis was gender, and the
mediation variable for each analysis was students’
achievement in figural reasoning ability. The values in
Table 9 indicate that the indirect effect of gender on
measurement is significant, but small. The question is now
Table 7 Regression coefficients for overall measurement competence, MS, and IK for students with high figural reasoning ability
Overall measurement
competence (69 items)
Instrumental knowledge
(43 items)
Measurement
sense (26 items)
B SE B SE B SE
Grade .83 .43 1.34* .59 .82* .34
Gender -.65 .42 -.77 .57 -.38 .43
Figural reasoning ability .34 .18 .47* .18 .24 .15
Grade 9 gender -.10 .19 .04 .28 -.16 .17
Grade 9 figural reasoning ability -.08 .15 -.25 .20 -.01 .12
Gender 9 figural reasoning ability .05 .15 .04 .20 .02 .16
The group with a high figural reasoning ability consists of N = 736 students (grade 3: N = 327; 49% girls; grade 4: N = 409; 56% girls)
Overall measurement competence: R2 = .13; IK: R2 = .18; MS: R2 = .18
B regression coefficient, SE standard error
* p \ .05; ** p \ .01; *** p \ .001
Table 8 Regression coefficients for overall measurement competence, MS, and IK for students with low figural reasoning ability
Overall measurement
competence (69 items)
Instrumental knowledge
(43 items)
Measurement
sense (26 items)
B SE B SE B SE
Grade .20 .39 .40 .54 .46 .36
Gender -.87* .43 -1.22* .48 -.59 .40
Figural reasoning ability .35** .10 .40** .13 .23* .10
Grade 9 gender -.06 .21 -.02 .26 -.09 .22
Grade 9 figural reasoning ability -.21 .15 -.20 .21 -.10 .12
Gender 9 figural reasoning ability -.07 .14 -.13 .18 -.01 .16
The group with a low figural reasoning ability consists of N = 641 students (grade 3: N = 396; 57% girls; grade 4: N = 245; 57% girls)
Overall measurement competence: R2 = .20; IK: R2 = .19; MS: R2 = .22
B regression coefficient, SE standard error
* p \ .05; ** p \ .01; *** p \ .001
Primary school students’ achievement in measurement 661
123
what the role of the figural reasoning ability is. If figural
reasoning ability would be a mediator, we could expect that
there will be a smaller absolute value of the direct effect
compared to the total effect. However, our results indicate
the opposite. Therefore, we may conclude that our medi-
ation analysis reveals a suppressor effect of the variable
figural reasoning ability, although it is only a very weak
effect. Summarizing these results, we found the following
relations: (1) being a boy increases achievement in mea-
surement; (2) being a girl increases achievement in figural
reasoning ability; (3) having a high figural reasoning ability
increases measurement achievements, which is indepen-
dent of gender. On the whole, this means that the effect of
gender on measurement achievement is underestimated if
the fact that figural reasoning ability suppresses this effect
is not taken into account.
6 Discussion
6.1 Summary of our findings
The analysis of a large set of data collected in 2007 within
the framework of the ESMaG project produced interesting
insight into the measurement competencies of German
students in grades 3 and 4 and how these competencies
were assessed.
An inspection of the KMK standards for measurement
showed that the distinction into the competencies Having
conceptions of measures and Dealing with measures in
context situations that is reflected in these standards was
supported from an empirical point of view. However, the
latent correlation between the two dimensions was rela-
tively high compared to the correlation an alternative
classification including the measurement competencies
Instrumental knowledge (IK) and Measurement sense
(MS). Because of these findings, we used this alternative
classification for answering our research questions.
Although we have to note here again that our item pool
was limited, our results showed that, on average, the stu-
dents solved an equal number of items on both compe-
tencies correctly. However, we found in both competencies
remarkable differences between the average mean scores of
particular item categories. Within the IK items, those
related to time were more difficult than the IK items about
other attributes. Within the MS items, the problem-solving
items were more difficult than the knowledge items.
The analyses carried out to investigate the role of grade,
gender, and figural reasoning ability showed that all these
predictors had significant effects on the overall measure-
ment competence, as well as on the competencies IK and
MS. Figural reasoning ability was found to have the largest
effect. Our study also confirmed gender differences. Male
students outperformed female students both in the overall
measurement competence and in the IK and MS compe-
tencies. These results agree with those reported by Win-
kelmann et al. (2008) and Winkelmann and van den
Heuvel-Panhuizen (2009). However, related to figural
reasoning ability, our results showed that girls outper-
formed boys.
When testing the relationship between gender and
measurement mediated by figural reasoning ability, we
found a small but significant indirect effect of gender on
measurement achievement. Moreover, the results indicated
a suppressor effect of figural reasoning ability, i.e., gender
differences in measurement are underestimated when
ignoring this mediation.
However, all main effects became less substantial or
even not significant within the subgroup of students with a
high figural reasoning ability and the subgroup with a low
figural reasoning ability. Because in this latter group, no
gain in achievement was found between grades 3 and 4, it
seems that for these students 1 year of extra instruction did
not have an effect on their measurement achievement.
Another finding was that within the group of students
with a high figural reasoning ability, we did not observe an
Table 9 Results of mediation analysis
Dependent variable: overall measurement
competence
Dependent variable: Instrumental
knowledge
Dependent variable:
Measurement sense
b SE b SE b SE
Total effect (b3? b1�b2) -.13*** .01 -.15*** .02 -.11*** .02
Indirect effect (b1�b2) .03*** .01 .03*** .01 .03*** .01
Direct effect (b3) -.16*** .01 -.18*** .02 -.14*** .02
Noverall = 4,850; NIK = 3,031; NMS = 3,031
All standard errors are estimated using bias-corrected bootstrap methods with 10,000 samples
b1 standardized regression coefficient of figural reasoning ability on gender, b2 standardized regression coefficient of the measurement ability on
of figural reasoning ability, b3 standardized regression coefficient of the measurement ability on gender
* p \ .05; ** p \ .01; *** p \ .001
662 J. Hannighofer et al.
123
effect of gender on the measurement achievement, whereas
within the group with a low figural reasoning ability we
obtained a significant gender effect on the overall mea-
surement competence and on IK, with boys outperforming
girls.
6.2 Implications for instruction
The main message from our study is that the German pri-
mary school students’ achievement in the domain of
measurement shows on average a rather balanced compe-
tence structure consisting of IK as well of MS. Having
available both types of knowledge of measurement is a
good thing, but it is not always achieved in education.
Pesek and Kirshner (2000) found that in classroom practice
there is often no time to deal with relational learning
involving explaining, reasoning, reflecting, connecting, and
communicating. To prepare students for the standardized
tests, teachers often have a preference for teaching instru-
mental knowledge: that is, students must learn skills first
and foremost. Obviously, teachers in Germany do not only
pay attention to plain measurement skills, but also include
activities in their teaching that support the development of
understanding measurement.
When looking closer at the two competencies, it was
revealed that, referring to the IK competence, students
have more difficulties with tasks that deal with time mea-
sures, for example, converting time measures such as
‘‘1 min 20 s = ___sec’’, than tasks that deal with other
attributes. Hence, one implication from this analysis is that
mathematics teachers should focus especially on explana-
tion of how to solve time tasks. From the MS results, it
could be concluded that teachers should offer students
more opportunities to learn measurement-related problem
solving.
Other implications for instruction are connected to the
relation of measurement competencies with gender and
figural reasoning ability. Teachers should be aware of the
fact that girls, on the one hand, on average have higher
scores in figural reasoning ability, but, on the other hand,
have more difficulties with measurement tasks. A possible
reason could be that girls have fewer difficulties in working
with figures when they have to reason about figural pat-
terns, whereas boys are better at solving practical mea-
surement tasks related to figures. Therefore, we suggest
that teachers seek ways to increase classroom activities,
which provide practice for both their measurement abilities
and abilities that appeal to figural reasoning.
6.3 Implications for assessing educational output
The quality of the above conclusions mainly depends on
the quality of the assessment. Assessment, in turn, is
determined by the quality of the standards that indicate
what competencies are the goals of education and which
are used to develop the items for assessing the students’
measurement achievement. Our study has shown how
essential it is to have a clear and focused structure of the
mathematical domain when evaluating students’ achieve-
ment. Using the initial data of the ESMaG project about
measurement achievement, we found a rather high corre-
lation between the two measurement competencies. Due to
this correlation, we assumed an overlap between them.
Support for this assumption can be seen in the description
of the sub-competencies within the two KMK competen-
cies for measurement. These descriptions are rather
ambiguous. Some of them are described by referring to the
same content of measurement and therefore will not result
in two distinct sub-competencies on which education can
focus. This ambiguity makes it rather difficult not only to
get a clear picture of the educational output of measure-
ment instruction, but also to give adequate feedback to
teachers. We expect that a classification that focuses on
conceptual and procedural measurement knowledge as the
two main distinctive but, of course, also related compe-
tencies would be easier to handle for test designers and will
inform teachers better.
6.4 Limitations of the study and suggestions
for further research
While summarizing our results and stating implications for
teaching and assessing measurement as a mathematical
domain, we are quite aware of the limitations of our study.
An important point that should be kept in mind is that we
did a cross-sectional study and not a longitudinal one. This
means that in our study, progress over grades and, conse-
quently, influence of teaching were not established by
following the students. Therefore, prudence is required
when using the results of this study. To gain a better
understanding of how the measurement develops over time,
research is necessary with a cohort study design.
A further weak point resulted from the rather ambiguous
description of the measurement competencies in the KMK
standards, which were taken as the basis for the item
development. Although we think that the conceptual–pro-
cedural distinction that we used to develop an alternative
classification gives better access to students’ achievement
in measurement and is more informative for educational
decision making, the analyses we did with this new clas-
sification were still based on items that were developed for
the KMK classification. In other words, the items used
were not fine-tuned to the alternative classification.
Therefore, we see our analyses as a first exploration of
using this new distinction. Further research should start
with a re-definition of the standards for the domain
Primary school students’ achievement in measurement 663
123
measurement. Then, a new item development process
should be initiated based on these newly formulated stan-
dards, which make more explicit the different knowledge
types students should attain to acquire measurement com-
petencies. Any assessment should start with a clear view of
what should be assessed. We hope our study contributes to
this clearness.
References
Bollen, K. A. (1989). Structural equations with latent variables. New
York: Wiley.
Bos, W., Bonsen, M., Baumert, J., Prenzel, M., Selter, C., & Walther,
G. (Eds.) (2008). TIMSS 2007. Mathematische und naturwis-senschaftliche Kompetenzen von Grundschulkindern in Deutsch-land im internationalen Vergleich. [TIMSS 2007. Mathematical
and scientific competencies of primary school students in
Germany in an international context]. Munster: Waxmann.
Bos, W., Lankes, E.-M., Prenzel, M., Schwippert, K., Valtin, R., &
Walther, G. (2003). Erste Ergebnisse aus IGLU. Schulerleistun-gen am Ende der vierten Jahrgangsstufe im internationalenVergleich. [Fourth-Grade Students in an International Context:
First results from an International Reading Literacy Study
(PIRLS)]. Munster: Waxmann.
Fennema, E. (1979). Women and girls in mathematics—equity in
mathematics education. Educational Studies in Mathematics, 10,
389–401.
Granzer, D. (2009). Von Bildungsstandards zu ihrer Uberprufung:
Grundlagen der Item- und Testentwicklung. [From standards to
evaluation: Examination of educational standards: Background
of item and test development.] In D. Granzer, O. Koller, A.
Bremerich-Vos, M. van den Heuvel-Panhuizen, K. Reiss, & G.
Walther (Eds.), Bildungsstandards Deutsch und Mathematik (pp.
21–30). Weinheim/Basel: Beltz Verlag.
Heller, K. A., & Perleth, Ch. (2000). Kognitiver Fahigkeitstest fur 4.-12. Klassen, Revision (KFT 4-12? R). [Cognitive ability test for
grade 4-12, revision (KFT 4-12? R).] Gottingen: Hogrefe.
Hiebert, J. (1986). Conceptual and procedural knowledge: The caseof mathematics. Hillsdale: Lawrence Erlbaum Associates.
Kaiser, G., & Steisel, T. (2000). Results of an analysis of the TIMS
study from a gender perspective. Zentralblatt fur Didaktik derMathematik, 32(1), 18–24.
Kaufmann, G. (1990). Imagery effects on problem solving. In P.
J. Hampson, D. E. Marks, & J. T. E. Richardson (Eds.), Imagery:Current developments (pp. 169–197). New York: Routledge.
KMK (Konferenz der Kultusminister der Lander in der Bundesre-
publik Deutschland). (2005). Bildungsstandards im Fach Math-ematik fur den Primarbereich (Jahrgangsstufe 4) [Educational
standards in mathematics for primary school (fourth grade).]
Munchen: Luchterhand/Wolters Kluwer Deutschland.
Lankes, E.-M., Bos, W., Mohr, I., Plaßmeier, N., Schwippert, K.,
Sibberns, H., & Voss, A. (2003). Anlage und Durchfuhrung der
Internationalen Grundschul-Lese-Untersuchung (IGLU) und
ihrer Erweiterung um Mathematik und Naturwissenschaften
(IGLU-E). [Design and administering of the international
primary school reading research (PILRS) and its expansion for
mathematics and science.] In W. Bos, E.-M. Lankes, M. Prenzel,
K. Schwippert, R. Valtin, & G. Walther (Eds.), Erste Ergebnisseaus IGLU. Schulerleistungen am Ende der vierten Jah-rgangsstufe im internationalen Vergleich (pp. 7–28). Munster:
Waxmann.
Lehrer, R. (2003). Developing understanding of measurement. In J.
Kilpatrick, W. G. Martin, & D. Schifter (Eds.), A researchcompanion to principles and standards for school mathematics.
Reston: National Council of Teachers of Mathematics.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis withmissing data. New York: Wiley.
Lobemeier, K. (2005). Welche Leistungen erbringen Viertklassler beiAufgaben zum Thema Großen? Untersuchungen zur mathemat-isch-naturwissenschaftlichen Kompetenz im Grundschulalter imRahmen von IGLU. [How do fourth graders perform in tasks
dealing with attributes? Research of mathematical scientific
competence in primary school in the framework of IGLU.] Kiel,
Germany: Christian-Albrechts-Universitat.
MacKinnon, D. P. (2008). Introduction to statistical mediationanalysis. Mahwah: Erlbaum.
MacKinnon, D. P., Lockwood, C. M., & Williams, J. (2004).
Confidence limits for the indirect effect: Distribution of the
product and resampling methods. Multivariate BehavioralResearch, 39, 99–128.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992).
Estimating population characteristics from sparse matrix samples of
item responses. Journal of Educational Measurement, 29, 133–161.
Mullis, I. V. S., Martin, M. O., Beaton, A. E., Gonzalez, E. J., Kelly,
D. L., & Smith, T. A. (1997). Mathematics achievement in theprimary school years. IEA0s third international mathematics andscience study. Chestnut Hill: Boston College.
Mullis, I. V. S., Martin, M. O., & Foy, P. (2008). TIMSS 2007International Mathematics Report. Chestnut hill: TIMSS &
PIRLS International Study Center, Lynch School of Education,
Boston College.
Muthen, L. K., & Muthen, B. (1998–2007). Mplus user’s guide.Version 5. Los Angeles: Muthen & Muthen.
NCTM (National Council of Teachers of Mathematics) (1989).
Curriculum and evaluation standards for school mathematics.
Reston, VA: NCTM.
NCTM (National Council of Teachers of Mathematics) (2000).
Principles and standards for school mathematics. Reston, VA:
NCTM.
OECD (2003). The PISA 2003 assessment framework—Mathematics,
reading, science and problem solving, knowledge and skills.
Paris: OECD.
Pesek, D. D. & Kirshner, D. (2000) Interference of instrumental
instruction in subsequent relational learning. Journal forResearch in Mathematics Education, 31(5), 524–540.
Ratzka, N. (2003). Mathematische Fahigkeiten und Fertigkeiten amEnde der Grundschulzeit. Empirische Analysen im Anschluss anTIMSS [Mathematical achievement at the end of primary school.
Empirical analyses based on TIMSS]. Hildesheim: Franzbecker.
Resnick, L. B., & Ford, W. W. (1981). The psychology ofmathematics for instruction. Hillsdale: Erlbaum.
Rittle-Johnson, B., & Alibali, M. W. (1999). Conceptual and
procedural knowledge of mathematics: Does one lead to the
other? Journal of Educational Psychology, 91, 175–189.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys.
New York: Wiley.
Sherman, J. A. (1980). Predicting Mathematics grades of high school
girls and boys: A further study. Contemporary EducationalPsychology, 5, 249–255.
Skemp, R. R. (1976). Relational understanding and instrumental
understanding. Mathematics Teaching, 77, 20–26.
Van den Heuvel-Panhuizen, M., & Buys, K. (2008). Young childrenlearn measurement and geometry. A learning–teaching trajec-tory with intermediate attainment targets for the lower grades inprimary school. Rotterdam/Tapei: Sense Publishers.
Vasilyeva, M., Casey, B. M., Dearing, E., & Ganley, C. M. (2009).
Measurement skills in low-income elementary school students:
664 J. Hannighofer et al.
123
Exploring the nature of gender differences. Cognition andInstruction, 27(4), 401–428.
von Davier, M. (2009). Some notes on the reinvention of latent
structure models as diagnostic classification models. Measure-ment—Interdisciplinary Research and Perspectives., 7(1), 67–74.
Winkelmann, H., & van den Heuvel-Panhuizen, M. (2009). Ges-
chlechtsspezifische mathematische Kompetenzen. In D. Granzer,
O. Koller, A. Bremerich-Vos, M. van den Heuvel-Panhuizen, K.
Reiss, & G. Walther (Eds.), Bildungsstandards Deutsch undMathematik (pp. 142–156). Weinheim: Beltz Verlag.
Winkelmann, H., van den Heuvel-Panhuizen, M., & Robitzsch, A.
(2008). Gender differences in the mathematics achievements of
German primary school students: Results from a German large-
scale study. ZDM: The International Journal on MathematicsEducation, 40, 601–616.
Primary school students’ achievement in measurement 665
123