demonstrating the consequences of not taking into account sampling designs with timss 2011 data

Demonstrating the consequences of not taking into account sampling designs with

TIMSS 2011 data

Dr. Christian BokhoveLecturer in Mathematics EducationUniversity of SouthamptonEARLI SIGAugust 28th 2014

OUTLINE

• International studies• IEA & OECD

• PISA, TIMSS, …

• Some aspects of their sampling design• Two stage sampling

• Weights

• Rotated test design

• What if you don’t take this into account?• Simulation with TIMSS 2011 data

• Single level model

• Multilevel models

IEA & OECD

The International Association for the Evaluation of Educational Achievement (IEA) is an independent, international cooperative of national research institutions and governmental research agencies. It conducts large-scale comparative studies of educational achievement and other aspects of education.

The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.

http://www.iea.nl/

http://www.iea.nl/

http://www.oecd.org/education/

http://www.oecd.org/education/

PISA

“The Programme for International Student Assessment (PISA) is a triennial international survey which aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students. To date, students representing more than 70 economies have participated in the assessment.”

• Last one appeared in 2013 with 2012 data

http://www.oecd.org/pisa/

http://www.oecd.org/pisa/

TIMSS

“TIMSS 2011 is the fifth in IEA’s series of international assessments of student achievement dedicated to improving teaching and learning in mathematics and science. First conducted in 1995, TIMSS reports every four years on the achievement of fourth and eighth grade students.“

http://timssandpirls.bc.edu/timss2011/

http://timssandpirls.bc.edu/timss2011/

OUTLINE




• Weights





Two-stage sampling in educational studies

● Random sampling is rarely used in educational surveys:

– Too expensive (e.g., training test administrators and travel costs)

● Selected students attend many different schools

– It is not practical to contact many schools

– A link with class, teacher, school variables is sought

● Sampling is usually conducted in two stages

● First stage

– Schools are selected

● Second stage

– Students (PISA) or classes (TIMSS/PIRLS) are selected

● 35 students selected randomly (PISA)

● One or two intact classes (TIMSS/PIRLS)

Replicate weights

● Replicate weights or resampling techniques are used to calculate correct standard errors in two-stage sampling designs

● The idea behind:

– There are many possible samples of schools and not all of them yield the same estimates

– Use different samples of schools to calculate estimates

– Take into account error of selecting one school and not another (sampling error)

● Each replicate weight represents one sample

● Variability between estimates reflects the sampling error

Two replication methods

● Jackknife

– TIMSS and PIRLS

– Schools are paired with other similar schools within zones

– A replicate is created for each zone or pair of schools

– One school is randomly removed within each zone and the weight of the other school is doubled

● Balanced repeated replication (BRR)

– Select one school at random within each stratum

– Set its weight to 0

– Double the weight of the other school

– PISA uses a variant of BRR (Fay) to preventsmaller sample size

Source: OECD (2009). PISA Data Analysis Manual: SPSS (2nd Edition. Paris): OECD Publishing.

OUTLINE




• Weights





Weights

• In theory sampling design provides student samples with equal selection probabilities.

• But variation in number of classes selected, and differential patterns of nonresponse can result in varying selection probabilities, requiring a unique sampling weight for the students in each participating class in the study.

• Total weight (TOTWGT)• Sums to the student population size in each country

• The overall student sampling weight is the product of the final weight components for schools, classes, and students• Important in multilevel analyses

• School level: final school weight

• Student level: final student weight multiplied with final class weight

OUTLINE




• Weights





Rotated test design

● The item pool should include a large number of items for domain validity (e.g., mathematical literacy)

● At the same time:

– Fatigue biases results of long tests

– Schools refuse to participate in lengthy studies

● Rotated test forms

– Students are assigned a subset of item pool

– Minimize testing time

Plausible values

● Rotated booklets introduce challenges for estimating academic achievement

– Students miss data on a number of items

● Plausible values methods are employed to obtain population estimates with rotated booklet designs

● Students do not answer all items but plausible scores are produced as if they had responded to all items based on

– Responses to test items

– Background characteristics

Plausible values

● Plausible values are random draws from the distribution of a student's ability

– Instead of obtaining a point estimate, a range of values are estimated for each student

● A single score cannot be calculated because data is missing for a number of items

● Plausible values account for imputation error

– Making inference on ability from small number of items

● Estimation should be conducted separately for each plausible value

– Typically five plausible values are considered

– The variability between estimates reflects the imputation error

Challenge

● Ignoring the complex design leads to wrong conclusions, like different point estimates and/or underestimated standard errors, see Rutkowski et al. (2010)

– Variance estimation: jackknife, BRR

– Not taking into account weights (e.g. Rutkowski et al (2010): Bulgarian TIMSS 2007, higher probability of selection to students from vocational and profiled schools). In a multilevel situation choosing wrong composite weights.

– Treatment of plausible values: instead of Rubin’s rules averaging (five) plausible values or choosing only one plausible value.

● Drent et al. (2013) formulated quality criteria (low, satisfactory, high)

● Standard software cannot handle replicate weights and plausible values

Available software

● IDB Analyzer (SPSS)

● NAEP Data Explorer (web tool)

● PISA SPSS macros

● R package 'intsvy‘ (Daniel Caro, Oxford)

– Free

– Does not rely on commercial software like SPSS or SAS

– Open source

– Can be extended to perform other analyses

Available software

Multilevel software

● R

– Has multilevel package but no weights

– Can link to MLwin

● MLwin

– Have to combine plausible values manually

– No resampling

– Does handle weights

● HLM

– Combines plausible values

– Weights

– No resampling

OUTLINE




• Weights





Simulation with TIMSS 2011 data

• TIMSS 2011

• Three aspects: jackknife, weights, plausible values

• Five countries:England is chosen as a base-level, using the ranking for grade 8 TIMSS 2011. One arbitrary country significantly above England in the rankings, Singapore, is chosen, as well as one country significantly below England in the rankings (Norway). In addition the countries respectively one place higher and one place lower are chosen (United States and Hungary).

Simulation with TIMSS 2011 data

• Data preparation:• Publicly available TIMSS 2011 year 8 data files are used.

• Additional columns calculated: average of the five plausible values and different weighting columns.

• Two experiments: A. single level analyses, and B. multilevel analyses with students nested in schools.

• For experiment A an open source R package intsvy (Caro, 2014) for R is used.

• Experiment B looks at multilevel models by constructing null models in HLM 6.08 for five countries with student and school levels.

Single level

Different scenarios:

• Two conditions concern variance estimation with jackknife (JK): either jackknife is applied or isn’t applied.

• Two conditions concern weights (Wgt): either weights are applied or are not applied.

• Three final conditions for the maths achievement scores are used for Plausible Values. • PVR denotes the correct approach using ‘plausible values with Rubin’s rules’.

• PVA denotes the ‘mean of the plausible values’.

• PV1 only uses ‘the first plausible value’.

A total of 2×2×3=12 cases are calculated, as shown in the table on the next slide. Case 1 replicates the values from the international report (Mullis, Martin, Foy, & Arora, 2012).

PV1 Case 9

With JK With Wgt

Case 10

No JK With Wgt

Case 11

With JK No Wgt

Case 12

No JK No Wgt

Country Score SE # Score SE # Score SE # Score SE #

Singapore 609.71 3.68 1 609.71 1.08 1 606.22 3.63 1 606.22 1.08 1

USA 508.75 2.58 2 508.75 0.75 2 508.92 2.52 4 508.92 0.74 4

England 506.03 5.45 3 506.03 1.36 3 509.44 5.59 3 509.44 1.37 3

Hungary 504.75 3.44 4 504.75 1.22 4 513.38 2.96 2 513.38 1.16 2

Norway 475.24 2.38 5 475.24 1.03 5 477.04 2.62 5 477.04 1.03 5

PVA Case 5

With JK With Wgt

Case 6

No JK With Wgt

Case 7

With JK No Wgt

Case 8

No JK No Wgt


Singapore 610.99 3.73 1 610.99 1.06 1 607.54 3.68 1 607.54 1.06 1

USA 509.48 2.59 2 509.48 0.73 2 509.68 2.53 4 509.68 0.72 4

England 506.76 5.48 3 506.76 1.34 3 509.99 5.64 3 509.99 1.35 3

Hungary 504.81 3.48 4 504.81 1.21 4 513.47 2.98 2 513.47 1.15 2

Norway 474.64 2.37 5 474.64 0.99 5 476.55 2.64 5 476.55 1.00 5

PVR Case 1

With JK With Wgt

Case 2

No JK With Wgt

Case 3

With JK No Wgt

Case 4

No JK No Wgt


Singapore 610.99 3.77 1 610.99 0.83 1 607.54 3.74 1 607.54 0.87 1

USA 509.48 2.63 2 509.48 0.55 2 509.68 2.58 4 509.68 0.57 4

England 506.76 5.53 3 506.76 0.89 3 509.99 5.63 3 509.99 0.70 3

Hungary 504.81 3.48 4 504.81 0.47 4 513.47 2.98 2 513.47 0.40 2

Norway 474.64 2.44 5 474.64 0.55 5 476.55 2.66 5 476.55 0.50 5

Maths achievement scores and standard errors for five countries for twelve different cases with weights, jackknifeand plausible values.

Observations

Differences in achievement results and standard errors:

• Not taking into account Jackknife (example in yellow)• Average score the same.• Underestimates standard error.• So: relative ranking same but significant testing influenced.

• Not taking into account weights (example in orange)• Influences achievement scores: USA, England, Hungary and Norway scoring

higher, and Singapore scoring lower.• Impact on relative rankings. • Standard errors different, some higher some lower.

• Plausible values (example in green)• PVA and PVR the same achievement score, PV1 different.• PVA and PV1 underestimate standard error.• But no clear pattern PVA and PV1 (which contradicts previous literature).

Multilevel

Used HLM, does not have Jackknife• Note that with MLwin you need to

combine Plausible Values manually.• Three conditions concern weights:

no weights, weights only at student level (see Willms & Smith, 2005) and final weights (Rutkowski et al., 2010).

• Three conditions for the maths achievement scores are used for Plausible Values. PVR denotes the correct approach using ‘plausible values with Rubin’s rules’. PVA denotes the ‘mean of the plausible values’. PV1 only uses ‘the first plausible value’.

• The 3×3 scenarios are reported in table 3.

Maths achievement scores and standard errors of five countries for multilevel null models in three different weighting scenarios S1, S4 and S6 and plausible values.

Observations

Differences in achievement results and standard errors:

• The different weighting methods greatly influence achievement scores and standard errors. This also has an impact on the relative rankings. There does not seem to be a pattern in over- or underestimation of scores and standard errors.

• For plausible values the cases for PV1 yield a different average than PVA and PVR, in three cases lower except for Hungary and Norway. For PVA and PV1, the standard error is underestimated with respect to PVR. However, between PVA and PV1 underestimation of SE’s differ only slightly, with PVA in most cases being closer to or just as close to PVR as PV1.

• Singapore PV1 PVA PVRUnited states PVA PVR PV1England PV1 PVA PVRHungary PV1 PVA PVRNorway PVA PV1 PVR

Final thoughts

• Not taking into account three features of complex sample designs for LSA’s can have a big influence on achievement scores, standard errors and rankings.

• Confirms findings by Rutkowski et al. (2010).

• Not all ‘rules of thumb’ from previous literature (Drent et al., 2013; Rutkowski et al., 2010) seem to hold.

• Therefore, caution should always be taken when analysing LSA data, hopefully improving future LSA analyses by educational researchers.

• Need transparent methodology

THANK [email protected]

QUESTIONS/DISCUSSION

mailto:[email protected]

Relevant referencesBeaton, A.E., & Gonzalez, E.J. (1995). NAEP Primer. Center for the study of testing, evaluation and educational policy, Boston College. Chestnut hill: MA.

Caro, D. (2014). intsvy: International Assessment Data Manager. R package version 1.3. http://CRAN.R-project.org/package=intsvy

Drent, M, Meelissen, M.R.M., & van der Kleij, F.M. (2013). The contribution of TIMSS to the link between school and classroom factors and student achievement. Journal of curriculum studies, 45 (2), 198 - 224.

Goldstein, H. (2004). International comparisons of student attainment: some issues arising from the PISA study. Assessment in Education, 11(3), 319-330.

Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210-231.

Martin, M.O. & Mullis, I.V.S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS 2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.

Mullis, I.V.S., Martin, M.O., Foy, P., & Arora, A. (2012).TIMSS 2011 International results in mathematics. Lynch School of Education, Boston College.

Rubin, D. (1987). Multiple imputation for nonresponse in sample surveys. New York: John Wiley.

Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151.

Von Davier, M., Gonzalez, E., & Mislevy, R.J. (2009). Plausible values: What are they and why do we need them? IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 2, 9-36.

Willms, J.D., & Smith, T. (2005). A manual for conducting analyses with data from TIMSS and PISA. Report prepared for UNESCO Institute for Statistics.

http://cran.r-project.org/package=intsvy