investigating the statistical and cognitive dimensions in large-scale science assessments

Investigating the Statistical and Cognitive Dimensions in Large-Scale Science Assessments

CESC-SSHRC Symposium 2005Jacqueline P. Leighton

Acknowledgments

Canadian Education Statistics Council (CESC)

Social Sciences and Humanities Research Council (SSHRC)

Ms. Rebecca J. Gokiert, Ms. Ying Cui CRAME colleagues

Overview

Rationale Materials—SAIP Science 99 Methods & Results

Phase 1 Methods & Results

Phase 2 Implications for Policy

Rationale

To identify the dimensional structure of the School Achievement Indicators Program (SAIP) Science Assessment

To find support (or not) for the view that science performance is associated with multiple and distinct thinking skills

Materials—SAIP Science 99

A dichotomously scored two-stage test

Administered to students in both Grade 8 and Grade 11 (13- and 16-year-olds)

6 content domains 5 ability levels

Materials—SAIP Science 99

ROUTING TEST A

TEST B TEST C

TEST AB TEST AC

Method—Phase 1: Exploratory

Dimensionality test or DIMTEST (Stout et al., 2001) is a nonparametric procedure used to test the null hypothesis that a set of test data is unidimensional

Methods—Phase 1: Exploratory

EFA of the tetrachoric correlations was conducted, using 5 recommended decision rules

The factors retained were rotated using orthogonal rotation procedures (i.e., quartimax, varimax) and an oblique transformation procedure (i.e., direct oblimin)

Results—Phase 1: DIMTEST

Test Section

Section AB Section AC

Sample size for

FAC

Sample

size for T

T P

Sample size

for FAC

Sample

size for T

T P

13-year-old

group11000 2054

4.8891

0.0000

800 21034.1602

0.0000

13-year-old

group21000 2054

5.4745

0.0000

800 21034.3294

0.0000

16-year-old

group1800 1200

3.5009

0.0002

1000 30104.2448

0.0000

16-year-old

group2800 1200

3.5425

0.0002

1000 30105.9118

0.0000

Results—Phase 1: EFA

EFA Results Decision rules indicated two factors Oblique results interpreted because

factors shared low to moderate correlations (range of .014 to .384)

Method—Phase 2: Confirmatory

Common shortcoming with EFA is the sparse description of the factors found to underlie the data (Haig, 2005)

For each item with a loading equal to or greater than 0.3, the following information was recorded: First five to ten words of the test question, Specific factor on which the item loaded Content standard or objective Ability level of the item

Methods—Phase 2: Confirmatory

Preliminary analyses of the AB and AC tests suggested that the two factors tapped student reasoning about causes and effects and student reasoning about category membership


Recently published review article (2004) by Deanna Kuhn and David Dean Jr.. In the ongoing process of managing

and reducing the complexity of information from the external environment, individuals typically make use of two forms of inference—causal and non-causal


SAIP items were reviewed and coded according to whether they contained primarily causal or categorical-type key words

We used key introductory words such as “why,” “how,” “cause/effect,” “what,” “which,” or “identify” to code items as either primarily causal or primarily categorical


Influence of item format on students’ interpretation of the item as requiring causal versus categorical reasoning SAIP items also coded according to item

format Format might function as a proxy for

invoking either causal or categorical reasoning


Linear factor analysis with LISREL to estimate the parameters for a 2-dimensional model associated with the Causal-Categorical Model (CCM) the Item Format Model (IFM)

Linear factor analysis to estimate the parameters for a 6-dimensional model using item coding associated with the Test Specifications Model (TSM)

Results—Phase 2: Confirmatory

Using recommended fit indices (Gierl & Rogers, 1996), none of the models fit the AB test data adequately

For the AC data, the IFM provided a consistently better fit than the CCM and TSM

Policy Implications

Multidimensional latent structure of the SAIP Science Assessment Distinct forms of thinking in science Sub-scores might be a better form of

score reporting for SAIP and similar large-scale assessments

Policy Implications

Superiority of the Item Format Model in confirmatory factor analysis Item format may function to elicit

distinct forms of reasoning in science—causal and categorical

Policy Implications

Use of SAIP sub-scores to measure and gauge improvements in specific forms of reasoning in students

Test design and feedback that is focused on cognitive skills as well as content

investigating the statistical and cognitive dimensions in large-scale science assessments

Documents