riverside interim assessments · scaling and equating test development 5 the following sections...

24
Riverside ® Interim Assessments Research Overview Version 2

Upload: others

Post on 11-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Riverside® Interim Assessments

Research Overview

Version 2

Page 2: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Copyright © 2012 by The Riverside Publishing Company. All rights reserved.

Permission is hereby granted to educational institutions to reprint or photocopy in classroom quantities the pages or sheets in this work that carry The Riverside Publishing Company copyright notice. These pages are designed to be reproduced by teachers for use in their classes, provided each copy made shows the copyright notice. Such copies may not be sold and further distribution is expressly prohibited. Riverside is not authorized to grant permissions for further uses of reprinted text without the permission of their owners. Permission must be obtained from the individual copyright owners as identified herein. Except as authorized above, prior written permission must be obtained from Riverside to reproduce or transmit this work or portions thereof in any other form or by any other electronic or mechanical means, including any information storage or retrieval system, unless expressly permitted by federal copyright law. Address inquiries to Permissions, Riverside, 3800 Golf Rd., Suite 200, Rolling Meadows, IL 60008.

Page 3: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

v2.0 Content i

Contents

Part 1 Introduction ................................................................... 1

About This Guide................................................................................................. 1 Purpose ............................................................................................................ 1 How to Use This Guide ..................................................................................... 1

About the Riverside Interim Assessments ............................................................. 1

Purpose of the Common Core State Standards .................................................... 2

Part 2 Test Development .......................................................... 3

General Development .......................................................................................... 3

Test Content ........................................................................................................ 3 Domains Measured by the Riverside Interim Assessments ................................. 4

Scaling and Equating ........................................................................................... 5 Scaling.............................................................................................................. 5 Equating........................................................................................................... 5 Assessing Unidimensionality of the Data .......................................................... 7 Assessing Local Independence of the Data ....................................................... 7 Assessing Data Fit to the Model ....................................................................... 7

Part 3 Scale Scores and Reporting ........................................... 11

Standards Setting and Domain Scale Scores ....................................................... 11 Aggregate Proficiency Scores ......................................................................... 12

Total Scale Scores .............................................................................................. 12

Development of the Total Score Scale ............................................................... 13

Linking the Riverside Interim Assessments Growth Scale to Iowa Assessments, Form E.................................................................................. 14

Part 4 Score Interpretation ..................................................... 15

Using Domain-level Scores ................................................................................. 15

Part 5 Administration Scenarios .............................................. 17

Iowa Assessments—Fall Model ........................................................................ 17 Iowa Assessments—Spring Model ................................................................... 17 Iowa Assessments—Fall/Spring Model ............................................................. 17

Index ........................................................................................... 19

Page 4: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

ii Riverside Interim Assessments Research Overview

Page 5: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Introduction 1

About This

Part 1 Introduction

Guide Purpose

The Riverside® Interim Assessments Research Overview describes the research methods and standards employed in the development of the Riverside Interim Assessments.

How to Use This Guide

This guide focuses on activities that occur in the Adopt phase of the assessment life cycle.

Understand your options and make informed decisions

Get organized and prepare for testing

Administer the tests according to the directions

Prepare answer documents for scoring

Analyze test results and communicate with students, parents, and staff

About the

Riverside Interim Assessments The Riverside Interim Assessments have been designed to assess the Common Core State Standards (CCSS) in English Language Arts (ELA) and Mathematics. The Riverside Interim Assessments provide information to schools and districts that will help them make informed decisions about the progress being made by their students and the effectiveness of their instructional programs in teaching the CCSS.

The Riverside Interim Assessments can be administered independently or within a Balanced Assessment System.

• Administer the Riverside Interim Assessments independently as both a pretest for standards not yet covered and a posttest for standards for which instruction has been provided.

• Administer the Riverside Interim Assessments in a Balanced Assessment System with a summative assessment, such as the Iowa Assessments™, to provide data that help determine whether student progress during the year meets estimates.

The Riverside Interim Assessments include the following features:

• measurement of content covered by the CCSS

• three parallel, alternate forms (for each grade and content area) that can be used interchangeably

• vertical growth scale that can support longitudinal monitoring of student performance

Page 6: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Purpose of the Common Core State Standards

2 Riverside Interim Assessments Research Overview

• linkage to the Iowa Assessments for an enhanced assessment system

• subscore or domain reporting that can assist educators in identifying strengths and weaknesses in student performance

• flexible administration models that can serve the needs of many school systems

The CCSS were developed to “provide a consistent, clear understanding of what students are expected to learn… The standards are designed to be robust and relevant to the real world, reflecting the knowledge and skills that [students] need for success in college and careers.”1

The CCSS provide a progression of ELA and mathematics skills toward college and career readiness. These broad standards encompass skills such as reading, writing, speaking, listening, language, problem solving, abstract reasoning, attention to precision, and many others. To implement the CCSS, much work will be required to develop both appropriate curricula and assessment systems.

1 Home: Common Core State Standards Initiative. (2011). Retrieved November 11, 2011, from Common Core State Standards

Initiative: http://www.corestandards.org/

Page 7: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Test Development 3

General Development

Part 2 Test Development

Test Content

Creating assessments that deliver reliable performance information, such as the Riverside Interim Assessments, requires careful evaluation of test materials. It is important that students find the assessment relevant, interesting, and engaging but not offensive, troubling, or distracting. In an effort to achieve this balance, Riverside analyzed every item and the tests as a whole during the test development process for the following key elements:

• test content

• bias and sensitivity

• representational fairness

• language usage

• stereotyping

• controversial or emotionally charged subject matter

• historical context

The test design of the Riverside Interim Assessments takes into account the important diversity of society and avoids language, symbols, gestures, words, phrases, or examples that are generally regarded as sexist, racist, offensive, inappropriate, or negative toward any group.

Items for the Riverside Interim Assessments have been written to measure the CCSS in English Language Arts (ELA) and Mathematics. All items were written to measure the specified content standards at the specified grade level.

Items are both standalone and passage driven. ELA items written for reading are typically passage driven, requiring information from the passage to answer the question. Most other items, including those covering mathematics standards and other ELA standards, were written to stand by themselves and do not require information from a passage or prompt to answer the question. Standalone items are not linked to any other items or stimulus material. All items are selected response (multiple choice); students select a response from a group of three (Grade 2) or four (Grades 3–11) answer choices. The Mathematics tests and the Language domain consist only of standalone, selected-response items. For the reading component, all items are passage driven.

Page 8: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

4 Riverside Interim Assessments Research Overview

Domains Measured by the Riverside Interim Assessments

The Riverside Interim Assessments have been designed to measure a portion of the CCSS. As noted earlier, these assessments are in a traditional, selected-response format and therefore do not address skills such as speaking, listening, and writing. Blueprints were drafted that first identified which standards could be assessed in this format and then identified with what emphasis or weight the standards would be assessed. Scores are provided for those domains that contain sufficient numbers of items.

Table 1 shows the domains that are reported on each of the Riverside Interim Assessments.

Table 1: Reporting Domains of the Riverside Interim Assessments

Domain Grade

2 3 4 5 6 7 8 9 10 11 Mathematics

Operations and Algebraic Thinking X X X X

Number and Operations in Base Ten X X X X

Number and Operations—Fractions X X X

Measurement and Data X X X

X

Geometry X X X X X X X

Ratio and Proportional Relationships X X

The Number System X X X

Expressions and Equations X X X

Statistics and Probability X X X X X

Functions X X X X

Algebra X X X

Numbers and Quantity X

English Language Arts and Literacy in History/Social Studies, Science, and Technical Subjects Reading Standards for Literature X X X X X X X X X X

Reading Standards for Informational Text X X X X X X X X X X

Foundational Skills X X

Language Standards X X X X X X X X

Reading Standards for Literacy in History/Social Studies X X X X X X

Reading Standards for Literacy in Science and Technical Subjects X X X X X X

Page 9: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Scaling and Equating

Test Development 5

The following sections provide details about the scaling and equating procedures used during the development of the Riverside Interim Assessments.

Scaling

Riverside used a pre-equating model (Kolen & Brennan, 20042) to produce equated forms for each grade and content area. The equating methods used, which are described below, maintain consistency for the assessment scale scores over time and ensure that the achievement levels are applied consistently across the three assessment forms for each grade and content area.

Riverside’s researchers used the Rasch Item Response Theory (IRT) model and WINSTEPS software (Linacre, 20063) to scale and equate the Riverside Interim Assessments. WINSTEPS is designed to produce a single scale by analyzing data from a set of students’ responses to the items. Rasch IRT is a modern test theory that mathematically establishes the probability of a correct response on a particular test item in terms of a variety of parameters. Specifically, it expresses the probability of a correct response to an item as a function of the ability of the person and the difficulty of the item. One key feature of the Rasch model is the placement of estimates of a test taker’s ability and item difficulty on the same scale. This feature distinguishes the Rasch model from classical test theory (CTT), which can also be used to analyze test results. CTT does not take item difficulty into account because it defines a test taker’s ability simply as the total score on the test.

Equating

IRT pre-equating involves using field test data to scale item parameters and equate test forms before the forms are administered operationally. The approach used to pre-equate the Riverside Interim Assessments is described in the following steps.

1. Calibrate all 2011 operational items from the field test concurrently.

2. Establish the overall base scale through calibration of the 2011 operational forms.

3. Establish each domain base scale through calibration of the 2011 operational items for each domain.

Step 1: Concurrent Calibration of 2011 Field Test Forms

In the 2011 field test, five test forms made up of selected-response items were spiraled within each classroom for each grade and content area. Assuming that randomly equivalent groups of students took every form, Riverside researchers calibrated the complete pool of items for each content area (all operational items) concurrently, placing all items on a common IRT scale. Table 2 on the next page shows the number of items on each field test form and on each operational form.

2 Kolen, M.J., & Brennan, R.L. (2004). Test Equating, Scaling, and Linking: Methods and Practices (2nd ed.). New York: Springer-

Verlag. 3 Linacre, J. M. (2006). WINSTEPS [Computer software manual]. Chicago: MESA Press.

Page 10: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

6 Riverside Interim Assessments Research Overview

Table 2: 2011 Number of Items on Each Field Test and Operational Form

ELA Field Test Operational 2 25 33 3 42 35 4 42 35 5 42 35 6 54 45 7 54 45 8 54 45 9 54 45

10 54 45 11 54 45

Mathematics Field Test Operational 2 33 33 3 38 38 4 38 38 5 45 45 6 45 45 7 45 45 8 45 45 9 40 40 10 51 51 11 41 44

Step 2: Calibration of the 2011 Operational Forms

Three separate operational forms were constructed for each grade and content area. The forms were built so that they are consistent with the test blueprint and parallel with each other based on statistics from the field test. To establish the base scale for each grade and content area test, researchers created a table that converted the raw scores to test-taker ability estimates as calculated with the Rasch IRT model for each operational form.

Step 3: Calibration of the 2011 Operational Items for Each Domain

For each domain on every operational form, researchers completed another calibration and created tables that converted the raw scores to test-taker ability estimates for each domain.

In addition to the three steps described above, Riverside researchers completed an analysis to evaluate classical item statistics for the operational forms for each grade and content area, using the data from the 2011 field test administration. These analyses supported test development efforts by providing additional data that aided the selection of the highest-quality items for inclusion in each operational form.

Because the Rasch model is the basis for all scoring and scaling analyses associated with the Riverside Interim Assessments, the utility of the results from the 2011 field test administration depends on the degree to which the assumptions of the model are met and the degree to which the test data fit the model. The assumptions of the Rasch model are that (1) the data are unidimensional (a single trait influences the position of the items or students on the scale) and (2) the data are locally independent, meaning that responses to one item do not depend

Page 11: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Test Development 7

on responses to another item. The sections below address these assumptions and include evaluations of the dimensionality and local independence of the data, as well as fit indices.

Assessing Unidimensionality of the Data

Riverside researchers completed a residual-based, unrotated principal components analysis (PCA) to assess the unidimensionality assumption of the Rasch model. The purpose of this analysis was to reveal contrasts between opposing factors by showing the variance explained by factors not accounted for by the Rasch model. That is, the Rasch dimension was removed first, and the residual variance (the proportion of the variation in the data set that is unaccounted for by the Rasch model) was then analyzed. Consequently, for this model to hold, one does not want to identify a second dimension that accounts for a practically significant amount of residual variance.

Analyses for the operational forms for each grade and content area generally showed the secondary dimension to represent an impact of less than 5% of the total variance and it was, therefore, considered of little practical import.

Assessing Local Independence of the Data

Based on the principal component analysis, standardized residual correlations were produced to assess the local independence assumption of the Rasch model. The purpose of these analyses was to detect dependency between pairs of items. Results of these analyses generally supported the assumption of local independence; values for standardized residual correlations were generally low, indicating little dependency between pairs of items.

Assessing Data Fit to the Model

Two statistics were used to evaluate how well the data fit the Rasch model: infit and outfit. Infit (inlier-sensitive fit) is sensitive to aberrations in item response patterns at the test-taker’s ability level. High infit statistics indicate unexpected responses to items that are well targeted to the test-taker’s ability. For example, the test-taker answers a number of items that are well suited to his ability incorrectly. Low infit statistics, while not a threat to measurement, may indicate overfit of the data to the model that may result in artificially inflated reliability statistics. Outfit (outlier-sensitive fit) is sensitive to outliers; in other words, to aberrant responses for items with difficulty far from a test-taker’s ability. For example, the test-taker incorrectly answers items that should be easy, or correctly answers questions that should be difficult. High outfit values may indicate lucky guessing or careless mistakes. Relatively speaking, extremely high infit values are believed to be a greater threat to the measurement process than extreme outfit values.

Infit and outfit can be expressed as a mean square (MS) statistic or a standardized metric (z). Both ways of presenting these data can be useful because they provide different perspectives. Because MS values are more oriented toward practical significance, they are reported below. Rules of thumb regarding “practically significant” MS fit values vary. For the Riverside Interim Assessments, values below 0.7 and above 1.3 were considered to be outside the range of acceptable fit. Tables 3 through 6 provide item summary statistics, including summary fit statistics, for the Riverside Interim Assessments for Grades 4 and 8 for both ELA and

Page 12: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

8 Riverside Interim Assessments Research Overview

Mathematics. The results presented for these grades are representative of those for the Riverside Interim Assessments for all grades and both content areas. Note that the data are based on the operational test calibrations, which were used to establish the base scale for the assessments.

Table 3: Item Summary Statistics across Three Operational Forms for ELA Grade 4

Statistic Rasch

Difficulty Estimate

p-value MS Infit MS Outfit Point

Biserial

# of Items 105 105 105 105 105

Mean 0.00 0.57 .99 1.03 0.42

Standard Deviation (SD) 0.97 0.17 .11 .32 0.11

Minimum –2.18 0.14 .73 .49 –0.08

Percentiles

10 –1.34 0.31 .86 .75 0.27

25 –0.59 0.46 .92 .86 0.35

50 0.01 0.58 .98 .96 0.45

75 0.55 0.69 1.06 1.11 0.49

90 1.33 0.80 1.14 1.34 0.56

Maximum 2.46 0.90 1.28 2.74 0.61

Table 4: Item Summary Statistics across Three Operational Forms for ELA Grade 8

Statistic Rasch

Difficulty Estimate

p-value MS Infit MS Outfit Point

Biserial

# of Items 135 135 135 135 135

Mean 0.00 0.57 .99 1.01 0.42

SD 0.86 0.16 .12 .28 0.11

Minimum –2.10 0.11 .77 .47 0.01

Percentiles

10 –1.14 0.36 .85 .73 0.27

25 –0.56 0.46 .90 .83 0.36

50 0.03 0.57 .99 .98 0.44

75 0.63 0.68 1.07 1.13 0.51

90 1.05 0.77 1.17 1.27 0.54

Maximum 2.74 0.89 1.34 2.75 0.59

Page 13: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Test Development 9

Table 5: Item Summary Statistics across Three Operational Forms for Mathematics Grade 4

Statistic Rasch

Difficulty Estimate

p-value MS Infit MS Outfit Point

Biserial

# of Items 114 114 114 114 114

Mean –0.02 0.49 1.00 1.03 0.38

SD 1.13 0.19 .10 .23 0.10

Minimum –2.48 0.13 .83 .70 0.08

Percentiles

10 –1.55 0.23 .88 .80 0.26

25 –0.88 0.32 .91 .87 0.31

50 –0.05 0.49 .98 .99 0.40

75 0.84 0.66 1.06 1.15 0.46

90 1.32 0.76 1.14 1.29 0.48

Maximum 3.02 0.82 1.38 2.35 0.54

Table 6: Item Summary Statistics across Three Operational Forms for Mathematics Grade 8

Statistic Rasch

Difficulty Estimate

p-value MS Infit MS Outfit Point

Biserial

# of Items 135 135 135 135 135

Mean –0.01 0.39 1.00 1.01 0.32

SD 0.64 0.13 .08 .12 0.11

Minimum –1.55 0.10 .85 .80 0.07

Percentiles

10 –0.78 0.24 .90 .88 0.15

25 –0.42 0.29 .94 .92 0.23

50 –0.01 0.38 .99 .99 0.33

75 0.40 0.46 1.05 1.09 0.39

90 0.81 0.57 1.11 1.18 0.45

Maximum 1.70 0.71 1.21 1.29 0.53

Page 14: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

10 Riverside Interim Assessments Research Overview

Page 15: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Scale Scores and Reporting 11

Standard

Part 3 Scale Scores and Reporting

s Setting and Domain Scale Scores Domain scale scores for the Riverside Interim Assessments for the reporting domains listed in Table 1 on page 4 are based on a minimum of eight items per domain in Mathematics and a minimum of nine items per domain in ELA. Because these scores were equated, they enable comparisons of student performance on the domains across administrations of the three forms throughout a school year. To provide additional information about student performance, following the field testing and building of the operational forms, Riverside held a standard-setting workshop to determine recommendations for proficiency classifications for each reporting domain as well as the total score.

To guide the standards setting, Riverside content specialists developed performance-level descriptors for each domain at each grade level. These descriptors were used during the standards setting to ensure that participants had a common understanding of a proficient student. A proficient student was expected to have been immersed in the CCSS for a minimum of the past three years of schooling, and the proficient descriptor was limited to the content covered on the blueprint of the assessment. The following descriptors for Grade 6, ELA, Reading for Literature and Grade 7, Math, Ratio and Proportional Relationships are representative of the descriptors written.

Grade 6, Domain: Reading for Literature The student’s overall performance in Reading for Literature meets the standard set for students in sixth grade. Students performing at the proficient level consistently demonstrate a clear understanding of grade-level literary texts. They demonstrate understanding of key ideas and details in literature by citing textual evidence, determining a theme or central idea, summarizing texts, and describing the evolution of plot and characters in a text.

Students exhibit an understanding of the craft and structure of literature by determining the meaning of words and phrases including figurative and connotative, analyzing how parts of a text contribute to the overall structure and meaning, and explaining how point of view is developed in a text. Students at this performance level integrate knowledge and ideas by comparing and contrasting the themes and topics of two or more texts in different forms or genres.

Grade 7, Domain: Ratio and Proportional Relationships The student’s overall performance in Ratio and Proportional Relationships meets the standard set for students in seventh grade. Students performing at the proficient level demonstrate a consistent ability to compute unit rates. They are able to represent and use proportional relationships to solve multistep ratio and percent problems.

Riverside staff members planned and facilitated the standards-setting workshops. Sixteen panelists, who are experienced educators with a strong understanding of the CCSS, participated in the standards-setting meeting. The panels were organized by content area and

Page 16: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Total Scale Scores

12 Riverside Interim Assessments Research Overview

grade into four groups: Math grades 2–6 and grades 7–11; Reading grades 2–6 and grades 7–11. Each group included four panelists.

The direct consensus method was used to set the proficient cut score for each reporting domain. In the first round, panelists rated the items individually to determine a raw cut score for the domain. Results were compiled and reported, and then the panelists discussed their ratings together. In the second round, panelists considered that discussion as they reviewed and revised, as needed, their ratings. In the third round of ratings, participants reviewed the average (median) of their ratings and again discussed their differences in an attempt to reach common cut scores for each reporting domain. If the panel then reached consensus, it was submitted as the final recommended cut score. If no consensus was reached, the panelists completed a final rating form and their ratings were averaged as the panel’s recommended cut score. In every case, participants reached a consensus on the cut score for each reporting domain and therefore no mediation was needed.

In addition to the rounds of ratings, panelists participated in vertical articulation meetings to ensure consistency in cut scores across grades. In those discussions, the results of each panel were reviewed together as a whole, and the participants discussed the ratings and had an opportunity to revisit the tests, items, and blueprints and make adjustments to the cut scores. Surveys submitted by the standard-setting participants confirmed that the cut scores produced by this process were representative of the proficient performance-level descriptors.

Aggregate Proficiency Scores

In addition to the domain proficiency cut scores, the content experts participating in the standards setting set recommended proficiency cut scores for the aggregate content areas of ELA and Mathematics at each grade. Cut scores were calculated by simply adding up the cut scores at each domain of the test form. These cut scores separate performance into three categories: Needs Improvement, Approaching Proficiency, and Proficient. It is important to note that these proficiency-level categories were set based on the Common Core State Standards and are recommended for those educators who have a CCSS-based curriculum. However, these are recommendations only; users of the Riverside Interim Assessments are free to establish their own proficiency-level categories to best reflect the specific needs of their students.

Students receive a total scale score for both the ELA and Mathematics tests. Within a grade and content area, total test scale scores and domain scale scores may be compared across the three forms. Raw scores and average raw scores are not comparable across forms and should not be used to compare student performance from form to form.

Riverside partnered with Iowa Testing Programs at The University of Iowa to create the scale for the total ELA and Mathematics scores, mapping each raw score point to the Riverside Interim Assessments scale score. The result was the creation of vertical scales designed to facilitate comparisons of student performance using any of the three forms at different points during the school year. The next section provides a summary of this process.

Page 17: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Development of the Total Scor

Scale Scores and Reporting 13

e Scale The Iowa Assessments Standard Score (SS) scale (Hoover, Dunbar, & Frisbie, 20034) was used to create the foundation for the Riverside Interim Assessments total score scale. Essentially, the scale was developed to be consistent with the existing Iowa Assessments growth scale by the use of simple effect-size measures. The Riverside Interim Assessments span Grades 2 through 11. The grade-to-grade growth pattern was established by computing effect-size-like measures between adjacent grades (the difference between adjacent-grade medians divided by the pooled standard deviation [SD] for the adjacent grades) on the Iowa Assessments SS scale. As described earlier, item response theory (IRT) was used to establish an initial metric for the Riverside Interim Assessments score scale. Differences between the medians of adjacent grades were then defined by the effect-size measures of the Iowa Assessments SS scale. One advantage of this approach is that it requires no explicit assumptions about the characteristics of individual items (e.g., unidimensionality and local independence) or about construct equivalence. The method simply defined the distance between within-grade distributions in standard units and created the spacing required to produce the growth feature of the Riverside Interim Assessments scale.

The ranges for the Riverside Interim Assessments scale scores for each test and grade are shown in Table 7. Also shown in the table are the recommended cut scores for “proficient” performance. Specifically, these scale scores represent the minimum scores for classifying students as proficient based on recommendations from Riverside. Please note that these scale scores apply to all forms within the grade and content area designations. That is, regardless of which form (A, B, or C) is administered and scored, these score ranges are applicable.

Table 7: Scale Score Ranges for ELA and Mathematics

ELA Mathematics

Grade Lower Bound

Proficient Upper Bound

Lower Bound

Proficient Upper

Bound 2 75 120 157 84 120 142 3 76 133 181 85 136 175 4 77 144 203 86 146 194 5 78 154 222 87 167 213 6 82 170 238 88 182 232 7 86 192 254 89 209 251 8 90 197 270 93 229 267 9 94 207 286 97 239 283 10 98 224 302 104 250 296 11 105 249 315 111 258 309

4 Hoover, H.D., Dunbar, S.B., & Frisbie, D.A. (2003). The Iowa Tests: Guide to Research and Development. Rolling Meadows, IL:

Riverside Publishing.

Page 18: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Linking the

14 Riverside Interim Assessments Research Overview

Riverside Interim Assessments Growth Scale to Iowa Assessments, Form E For those educators who elect to use the Iowa Assessments with the Riverside Interim Assessments as part of a comprehensive Balanced Assessment System, the Riverside Interim Assessments total score scales for ELA and Mathematics have been aligned to associated Iowa Assessments scores. As noted above, the growth scale for the Riverside Interim Assessments has been defined using basic characteristics of the Iowa SS scale. Links between the growth scale and the Iowa Assessments scale itself have been established through a process referred to as “scale alignment” (Holland & Dorans, 20065). In this way, the Riverside Interim Assessments and the Iowa Assessments can be used together to monitor the growth of students over time. However, because the Riverside Interim Assessments and the Iowa Assessments measure related but slightly different constructs, estimated ranges of probable Iowa Assessments scores are provided for each Riverside Interim Assessments score in lieu of actual point estimates.

The combined Riverside Multimeasure Student Roster report presents both Riverside Interim Assessments scale scores and estimated score ranges for the Riverside Interim Assessments in the Iowa Assessments SS metric. Check with your Riverside Assessment Consultant for report availability.

5 Holland, P.W., & Dorans, N.J. (2006). Linking and Equating. In R.L. Brennan (Ed.), Educational Measurement (4th ed.).

Washington, D.C.: National Council of Measurement in Education and The American Council on Education.

Page 19: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Score Interpretation 15

Using Domain

Part 4 Score Interpretation

-level Scores The Riverside Interim Assessments are designed as a criterion-referenced testing program with the Common Core State Standards (CCSS) as their foundation. The assessments can be most useful in asking questions such as “How accomplished are my students in understanding the CCSS?” or “Which students are having more difficulty than others in understanding CCSS concepts?” Responses to these questions can be aided by classifying students with respect to pre-determined levels of proficiency. If student performance falls short of the scores that are associated with proficient performance, educators have an opportunity to address the deficiency.

Riverside Interim Assessments provide domain-level scores as well as an overall score for a content area. Domain scores provide more “granularity” than the total score and therefore provide a better indication of students’ strengths and weaknesses within the domain. Domain-level scores can be more actionable for determining what instructional interventions may be helpful to assist students in improving their understanding. Students who are “proficient” in a domain demonstrate an understanding of the CCSS content; students classified as “approaching proficiency” could benefit from more targeted instruction; and students who are classified as “needs improvement” are in most need of focused instruction. A review of student performance with respect to proficiency level classifications early in the year can provide a significant opportunity for educators to help their students improve.

As mentioned earlier, for convenience Riverside has provided proficiency level recommendations that may be useful for many school systems. However, these recommendations may not meet the needs of all. Riverside Interim Assessments users should feel free to establish their own proficiency levels that better meet their specific student population and needs. If users have the resources and inclination to do their own standard-setting work, they are encouraged to do so; the process can be an excellent professional development opportunity for educators.

The proficiency recommendations from Riverside apply to both the domain scores and the overall (total) scores. When evaluating the domain scores, proficiency classification information can be determined directly from the score itself. Domain scores of 4 or 5 are considered to represent “proficient” performance; a domain score of 3 represents “approaching proficiency”; and domain scores of 2 or below represent the “needs improvement” category. These relationships apply to all domain scores, regardless of grade level or content area.

Proficiency level classifications for the total score scales are not as directly intuitive as they are for the domains. Tables 8 and 9, for ELA and Mathematics, respectively, convey the total score ranges that represent each proficiency category for each test area.

Page 20: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

16 Riverside Interim Assessments Research Overview

Table 8: Recommended Proficiency Level Score Ranges for ELA Total Score

Grade Needs Improvement Approaching Proficiency

Proficient

2 106 and below 107 – 119 120 and above 3 118 and below 119 – 132 133 and above 4 124 and below 125 – 143 144 and above 5 132 and below 133 – 153 154 and above 6 144 and below 145 – 169 170 and above 7 165 and below 166 – 191 192 and above 8 171 and below 172 – 196 197 and above 9 179 and below 180 – 206 207 and above 10 196 and below 197 – 223 224 and above 11 213 and below 214 – 248 249 and above

Table 9: Recommended Proficiency Level Score Ranges for Math Total Score

Grade Needs Improvement Approaching Proficiency

Proficient

2 112 and below 113 – 119 120 and above 3 122 and below 123 – 135 136 and above 4 129 and below 130 – 145 146 and above 5 148 and below 149 – 166 167 and above 6 153 and below 154 – 181 182 and above 7 180 and below 181 – 208 209 and above 8 191 and below 192 – 228 229 and above 9 205 and below 206 – 238 239 and above 10 204 and below 205 – 249 250 and above 11 222 and below 223 – 257 258 and above

Page 21: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Administration Scenarios 17

Part 5 Administration Scenarios

Although the Riverside Interim Assessments have been designed to be administered independently, the value of the Riverside Interim Assessments can be enhanced if they are used within an integrated Balanced Assessment System. One of the simplest models for using the test in this way is for the three forms of the Riverside Interim Assessments to be administered at approximately the first quarter, the halfway point, and the third quarter within a school year. At each administration, the Riverside Interim Assessments will serve as both a pretest for standards not yet covered and a posttest for standards in which instruction has already been provided. Educators might choose to incorporate other related assessments as well—perhaps something at the start of the school year and/or an assessment at the end of the school year. For public school educators, the latter test might be the state assessment.

The Riverside Interim Assessments have been designed to integrate with the Iowa Assessments, also published by Riverside. Three possible models for incorporating the Iowa Assessments with the Riverside Interim Assessments are presented below. Variations of these options are also possible and reasonable. Educators can and should determine the administration model that best addresses their local needs.

Iowa Assessments—Fall Model

• Iowa Assessments fall administration

• Riverside Interim Assessments 1—25% of school year (end of first quarter)

• Riverside Interim Assessments 2—50% of school year (end of first semester)

• Riverside Interim Assessments 3—75% of school year (end of third quarter)

Iowa Assessments—Spring Model

• Riverside Interim Assessments 1—25% of school year (end of first quarter)

• Riverside Interim Assessments 2—50% of school year (end of first semester)

• Riverside Interim Assessments 3—75% of school year (end of third quarter)

• Iowa Assessments spring administration (end of year)

Iowa Assessments—Fall/Spring Model

• Iowa Assessments fall administration

• Riverside Interim Assessments 1—25% of school year (end of first quarter)

• Riverside Interim Assessments 2—50% of school year (end of first semester)

• Riverside Interim Assessments 3—75% of school year (end of third quarter)

• Statewide criterion-referenced assessment

Page 22: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

18 Riverside Interim Assessments Research Overview

Page 23: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

Index 19

Index

Administering the Riverside Interim Assessments, 1 Balanced Assessment System

administration, 1 independent administration, 1 Iowa Assessments administration, fall

model, 17 Iowa Assessments administration,

fall/spring model, 17 Iowa Assessments administration, spring

model, 17

Assessment life cycle, 1

Balanced Assessment System, link to Iowa Assessments, Form E, 14

Common Core State Standards, 1 assessing in ELA and Mathematics, 1 measure, 3 proficiency levels, 12 purpose, 2

Cut scores, 12 calculating, 12 setting for each reporting domain, 12

Domain-level scores approaching proficiency, 15 evaluating, 15 indication of students' strengths and

weaknesses, 15 needs improvement, 15 proficient, 15

Domains, 4 calibration of the 2011 operational items

for each domain, 6 domain scale scores, 11 proficient performance-level descriptors,

11

Equating, 5 2011 field test, 5 base scale, 6 calibration of the 2011 operational

forms, 6 calibration of the 2011 operational items

for each domain, 6

number of items on each standalone field test and operational form, 6

pre-equating approach, 5

Field test forms, 5 2011 field test, 5 concurrent calibration, 5

Growth scale, 14 consistency with existing Iowa

Assessments growth scale, 13 link to Iowa Assessments, Form E, 14

Iowa Testing Programs, 12

Item summary statistics, 7 ELA Grade 4, 8 ELA Grade 8, 8 Mathematics Grade 4, 9 Mathematics Grade 8, 9

Performance descriptors, 11 samples, 11

Pre-equating approach, 5

Proficiency level classifications, 15 Score Ranges for ELA, 16 Score Ranges for Math Total, 16

Rasch Item Response Theory (IRT) model, 5

Reporting domains, 4

Riverside Interim Assessments about, 1 features, 1

Scale alignment, 14

Scaling, 5 pre-equating model, 5 Rasch Item Response Theory (IRT) model,

5 WINSTEPS software, 5

Scaling and equating, 5 2011 field test, 5 base scale, 6 calibration of the 2011 operational

forms, 6 calibration of the 2011 operational items

for each domain, 6

Page 24: Riverside Interim Assessments · Scaling and Equating Test Development 5 The following sections provide details about the scaling and equating procedures used during the development

20 Riverside Interim Assessments Research Overview

concurrent calibration of 2011 field test forms, 5

data fit to the model, assessing, 7 data unidimensionality, assessing, 7 equating, 5 item summary statistics, 7 local data independence, assessing, 7 number of items on each standalone

field test and operational form, 6 pre-equating approach, 5 pre-equating model, 5 results, 8 scaling, 5 statistics used, 7

Standards setting and domain scale scores, 11 aggregate proficiency scores, 12 workshop, standards-setting, 11

Standards setting and domain scales scores proficient performance-level descriptors,

construction of, 11

Standards-setting workshops planning and facilitation, 11

Statistics used in scaling and equating, 7 infit, 7 outfit, 7

Test content, 3 domains measured, 4

ELA items, 3 Mathematics items, 3 passage-driven items, 3 reading component, 3 reported domains, 4 selected response, 3 standalone items, 3

Test development, 3 diversity, 3 key elements, 3 process, 3

Total scale scores, 12 ELA, 12 Mathematics, 12 vertical scales, development of, 12

Total score scale consistency with the existing Iowa

Assessments growth scale, 13 development, 13 Iowa Assessments Standard Score (SS)

scale, 13 scale score ranges for ELA and

Mathematics, 13, 16

University of Iowa, The, 12

vertical scales, 12

WINSTEPS software, 5