large-scale testing: uses and abuses richard p. phelps universidad finis terrae, santiago, chile...

Large-scale testing: Uses and abuses

Richard P. Phelps

Universidad Finis Terrae, Santiago, Chile

January 7, 2014

Large-scale testing: Uses and abuses

1. 3 types of large-scale tests2. Measuring test quality3. A chronology of mistakes4. Economists misunderstand testing5. How SIMCE is affected

AchievementAptitude

Non-cognitive

1. Three types of large-scale tests

Achievement tests Historically, were larger versions of classroom tests

~ 1900 - “scientific” achievement tests developed (Germany & USA)

SOURCE: Phelps, Standardized Testing Primer, 2007

J.M. Rice - systematically analyzed test structures & effects

E.L. Thorndike - developed scoring scales

Achievement tests

Purpose: to measure how much you know and can recall

Developed using: content coverage analysis

How validated: retrospective or concurrent validity (correlation with past measures, such as high school

grades)

Requires a mastery of content prior to test.

Fairness assumes that all have same opportunity to learn content

Coachable – specific content is known in advance


Aptitude tests

1917 – Adapted by U.S. Army to select, assign soldiers in World War 1

1930s – Harvard University president J. Conant- wanted new admission test to identify students from lower social classes with the

potential to succeed at Harvard- developed the first Scholastic Aptitude Test (SAT)


1890s – A. Binet & T. Simon (France)

- Pre-school children with mental disabilities

- achievement test not possible- developed content-free test of mental abilities

(association, attention, memory, motor skills, reasoning)

Aptitude testsPurpose: predict how much can be learned

Developed using: skills/job analysis

How validated: predictive validity, correlation with future activity (e.g., university or job evaluations)

Content independent. Measures: … what student does with content provided… how student applies skills & abilities developed over a lifetime

Not easily coachable – the content is either…… not known in advance, … basic, broad, commonly known by all, curriculum-free;… less dependent on the quality of schools


Aptitude tests

Aptitude tests can identify:

- Students bored in school who study what interests them on their own

- Students not well adapted to high school, but well adapted to university

- Students of high ability stuck in poor schools


Achievement Aptitude

Measure past learning potential

Development content analysis job/skills analysis

Validation retrospective predictive

Content dependent independent

Coachable? very much not much

Comparing Achievement & Aptitude tests

Non-cognitive tests

More recently developed – measure values, attitudes, preferences

Types: integrity tests career exploration matchmakingemployment “fit”

Non-cognitive tests

Purpose: to identify “fit” with others or a situation

Developed using: surveys, personal interviews

How validated? success rate in future activities

Content is personal, not learned

“Faking” can be an issue (e.g., “honesty” tests)

Achievement Aptitude Non-Cognitive

Measure past learning potential attitudes, values, preferences

Development content analysis job/skills analysis surveys

Validation retrospective predictive predictive

Content dependent independent independent

Coachable? very much very little can be faked

Comparing Achievement, Aptitude, & Non-Cognitive Tests

2. Measuring test quality

3 measures are important:1. Predictive validity2. Content coverage3. Sub-group differences

Test reports can be “data dumps”

Predictive validity(values from -1.0 to +1.0)

…measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion)

A test with low predictive validity provides a little information.

Source: NIST, Engineering Statistics Handbook

A positive correlation between two measures


A negative correlation between two measures


No correlation between two measures

How does one measure predictive capacity?

Correlation Coefficient: I--------------------------------------------I

-1 0 1

0

0.1

0.2

0.3

0.4

0.5

0.6

SAT

PSU 2010

Predictive validities: SAT and PSU

SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Language Mathematics SAT Writing PSU Social Science

0

0.1

0.2

0.3

0.4

0.5

0.6

SAT PSU Administracion

Predictive validities: SAT and PSU(faculty: Administracion)

SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

large-scale testing: uses and abuses richard p. phelps universidad finis terrae, santiago, chile...

Documents

aptitude tests aptitude

achievement aptitude

aptitude tests purpose

achievement tests purpose

honesty tests

scientific achievement

types of largescale

reasoning slide