large-scale testing: uses and abuses richard p. phelps universidad finis terrae, santiago, chile...
TRANSCRIPT
![Page 1: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/1.jpg)
Large-scale testing: Uses and abuses
Richard P. Phelps
Universidad Finis Terrae, Santiago, Chile
January 7, 2014
![Page 2: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/2.jpg)
Large-scale testing: Uses and abuses
1. 3 types of large-scale tests2. Measuring test quality3. A chronology of mistakes4. Economists misunderstand testing5. How SIMCE is affected
![Page 3: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/3.jpg)
AchievementAptitude
Non-cognitive
1. Three types of large-scale tests
![Page 4: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/4.jpg)
Achievement tests Historically, were larger versions of classroom tests
~ 1900 - “scientific” achievement tests developed (Germany & USA)
SOURCE: Phelps, Standardized Testing Primer, 2007
J.M. Rice - systematically analyzed test structures & effects
E.L. Thorndike - developed scoring scales
![Page 5: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/5.jpg)
Achievement tests
Purpose: to measure how much you know and can recall
Developed using: content coverage analysis
How validated: retrospective or concurrent validity (correlation with past measures, such as high school
grades)
Requires a mastery of content prior to test.
Fairness assumes that all have same opportunity to learn content
Coachable – specific content is known in advance
SOURCE: Phelps, Standardized Testing Primer, 2007
![Page 6: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/6.jpg)
Aptitude tests
1917 – Adapted by U.S. Army to select, assign soldiers in World War 1
1930s – Harvard University president J. Conant- wanted new admission test to identify students from lower social classes with the
potential to succeed at Harvard- developed the first Scholastic Aptitude Test (SAT)
SOURCE: Phelps, Standardized Testing Primer, 2007
1890s – A. Binet & T. Simon (France)
- Pre-school children with mental disabilities
- achievement test not possible- developed content-free test of mental abilities
(association, attention, memory, motor skills, reasoning)
![Page 7: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/7.jpg)
Aptitude testsPurpose: predict how much can be learned
Developed using: skills/job analysis
How validated: predictive validity, correlation with future activity (e.g., university or job evaluations)
Content independent. Measures: … what student does with content provided… how student applies skills & abilities developed over a lifetime
Not easily coachable – the content is either…… not known in advance, … basic, broad, commonly known by all, curriculum-free;… less dependent on the quality of schools
SOURCE: Phelps, Standardized Testing Primer, 2007
![Page 8: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/8.jpg)
Aptitude tests
Aptitude tests can identify:
- Students bored in school who study what interests them on their own
- Students not well adapted to high school, but well adapted to university
- Students of high ability stuck in poor schools
SOURCE: Phelps, Standardized Testing Primer, 2007
![Page 9: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/9.jpg)
Achievement Aptitude
Measure past learning potential
Development content analysis job/skills analysis
Validation retrospective predictive
Content dependent independent
Coachable? very much not much
Comparing Achievement & Aptitude tests
![Page 10: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/10.jpg)
Non-cognitive tests
More recently developed – measure values, attitudes, preferences
Types: integrity tests career exploration matchmakingemployment “fit”
![Page 11: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/11.jpg)
Non-cognitive tests
Purpose: to identify “fit” with others or a situation
Developed using: surveys, personal interviews
How validated? success rate in future activities
Content is personal, not learned
“Faking” can be an issue (e.g., “honesty” tests)
![Page 12: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/12.jpg)
Achievement Aptitude Non-Cognitive
Measure past learning potential attitudes, values, preferences
Development content analysis job/skills analysis surveys
Validation retrospective predictive predictive
Content dependent independent independent
Coachable? very much very little can be faked
Comparing Achievement, Aptitude, & Non-Cognitive Tests
![Page 13: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/13.jpg)
2. Measuring test quality
3 measures are important:1. Predictive validity2. Content coverage3. Sub-group differences
Test reports can be “data dumps”
![Page 14: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/14.jpg)
Predictive validity(values from -1.0 to +1.0)
…measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion)
A test with low predictive validity provides a little information.
![Page 15: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/15.jpg)
Source: NIST, Engineering Statistics Handbook
A positive correlation between two measures
![Page 16: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/16.jpg)
Source: NIST, Engineering Statistics Handbook
A negative correlation between two measures
![Page 17: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/17.jpg)
Source: NIST, Engineering Statistics Handbook
No correlation between two measures
![Page 18: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/18.jpg)
How does one measure predictive capacity?
Correlation Coefficient: I--------------------------------------------I
-1 0 1
![Page 19: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/19.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
SAT
PSU 2010
Predictive validities: SAT and PSU
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
![Page 20: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649d885503460f94a6d256/html5/thumbnails/20.jpg)
Language Mathematics SAT Writing PSU Social Science
0
0.1
0.2
0.3
0.4
0.5
0.6
SAT PSU Administracion
Predictive validities: SAT and PSU(faculty: Administracion)
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013