principles and applications of special education...

Post on 20-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT

C L A S S 3 : D E S C R I P T I V E S T A T I S T I C S & R E L I A B I L I T Y A N D V A L I D I T Y

F E B R U A R Y 2 , 2 0 1 5

•  Define basic terminology used in assessment, such

as validity, reliability, standard deviation, etc. •  Understand how to evaluate the technical adequacy

of tests including the norms, reliability, and validity. •  Interpret information from formal and informal

assessments. •  Describe the function of standardized assessment in

the eligibility process.

OBJECTIVES

TONIGHT’S SCHEDULE

4:30 – 4:45 Group Presentations – Utah SPED Rules 4:45 – 5:15 Problem-Solving Teams – Case Studies 5:15 – 6:00 Descriptive Statistics 6:00 – 6:15 Break 6:15 – 7:00 Reliability & Validity 7:00 – 7:20 Graduate Students – Annotated Bibliography

REVIEW

§ Children can not be determined to have a disability because of what? § Describe each of the following: § RIOT/ICEL § RTI and its relationship to the medical model § LRE § Components of an IEP

CRITERION OR NORM-REFERENCED?

•  WISC-IV (Intelligence Test) •  History test •  Correct words on a spelling test •  Woodcock Johnson Achievement Test III •  Driving Test •  Number of steps correctly performed in a dressing

routine.

WHY IS MEASUREMENT IMPORTANT? Standardized assessment is heavily applied in the

educational decision-making process.

Educators must understand § Test-selection criteria § Basic principles of measurement § Administration techniques § Scoring procedures

High priority placed on assessment Mistakes made by professionals: •  Identified students based upon referral information and not testing.

•  Data presented played little role in planning.

•  Choosing poor-quality instruments.

•  Taking the recommendation at face value.

•  Using quick assessments even if those assessments do not address the areas of concern.

•  Failure to establish effective rapport with the examinee.

•  Failure to document behaviors during the examination that may be of diagnostic value.

•  Failure to adhere to the administration rules.

•  Making scoring errors.

•  Ineffectively interpreting assessment results for educational use.

CONCERNS IN THE FIELD

Nominal Scale § Used for identification purposes only; the numbers

function like a name (e.g., an ID number) § Numbers cannot be used in mathematical operations § Least useful scale

NUMERICAL SCALES

Ordinal Scale § Used to rank the order of items § Numbers have the quality of identification and indicate

greater or lesser quality (e.g., first place, second place, etc.)

§ Numbers are not equidistant (i.e., the distance between first and second place and second and third place is not necessarily the same)

NUMERICAL SCALES

Ratio Scale § Used for direct comparisons and mathematical

manipulations § Numbers are equidistant from each other § Numbers have a true meaning of absolute zero § Can be used in all mathematical operations (e.g.,

counts of behaviors, income, height, weight, etc.)

NUMERICAL SCALES

Interval Scale § Used for identification that rank greater or lesser quality

or amount – Numbers are equidistant (e.g., degrees on a

thermometer, IQ Scores, rating scales). – Most data in education will be interval scale data. § Does not have an absolute-zero quality § Numbers cannot be used in other mathematical

operations (e.g., multiplication)

NUMERICAL SCALES

•  Scores an individual receives when individual items on tests are summed.

•  Raw scores convey very little meaning unless referenced to some standard.

•  Subtract the number of items students missed from the number of items presented.

•  All other scores, derived scores, are “derived” from the raw score.

RAW SCORES

Large sets of data are organized and understood through methods known as descriptive statistics.

Derived scores obtain meaning from large sets of data or large samples of scores.

Scores derived from the raw score include: § Percentile rank § Standard score § Grade equivalent § Age equivalent

DESCRIPTIVE STATISTICS

•  Measures of Central Tendency: A way to organize data to see how the data cluster, or are distributed around a numerical representation of the average score.

§  Caution using this technique if your scores are widely scattered.

•  A normal distribution represents the way test scores would fall if a test was given to every single student of the same age or grade in the population.

MEASURES OF CENTRAL TENDENCY

Most students’ scores fall in the middle of the curve

Fewer students’ scores fall at the edges of the curve

Distribution is symmetric or equal on either side of the vertical line.

It is important to know how students performed as a group and what constitutes excellent, average, and poor performance.

Frequency Distribution §  Rank scores from highest to lowest. §  Tally how many of each score was obtained.

Mode §  The score that occurs the most number of times.

Bimodal Distribution §  The distribution has two modes.

Multimodal Distribution §  A distribution with three or more modes.

Frequency Polygon §  A graph that represents a data set.

AVERAGE PERFORMANCE

MODE & FREQUENCY POLYGON

Mode

MEDIAN Median

• Found by rank ordering the data set, writing each score the number of times it occurs.

• Count halfway down the list of scores; 50% of the data are listed above the median and 50% are below.

• In a data set with an even number of scores, the median score may not actually exist in the data set.

MEDIAN EXAMPLES

Median Score

100 69

97 69

89 68

85 62

85 60

78 Median Score

100 83

96 82

95 80

90 78

85 77

84

•  One of the best measures of average performance is the mean.

•  The mean is found by calculating simple average. •  Mean can be affected by extreme scores, especially if

the group is composed of only a few students. § Can be controlled by eliminating extreme scores (i.e., outliers).

Example:

Data set: 90, 80, 75, 60, 70, 65, 80, 100, 80, 80 780 ÷ 10 = 78

MEAN

•  Measures of dispersion are used to calculate how scores are spread from the mean.

•  Variability is the way that scores in a set of data are spread apart. •  Range

§ Provides an idea about the spread. § Calculated by subtracting the lowest score from the highest score.

Example: § Top score = 100; Lowest score = 45 § 100 – 45 = 55

MEASURES OF DISPERSION

•  Data are described as having variance. •  Variance can be described as the degree or amount of

variability or dispersion in a set of scores. •  The dispersion of a set of scores around the mean •  Applicable for Equal Interval & Ratio, not Nominal or

Ordinal

VARIANCE

•  Standard deviation is one determined typical unit above and below the score of 100.

•  Standard deviation is one method of calculating difference in scores or variability of scores known as dispersion.

•  Must calculate variance before you can calculate standard deviation. •  Standard Deviation = √ variance •  Any test score that is 1 standard deviation above or below the mean

score is considered significant. •  Applicable for Equal Interval & Ratio, not Nominal or Ordinal

STANDARD DEVIATION

•  In a normal distribution, the standard deviations represent the percentages of scores shown on the bell curve.

•  More than 68% of the scores fall within one standard deviation above or below the mean.

•  Two standard deviations below the mean = Intellectual Disability •  Two standard deviations above the mean = Gifted

STANDARD DEVIATION & NORMAL DISTRIBUTION

•  When small samples or very restricted populations are used, test results may not distribute into a normal curve.

•  Extreme scores can change the appearance of a set of scores and subsequently influence the way the data are described.

•  Distributions can be skewed in a positive or negative direction.

SKEWED DISTRIBUTIONS

Negatively Skewed: Large number of scores occur above the mean.

Positively Skewed: Large number of scores occur below the mean.

Percentile Rank § Rank each score on the continuum of the normal

distribution § Percentiles range from <1% to 99.9%, with 50 being the

average. § A person who scores at the 75%tile scored as well or

better than 75% of the students in that age/grade group.

TYPES OF SCORES

Percentile Rank For example: Jalen obtained a percentile rank of 42. This means that Jalen performed as well as or better than 42% of children his age on the test.

Or, 42% of children Jalen's age scored at or below Jalen's score.

Descriptors for Percentile Ranges

Percentile Range Descriptor

98th %ile and Above Upper Extreme

91st to 97th %ile Well Above Average

75th to 90th %ile Above Average

25th to 74th %ile Average

9th to 24th %ile Below Average

3rd to 8th %ile Well Below Average

2nd %ile and Below Lower Extreme

TYPES OF SCORES T scores § Have an average of 50 and standard deviation of 10

Stanines § Scores are divided into 9 groups with 5 being the mean and 2 being

the standard deviation Deciles § Scores are divided into 10 groups, 10 for the lowest group, 100 for

the highest § Each group represents 10% of the obtained scores.

STANDARD SCORES Standard scores are scores of relative standing with a set,

fixed, predetermined mean and standard deviation

CHOICE OF TEST SCORES

Percentile Ranks § Preferable over age and grade equivalents § Are considered “comparable scores” § Straightforward indicators of an individual’s standing

within a group § Reported as a reference to the student’s standing to the

group upon which the test was normed

Standard Scores § Advantages § Comparative § Based upon a normal or normalized distribution of

scores (bell curve) § Can be directly translated into percentile ranks

§ Because of a uniform mean (bell curve), they can be compared from one subtest to the next and one test administration to another.

CHOICE OF TEST SCORES (CONTINUED)

Age and Grade Equivalents: § Appear to be the simplest, but in fact, they can be the most

misinterpreted. § Major limitations: § Do not provide information about whether student’s

performance is within average limits. § Do not describe a student’s current instructional level. § Do not indicate what test questions the student answered

correctly. § A word of caution: Findings should be reported and worded

carefully to prevent misinterpretation.

CHOICE OF TEST SCORES (CONTINUED)

GRADE EQUIVALENTS ARE OBTAINED FROM MEAN OR MEDIAN SCORES BY GRADE.

MISLEADING: FALSE IMPRESSION OF PROGRESS

Grade Placement

Grade Equivalent

Years Below

Percentile Rank

2 1.9 .1 25th

3 2.4 .6 25th

4 3.1 .9 25th

5 3.9 1.1 25th

6 4.5 1.5 25th

7 5.3 1.7 25th

AGE AND GRADE EQUIVALENTS What are the scores based on?

Why is this a problem?

32 - 16

37 17

42 12 19 67

59 23 12

70

45 - 11 +16 - 32 +26 - 26

+14 - 6 +12 -15 + 5

+14

1)

7) 8) 9) 10) 11) 12)

6) 5) 4) 3) 2)

Chelyn only gets the even # questions correct Raw Score = 6

Lou only gets the odd # questions correct Raw Score = 6

DO THESE STUDENTS HAVE THE SAME SKILLS?

AGE & GRADE EQUIVALENTS 2 years below grade has different meanings at different grades Kurt is at the 12.5 grade level and obtained a grade equivalent of 10.5 on

the Reading Recognition Subtest of the PIAT. Mason is at the 3.5 grade level and obtained a grade equivalent of 1.5 on

the same test. Is their performance the same? Who performed better? Kurt obtained a standard score of 93, 33rd percentile Mason obtained a standard score of 72, 3rd percentile

GRADE EQUIVALENTS MEAN DIFFERENT THINGS ON DIFFERENT TESTS •  Billy, grade placement 7.5, obtained a grade equivalent of 5.5 on the

WRMT. •  Bobby, grade placement 7.5, obtained a grade equivalent of 5.5 on the

Reading Subtest of the WRAT. Is their performance the same? Who performed better? Billy performed at the 18th percentile Bobby performed at the 34th percentile

•  At the same point on the scale and the same age level, identical grade equivalents mean different things on different tests.

RELIABILITY & VALIDITY

RELIABILITY & VALIDITY Aids in determining test accuracy and dependability § Reliability—the dependability or consistency of an

instrument across time or items. § Validity—the degree to which an instrument measures what

it was designed to measure. Instruments should have both properties but may have only

one (not that strong of an instrument)

Correlation—the degree of relationship between two variables.

§ Two administrations of the same test § Administration of equivalent forms

Correlation coefficient ranges: +1.00 to -1.00 § Perfect positive correlation = +1.00 § Perfect negative correlation = -1.00 § No correlation = 0 § Numbers closer to +1.00 represent stronger relationships

§  The greater degree of the relationship, the more reliable the instrument. §  The + does not indicate strength, but direction.

Correlation (r)

Scattergrams provide a graphic representation of a data set and show a correlation.

The more closely the dots on a scattergram approximate a straight line, the nearer to perfect the correlation.

SCATTERGRAM

TYPES OF CORRELATION

P O S I T I V E C O R R E L A T I O N

Variables with a positive relationship move in the same direction. Scores on variables increase simultaneously.

N E G A T I V E C O R R E L A T I O N

High scores on one variable are associated with low scores on another variable.

No Correlation

n  When data from two variables are not associated or have no relationship.

n  No linear direction on a scattergram

RELATIONSHIP BETWEEN RELIABIITY & VALIDITY

Suppose I have a faulty measuring tape and I use it to measure each student’s height.

On the other hand, if I have a correctly printed measuring tape…

My tool is invalid, but it’s still reliable.

My tool is both valid and reliable.

RELIABILITY Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for the potatoes an hour later.

VALIDITY Let’s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you actually weigh 150 pounds (perhaps you re-set the scale in a weak moment).

RELIABILITY CHECKS

Test-Retest (Stability) Equivalent Forms Inter-Rater (Agreement)

TEST-RETEST RELIABILITY Test-retest reliability—the trait being measured is one that is stable

over time. If the trait being measured remains constant, the re-administration of

the instrument will result in scores similar to the first score. § Important to conduct re-test shortly after first test to control for

influencing variables. Difficulties: § Too soon: Students may remember test items (practice effect) and

score higher the second time. § Too far: Greater influence of time variables (e.g., learning,

maturation, etc.)

EQUIVALENT (ALTERNATE) FORMS RELIABILITY

Equivalent forms reliability § Two forms of the same instrument are used. § Items are matched for difficulty. Advantage: Two tests of the same difficulty level that can

be administered within a short time frame without the influence of practice effects.

INTERRATER RELIABILITY

Interrater reliability § The consistency of a test across examiners. § One person administers a test, a second person rescores

the test. § The scores are then correlated to determine how much

variability exists between the scores.

ASSUMPTIONS OF TESTING 1.  People involved are skilled 2.   Error is always present 3.   Acculturation is comparable 4.   Behavior sample is adequate 5.   Present behavior is observed

1. PEOPLE ARE SKILLED: •  in administering the test - including establishing

rapport

•  in scoring the test

•  in interpreting the results

•  in utilizing the results

2. ERROR IS ALWAYS PRESENT

Obtained Score = True Score + Error Random error is unreliability § e.g., lack of familiarity with tests, examiner fatigue, etc.

Do not make decisions based on error

3. ACCULTURATION IS COMPARABLE

The comparison group has comparable Experiential Background § Test item asking about how to get out of a forest for an

inner city child. Opportunity to Learn § Books available in the child’s home

4. BEHAVIOR SAMPLE IS ADEQUATE All tests are only samples of behavior. Samples of Behaviors on Test

Domain of Behaviors of Interest

5. PRESENT BEHAVIOR IS OBSERVED Future behavior is inferred. Tests can only inform us directly about present behavior.

TEST VALIDITY Does the test actually measure what it is supposed to measure? §  Criterion-related validity: Comparing scores with other

criteria known to be indicators of the same trait or skill §  Concurrent Validity: Two tests are given within a very

short timeframe (often the same day). If scores are similar, the tests are said to be measuring the same trait.

§  Predictive Validity: Measures how well an instrument can predict performance on some other variable (e.g, ACT or GRE scores).

CONTENT VALIDITY Ensuring that the items in a test are representative of content purported to be measured. § PROBLEM: Teachers often generalize and assume the test covers more

than it does (e.g., the WRAT-3 reading subtest only measures word recognition—not phonemic awareness, phonics, vocabulary, reading comprehension, etc.).

Some of the variables of content validity may influence the manner in which results are obtained and can contribute to bias in testing.

§ Presentation Format: The method by which items are presented to the student

§ Response Mode: The method for the examinee to answer items.

VALIDITY OF TEST ~V~ VALIDITY OF USE §  Tests may be used inappropriately even though they are

valid instruments. §  Results obtained may be used in an invalid manner. §  Tests may be biased and/or discriminate against different

groups. §  Item bias, when an item is answered incorrectly a

disproportionate number of times by one group compared to another.

§  Predictive validity may predict accurately for one group and not another.

NEXT WEEK •  Read Chapter 5 •  Submit Online Self-Assessment

SOURCES § Overton, T. (2012). Assessing learners with special needs (7th ed.). Upper

Saddle River, NJ: Pearson Education Inc.

top related