  • Measurement in Psychology I:RELIABILITY

    Lawrence R. Gordon

  • Do you support the civil union legislation?What are some of the ways in which you can ask this question?How do you measure the response (operational definitions)?

  • Levels of MeasurementNominal scalesgiving names to data, putting into categoriesExamples: sex, race labels; baseball uniform numbersOrdinal scalesnumbers give order but not distanceExamples: mailbox numbers; class rankings

  • Levels of Measurement (cont.)Interval scalesnumbers indicate order and distance (they are separated by equal distances or intervals)Example: Fahrenheit temperatureRatio scalesnumbers indicate order, distance, AND have a true zero point (zero = there isnt any)Examples: height; weight; miles per hour; time

  • Levels of Measurement ExampleAuto race which started at 2 pm



    Finish Order



    Elapsed Time


























  • Closed vs. Open ResponsesClosed responses (a.k.a. forced choice)Examples (rate civil union support on a scale 1 to 9)Advantages you know what the responses will be (or what they should be!) because of restrictions on choiceeasy to empirically evaluate (relatively)gives data that gives a straightforward answer to how you ask your questioncoding not necessary, usually

  • Closed vs. Open ResponsesClosed responses (a.k.a. forced choice)Disadvantages may not be sensitive enough to get some interesting informationwill not give you as clear an indication of what participants think/feel/reportDo you agree that same-sex couples should have the right to marry/civil union?123456789Disagree AgreeCompletely Completely

  • Closed vs. Open ResponsesOpen responses (a.k.a. free response) Examples (Do you support the civil union legislation? Why?)Example from the survey used the first day?Please describe yourself in 12 words or lessmore on this in a bit...Advantagesgives any answer participant wantsnot restricted by choices

  • Closed vs. Open ResponsesOpen responses (cont.) Disadvantageshave to code to empirically evaluate (time intensive, need to find people who will do it)reliability issues!

  • ReliabilityConsistency (stays the same)Repeatable (get the same results again and again) Measures need to be reliable to be good measuresNow, some nitty-gritty...

  • Reliability (cont.)Measuring closed responsesyou dont need to put things into categoriesreliable over time (do you get the same answers again and again?)if the answers vary greatly from one time of measurement to the next, the measurement is not reliable

  • Reliability (cont.)Measuring closed responses (cont.)scales (sets of questions designed to measure something) need to be given multiple times, or in multiple forms, and the answers must remain similar for the scale to be reliableExample (personality scale?)

    Types of reliabilityStability (test-retest reliability)Equivalence (parallel forms reliability)Consistency (split-half reliability)Homogeneity (internal consistency reliability)

  • Reliability Quick ExampleAny test, scale, inventory with items: E.g., a 50-item test, scored 0-50:

    Form A 9/4 9/4, Form AExaminee9/4 9/25Form A Form BOdd Even1 George2735273315122 Alice4946494030193 Mary3035302713174 Larry16101619795 Linda2724272010176 Doug4042404822187 Chuck2118213510118 Judy423942351923

    Test-retest: Form A, 9/4 vs 9/25 (r=.92")StabilityParallel forms: Form A vs Form B, 9/4 (r=.69")EquivalenceCross form: Form A 9/25 vs Form B 3/19 (r=.72")Stab & EquivSplit-half: Odd vs Even, Form A 9/4 (r=.79")ConsistencyAlpha reliability No example data from all 50 itemsInternal consistency

  • Reliability (cont.)Measuring open responsesWill often code into categories (Examples)How do you assess reliability?

  • Reliability (cont.)Measuring open responses (cont.)Does everyone put the response into the same category? If yes, you have good inter-coder reliability more specific operational definitions will increase this reliabilityCoding personality responses into categoriesUsing positive, negative, and neutral descriptors

  • Reliability (cont.)Measuring behavioral responses through observation special cases of open response, cant really control what participants do coding and/or rating what you observereliability of ratings (interrater reliability? If all raters agree on the rating, then yes.)need to be very clear on operational definitionsBaggage claim study (Scherer & Ceschi, 2000)

  • Assessing ReliabilityStepsdecide on operational definitions of your variables and scale(s) of measurementtrain your coders/raters, answer questions, and alleviate confusiondo the coding and ratingcompare responseswere the measurements reliable?

  • Reliability ExerciseMeasuring your personalityLooking for big traitsdefining big traits and training codersThe Big Five Personality Factors1. Open to Experience (O) vs. Closed to Experience (NO)2. Conscientious (C) vs. Nonconscientious (NC)3. Extraverted (E) vs. Introverted (NE)4. Agreeable (A) vs. Unagreeable (NA)5. Neurotic (N) vs. Nonneurotic (NN)Which one best fits the description?Do the coding!

  • Reliability ExerciseMeasuring your personalityLooking for big traitscompare responses to other codersintercoder reliabilityList number on which you agreedList number on which you disagreedCalculate the percentageswere the measurements reliable?

  • And for next timeis reliability enough?If your measurement is reliable, does that mean that it is good? Does being reliable make your measurement valid?

    Get a few different ideas about how this might be done, using free response, closed response, etc.When measuring something, one of the decisions you have to make if what level of measurement you are operating at.

    Nominal justs puts into categories. Even if a number is used, that number does not represent an amount or quantity of anything.

    Ordinal gives order. You know that one is bigger but not by how much. From first day survey, you ranked the characteristics you prefer in a mate. If you put intelligence first and kind-understanding second, we dont know how much more you prefer one over the other.Interval scales may or may not use a zero point, but if they do, the zero does not mean that there is none of something. 0 degrees Fahrenheit does not mean there is no temperature.

    Ratio does have a true zero point that allows you to calculate ratios. If you weigh 200 lbs., then you are twice as heavy as someone who weighs 100 lbs.Names whats most popular in this class (6 Erins)Cars Domestic / ForeignThe participants can only choose the options that you provide. Usually sometimes people get creative!

    This allows you to compare answers to one question without a lot of tinkering.If participants need to choose among your options, they might not actually have your options capture their true feelings, and may prefer an option that you didnt put.

    Coding means that you need to give some organization to the chaos in the data. You want to group it, put it in categories, so that you can compare it!The specificity in your operational definitions will help remove some of the subjectivity in your judgements.

    Exercise: 1. People tend to be biased in that they see themselves as good, worthy human beings. They may display this bias by describing themselves in positive ways.2. Operationally define:Positive terms (+)Negative terms (-)Neutral terms (0)NEED A MAJORITY OF THE CLASS TO AGREE TO LABEL A TERM INTO A CATEGORY!3. Go through and code the responses.4. Add up numbers of terms in each category to make the comparison.When observing behavior, you need to make judgments on the fly about what category the behavior fits into.

    Baggage claim (PSPB, 26(3), 327-339): rated feelings of passengers who did not get their luggage. Baggage claims personnel did it on the fly. A trained (certified!) coder rated the feelings using videotapes of the interactions using the Facial Action Coding System, an operational definition that divides the face into several action units that need to operate in various combinations to be coded as enjoyment smiles or false smiles.

    Many 109 projects use this!Ex: What makes a driver good? Good: drives in the speed limit, courteous, uses directionals Bad: speeds, quick lane change, no directionals, cuts off, runs red lightsRemember that we asked you to describe yourself in 12 words of less on the firs


