evaluation - virginia techcourses.cs.vt.edu/~cs3724/fall2007-bowman/lectures/evaluation.pdf ·...

1

Usability Evaluation

CS 3724Doug Bowman

Usability Evaluation Any analysis or empirical study of the

usability of a prototype or system Two different goals:

To provide feedback in software development insupport of an iterative development process

Recognize problems, understand underlying causes, andplan changes

To determine how well a system meets usabilityspecifications or compares to a competing system

Goals of Usability Evaluation

Design Construction

Formative Evaluation:What and how to re-design?

Summative Evaluation:How well did we do?

Formative and SummativeGoals

Formative: during development, guides processSummative: after development, or at a checkpoint

What sorts of “test data” aid formative goals?What about summative?

SBD relies on mediated usability evaluation Claims analysis documents design features of concern Users’ performance & reactions tied to these features,

establishing a usability specification

2

Analytic and EmpiricalMethods

Analytic: theory, models, guidelines (experts)Empirical: observations, surveys (users)

“If you want to evaluate a tool, say an axe, you might studythe design of the bit, the weight distribution, the steel alloyused, the grade of hickory in the handle, etc., or you mightjust study the kind and speed of the cuts it makes in thehands of a good axeman.”

Which is more expensive? Why? Which carries more weight with developers? Why?

Analytic methods Usability inspection Heuristic evaluation Cognitive walkthrough GOMS analysis

Usability Inspection Expert walkthrough based on usability guidelines,

often working from a checklist Generally want more than one expert (if affordable!)

Guidelines (and walkthrough) can be at manylevels e.g., screen layout, detailed analysis of cognitive states

May or may not use a standard set of tasks Depends on how comparable you want judgments to be

Summarize by listing problems identified in eachcategory, also often rating them for severity

Heuristic Evaluation

Use simple and natural language Speak the users’ language Minimize memory load Be consistent Provide feedback Provide clearly marked exits Provide shortcuts Provide good error messages Prevent errors Include good help and documentation

Multiple expertsreview with respect tothese issues

Can also includedifferent classes ofstakeholders, e.g.developers, users

3

Cognitive Walkthrough• Form-based inspection of system/prototype• Careful task selection, answer questions at each

step; e.g. How obvious is the next action? Mustcompeting goals be ignored? Is knowledgeassumed?

• Check-list approach attractive to practitioners• Concerns with how to select the tasks for analysis,

i.e. complexity/realism vs evaluation cost• In practice, often can be used in more lightweight

fashion, more of a “tuning” to issues

Excerpt From CognitiveWalkthrough Form

. . .Step [B] Choosing the Next Correct Action:[B.1] Correct Action: Describe the action that the user should take at this step.[B.2] Knowledge Checkpoint: If you have assumed user knowledge or experience, update the USERASSUMPTION FORM.[B.3] System State Checkpoint: If the system state may influence the user, update the SYSTEMSTATE FORM.[B.4] Action Availability: Is it obvious to the user that this action is a possible choice here? If not,indicate why.[B.5] Action Identifiability:[B.5.a] Identifier Location, Type, Wording, and Meaning:

______ No identifier is provided. (Skip to subpart [B.5.d]Identifier type: LabelPrompt Description Other (Explain)Identifier wording: _______________________________________Is the identifier’s location obvious? If not, indicate why.

[B.5.b] Link Between Identifier and Action: Is the identifier clearly linked with this action? If not,indicate why.[B.5.c] Link Between Identifier and Goal: Is the identifier clearly linked with an active goal? If notindicate why.. . .

GOMS Analysis Build predictive model using scientific

knowledge about human memory andbehavior like HTA, can analyze for complexity, consistency or build computational version, to estimate task

times for different design alternatives if successful, can provide huge benefit...why?

Extends general techniques of HTA goals, subgoals, plans, actions BUT adds model elements for mental activities

such as goal creation and selection, memoryretrieval, etc.

Downsides of AnalyticMethods Usability inspections are rapid, relatively

cheap But may miss details only seen in realistic use

contexts involving real users, say little about whatcaused the problems, or expected impact

Model-based approaches have good scientificfoundation, are credible, can be very powerful But current theories have limited scope, and

developing the models takes time/expertise

4

Running an AnalyticEvaluation (1/2)

Overall organization Who to test and how to coordinate? Where to test? How many to test at the same time? How long should it take? Do you need informed consent forms?

Demo the prototype, concerns: What functionality to demonstrate? What fidelity to use? How to ensure all evaluators understand the same things

about the interface?

Running an AnalyticEvaluation (2/2) Obtaining evaluation results

How to balance participants using each evaluationtype (project groups, gender, HCI experience)?

What types of results will heuristics and surveyeach provide?

What level of detail do you need & how can youensure you’ll participants will provide this?

Analyzing results How will you compare the results you obtain with

each method? What will be the follow-on actions?

Empirical Evaluations: Validity Conclusions based on actual use, BUT...

Are the users representative? Is the test population large, diverse enough? Is the test system realistic (versus early prototype)? Do the tasks match what happens in real use? Do the data (measures) reveal real life impacts?

These are the general concerns of “ecologicalvalidity”, the extent to which an investigation isa genuine reflection of real-world happenings

Empirical methods Field studies Interviews, introspection Lab-based usability evaluation Controlled experiments

5

Field Studies Variants of ethnographic methods we discussed

during requirements analysis Observation of realistic tasks, interviews, data files,

etc. Avoids the problem of ecological validity

Summarize data through content classification e.g., problem categories, as in themes analysis Can also sort by severity, based on observed impacts

Field data collection and analysis time-consuming Also, much of the data simply reveals successful use!

Interviews, User Introspection Ask users to report on their use

More efficient access to data of interest:critical incidents

Can enhance by making this collaborative,a discussion among usability personneland multiple stakeholders

BUT, human memory is biased Wanting things to make sense Assuming things work as they always have

Usability Evaluation in the Lab Carefully selected set of representative tasks

e.g., based on task analysis of the system, designgoals

In SBD, claims are used to guide task selection Control aspects of situation that are uninteresting

e.g., experimenter, location, task order, instructions Collect multiple measures of usability impacts

Performance (time and errors), output quality Satisfaction ratings or other subjective measures

Interpretation comes back to validity of the test Both ecological (realism) and internal (controls)

VT Usability Lab Partitioned into work,

evaluation areas One-way mirrors allow

observation Machines support

logging and screencapturing

Video recording andediting equipment

6

Controlled Experiments If asking a specific question, making a choice Operationalize independent and dependent

variables What is manipulated, what outcomes are measured

Define hypotheses in advance of the test Causal relation of independent and dependent

variables Testing these requires the use of inferential statistics

Construct an effective and practical design Within-subjects or between-subjects testing conditions How many people to test, how assign to conditions

Controlled experiment issues Choosing independent variables Choosing dependent variables Controlling (holding constant) other

variables Within- vs. between-subjects design Counterbalancing order of conditions Full factorial or partial designs

Some Variations Usability testing with think-aloud instructions

Users comment as they work on their current goals,expectations, and reactions

BUT, thinking aloud takes capacity, changes task itself Very useful in supporting formative evaluation goals

Storefront testing: bring the prototype into the hall! Fast, easy, quick cycle...but no control of users, tasks

All of these can (should!) be supplemented withinterviews and/or user reaction surveys Objective measures of behavior not always correlated

with subjective measures of experience or satisfaction

“Discount” Usability Evaluation Goal: get the most useful information for

guiding re-design with the least cost Pioneered by Jakob Nielsen (heuristic inspection)

Do a little bit of each (analytic and empirical) 3-4 experts find most of the guidelines issues 4-6 users experience most of the actual use

problems Between the two, get a good sense of what to fix

Not surprisingly, a popular strategy, prettymuch what you find in practice

7

Conducting a Usability Test Recruiting of test participants Preparation of materials

informed consent, background & reactionquestionnaires, general and task-specificinstructions, data collection

Test procedures before, during, after; including assistance policy

Summarizing and interpreting the results

Recruiting Test Participants Who are stakeholders, which ones are actors?

May mean different users for different tasks Or, may mean users role-playing other stakeholders

How do you get people to participate?! Participatory design, but this has its own downsides Offer stipends or other rewards Make test seem interesting, emphasize novelty Last resort, hire from a temp agency...

Your project: OK to ask friends, classmates, students Choose people who can role-play scenario context

Procedure Welcome Informed consent Demographic/background

questionnaire Pre-testing Familiarize with equipment Exploration time with interface Tasks Questionnaires / post-testing Interviews

Subject “packets” areoften useful fororganizing informationand data

Pilot testing should beused in most cases to: “debug” your procedure identify variables that

can be dropped from theexperiment

Informed Consent Always an issue when human subjects involved

The history: psychological research that deliberatelydeceives people, engages them in moral dilemmas, or ispotentially harmful

The fix: procedures must be approved by a committee Ensures respect for individuals’ concerns and

hesitations about participating Full disclosure of procedures (except when necessary) Clear statement of voluntary nature, participant’s rights Signature indicating understanding and willingness

See p. 256 for a model to use in developing your owninformed consent form

8

User BackgroundQuestionnaire

Characterize the user sample you end up with Relevant experience, expectations, starting attitudes The question: are these the users you need to test?

But also, helps to interpret test results E.G., Experienced computer users will likely do better Domain experts may be more critical, more specific

A range of questions but not too long Personal, demographic, experience, current attitudes Shoot for one page, seems less intimidating, tedious

See p. 258 for a model to use in developing yourown user background survey

StronglyDisagree

Disagree Neutral Agree StronglyAgree

Developing User RatingScales

Convenient for gathering subjective reactions often summarized numerically by mapping judgment categories to

ordinal variable (e.g. 1—>5) flexible, can be very general or specific can use to examine opinion change (post-pre)

Likert scale: measures strength of agreement to an assertion aboutthe system or task domain

Shopping for groceries online is enjoyable.

Background testing You may want to correlate your findings with

some standardized test score Examples

Spatial ability tests Memorization tests Domain knowledge tests

Example result: “Users with high spatial abilitypreferred interface X, while others preferredinterface Y”

Pre- and post-testing Some systems have a specific learning

goal (i.e. education or training) Giving users the same test or type of

test both before and after usage of thesystem is a way to measure learning

Only used with very well-developedprototypes

9

User familiarization Make the user comfortable Tell them what you’re doing without

giving everything away Show them the facilities and equipment Give them written instructions Tell them approximately how much time

the evaluation will take Ask if they have any questions

System exploration User gets “free play” time with the

system You can observe what they seem to

understand easily and what istroublesome

See if they “find” all the features or partsof the system

Good with verbal protocol

Task Instructions General instructions that introduce overall test Two sorts of instructions, depending on test type

Open-ended and goal-directed, for scenario exploration These participants will be doing think-aloud process

Usage context followed by very precise goals forsubtasks

Clear specification of the user’s goal Avoid options or ambiguities unless part of the test No step-by-step scripts: you are testing the system, not your

ability to write complete instructions!

See p. 254 & p. 259 for models to use indeveloping your own instructions

Task Instructions

Background to Tasks 1-4:

Imagine that you are a neighbor of a high school student (Jeff) who is participating in the VSF.

You have discovered that Jeff will be exhibiting his project tonight, and you log on to visit with

him. With you is your daughter Erin, who is a middle school student just getting interested in

science.

Task 1:

• Find Jeff’s exhibit and go to it in the VSF

Task 2:

• Locate Jeff’s position in his exhibit, and join him so that you are looking at the same material.

Task 3:

• Join the ongoing conversation (between Jeff and Sarah, another visitor). Let Jeff know that

Erin is with you, and ask him to show you around his exhibit.

Task 4:

• Follow Jeff’s directions about how to use the “asterisk tool” to mark the three Excel charts as

of interest to you and Erin.

Notice that these instructions make assumptions about system state at each point.

10

Metrics Task completion time Error rate Spoken comments Observation of user behavior “Critical incidents” User comfort (!)

Planning for Data Collection Be prepared: know in advance what and how

One evaluator interacts with the user, the other keepstrack of what happens, collects times, etc.

A structured form or template can be very useful Take advantage of tools if available and easy to use

Video taping, screen capture, event logging, etc. Particularly useful when collecting think-aloud data

Know when and how to intervene if necessary A three stage assistance policy: “try again”, “look here”,

and finally “just do this: ...” Be ready to prompt (“what just happened?”) for users in

the exploratory think-aloud condition

User Reaction Questionnaire Critical for gathering subjective reactions

For small tests, interviews can be even more useful Similar in structure to background questionnaire

But no demographics this time May include change in opinion due to test experience Specific rating scales tied directly to target outcomes in

the usability specifications The “three best” or “three worst” features Don’t forget the “anything else?” at the end

See p. 261 for model to use in developing your ownquestionnaire

Task-Specific UsabilityJudgments

I was confused by commands used to synchronize and un-synchronize with others.Strongly StronglyDisagree Disagree Neutral Agree Agree

The procedure for uploading files into exhibit components is familiar to me.Strongly StronglyDisagree Disagree Neutral Agree Agree

Learning that I could not make permanent changes to project data increased myconfidence.Strongly StronglyDisagree Disagree Neutral Agree Agree

Creating a new exhibit element that is nested behind another element is complex.Strongly Strongly

Disagree Disagree Neutral Agree Agree

11

Debriefing Talk to the users about the session Assure them they did well Give more details about what you’re

doing if they are interested Thank them Give them donuts (or $)

Results: Summarizing UserData How you summarize depends on variable

type: Categorical: responses are classified into groups Ordinal: responses fall in groups, but natural

order Interval: a scale with equidistant values Ratio: numerical scale with defined zero value Qualitative: comments to organize and discuss

Examples of each? What are appropriate summary treatments of

these differing kinds of variables?

Statistics t-test

Compares 1 dep var on 2 treatments of 1 ind var

ANOVA: ANalysis Of VAriance Compares 1 dep var on n treatments of m ind vars

Result: “significant difference” betweentreatments?

p = significance level (confidence) typical cut-off: p < 0.05

Statistics in Microsoft Excel Enter data into a spreadsheet Go to Tools…, Data Analysis… (may

need to choose Analysis Toolpak fromAddins first)

Select appropriate analysis

12

t-tests in Excel Used to compare two groups of data Most common is “t-test: two-sample

assuming equal variances” Other t-tests:

Paired two-sample for means Two-sample assuming unequal variances

ANOVAs in Excel Allows for more than two groups of data

to be compared Most common is “ANOVA: Single factor

analysis” Other ANOVAs:

ANOVA: Two-factor with replication ANOVA: Two-factor without replication

p < 0.05 Found a “statistically significant difference” Averages determine which is ‘better’ Conclusion:

Vis Tool has an “effect” on user performance for task1 PerspWall better user performance than Lifelines for task1 “95% confident that PerspWall better than Lifelines” NOT “PerspWall beats Lifelines 95% of time”

Found a counterexample to the null hypothesis Null hypothesis: Lifelines = PerspWall Hence: Lifelines ≠ PerspWall

p > 0.05 Hence, same?

Vis Tool has no effect on user performance for task1? Lifelines = PerspWall ?

Be careful! We did not detect a difference, but could still be

different Did not find a counter-example to null hypothesis Provides evidence for Lifelines = PerspWall, but not

proof Boring! Basically found nothing

How? Not enough users (other tests can verify this) Need better tasks, data, …

13

Reporting Results Often considered the most important

section of professional papers Statistics NOT the most important part

of the results section Statistics used to back up differences

described in a figure or table

Reporting Means, SDs, t-tests

Give means and standard deviations,then t-test

… the mean number was significantly greaterin condition 1 (M=9.13, SD=2.52) than incondition 2 (M=5.66, SD=3.01), t(44)=3.45,p=.01

What Are Those Numbers? … the mean number was significantly greater in

condition 1 (M=9.13, SD=2.52) than in condition2 (M=5.66, SD=3.01), t(44)=3.45, p=.01 M is the mean SD is the standard deviation t is the t stat the number in parentheses is the degrees of freedom

(df) p is the probability the difference occurred by chance

Reporting ANOVAs … for the three conditions,

F(2,52)=17.24, MSE=4528.75, p<.001 F(x,y) -- F value for x between groups and y

within groups degrees of freedom (df) MSE -- mean square error for the between

groups condition p -- probability that difference occurred by

chance

14

Making Sense of the Results Relate to high-level goals: is the system

useful, easy to learn and use, satisfying? Which of these is hardest to judge in lab study?

But also compare directly to usability specs: Did you miss, meet, or surpass the target level? More importantly, can you figure out why?

Guidance on how to change design comesfrom the details of the testing, not thesummary values Why was user confused (or not), why was an

interaction difficult or awkward, etc.

Usability Specifications Quality objectives for final system usability

like any specification, must be precise managed in parallel with other design specifications

In SBD, these come from scenarios & claims scenarios are analyzed as series of critical subtasks reflect issues raised and tracked through claims analysis each subtask has one or more measurable outcomes tested repeatedly in development to assess how well

project is doing (summative) as well as to direct designeffort toward problem areas (formative)

Precise specification, but in a context of use

Design scenarios:extract motivation and context

for subtasks to be tested

Usability specifications:a list of subtasks with

performance andsatisfaction parameters

Activity, information, andinteraction claims:

identify key design featuresto be tested

Estimates of behavior:published or pilot data ofexpected user behavior

What about Generality? Salient risk in focusing only on design

scenarios may optimize for these usage situations the “successful” quality measures then reflect this

When possible, add contrasting scenarios overlapping subtasks, but different user situations

(user category, background, motivation) assess performance satisfaction across scenarios

Motivation to construct functional prototypesas early as feasible in development cycle

15

• Where do targets come from? Serious, but not absolute• Notice that we can also “test” overarching scenario

Scenario & Subtasks Worst Case Planned Best Case

Interaction Scenario: Mr. King coaches Sally

2.5 on usefulness, ease of use, and satisfaction

4 on usefulness, ease of use, and satisfaction

5 on usefulness, ease of use, and satisfaction

1. Identify Sally’s view and synchronize

1 minutes, 1 error

3 on confusion

30 seconds, 0 error

2 on confusion

10 seconds, 0 error

1 on confusion

2. Upload desktop file from the PC

3 minutes, 2 errors

3 on familiarity

1 minute, 1 error

4 on familiarity

30 seconds, 0 error

5 on familiarity

3. Open, modify, try to save Excel file

2 minutes, 1 error

3 on confidence

1 minute, 0 errors

4.5 on confidence

30 seconds, 0 error

5 on confidence

4. Create nested exhibit component

5 minutes, 3 errors

3 on complexity

1 minute, 1 error

2 on complexity

30 seconds, 0 error

1 on complexity

A Sample Usability Specification

evaluation - virginia techcourses.cs.vt.edu/~cs3724/fall2007-bowman/lectures/evaluation.pdf ·...

Documents