introduction to usability studies, presented to baobab health trust

Post on 30-Nov-2014

127 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction to usability studies, including basic statistical analyses. Presented to Baobab Health Trust, Lilongwe, Malawi, March 2014.

TRANSCRIPT

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Usability Studies and Empirical Studies

Harry Hochheiser University of Pittsburgh Department of Biomedical Informatics !harryh@pitt.edu !+1 410 648 9300

Attribution-ShareAlike CC BY-SA

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Outline

!

Usability Studies

Think-Aloud

Summative Studies

Empirical Studies

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Beyond InspectionsInspections won't tell you which problems users will face in

action

Might not identify mental models and confusions

..finding out where things go wrong.

Baobab Health, March 2014 Harry Hochheiser, harryh@pitt.edu

Edit Title

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

No bright dividing line in process

DesignFully-functional

Prototype Paper Prototype Release

Usability Inspections

Usability Studies

Empirical User Studies, Case Studies, Longitudinal Studies, Acceptance Tests

Low  cost,  low  validity Higher  cost,  validity

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Goals

• Generally, to understand if the proposed design supports completion of intended tasks

• Be specific -

• Tasks and users

• Define success

• User Satisfaction?

• Do users like the tool?

• What are the important metrics?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Tasks

• Representative and specific

• What would users do?

• Realistic – given available time and resources

• Appropriate for assessment of goals

• Possibly some user-defined/suggested

• Particularly if participants were informants in earlier requirements-gathering

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Which Tasks?Bad: Give this a try?

Better: Try to send an email, find a contact, and file a response

Still better:

Detailed scenario with multiple actions that required coordinated use of diverse components of an application's functionality

Formative Usability Studies:

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Conditions

• Usability Lab

• Two-way mirrors/separate rooms

• Workspace

• Online?

• Often video and/or audio-recorded

• Screen-capture

• Logs and instrumented software

• Goal: Ecological Validity

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Measures

• Key question to answer: “can users complete tasks”?

• Generally, lists of usability problems

• Description of difficulty

• Severity

• Task completion times – depending on methods

• Error rates?

• User Satisfaction

• Quantitative results for measuring success

• Not comparative

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Methodology

• Define Scope

• Users complete tasks

• Researchers observe process

• What happens?

• What goes right? What goes wrong?

• Note difficulties, confusions?

• Record – audio/video, screen capture

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Participants

• Somewhat representative of likely users

• Willing guinea-pigs

• Need folks who are patient, willing to deal with problems

• Well-motivated

• Compensated

• Eager to use the tool

• Small numbers – repeat until diminishing returns

• How many?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Nielsen – why you only need to test with 5 users http://www.useit.com/alertbox/20000319.html

Hwang & Salvendy (2010) – maybe need 10 +/- 2

Only 5 users – or maybe not

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Two approaches

• Observation

•Subject performs tasks, researchers observe

• Ecological validity, but no insight into users

!• “Think aloud”

•User describes mental state and goals

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Think-Aloud Protocols• User describes what they are doing and why as they try to complete a task

• Describe both goals and steps taken to achieve those goals.

• Observe

• Confusions – when steps taken don't lead to expected results

• Misinterpretations – when choices don't lead to expected outcomes

• Goal: identify both micro- and macro-level usability concerns

• Strong similarities with contextual inquiry, but..

• Focus specifically on tool

• Participant encouraged to narrat

• Evaluator generally doesn’t ask questions

!!

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Caveats

• Think-aloud is harder than it might sound

• What is the role of the investigator?

• How much feedback to provide?

• Very Little

• What (if anything) do you say when the user runs into problems?

• Not much

• What if it's a system that you built?

• How to identify/describe a usability problem?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Think-Aloud Protocols: A Comparison of Three Think-Aloud Protocols for use in Testing Data-Dissemination Web Sites for Usability Olmsted-Hawala, et al. 2010

"... it is recommended that rather than writing a vague statement such as 'we had participants think aloud,' practitioners need to document their type of TA protocol more completely, including the kind and frequency of probing.”

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Reporting Usability Problemsadapted from Mack & Montaniz, 1994• Breakdowns in goal-directed behavior

• Correct action, noticeable effort

• To find

• To execute

• Confused by consequence

• Correct action, confusing outcome

• Incorrect action requires recovery

• Problem tangles

• Qualitative analysis by interface interactions

• Objects and actions

• Higher-level categorization of interface interactions

Gulf of Execution

Gulf of Evaluation

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Reporting Usability Problemsadapted from Mack & Montaniz, 1994

• Inferring possible causes of problems

• Problem reports

• Design-relevant descriptions

• Quantitative analysis of problems by severity

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Analysis

• Challenge – identify problems at the right level of granularity?

• When does a series of related difficulties lead to a need for redesign?

• What if these difficulties come from different tasks?

• When appropriate, relate usability observations back to contextual inquiry or other earlier investigations

• Does the implementation fail to line up with the needs?

• Perhaps in some unforeseen manner?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Formative Usability Studies: Analysis

• Multiple observers

• Calculate agreement metrics?

• Use audio, video, transcripts to illustrate difficulties

• Particularly useful for demonstrating problems to implementation folks

• Rate problem severity

• Which are show-stoppers and which are nuisances?

• Which require redesign vs. small changes?

• Must prioritize...

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Completion – Summative User Studies

• Demonstrate successful execution of system

• With respect to

• Alternative system – even if straw man

• Stated performance goals – Acceptance Tests

• Generally empirical

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Completion – Summative Studies of systems in use

• Case studies

• Descriptions of individual deployments

• Qualitative

• Longitudinal study of ongoing use

• Collect data regarding impact

• Similar to case studies, but potentially more quantitative.

• Use observations and interviews to see what works?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

After system is complete

More realistic conditions?

Acceptance tests

Usability tests aimed at measuring success

Does the tool do what the client wants

• 95% task completion rate within 3 minutes, etc.?

Client has clearer idea – not just “user friendly”

Summative Tests

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

What: Empirical Studies

• Quantitative measure of some aspect of successful system use

• Task completion time (faster is better)

• Error rate

• Learnability

• Retention

• User satisfaction...

• Quality of output?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Tension in empirical studies

• Metrics that are easy to measure may not be most interesting

• Task completion time

• Error rate

• Great for repetitive data entry tasks, less so for complex tasks

• Analytics, writing...

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Empirical User Studies: Goals

• I have two interfaces – A and B.

• Which is better? and how much better?

• Want to determine if there is a measurable, consistent difference in

• Task completion times

• Error rates

• Learnability

• Memorability

• Satisfaction

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Running Example: Menu Structures

• Hierarchical Menu structures

• Multiple possibilities for any number of leaf nodes

• Broad/Shallow vs. Narrow/Deep

• which is faster?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Hypothesis

• Testable Theory about the world

• Galileo: The rate at which falling items fall is independent of their weight

• Menus

• Users will be able to find items more quickly with broad/shallow trees than with narrow/deep trees.

• Often stated as a “null hypothesis” that you expect will be disproven:

• There will be no difference in task performance time between broad/shallow trees and narrow/deep trees.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Background/Context

• Controlled experiments from cognitive psychology

• State a testable/falsifiable hypothesis

• Identify a small number of independent variables to manipulate

• hold all else constant

• choose dependent variables

• assign users to groups

• collect data

• statistically analyze & model

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Other goals

• Strive for

• removal of bias

• replicable results

• Generalizable theory that can inform future work

• or, demonstrable evidence of preference for one design over another.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Empirical User Studies: Tasks

• Use variants of the design to complete some meaningful operation

• Usually relatively close-ended, well-defined

• Relatively clear success/failure

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Empirical User Studies: Conditions

• Lab-like?

• Simulated realistic conditions?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Independent Variables

• What are you going to test?

• Condition that is “independent” of results

• independent of user's behaviors

• independent of what you're measuring.

• one of 2 (or 3 or 4) things you're comparing.

• can arise from subjects being classified into groups

• Examples

• Galileo: dropping a feather vs. bowling ball

• Menu structures – broad/shallow vs. narrow/deep

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Dependent variable

• Values that hypothesis test

• falling time

• task performance time, etc.

• May have more than one

• Goal: show that changes in independent variable lead to measurable, reliable changes in dependent variables.

• With multiple independent variables, look for interactions

• Differences between interfaces increase with differences in task complexity

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Controls

• In order to reliably say that independent variables are responsible for changes in dependent variables, we must control for possible confounds

• Control – keep other possible factors constant for each condition/value of independent variables

• types of users, contexts, network speeds, computing environments

• confound – uncontrolled factor that could lead to an alternate explanation for the results

• What happens if you don’t control as much as possible?

• Confounds, not independent variables, may be the cause of changes in dependent variables.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Examples of Controls

• Galileo:

• windy day vs. not windy?

• Menus

• network speed/delays? (do everything on one machine)

• skills of users? (more on participant selection later)

• font size, display information, etc.?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

• Related to controls

• Experimenter can introduce biases that might influence outcomes

• Instructions?

• Choice of participants?

• more on this in a moment

• Protocols

• prepare scripts ahead of time

• Learning Effects?

Bias

Thanks to Jinjuan Feng for figure

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Between-Groups vs. Within-Groups Design• How do you assign participants to conditions?

• All people do all tasks/cells?

• Within-groups – compare within groups of individuals.

• one group of test participants

• Certain people for certain cells?

• between groups – compare between groups of individuals

• 2 or more groups

• Mixed models

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Between Groups

• Pros

• Simpler design

• Avoid learning effect

• Don't have to worry about ordering

• Cons

• may need more participants

• to get enough data for statistical tests

• to avoid influence of some individuals.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Within-Groups

• Pros:

• Can be more powerful statistically

• same person uses each of multiple interfaces

• Fewer Participants

• Cons

• Learning effects require appropriate randomization of tasks/interfaces

• Fatigue is possible

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Mixed Models

• Elements of both

• 3 different interfaces

• Want to compare performance of different groups

• Docs vs. Nurses?

• Each interface a within-subject experiment

• Across professions is between-subjects.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Other Challenges

• Ordering tasks?

• How many?

• Want to avoid fatigue, boredom, and expense of long sessions

• How many users?

• 20 or more?

• Variability among subjects

• May be unforeseen.

• Bi-modal distribution of education or computer experience?

• Training materials

• Run a pilot

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Procedure• Users conduct tasks

• Measure

• record task completion times

• errors

• etc.

!

• Now what?

• Analyze data to see if there is support for the hypothesis

• alternatively, if the

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Hypothesis Testing

• Not about proof or disproof

• Instead, examine data

• Find likelihood that the data occurred randomly if the null hypothesis is true

• If this is small, say that we have support for the hypothesis

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Data, Stats, and R

• Need to talk about

• data distributions

• statistical analyses

• to do hypothesis testing

!

• Tools:

• R - r-project.org

• R-Studio - rstudio.org

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Sampling

• Data sets come from some ideal universe

• all possible task performance times for a given menu selection task

• Compare two samples with given means and deviations

• Are they really different? Or do they just appear different by chance?

• Statistical testing gives us a p-value

• probability that differences are random chance

• low values are significant

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

The key questions• Given two sets of measurements, or samples, did they come

from the same underlying source or distribution

!

• x = [29 33 89 56 86 85 7 84 67 78 59 28 10 76 11 12 97 61 66 9 40 95 90 4 31 18 24 48 45 82]

• y = [51 3 10 11 5 90 87 13 64 86 67 98 12 55 56 80 59 63 94 93 25 4 79 52 36 73 99 22 62 2]

• mean(x) = 50.67, sd(x)=31.01

• mean(y) = 51.7, sd(y) = 33.26

• are they from the same distribution?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Boxplot

• Show quartiles

• Are they the same?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

“Normal” distributions

• Given mean and standard deviation (measure of variation)

• 95% of area under curve within 2 standard deviations

• If you take many samples from a space

• Their averages will go to a normal distribution

• Statistical testing -> comparison of distributions.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Histograms

Run a subset of a population, 1000 times

get average of each subset

Normal distribution

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Hypothesis testing

• Test probability that there is no difference between two distributions

• Possible errors

• Type 1 Error: α - reject null hypothesis when it is true

• believe there is a difference when there is none

• False positive

• Type 2 Error: β- accept null when false

• believe no difference when there is

• False Negative

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Significance Levels and Errors

• Highly significants ( p <0.001)

• Don't believe there is a difference unless it's really clear

• low chance of false positive – Type 1

• Greater chance of false of false negative /Type 2

• Less significant (p < 0.05)

• More ready to believe there is a difference

• More false positive/type 1 errors

• fewer type 2 errors

• Usually use p=0.05 as cut-off.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Type 1 and Type 2 errorsType 1 error

reject the null hypothesis when it is, in fact, true

Type 2 error

accept the null hypothesis when it is, in fact, false Decision

Reality

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Statistical Methods -Crash Course

• Comparisons of samples

• t-tests: 2 alternatives to compares

• ANOVA: > 2 alternatives, multiple independent variables

• Correlation

• Regression

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

t-test

• x = [29 33 89 56 86 85 7 84 67 78 59 28 10 76 11 12 97 61 66 9 40 95 90 4 31 18 24 48 45 82]

• y = [51 3 10 11 5 90 87 13 64 86 67 98 12 55 56 80 59 63 94 93 25 4 79 52 36 73 99 22 62 2]

• t.test(x,y)

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Results

Welch Two Sample t-test!

!data: x and y!

t = -0.1245, df = 57.72, p-value = 0.9014!

alternative hypothesis: true difference in means is not equal to 0!

95 percent confidence interval:!

-17.65522 15.58855!

sample estimates:!

mean of x mean of y !

50.66667 51.70000 !

!

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

xkcd on significance testing http://xkcd.com/882/

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Correlation

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Correlation

• Attributing causality

• a correlation does not imply cause and effect

• cause may be due to a third “hidden” variable related to both other variables

• drawing strong conclusion from small numbers

• unreliable with small groups

!

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Regression

Calculates a line of “best fit” Use the value of one variable to predict the value of the other

r2=.67, p < 0.01 r=.82

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Be careful http://xkcd.com/552/

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

User Modeling Hourcade, et al. 2004

Predict performance characteristics?

Calculate index of difficulty

similar to MT = a + b log2 (A/W+1)

Linear regression to see how well it fits

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Longitudinal use• Lab studies are artificial

• Many tools used over time.

• use and understanding evolve

• Longitudinal studies look at usage over time

• Expensive, but better data

• Techniques

• Interviews, usability tests with multiple sessions, continuous data logging, Instrumented software, Diaries

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Case Studies

• In-depth work with small number of users

• Multiple sessions

• Describe scenarios

• Illustrate use of tool to accomplish goals

• Good for novel designs, expert users

• Formative evaluation – can be used to gather requirements

• Summative – show validity of idea

• Possibly less compelling than usability evaluations.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Informed Consent

• Research must be done in a way that protects participants

• Principles

• Respect for persons

• Beneficence – minimize possible harms, maximize possible benefits

• Justice – costs and benefits should not be limited to certain populations

• Institutional Review Board (IRB) – approves experiments and requires signatures on “informed consent” form.

• Crucial for responsible research

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Other Metrics

What if task completion time is not the most important metric?

!

Insight?

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Automated Usability Testing

Possible for defined criteria

Text complexity?

Accessibility

WCAG

Section 508

Example: wave.webaim.org.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Log File Analysis

• Use clickstream and usage data to study actual use

• Which parts of the system are people using?

• Which are they not using?

• Are they going in circles?

• Are they having problems?

• Rich data, but hard to interpret

• particularly without observations or interviews to provide context.

Baobab Health, March 2014Harry Hochheiser, harryh@pitt.edu

Shortcomings of User Studies

What happens in the lab may not be reflected in real use

!

Deployment/post-mortem, etc.

!

Case studies, qualitative work

!

How can we meaningfully evaluate a system in use

… when deployment presents a significant expense...

top related