usability evaluation methods (part 2) and performance metrics

CN5111 – Week 4: Usability evaluation methods (part 2) and performance metrics Dr. Andres Baravalle

Lecture content• Usability testing (review & scenarios)• Usability inspection• Usability inquiry• Performance metrics

2

Usability testing (review and scenarios)

3

Usability testing• When: common for comparison of products or

prototypes• Tasks & questions focus on how well users

perform tasks with the product– Focus is on time to complete task & number & type of

errors• Data collected by video & interaction logging• Experiments are central in usability testing

– Usability inquiry tends to use questionnaires & interviews

4

Testing conditions• Usability lab or other controlled space• Emphasis on:

– Selecting representative users– Developing representative tasks

• Small sample (5-10 users) typically selected• Tasks usually last no longer than 30 minutes• The test conditions should be the same for every

participant

5

Some type of data· Time to complete a task· Time to complete a task after a specified time

away from the product· Number and type of errors per task· Number of errors per unit of time· Number of navigations to online help or manuals· Number of users making a particular error· Number of users completing task successfully

6

How many participants is enough for user testing?• The number is a practical issue• Depends on:

– Schedule for testing– Availability of participants– Cost of running tests

• Typically 5-10 participants– Some experts argue that testing should

continue with additional users until no new insights are gained

7

Examples• The next slides describe 2 experiments:

the one behind the book Prioritizing Web Usability and a fictional one on OpenSMSDroid

• Both use Thinking Aloud and video/screen recording for data collection

8

Prioritizing Web Usability• Prioritizing Web Usability (Nielsen and Loranger, 2006)

used the Thinking Aloud method to collect insight on user behaviour:– 69 users, all with at least one year experience in using the

web– Broad range of job backgrounds and web experience – but no

one working in IT or marketing– 25 web sites tested with specific tasks– Windows desktops with 1024x768 resolution running Internet

Explorer– Recordings of monitor and upper body for each session– Broadband speed between 1 and 3 Mbps

9

Prioritizing Web Usability (2)• The tasks that the users were asked to perform

included:– Go to ups.com and find how much does it cost to

send a postcard to China– You want to visit the Getty Museum this weekend. Go

to getty.edu and find opening times/prices– Go to nestle.com and find a snack to eat during

workouts– Go to bankone.com and find best savings account if

you have a $1,000 balance

10

Prioritizing Web Usability (3)• The result of the research is presented as

a book:– Organising the finding in categories (including

searching, navigation, typography and writing style)

– Using plenty of examples and screenshots to demonstrate the usability issues that were identified

11

Prioritizing Web Usability: findings• People succeed 66% of the time when

working on “single site” activities and 60% of the time when having to browse through the internet for information

12

Prioritizing Web Usability: findings (2)• Experienced users spend about 25

seconds in a homepage and 45 in an interior page (35 and 60 for inexperienced users)

• Only 23% of users scroll on their first visit of a homepage – The number decreases after the first visit– The average scroll for first visit is 0.8 of a

screen

13

Prioritizing Web Usability: findings (3)• 88% of users go to search engines to find

information• Font face and size: different font faces for

print and screen – Different font size depending on target

audience• More in the book…

14

OpenSmsDroid evaluation• You have been tasked to evaluate the usability

for a new (fictional) Android application to write short text messages, OpenSMSDroid

• You have decided to set up an experiment– The next experiment is (loosely) adapted from

“Experimental Evaluation of Techniques for Usability Testing of Mobile Systems in a Laboratory Setting” (Beck, Christiansen, Kjeldskov, Kolbe and Stage, 2003)

15

OpenSmsDroid evaluation• Your test users will be perform a set of

tasks in specific configurations using the thinking aloud method for data collection– A constraint of 5 minutes has been set for

each of the tasks– The usability researcher will record the

session and take notes

16

OpenSmsDroid evaluation: testing configurations• Configurations for the test (tentative list):

– Sitting on a chair at a table– Walking on a treadmill at constant speed– Walking on a treadmill at varying speed– Walking on an 8-shaped course that is changing as

obstructions are being moved, within 2 meters of a person that walks at constant speed

– Walking on an 8-shaped course that is changing as obstructions are being moved, within 2 meters of a person that walks at varying speed

– Walking in Westfield Stratford at 16:00 on Saturday

17

OpenSmsDroid evaluation: testing configurations (2)• For practical reasons and after reviewing the

literature, these settings have been selected for this evaluation:– Sitting on a chair at a table– Walking on a treadmill at constant speed– Walking in Westfield Stratford at 16:00 on

Saturday

18

OpenSmsDroid evaluation: tasks• Writing a new SMS containing the phrase “The quick

brown fox jumps over the lazy dog” repeated 2 times to an existing contact (without using predictive text features)

• Writing a new SMS containing the phrase “The quick brown fox jumps over the lazy dog” repeated 2 times to an existing contact (using predictive text features)

• Taking a picture and sending it to an existing contact• Taking a short 1 minute video and sending it to an

existing contact

19

OpenSmsDroid evaluation: tasks (2)• In each test, you can collect:

– Quantitative data: time needed to perform the task, and if the task has been completed

– Qualitative data: asking the user to think aloud while interacting with the device and recording the interaction

20

OpenSmsDroid evaluation: data analysis• The evaluation will analyse the data

collected and report on any findings, informing on any difference in performance and suggesting possible changes to the interface– An experiment can also generate further

hypothesis which will be used in further experiments

21

OpenSmsDroid experiment: what's missing?• Something to compare to!

– Otherwise you cannot know if the interface is better or not

22

Usability inspections

23

Usability inspection methods• Heuristic evaluation and walkthroughs

are the most common usability inspection methods– We'll also see several other methods

24

Usability inspections and heuristics (2)• Usability inspection methods are based on

having evaluators inspecting an user interface

• Usability inspection methods aim to examine usability-related aspects of an user interface, even if the interface has not been yet developed– Can be used to perform usability evaluations

in the initial stages of the development

25

Heuristic evaluations• Heuristic evaluation is a method that

requires usability specialists to judge whether each element of an user interface follows established usability principles and guidelines– E.g. Jakob Nielsen’s heuristics

• Heuristics are being developed for mobile devices, wearables, virtual worlds, etc.

26

27

Nielsen’s heuristics: discount evaluations• An heuristic evaluation is referred to as

discount evaluation when 5 evaluators are used– Empirical evidence suggests that on

average 5 evaluators identify 75-80% of usability problems on generalist web sites

28

Heuristic evaluations: stages

• Briefing session to tell experts what to do.• Evaluation period of 1-2 hours in which:

– Each expert works separately– Take one pass to get a feel for the product– Take a second pass to focus on specific features

• Debriefing session in which experts work together to prioritize and categorise the problems

Relevant standards and guidelines• Relevant standards and guidelines

include:– W3C HTML and CSS standards– W3C WAI guidelines– Mobile Web Best Practices guidelines– ISO 9271

29

Heuristic evaluations: advantages and problems• Few ethical & practical issues to consider

because users not involved– Can be difficult & expensive to find experts– Experts should have knowledge of application domain

& of the evaluation method used• Critical points:

– Important problems may get missed– Focus can be lost on trivial problems – Experts have biases

30

Cognitive walkthroughs• Focus on ease of learning• Designer presents an aspect of the design

& usage scenarios• Expert is told the assumptions about user

population, context of use, task details.• One or more experts walk through the

design prototype with the scenario.

31

Cognitive walkthroughs (2)• Experts are guided by 3 questions:

– Will the correct action be sufficiently evident to the user?

– Will the user notice that the correct action is available?

– Will the user associate and interpret the response from the action correctly?

• As the experts work through the scenario they note problems.

32

Pluralistic walkthrough• Variation on the cognitive walkthrough

theme.– Performed by a team

• The panel of experts begins by working separately

• Then there is managed discussion that leads to agreed decisions.

• The approach lends itself well to participatory design

33

Feature inspection• Feature inspection is a technique that focuses

on the features of a product or of a web site– A group of inspectors that are given some use cases

and are asked to analyse each feature of the web site for what regards availability, understandability, and other aspects of usability

– This technique is better in the middle stages of development, when features are known but the artefact cannot be evaluated with methods as lab experiments.

34

Standards inspection• Standards inspection is a technique used to

ensure the compliance of a web site against some standard

• A usability professional with extensive knowledge of the relevant standards inspects a web site for compliance

• Different standard inspections can be run on the same artefact– Nielsen’s heuristics include standards inspection

35

Usability inquiry

36

Truckers and mobile devices[...] a major mobile device company was trying to understand why there were so many data entry errors on a mobile device for long-haul truck drivers. Many people in the company tended to blame the truckers, whom they assumed were un educated.None of them had ever actually met a trucker, but they figured it couldn’t be too hard to type in a word or two. One winter, a senior user interface (UI) designer decided to see for himself.The designer spent a week at a truck stop watching truck drivers use the device and talking to them about it. He quickly discovered that the truckers could spell perfectly. Instead, the problem was the device. The truckers tended to be big men, with big fingers. To make matters worse, they often wore bulky gloves in the winter.The device had tiny buttons, making typing with big fingers in warm gloves frustrating. (Observing the User's Experience; Goodman, Kuniavsky and Moed 2012)

37

Truckers and mobile devices (2)• The team realized it had been basing

important design decisions on faulty assumptions

• The team redesigned the UI so that it required less typing and increased the size of buttons. – The error rates dropped dramatically

38

Truckers and mobile devices (2)• That's an example of how usability enquiry

methods work

39

Usability inquiry• Usability inquiry methods focus (at

different degrees) on analysing an artefact either from “the native point of view" or looking for “the native point of view"– Used to obtain information about users' likes,

dislikes, needs, and understanding of the system

40

Usability inquiry (2)• They may use one or more of these

tecniques:– Talking to users– Observing users using a system in a real

working situation– Letting the users answer questions (verbally

or in written form)

41

Data collection & analysis• Data collection:

– Observation & interviews (e.g. contextual inquiry)– Notes, pictures, recordings, diaries– Video– Logging

• Analysis– Categorizing the findings– Using existing categories can be provided by pre-

existing research

42

Usability inquiry methods• The next slides will cover some popular

usability inquiry methods:– Diary– Contextual inquiry– Interviews and focus groups– Surveys

43

Diary method• The diary method requires users to keep a

diary of their interactions• Diaries can be free form or structured

– The diary method is best used when the researcher does not have the time, the resources or the possibility to use user monitoring methods or when the level of detail provided by user monitoring methods is not needed

44

Contextual inquiry• Contextual inquiry is a structured field

interviewing method which typically evaluates:– User opinions – User experience – Motivation– Context

• It is a study based on dialogue and interaction between interviewee and user, and it is one of the best methods to use when researchers need to understand the users' work context.

45

Interviews and focus groups• Interviews and focus groups are research

methods based on interaction between researchers and users– The researcher facilitates the discussion

about the issues rose by the questions– In focus groups (multiple users present), the

interaction among the users may raise additional issues, or identify common problems that many persons experience

46

Surveys• Surveys are a quantitative research

method, where a set list of questions are asked and the users' responses recorded– When the questions are administered by a

researcher, the survey is called a structured interview

– When the questions are administered by the respondent, the survey is referred to as a questionnaire

47

And now…• You have had an overview of a wide

selection of usability evaluation methods– And you are ready to use them in your

assignment

48

Measuring the User Experience• The next slides are based on the core text

book for this module, "Measuring the User Experience"

49

Performance metrics

50

Types of performance metrics• Task success• Level of success• Errors• Efficiency• Learnability

51

Task success• Binary success• Level of success

52

Task success• It measures how effectively users are able

to complete a given set of tasks• To measure task success, each task that

users are asked to perform must have a clear end state or goal

53

Task success (2)• You can:

– Ask users to articulate the answer verbally– Ask users to provide an answer in a

structured way (e.g. using an online tool or paper)

– Use proxy measures (e.g. when the correct solution is not easily identifiable as it may depend from individual circumstances)

54

Task failure• There are many different ways in which a

participant might fail– Giving up: participants indicate that they would not

continue with the task if they were doing this on their own

– Moderator “calls” it because the participant is not making any progress

– Too long: the participant completed the task but not within a predefined time

– Wrong: participants thought that they completed the task successfully, but they actually did not

55

Binary success• Binary success is the simplest and most

common way of measuring task success– Each time users perform a task, they should

be given a “success” or “failure” score (0 or 1)– Users either completed a task successfully or

they didn’t– That's the case, for example. for e-commerce– Sometimes, you will should measure

perceived success rather than factual success (why?)

56

Binary success: analysing and presenting data• The most common way to analyse and

present binary success rates is by task.• This involves simply presenting the

percentage of participants who completed each task successfully

57

Binary success: analysing and presenting data by task

58

Binary success: presenting data by user• Frequency of use (infrequent users versus

frequent users)• Previous experience using the product• Domain expertise • Age group

59

Levels of success• Identifying levels of success is useful

when there are reasonable shades of grey associated with task success.

60

Levels of success: complete, partial and failure• Complete success

– With assistance– Without assistance

• Partial success– With assistance– Without assistance

• Failure– User thought it was complete, but it wasn’t– User gave up

61

Levels of success: analysing and presenting data

62

Time on task• In most situations, the faster a user can

complete a task, the better the experience– Time on task is particularly important for

products where tasks are performed repeatedly by the user.

63

Time on task: analysing and presenting data• The most common way is to look at the

average amount of time spent on any particular task– You should always report a confidence

interval to show the variability in the time data

64

Time on task: range and threshold• A variation is to create ranges and report

the frequency of users who fall into each interval

• Another useful way to analyse task time data is by using a threshold. – In many situations, the only thing that matters

is whether users can complete certain tasks within an acceptable amount of time.

65

Errors• Errors reflect the mistakes made during a

task. Errors can be useful in pointing out particularly confusing or misleading parts of an interface.

66

Errors (2)• Examples of errors include:

– Entering incorrect data into a form field – Making the wrong choice in a menu or drop-

down list – Taking an incorrect sequence of actions – Failing to take a key action

67

Errors: analysing and presenting data• The most common approach is to look at

average error rates per task or per participant

68

Efficiency• Efficiency can be assessed by examining

the amount of effort a user expends to complete a task, such as the number of clicks in a website or the number of button presses on a mobile phone.

69

Efficiency (2)• Efficiency typically measures the number

of actions or steps that users took in performing each task.– Identify the action(s) to be measured: for

websites, it's typically mouse clicks or page views

– Define the start and end of an action– Count the action

70

Learnability• Learnability is a way to measure how

performance improves or fails to improve over time.

71

Learnability (2)• Learnability is normally measured using

performance metrics: time on task, errors, number of steps, or task success per minute.

72

Learnability: analysing and presenting data• The most common way to analyse and

present learnability data is by examining a specific performance metric (such as time on task, number of steps, or number of errors) by trial for each task or aggregated

• across all tasks.

73

Learnability: analysing and presenting data (2)

74

References• Beck, E., Christiansen, M., Kjeldskov, J.,

Kolbe, N. and Stage, J. (2003). ‘Experimental Evaluation of Techniques for Usability Testing of Mobile Systems in a Laboratory Setting’, OzCHI 2003.

• Nielsen, J. and Loranger, H. (2006). Prioritizing Web Usability.

75

usability evaluation methods (part 2) and performance metrics

Education