ibm almaden, oct 2000 automating assessment of web site usability marti hearst university of...

IBM Almaden, Oct 2000

Automating Assessment of

Web Site Usability

Marti HearstUniversity of California, Berkeley


The Usability Gap

196M new Web sites in the next 5 years [Nielsen99]

~20,000 user interface professionals [Nielson99]


The Usability Gap

Most sites have inadequate usability [Forrester, Spool, Hurst]

(users can’t find what they want 39-66% of the time)

196 M new Web sites in the next 5 years [Nielsen99]

A shortage of user interface professionals [Nielson99]


Usability effects the bottom line

IBM case study [1999]Spent $millions to redesign

site 84% decrease in help usage 400% increase in sales Attributed to improvements in

information architecture


Usability effects the bottom line

IBM case study [1999]Spent $millions to redesign

site 84% decrease in help usage 400% increase in sales Attributed to improvements in

information architectureCreative Good Study [1999]

Studied 10 e-commerce sites59% attempts failedIf 25% of these had succeeded ->

estimated additional $3.9B in sales


Talk Outline

Web Site Design Automated Usability Evaluation Our approach

WebTANGO Some Empirical Results

Wrap-up

Joint work with Melody Ivory & Rashmi Sinha


Web Site Design (Newman et al. 00)

Information design structure, categories of

information

Navigation design interaction with

information structure

Graphic design visual presentation of

information and navigation (color, typography, etc.)

Courtesy of Mark Newman


Information Architecture includes management

and more responsibility for content

User Interface Design includes testing and

evaluation

Web Site Design(Newman et al. 00)



Web Site Design Process

Discovery Assemble information relevant to project

Design Exploration

Explore alternative design approaches (information, navigation, and graphic)

Design Refinement

Select one approach and iteratively refine it

Production Create prototypes and specifications


Start


Iteration

Design

Prototype

Evaluate


Usability EvaluationStandard Techniques

User studies Potential users use the interface to complete

some tasks Requires an implemented interface

"Discount" Usability Evaluation Heuristic Evaluation

Usability expert assesses guidelines


Automated UE

We looked at 124 methods AUE is greatly under-explored

Only 36% of all methods Fewer methods for the web (28%)

Most techniques require some testing Only 18% are free from user testing Only 6% for the web


Survey of Automated UE

Predominant methods (Web) Structural analysis (4)

Bobby, Scholtz & Laskowski 98, Stein 97

Guideline Reviews (11) Log file analysis (9)

Chi et al. 00, Drott 98, Fuller & de Graaff 96, Guzdial et al., Sullivan 97, Theng & Marsden 98

Simulation (2) Webcriteria (Max), Chi et al. 00


Existing Metrics

Web metric analysis tools report on what is easy to measure Predicted download time Depth/breadth of site

We want to worry about Content User goals/tasks

We also want to compare alternative designs.


Web TANGOTool for Assessing NaviGation & Organization

Goal: automated support for comparing design alternatives

How: Assess usability of the information architecture

Approximate information-seeking behavior Output quantitative usability metrics


Benefits/Tradeoffs

Benefits Less expensive than traditional methods Use early in design process

Tradeoffs Accuracy?

Validate methodology with user studies Illustrate different problems than traditional methods

For comparison purposes only Does not capture subjective measures


Information-Centric Sites

museum, history

news, magazines

government info


Guidelines

There are many usability guidelines A survey of 21 sets of web guidelines

found little overlap (Ratner et al. 96) Why?

Our hypothesis: not empirically validated So … let’s figure out what works!


An Empirical Study:

Which features distinguish well-designed web pages?


Methodology

Collect quantitative measures from 2 groups Ranked: Sites rated favorably via expert review or

user ratings Unranked: Sites that have not been rated favorably

Statistically compare the groups Predict group membership


Quantitative Measures

Identified 42 aspects from the literature Page Composition (e.g., words, links, images) Page Formatting (e.g., fonts, lists, colors) Overall Page Characteristics

(e.g., information & layout quality, download speed)


Metrics

Word Count Body Text Percentage Emphasized Body

Text Percentage Text Positioning Count Text Cluster Count Link Count

Page Size Graphic Percentage Graphics Count Color Count Font Count Reading Complexity


Data Collection

Collected data for 2,015 information-centric pages from 463 sites Education, government, newspaper, etc.

Data constraints At least 30 words No e-commerce pages Exhibit high self-containment (i.e., no style sheets,

scripts, applets, etc.) 1,054 pages fit constraints (52%)


Data Collection

Ranked pages Favorably assessed by expert review or user rating

on expert-chosen sites Sources:

Yahoo! 101 (ER) Web 100 (UR) PC Mag Top 100 (ER) WiseCat’s Top 100 (ER) Webby Awards (ER) & Peoples Voice (UR)


Data Collection

Unranked Not favorably assessed by expert review or user

rating on expert-chosen sites Do not assume unranked = unfavorable Sources:

WebCriteria’s Industry Benchmark Yahoo Business & Economy Category Others


Data Analysis

428 pages 214 ranked pages 840 unranked pages

214 chosen randomly


Findings

Several features are significantly associated with ranked sites

Several pairs of features correlate strongly Correlations mean different things in ranked

vs. unranked pages Significant features are partially successful

at predicting if site is ranked


Significant Differences

Metric Ranked Unranked Ranked Unranked Sig.Word Count 790.5 585.8 1604.5 1315.7 0.150Body Text % 73.7 73.2 22.4 24.5 0.824Emphasized Body Text % 26.1 25 27.2 25.7 0.672Text Positioning Count 4.4 5.4 4.8 11.2 0.244Text Cluster Count 17.9 10.8 22.1 17.4 0.000Link Count 58.8 39.2 56.6 44.2 0.000Page Size (Bytes) 57341.2 39614.9 72024.3 34312 0.001Graphic % 53.6 52.8 27.9 29.3 0.756Graphics Count 25.1 17.5 28.1 22.5 0.002Color Count 8.6 7.4 3.8 3.1 0.001Font Count 4.6 4.6 2.7 2.9 0.836Reading Complexity (GFI) 15.8 19.6 7.8 21.1 0.014

Mean Standard Deviation


Significant Differences

Ranked pages More text clustering (facilitates scanning) More links (facilitate info-seeking) More bytes (more content facilitate info seeking) More images (clustering graphics facilitates

scanning) More colors (facilitates scanning) Lower reading complexity (close to best numbers in

Spool study facilitates scanning)


Metric Correlations

Emp. Body T. Cluster Link Color Emp. Body T. Cluster Link ColorMetric Text% Count Count Count Text% Count Count CountLink Count -0.008 0.516 - 0.201 -0.077 0.548 - 0.540Graphics Count -0.040 0.370 0.305 0.331 -0.102 0.445 0.525 0.344Color Count -0.200 0.447 0.201 - 0.013 0.610 0.540 -Font Count -0.083 0.315 0.091 0.642 0.043 0.321 0.366 0.551

Ranked Unranked


Metric Correlations

Created hypotheses based on correlations: Ranked Pages

Colored display text Link clustering Both patterns on all pages in random sample

Unranked Pages Display text coloring plus body text emphasis or clustering Link coloring or clustering Image links, simulated image maps, bulleted links At least 2 patterns in 70% of random sample

Confirmed by sampling


Two Examples

Metric Example Mean Std. Dev. Example Mean Std. Dev.Emphasized Body Text % 7.2 26.1 27.2 46.7 25 25.7Text Cluster Count 17 17.9 22.1 11 10.8 17.4Link Count 59 58.8 56.6 24 39.2 44.2Graphics Count 4 25.1 28.1 15 17.5 22.5Color Count 10 8.6 3.8 6 7.4 3.1Font Count 7 4.6 2.7 12 4.6 2.9

Ranked Unranked


Ranked PageColored display textLink clustering


UnRanked PageBody text emphasisImage links


Predicting Web Page Rating

Linear Regression Explains 10% of difference between groups 63% Accuracy (better at unranked prediction)


Predicting Web Page Rating

Home vs. Non-home pages Text cluster count predicts home page

ranking 66% accuracy Consistent with primary goal of home pages

Non-home page prediction Consistent with full sample results 4 of 6 metrics (link count, text positioning count,

color count, reading complexity)


Second study (new results)

Better rating data Webby Awards Sites organized into categories

New metrics computation tool More quantitative measures Process style sheets, inline frames

Larger sample of pages


Webby Awards 2000

27 categories We used finance, education, community,

living, health, services 100 judges 6 criteria 3 rounds of judging

We used first round only 2000 sites initially


Webby Awards 2000 6 criteria

Content Structure & navigation Visual design Functionality Interactivity Overall experience

Factor analysis: first factor accounted for 91% of the variance

Judgements somewhat normally distributed, with skew


New Metrics


Methodology

Data collection 1108 pages 163 sites 3 levels per site

14 metrics About 85% accurate Text cluster and text positioning counts less

accurate


Preliminary Results

Linear regression to predict Webby judges ratings

Top 30% vs bottom 30% Prediction accuracy:

72% if categories not taken into account 83% if categories assessed separately


Significant Metrics by Category


Category-based Profiles

K-means clustering of good sites, according to the metrics

Preliminary results suggest the sites do cluster

Can use clusters to create profiles of good and poor sites for each category

These can be used as empircally verified guidelines


Ramifications

It is remarkable that such simple metrics predict so well Perhaps good design is good overall There may be other factors

A foundation for a new methodology Empircal, bottom up Does this reflect cognitive principles?

But, no one path to good design


Longer Term Goal: A Simulator for

Comparing Site Design


Monte Carlo Simulation

Have a model of information structure Have a set of user goals Want to assess navigation structure

Compare alternatives/tradeoffs Identify bottlenecks Identify critically important pages/links Check all pairs of start/end points Check overall reachability before and after a change.


One Monte Carlo simulation step for Design 1, Task 1. Simulation starts from the home page and the target information is at Renter Support.

X


Monte Carlo simulation results for Design 1, Task 1. Simulation runs start from all pages in the site. Average Navigation times are shown for Tasks 2 & 3.

X


Monte Carlo Simulation

At each step in the simulation Assume a probability distribution over a set of next

choices. The next choice is a function of:

The current goal The understandability of the choice Prior interaction history The overall complexity of the page

Varying the distribution corresponds to varying properties of the links

Spot-check important choices


In Summary

Automated Usability Assessment should help close the Web Usability Gap

We can empirically distinguish between highly rated web pages and other pages Empirical validation of design guidelines Can build profiles of good vs. poor sites Are validating expert judgements with usability

assessments via a user study Web use simulation is an under-explored and

promising new approach


Current Projects

Automating Web Usability (Tango) Melody Ivory, Rashmi Sinha

Text Data Mining (Lindi) Barbara Rosario, Steve Tu

Metadata in Search Interfaces (Flamenco) Ame Elliott, Andy Chou

Web Intranet Search (Cha-Cha) Mike Chen, Jamie Laflen


More information: http://www.cs.berkeley.edu/~ivory/web http://www.sims.berkeley.edu/~hearst

ibm almaden, oct 2000 automating assessment of web site usability marti hearst university of...

Documents

ibm almaden

web slide

information architecture

sales slide

information design structure

berkeley slide

courtesy of mark newman

usability effects