extracting usability information from user interface events · extracting usability information 385...

Extracting Usability Information from User Interface Events

DAVID M. HILBERT AND DAVID F. REDMILES

University of California at Irvine

Modern window-based user interface systems generate user interface events as naturalproducts of their normal operation. Because such events can be automatically capturedand because they indicate user behavior with respect to an application’s user interface,they have long been regarded as a potentially fruitful source of information regardingapplication usage and usability. However, because user interface events are typicallyvoluminos and rich in detail, automated support is generally required to extractinformation at a level of abstraction that is useful to investigators interested inanalyzing application usage or evaluating usability.

This survey examines computer-aided techniques used by HCI practitioners andresearchers to extract usability-related information from user interface events. Aframework is presented to help HCI practitioners and researchers categorize andcompare the approaches that have been, or might fruitfully be, applied to this problem.Because many of the techniques in the research literature have not been evaluated inpractice, this survey provides a conceptual evaluation to help identify some of therelative merits and drawbacks of the various classes of approaches. Ideas for futureresearch in this area are also presented.

This survey addresses the following questions: How might user interface events beused in evaluating usability? How are user interface events related to other forms ofusability data? What are the key challenges faced by investigators wishing to exploitthis data? What approaches have been brought to bear on this problem and how do theycompare to one another? What are some of the important open research questions inthis area?

Categories and Subject Descriptors: H.5.2 [Information Interfaces andPresentation]: User Interfaces—Evaluation/methodology

General Terms: Human factors, Measurement, Experimentation

Additional Key Words and Phrases: usability testing, user interface event monitoring,sequential data analysis, human-computer interaction

1. INTRODUCTION

User interface events (UI events) are gen-erated as natural products of the normaloperation of window-based user interfacesystems such as those provided by the

Authors’ address: Department of Information and Computer Science, University of California, Irvine, CAe-mail: {dhilbert,redmiles}@ics.uci.edu.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works, requires prior specific permission and/or a fee. Permissions maybe requested from Publications Dept., ACM Inc., 1515 Broadway, New York, NY 10036, USA, fax +1 (212)869-0481, or [email protected]©2001 ACM 0360-0300/01/1200-0384 $5.00

Macintosh Operating System [Lewisand Stone 1999], Microsoft Windows[Petzold 1998], the X Window System[Nye and O’Reilly 1992], and the JavaAbstract Window Toolkit [Zukowski andLoukides 1997]. Such events indicate user

ACM Computing Surveys, Vol. 32, No. 4, December 2000, pp. 384–421.

Extracting Usability Information 385

behavior with respect to the componentsthat make up an application’s user inter-face (e.g., mouse movements with respectto application windows, keyboard presseswith respect to application input fields,mouse clicks with respect to applicationbuttons, menus, and lists). Because suchevents can be automatically captured andbecause they indicate user behavior withrespect to an application’s user interface,they have long been regarded as a po-tentially fruitful source of informationregarding application usage and usability.However, because user interface eventsare typically extremely voluminous andrich in detail, automated support isgenerally required to extract informationat a level of abstraction that is usefulto investigators interested in analyzingapplication usage or evaluating usability.

While a number of potentially relatedtechniques have been applied to the prob-lem of analyzing sequential data in otherdomains, this paper primarily focuses ontechniques that have been applied withinthe domain of HCI. Providing an in-depthtreatment of all potentially related tech-niques would necessarily limit the amountof attention paid to characterizing theapproaches that have in fact been broughtto bear on the specific problems associ-ated with analyzing HCI events. However,this survey attempts to characterize UIevents and analysis techniques in such away as to make comparison between tech-niques used in HCI and those used in otherdomains straightforward.

1.1 Goals and Method

The fundamental goal of this survey is toconstruct a framework to help HCI prac-titioners and researchers categorize, com-pare, and evaluate the relative strengthsand limitations of approaches that havebeen, or might fruitfully be, applied to thisproblem. Because exhaustive coverage ofall existing and potential approaches isimpossible, we attempt to identify keycharacteristics of existing approaches thatdivide them into more or less natural cate-gories. This allows classes of systems, notjust instances, to be compared. The hope

is that an illuminating comparison can beconducted at the class level and that clas-sification of new instances into existingclasses will prove to be unproblematic.

In preparing this survey, we searchedthe literature in both academic andprofessional computing forums for papersdescribing computer-aided techniques forextracting usability-related informationfrom user interface events. We selectedand analyzed an initial set of papers toidentify key characteristics that distin-guish the approaches applied by variousinvestigators.

We then constructed a two-dimensionalmatrix with instances of existing ap-proaches listed along one axis and char-acteristics listed along the other. This ledto an initial classification of approachesbased on clusters of related attributes.We then iteratively refined the compar-ison attributes and classification schemebased on further exploration of the litera-ture. The resulting matrix indicates areasin which further research is needed andsuggests synergistic combinations of cur-rently isolated capabilities.

Ideally, an empirical evaluation of theseapproaches in practice would help eluci-date more precisely the specific types of us-ability questions for which each approachis best suited. However, because many ofthe approaches have never been realizedbeyond the research prototype stage, lit-tle empirical work has been performed toevaluate their relative strengths and lim-itations. This survey attempts to providea conceptual evaluation by distinguish-ing classes of approaches and illuminat-ing their underlying nature. As a result,this survey should be regarded as a guideto understanding the research literatureand not as a guide to selecting an alreadyimplemented approach for use in practice.

1.2 Comparison Framework

This subsection introduces the high levelcategories that have emerged as a resultof the survey. We present the frameworkin more detail in Section 4.r Techniques for synchronization and

searching. These techniques allow user

ACM Computing Surveys, Vol. 32, No. 4, December 2000.

386 D. M. Hilbert and D. F. Redmiles

interface events to be synchronized andcross-indexed with other sources of datasuch as video recordings and coded ob-servation logs. This allows searches inone medium to locate supplementaryinformation in others. In some ways, thisis the simplest (i.e., most mechanical)technique for exploiting user interfaceevents in usability evaluation. However,it is quite powerful.r Techniques for transforming eventstreams. Transformation involves se-lecting, abstracting, and recoding eventstreams to facilitate human and au-tomated analysis (including counts,summary statistics, pattern detection,comparison, and characterization). Se-lection involves separating events andsequences of interest from the “noise.”Abstraction involves relating events tohigher-level concepts of interest in anal-ysis. Recoding involves generating newevent streams based on the results of se-lection and abstraction so that selectedand abstracted events can be subjectedto the same types of manual and au-tomated analysis techniques normallyperformed on raw event streams.r Techniques for performing counts andsummary statistics. Once user interfaceevents have been captured, there are anumber of counts and summary statis-tics that might be computed to summa-rize user behavior, for example, featureuse counts, error frequencies, use of thehelp system, and so forth. Although mostinvestigators rely on general-purposespreadsheets and statistical packages toprovide such functionality, some investi-gators have proposed specific “built-in”functions for calculating and reportingthis sort of summary information.r Techniques for detecting sequences.These techniques allow investigatorsto identify occurrences of concrete orabstractly defined target sequenceswithin source sequences of events thatmay indicate potential usability issues.In some cases, target sequences areabstractly defined and are supplied bythe developers of the technique. In othercases, target sequences are more spe-

cific to particular applications and aresupplied by the users of the technique.Sometimes the purpose is to generatea list of matched source subsequencesfor further perusal by the investigator.Other times the purpose is to automat-ically recognize particular sequencesthat violate expectations about properuser interface usage. Finally, in somecases, the purpose is to perform trans-formation of the source sequence by ab-stracting and recoding instances of thetarget sequence into “abstract” events.r Techniques for comparing sequences.These techniques help investigatorscompare source sequences againstconcrete or abstractly defined targetsequences indicating the extent towhich the sequences match one another.Some techniques attempt to detectdivergence between an abstract modelrepresenting the target sequence anda source sequence. Others attempt todetect divergence between a concretetarget sequence produced, for exam-ple, by an expert user, and a sourcesequence produced by some other user.Some produce diagnostic measuresof distance to characterize the corre-spondence between target and sourcesequences. Others attempt to performthe best possible alignment of eventsin the target and source sequences andpresent the results to investigators invisual form. Still others use points ofdeviation between the target and inputsequences to automatically indicate po-tential usability issues. In all cases, thepurpose is to compare actual sequencesof events against some model or trace of“ideal” or expected sequences to identifypotential usability issues.r Techniques for characterizing sequences.These techniques take source sequencesas input and attempt to constructan abstract model to summarize, orcharacterize, interesting sequentialfeatures of those sequences. Some tech-niques compute probability matrices inorder to produce process models withprobabilities associated with transi-tions. Others construct grammatical



Fig. 1 . Comparison framework.

models or finite state machines tocharacterize the grammatical structureof events in the source sequences.r Visualization techniques. These techni-ques present the results of transforma-tions and analyses in forms allowinghumans to exploit their innate visualanalysis capabilities to interpret results.These techniques can be particularlyuseful in linking results of analysis backto features of the interface.r Integrated evaluation support. Eval-uation environments that facilitateflexible composition of various trans-formation, analysis, and visualizationcapabilities provide integrated support.Some environments also provide built-in support for managing domain-specificartifacts such as evaluations, subjects,tasks, data, and results of analysis.

Figure 1 illustrates how the frame-work might be arranged as a hierarchy.At the highest level, the surveyed tech-niques are concerned with: synchroniza-tion and searching, transformation, anal-ysis, visualization, or integrated support.Transformation can be achieved throughselection, abstraction, and recoding. Anal-ysis can be performed using counts andsummary statistics, sequence detection,sequence comparison, and sequence char-acterization.

1.3 Organization of the Survey

Section 2 establishes background andpresents definitions of important terms.Section 3 discusses the nature and char-acteristics of UI events and provides ex-amples to illustrate some of the difficultiesinvolved in extracting usability-related in-formation from such events. Section 4presents a comparison of the approachesbased on the framework outlined above.For each class of techniques, the follow-ing is provided: a brief description, ex-amples, related work where appropriate,and strengths and limitations. Section 5summarizes the most important points ofthe survey and outlines some directionsfor future research. Section 6 presentsconclusions.

Many of the authors are not explicitregarding the event representations as-sumed by their approaches. We assumethat UI events are data structures that in-clude attributes indicating an event type(e.g., MOUSE PRESSED or KEY PRESSED), anevent target (e.g., an ID indicating a par-ticular button or text field in the user in-terface), a timestamp, and other attributesincluding various aspects of the state ofinput devices when the event was gener-ated. However, to raise the level of ab-straction in our discussion, we typicallyrepresent event streams as sequences ofletters where each letter corresponds to amore detailed event structure as describedabove. When beneficial, and possible,we provide examples using this notationto illustrate how particular approachesoperate.

2. BACKGROUND

This section serves three purposes. First,it establishes working definitions of keyterms such as “usability,” “usability evalu-ation,” and “usability data.” Second, it sit-uates observational usability evaluationwithin the broader context of HCI evalu-ation approaches, indicating some of therelative strengths and limitations of each.Finally, it isolates user interface events asone of the many types of data commonlycollected in observational usability evalu-ation, indicating some of its strengths and



limitations relative to other types. The def-initions and frameworks presented hereare not new and can be found in stan-dard HCI texts [Preece et al. 1994; Nielsen1993]. Those well acquainted with usabil-ity, usability evaluation, and user inter-face event data, may wish to skip directlyto Section 3 where the specific nature ofuser interface events and the reasons whyanalysis is complicated are presented.

2.1 Definitions

“Usability” is often thought of as referringto a single attribute of a system or device.However, it is more accurately character-ized as referring to a large number of re-lated attributes. Nielsen provides the fol-lowing definition [Nielsen 1993]:

Usability has multiple components andis traditionally associated with these fiveusability attributes:

Learnability: The system should be easy tolearn so that the user can rapidly startgetting some work done with the sys-tem.

Efficiency: The system should be efficientto use, so that once the user has learnedthe system, a high level of productivityis possible.

Memorability: The system should be easyto remember, so that the casual useris able to return to the system aftersome period of not having used it, with-out having to learn everything all overagain.

Errors: The system should have a low er-ror rate, so that users make few errorsduring the use of the system, and so thatif they do make errors they can easily re-cover from them. Further, catastrophicerrors must not occur.

Satisfaction: The system should be pleas-ant to use, so that users are subjectivelysatisfied when using it; they like it.

“Usability evaluation” can be definedas the act of measuring (or identifyingpotential issues affecting) usability at-tributes of a system or device with re-spect to particular users, performing par-ticular tasks, in particular contexts. The

reason that users, tasks, and contexts arepart of the definition is that the valuesof usability attributes can vary dependingon the background knowledge and experi-ence of users, the tasks for which the sys-tem is used, and the context in which it isused.

“Usability data” is any information thatis useful in measuring (or identifying po-tential issues affecting) the usability at-tributes of a system under evaluation.

Usability and utility are regarded assubcategories of the more general term“usefulness” [Grudin 1992]. Utility is thequestion of whether the functionality ofa system can, in principle, support theneeds of users, while usability is the ques-tion of how satisfactorily users can makeuse of that functionality. Thus, systemusefulness depends on both usability andutility.

While this distinction is theoreticallyclear, usability evaluations often iden-tify both usability and utility issues,thus more properly addressing usefulness.However, to avoid introducing new termi-nology, this survey simply assumes thatusability evaluations and usability datacan address questions of utility as well asquestions of usability.

2.2 Types of Usability Evaluation

This section contrasts the different typesof approaches that have been brought tobear in evaluating usability in HCI.

First, a distinction is commonly drawnbetween formative and summative eval-uation. Formative evaluation primarilyseeks to provide feedback to designersto inform and evaluate design decisions.Summative evaluation primarily involvesmaking judgements about “completed”products, to measure improvement overprevious releases or to compare competingproducts. The techniques discussed inthis survey can be applied in both sorts ofcases.

Another important issue is the morespecific motivation for evaluating. Thereare a number of practical motivations forevaluating. For instance, one may wish togain insight into the behavior of a system



Table I. Types of Evaluation and Reasons for Evaluating

Reasons for Predictive Observational ParticipativeEvaluating Evaluation Evaluation Evaluation

Understanding user behavior & performance X xUnderstanding user thoughts & experience x XComparing design alternatives x X XComputing usability metrics x X XCertifying conformance w/standards X

and its users in actual usage situationsin order to improve usability (formative)and to validate that usability has been im-proved (summative). One may also wish togain further insight into users’ needs, de-sires, thought processes, and experiences(also formative and summative). One maywish to compare design alternatives, forexample, to determine the most efficientinterface layout or the best design rep-resentation for some set of domain con-cepts (formative). One may wish to com-pute usability metrics so that usabilitygoals can be specified quantitatively andprogress measured, or so that competingproducts can be compared (summative).Finally, one may wish to check for confor-mance to interface style guidelines and/orstandards (summative). There are alsoacademic motivations, such as the desireto discover features of human cognitionthat affect user performance and compre-hension with regard to human-computerinterfaces (potentially resulting in forma-tive implications).

There are a number of HCI evaluationapproaches for achieving these goals thatfall into three basic categories: predictive,observational, and participative.

Predictive evaluation usually involvesmaking predictions about usability at-tributes based on psychological modelingtechniques (e.g., the GOMS model [Johnand Kieras 1996a; 1996b] or the CognitiveWalkthrough [Lewis et al. 1992]), or basedon design reviews performed by expertsequipped with a knowledge of HCI prin-ciples and guidelines and past experiencein design and evaluation (e.g., Heuris-tic Evaluation [Nielsen and Mack 1994]).A key strength of predictive approachesis their ability to produce results basedon nonfunctioning design artifacts with-

out requiring the involvement of actualusers.

Observational evaluation involves mea-suring usability attributes based on obser-vations of users actually interacting withprototypes or fully functioning systems.Observational approaches can range fromformal laboratory experiments to less for-mal field studies. A key strength of ob-servational techniques is that they tendto uncover aspects of actual user behaviorand performance that are difficult to cap-ture using other techniques.

Finally, participative evaluation in-volves collecting information regardingusability attributes directly from usersbased on their subjective reports. Methodsfor collecting such data range from ques-tionnaires and interviews to more ethno-graphically inspired approaches involvingjoint observer/participant interpretationof behavior in context. A key benefit ofparticipative techniques is their ability tocapture aspects of users’ needs, desires,thought processes, and experiences thatare difficult to obtain otherwise.

In practice, actual evaluations oftencombine techniques from multiple ap-proaches. However, the methods for pos-ing questions and for collecting, analyzing,and interpreting data vary from one cat-egory to the next. Table I provides a highlevel summary of the relationship betweentypes of evaluation and typical reasons forevaluating. An upper-case ‘X’ indicates astrong relationship. A lower-case ‘x’ indi-cates a weaker relationship. An empty cellindicates little or no relationship.

The approaches surveyed here are pri-marily geared toward supporting observa-tional evaluation, although some providelimited support for capturing participativedata as well.



Table II. Data Collection Techniques and Usability Indicators

Survey/Usability UI Event Audio/Video Post-hoc User Questionnaire/ PsychophysicalIndicators Recording Recording Comments Interview Test scores Recording

On-line behavior/ X Xperformance

Off-line behavior X(nonverbal)

Cognition/ X X X X Xunderstanding

Attitude/opinion X X XStress/anxiety X

2.3 Types of Usability Data

Having situated observational usabilityevaluation within the broader context ofpredictive, observational, and participa-tive approaches, user interface events canbe isolated as just one of many possiblesources of observational data.

Sweeny and colleagues [Sweeny et al.1993] identify a number of indicators thatmight be used to measure (or indicatepotential issues affecting) usability at-tributes:r On-line behavior/performance: e.g., task

times, percentage of tasks completed, er-ror rates, duration and frequency of on-line help usage, range of functions used.r Off-line behavior (nonverbal): e.g., eyemovements, facial gestures, durationand frequency of off-line documentationusage, off-line problem solving activity.r Cognition/understanding: e.g., verbalprotocols, answers to comprehensionquestions, sorting task scores.r Attitude/opinion: e.g., posthoc com-ments, questionnaire and interviewcomments and ratings.r Stress/anxiety: e.g., galvanic skin re-sponse (GSR), heart rate (ECG), event-related brain potentials (ERPs), elec-troencephalograms (EEG), ratings ofanxiety.

Table II summarizes the relationshipbetween these indicators and various tech-niques for collecting observational data.This table is by no means comprehensiveand is used only to indicate the rather spe-cialized yet complementary nature of userinterface event data in observational eval-

uation. UI events provide excellent datafor quantitatively characterizing on-linebehavior, however, the usefulness of UIevents in providing data regarding the re-maining indicators has not been demon-strated. However, some investigators haveused UI events to infer features of userknowledge and understanding [Kay andThomas 1995; Guzdial et al. 1993].

Many of the surveyed approaches focuson event data exclusively. However, somealso combine other sources of data includ-ing video recordings, coded observations,and subjective user reports.

3. THE NATURE OF UI EVENTS

This section discusses the nature and cha-racteristics of HCI events in general andUI events specifically. We discuss thegrammatical nature of UI events includ-ing some implications on analysis. We alsodiscuss the importance of contextual in-formation in interpreting the significanceof events. Finally, we present a composi-tional model of UI events to illustrate howthese issues manifest themselves in UIevent analysis. We use this discussion toground later discussion and to highlightsome of the strengths and limitations ofthe surveyed approaches.

3.1 Spectrum of HCI Events

Before discussing the specific nature ofUI events, this section introduces thebroader spectrum of events of interestto researchers and practitioners in HCI.Figure 2, adapted from Sanderson andFisher [1994], indicates the durations ofdifferent types of HCI events.



Fig. 2 . A spectrum of HCI events. Adapted from [Sanderson and Fisher 1994].

The horizontal axis is a log scale indicat-ing event durations in seconds. It rangesfrom durations of less than one second todurations of years. The durations of UIevents fall in the range of 10 millisecondsto approximately one second. The range ofpossible durations for each “type” of eventis between one and two orders of magni-tude, and the ranges of different types ofevents overlap one another.

If we assume that events occur serially,then the possible frequencies of eventsare constrained by the duration of thoseevents. So, by analogy with the contin-uous domain (e.g., analog signals), eachevent type will have a characteristic fre-quency band associated with it [Sandersonand Fisher 1994]. Event types of shorterduration, for example, UI events, can ex-hibit much higher frequencies when insequence, and thus might be referred toas high-frequency band event types. Like-wise, event types of longer duration, suchas project events, exhibit much lower fre-quencies when in sequence and thus mightbe referred to as low-frequency band eventtypes. Evaluations that attempt to ad-dress the details of interface design havetended to focus on high-frequency bandevent types, whereas research on com-puter supported cooperative work (CSCW)

has tended to focus on mid- to low-frequency band event types [Sandersonand Fisher 1994].

Some important properties of HCIevents that emerge from this characteri-zation include the following:

1. Synchronous vs. Asynchronous Events:Sequences composed of high-frequencyevent types typically occur syn-chronously. For example, sequencesof UI events, gestures, or conversa-tional turns can usually be capturedsynchronously using a single record-ing. However, sequences composed oflower frequency event types, such asmeeting or project events, may occurasynchronously, aided, for example,by electronic mail, collaborative appli-cations, memos, and letters. This hasimportant implications on the methodsused to sample, capture, and analyzedata, particularly at lower frequencybands [Sanderson and Fisher 1994].

2. Composition of Events: Events withina given frequency band are often com-posed of events from higher frequencybands. These same events typicallycombine to form events at lower fre-quency bands. Sanderson and Fisher of-fer this example: a conversational turn



is typically composed of vocalizations,gestures, and eye movements, and asequence of conversational turns maycombine to form a topic under discus-sion within a meeting [Sanderson andFisher 1994]. This compositional struc-ture is also exhibited within frequencybands. For instance, user interactionswith software applications occur andmay be analyzed at multiple levels ofabstraction, where events at each levelare composed of events occurring atlower levels. (See Hilbert et al. [1997]for an early treatment). This is dis-cussed further below.

3. Inferences Across Frequency BandBoundaries: Low frequency bandevents do not directly reveal their com-position from higher frequency events.As a result, recording only low fre-quency events will typically result ininformation loss. Likewise, high fre-quency events do not, in themselves,reveal how they combine to form eventsat lower frequency bands. As a result,either low frequency band events mustbe recorded in conjunction with highfrequency band events or there mustbe some external model (e.g., a gram-mar) to describe how high frequencyevents combine to form lower frequencyevents. This too is discussed furtherbelow.

3.2 Grammatical Issues in Analysis

UI events are often grammatical in struc-ture. Grammars have been used in numer-ous disciplines to characterize the struc-ture of sequential data. The main featureof grammars that make them useful inthis context is their ability to define equiv-alence classes of patterns in terms ofrewrite rules. For example, the followinggrammar (expressed as a set of rewriterules) may be used to capture the ways inwhich a user can trigger a print job in agiven application:

PRINT COMMAND –>"MOUSE PRESSED PrintToolbarButton"or

(PRINT DIALOG ACTIVATED then

"MOUSE PRESSED OkButton)

PRINT DIALOG ACTIVATED –>"MOUSE PRESSED PrintMenuItem" or

"KEY PRESSED Ctrl− P"

Rule 1 simply states that the user cantrigger a print job by either pressing theprint toolbar button (which triggers thejob immediately) or by activating the printdialog and then pressing the “OK” button.Rule 2 specifies that the print dialog maybe activated by either selecting the printmenu item in the “File” menu or by enter-ing a keyboard accelerator, “Ctrl-P.”

Let us assume that the lexical elementsused to construct sentences in this lan-guage are:

A: indicating “print toolbar buttonpressed”

B: indicating “print menu item selected”C: indicating “print accelerator key en-

tered”D: indicating “print dialog okayed”

Then the following “sentences” con-structed from these elements each indi-cate a series of four consecutive print jobactivations:

AAAA

CDAAA

ABDBDA

BDCDACD

CDBDCDBD

All occurrences of ‘A’ indicate an imme-diate print job activation while all occur-rences of ‘BD’ or ‘CD’ indicate a print jobactivated by using the print dialog andthen selecting “OK.”

Notice that each of these sequencescontains a different number of lexicalelements. Some of them have no lexi-cal elements in common (e.g., AAAA andCDBDCDBD). The lexical elements occupyingthe first and last positions differ from onesequence to the next. In short, there area number of salient differences betweenthese sequences at the lexical level. Tech-niques for automatically detecting, com-paring, and characterizing sequences aretypically sensitive to such differences. Un-less the data is transformed based on the



above grammar, the fact that these se-quences are semantically equivalent (inthe sense that each indicates a series offour consecutive print job activations) willmost likely go unrecognized, and evensimple summary statistics such as “# ofprint jobs per session” may be difficult tocompute.

Techniques for extracting usability-related information from UI events shouldtake into consideration the grammaticalrelationships between lower level eventsand higher level events of interest.

3.3 Contextual Issues in Analysis

Another set of problems arises in attempt-ing to interpret the significance of UIevents based only on the information car-ried within events themselves. To illus-trate the problem more generally, considerthe analogous problem of interpreting thesignificance of utterances in transcripts ofnatural language conversation. Importantcontextual cues are often spread acrossmultiple utterances or may be missingfrom the transcript altogether.

Let us assume we have a transcript ofa conversation that took place between in-dividuals A and B at a museum. The taskis to identify A’s favorite paintings basedon utterances in the transcript.

Example 1: “The Persistence of Memory,by Dali, is one of my favorites.”

In this case, everything we need to knowin order to determine one of A’s favoritepaintings is contained in a single utter-ance.

Example 2: “The Persistence of Memory,by Dali.”

In this case we need access to prior con-text. ‘A’ is most likely responding to a ques-tion posed by ‘B’. Information carried inthe question is critical in interpreting theresponse. For example, the question couldhave been: “Which is your least favoritepainting?”

Example 3: “That is one of my favorites.”In this case, we need the ability to deref-

erence an indexical. The information car-ried by the indexical “that” may not beavailable in any of the utterances in the

transcript, but was clearly available to theinterlocutors at the time of the utterance.Such contextual information was “therefor the asking,” so to speak, and couldhave been noted had the transcriber beenpresent and chosen to do so at the time ofthe utterance.

Example 4: “That is another one.”In this case we would need access to

both prior context and the ability to de-reference an indexical.

The following examples illustrate anal-ogous situations in the interpretation ofuser interface events:

Example 1′: “MOUSE PRESSED PrintTool-barButton”

This event carries with it enough infor-mation to indicate the action the user hasperformed.

Example 2′: “MOUSE PRESSED OkButton”This event does not on its own indicate

what action was performed. As in Exam-ple 2 previously, this event indicates a re-sponse to some prior event, for example,a prior “MOUSE PRESSED PrintMenuItem”event.

Example 3′: “WINDOW OPENED ErrorDia-log”

The information needed to interpret thesignificance of this event may be availablein prior events, but a more direct way to in-terpret its significance would be to querythe dialog for its error message. This issimilar to dereferencing an indexical, if wethink of the error dialog as figuratively“pointing at” an error message that doesnot actually appear in the event stream.

Example 4′: “WINDOW OPENED ErrorDia-log”

Assuming the error message is “InvalidCommand,” then the information neededto interpret the significance of this eventis not only found by dereferencing the in-dexical (the error message “pointed at”by the dialog) but must be supplementedby information available in prior events.It may also be desirable to query con-textual information stored in user inter-face components to determine the combi-nation of parameters (specified in a dialog,



for example) that led to this particularerror.

The basic insight here is that some-times an utterance — or a UI event —does not carry enough information on itsown to allow its significance to be prop-erly interpreted. Sometimes critical con-textual information is available elsewherein the transcript, and sometimes that in-formation is not available in the tran-script, but was available, “for the asking,”at the time of the utterance, or event,but not afterwards. Therefore, techniquesfor extracting usability-related informa-tion from UI events should take into con-sideration the fact that context may bespread across multiple events, and thatin some cases, important contextual infor-mation may need to be explicitly capturedduring data collection if meaningful inter-pretation is to be performed.

3.4 Composition of Events

Finally, user interactions may be analyzedat multiple levels of abstraction. For in-stance, one may be interested in analyzinglow-level mouse movement and timing in-formation, or one may be more interestedin higher-level information regarding thesteps taken by users in completing tasks,such as placing an order or composinga business letter. Techniques for extract-ing usability information from UI eventsshould be capable of addressing events atmultiple levels of abstraction.

Figure 3 illustrates a multilevel modelof events originally presented in Hilbertet al. [1997]. At the lowest level are physi-cal events, for example, fingers depressingkeys or a hand moving a pointing devicesuch as a mouse. Input device events, suchas key and mouse interrupts, are gener-ated by hardware in response to physicalevents. UI events associate input deviceevents with windows and other interfaceobjects on the screen. Events at this levelinclude button presses, list and menu se-lections, focus events in input fields, andwindow movements and resizing.

Abstract interaction events are not di-rectly generated by the user interfacesystem, but may be computed based on

Fig. 3 . Levels of abstraction in user interactions.

UI events and other contextual infor-mation such as UI state. Abstract in-teraction events are indicated by recur-ring, idiomatic patterns of UI events andindicate higher level concepts such asshifts in users’ editing attention or the actof providing values to an application bymanipulating application components.

Consider the example of a user edit-ing an input field at the top of a form-based interface, then pressing tab repeat-edly to edit a field at the bottom of theform. In terms of UI events, input focusshifted several times between the first andlast fields. In terms of abstract interactionevents, the user’s editing attention shifteddirectly from the top field to the bot-tom field. Notice that detecting the occur-rence of abstract interaction events suchas “GOT EDIT” and “LOST EDIT” requires theability to keep track of the last editedcomponent and to notice subsequent edit-ing events in other components.

Another type of abstract interactionevent might be associated with the act ofproviding a new value to an application bymanipulating user interface components.In the case of a text field, this would meanthat the field had received a number ofkey events, was no longer receiving keyevents, and now contains a new value. Thepatterns of window system events that in-dicate an abstract interaction event such



as “VALUE PROVIDED” will differ from onetype of interface component to another,and from one application to another, butwill typically remain fairly stable withina given application. Notice that detectingthe occurrence of an abstract interactionevent such as “VALUE PROVIDED” requiresthe ability to access user interface statesuch as the component value before andafter editing events.

Domain/task-related and Goal/problem-related events are at the highest levels.Unlike other levels, these events indicateprogress in the user’s tasks and goals. In-ferring these events based on lower levelevents can be straightforward when theuser interface provides explicit support forstructuring tasks or indicating goals. Forinstance, Wizards in Microsoft WordTM

[Rubin 1999] lead users through a se-quence of steps in a predefined task. Theuser’s progress can be recognized in termsof simple UI events such as button presseson the “Next” button. In other cases, in-ferring task and goal related events mightrequire more complicated composite eventdetection. For instance, the goal of placingan order includes the task of providing ad-dress information. The task-related event“ADDRESS PROVIDED” may be recognized interms of “VALUE PROVIDED” abstract inter-action events occurring within each of therequired fields in the address section ofthe form. Finally, in some cases, it may beimpossible to infer events at these levelsbased only on lower level events.

Techniques for extracting usability-related information from UI events shouldbe sensitive to the fact that user interac-tions can occur and be analyzed at multi-ple levels of abstraction.

4. COMPARISON OF APPROACHES

This section introduces the approachesthat have been applied to the problemof extracting usability-related informationfrom UI events. The following subsectionsdiscuss the features that distinguish eachclass of approaches and provide examplesof some of the approaches in each class. Wemention related work where appropriateand discuss the relative strengths and lim-

itations of each class. Figures 4 through11 provide pictorial representations ofthe key features underlying each class.Table III (located at the end of this section)presents a categorization and summary ofthe features belonging to each of the sur-veyed techniques.

4.1 Synchronization and Searching

4.1.1 Purpose. User interface events pro-vide detailed information regarding userbehavior that can be captured, searched,counted, and analyzed using automatedtools. However, it is often difficult to in-fer higher level events of interest fromuser interface events alone, and some-times critical contextual information issimply missing from the event stream,making proper interpretation challengingat best.

Synch and Search techniques seek tocombine the advantages of UI event datawith the advantages provided by more se-mantically rich observational data, suchas video recordings and experimenters’ ob-servations.

By synchronizing UI events with othersources of data such as video or coded ob-servations, searches in one medium canbe used to locate supplementary informa-tion in others. Therefore, if an investi-gator wishes to review all segments of avideo in which a user uses the help sys-tem or invokes a particular command, itis not necessary to manually search theentire recording. The investigator can: (a)search through the log of UI events forparticular events of interest and use thetimestamps associated with those eventsto automatically cue up the video record-ing, or (b) search through a log of obser-vations (that were entered by the inves-tigator either during or after the time ofthe recording) and use the timestamps as-sociated with those observations to cueup the video. Similarly, segments of inter-est in the video can be used to locate thedetailed user interface events associatedwith those episodes.

4.1.2 Examples. Playback is an early ex-ample of a system employing synch and



Fig. 4 . Synchronization and searching.

Fig. 5 . Transformation.

Fig. 6 . Counts and summary statistics.



Fig. 7 . Sequence detection.

Fig. 8 . Sequence comparison.

Fig. 9 . Sequence characterization.



Fig. 10 . Visualization.

search capabilities [Neal and Simmons1983]. Playback captures UI events au-tomatically and synchronizes them withcoded observations and comments that areentered by experimenters either during orafter the evaluation session. Instead of us-ing video, Playback allows recorded eventsto be played back through the applicationinterface to retrace the user’s actions. Theevaluator can step through the playbackbased on events or coded observations asif using an interactive debugger. There area handful of simple built-in analyses toautomatically calculate counts and sum-mary statistics. This technique capturesless information than video-based tech-niques since video can also be used torecord off-line behavior such as facial ges-tures, off-line documentation use, and ver-balizations. Also, there can be problemsassociated with replaying user sessions ac-curately in applications where behavioris affected by events outside of user in-teractions. For instance, the behavior ofsome applications can vary depending onthe state of networks and persistent datastores.

DRUM, the “Diagnostic Recorder for Us-ability Measurement,” is an integratedevaluation environment that supportsvideo-based usability evaluation [Macleodet al. 1993]. DRUM was developed at theNational Physical Laboratory as part ofthe ESPRIT Metrics for Usability Stan-dards in Computing Project (MUSiC).DRUM features a module for recordingand synchronizing events, observations,

Fig. 11 . Integrated Support.

and video, a module for defining andmanaging observation coding schemes, amodule for calculating predefined countsand summary statistics, and a module formanaging and manipulating evaluation-related information regarding subjects,tasks, recording plans, logs, videos, and re-sults of analysis.

Usability specialists at Microsoft,Apple, and SunSoft all report the useof tools that provide synch and searchcapabilities [Weiler 1993; Hoiem andSullivan 1994]. The tools used at Mi-crosoft include a tool for logging observa-tions, a tool for tracking UI events, anda tool for synchronizing and reviewingdata from the multiple sources. The toolsused at Apple and SunSoft are essentiallysimilar. All tools support some level ofevent selection as part of the captureprocess. Apple’s selection appears tobe user-definable while Microsoft andSunSoft’s selection appear to be pro-gramed into the capture tools. Scripts andgeneral-purpose analysis programs, suchas Microsoft ExcelTM [Dodge and Stinson1999], are used to perform counts andsummary statistics after capture. All toolssupport video annotations to produce“highlights” videos. Microsoft’s tools pro-vide an API to allow applications to reportapplication-specific events or events notreadily available in the UI event stream.

I-Observe, the “Interface Observation,Evaluation, Recording, and Visualiza-tion Environment,” also provides synchand search capabilities [Badre et al. 1995].


Syn

ch/

Eve

nt

Use

ofS

elec

t/A

bstr

act/

Cou

nts

&S

equ

ence

Seq

uen

ceS

equ

ence

Dat

aU

IT

ool/T

ech

niq

ue

Ref

eren

ceS

earc

hC

aptu

reC

onte

xtR

ecod

eR

ecod

eS

tats

Det

ect

Com

pare

Ch

arac

ter

Vis

ual

ize

Man

age

Pla

tfor

m

Pla

ybac

k[N

eal&

Sim

mon

s19

83]

Yes

Obs

.B

uil

t-in

Un

know

nA

pple

Lab

[Wei

ler

1993

]Ye

sO

bs.+

Vid

.B

uil

t-in

Mac

OS

Su

nS

oft

Lab

[Wei

ler

1993

]Ye

sO

bs.+

Vid

.B

uil

t-in

XW

indo

ws

Mic

roso

ftL

ab[H

oiem

&S

ull

ivan

1994

]Ye

sO

bs.+

Vid

.B

uil

t-in

MS

Win

dow

sI-

Obs

erve

[Bad

reet

al.1

995]

Yes

Vid

.M

odel

Bu

ilt-

inX

Win

dow

s

Inci

den

tM

onit

orin

g[C

hen

1990

]Ye

sU

IB

uil

t-in

XW

indo

ws

Use

rId

enti

fied

CIs

[Har

tson

etal

.199

6]Ye

sU

ser

Use

rU

ser

Un

know

nC

HIM

E[B

adre

&S

anto

s19

91]

Yes

UI

Mod

elM

odel

XW

indo

ws

ED

EM

[Hil

bert

&R

edm

iles

1997

]Ye

sU

I+U

ser

Mod

el+U

ser

Mod

el+U

ser

Bu

ilt-

inM

odel

Bu

ilt-

inJa

vaA

WT

UIM

S[B

uxt

onet

al.1

983]

Yes

Bu

ilt-

inB

uil

t-in

UIM

SM

IKE

[Ols

en&

Hal

vers

en19

88]

Yes

Bu

ilt-

inU

IMS

KR

I/A

G[L

owgr

en&

Nor

dqvi

st19

92]

Yes

Bu

ilt-

inU

IMS

Lon

g-T

erm

Mon

itor

ing

[Kay

&T

hom

as19

95]

Inst

r.P

rogr

ams

Gra

phs

AP

IA

US

[Ch

ang

&D

illo

n19

97]

Yes

Bu

ilt-

inB

uil

t-in

MS

Win

dow

sA

qued

uct

App

Sco

pe[A

qued

uct

Sof

twar

e19

98]

Inst

r.D

atab

ase

Dat

abas

eA

PI

Fu

llC

ircl

eT

alkb

ack

[Fu

llC

ircl

eS

oftw

are

1998

]In

str.

Dat

abas

eD

atab

ase

AP

I

LS

A[S

acke

tt19

78]

Bu

ilt-

inF

ish

er’s

Cyc

les

[Fis

her

1991

]B

uil

t-in

TO

P/G

[Hop

pe19

88]

Sim

.S

im.

Mod

elS

imu

lati

onM

RP

[Sio

chi&

Hix

1991

]B

uil

t-in

Exp

ecta

tion

Age

nts

[Gir

gen

soh

net

al.1

994]

Yes

UI+

Use

rU

ser

Use

rM

odel

OS

/2E

DE

M[H

ilbe

rt&

Red

mil

es19

97]

Yes

UI+

Use

rM

odel

+Use

rM

odel

+Use

rB

uil

t-in

Mod

elJa

vaA

WT

US

INE

[Lec

erof

&P

ater

no

1998

]Ye

sB

uil

t-in

Mod

elM

odel

Bu

ilt-

inX

Win

dow

sT

SL

[Ros

enbl

um

1991

]M

odel

N/A

Am

adeu

s[S

elby

etal

.199

1]M

odel

N/A

YE

AS

T[K

rish

nam

urt

hy

&M

odel

Mod

elN

/AR

osen

blu

m19

95]

EB

BA

[Bat

es19

95]

App

Mod

elM

odel

N/A

GE

M[M

anso

uri

-Sam

ani

App

Mod

elM

odel

Mod

elN

/A&

Slo

man

1997

]

AD

AM

[Fin

lay

&H

arri

son

1990

]C

oncr

ete

Un

know

nU

sAG

E[U

elin

g&

Wol

f19

95]

Yes

Con

cret

eB

uil

t-in

UIM

SE

MA

[Bal

bo19

96]

Inst

r.M

odel

AP

IU

SIN

E[L

ecer

of&

Pat

ern

o19

98]

Yes

Bu

ilt-

inM

odel

Mod

elB

uil

t-in

XW

indo

ws

Pro

cess

Val

idat

ion

[Coo

k&

Wol

f19

97]

Mod

elN

/A

Mar

kov-

base

d[G

uzd

ial1

993]

Mod

elM

anu

alG

ram

mer

-bas

ed[O

lson

etal

.199

4]M

odel

Man

ual

Pro

cess

Dis

cove

ry[C

ook

&W

olf

1995

]A

uto

N/A

Haw

k[G

uzd

ial1

993]

Scr

ipt

Scr

ipt

Scr

ipt

Scr

ipt

Bu

ilt-

inD

RU

M[M

acle

od&

Ren

gger

1993

]Ye

sO

bs.+

Vid

.B

uil

t-in

Bu

ilt-

inM

acO

SM

acS

HA

PA

[San

ders

onet

al.1

994]

Obs

.+V

id.

Dat

abas

eD

atab

ase

Bu

ilt-

inB

uil

t-in

+DB

Con

cret

eM

anu

alB

uil

t-in

Bu

ilt-

inE

rgol

igh

t[E

rgol

igh

tU

sabi

lity

Yes

Bu

ilt-

inM

odel

Mod

el?

Bu

ilt-

inB

uil

t-in

MS

Win

dow

sE

OR

S/E

UV

SS

oftw

are

1998

]

Section 4.1

Synch & Search

Section 4.2

Transform

Section 4.3

Counts & Stats

Section 4.4

Sequence Detect

Section 4.5

SequenceCompare

Section 4.6

SequenceCharacter

Section 4.7

Integrated Support

Techniques (sorted by class)

Tab

leIII

.A

Cla

ssifi

catio

nof

Com

pute

r-A

ided

Tech

niqu

esfo

rE

xtra

ctin

gU

sabi

lity-

Rel

ated

Info

rmat

ion

from

Use

rIn

terf

ace

Eve

nts


I-Observe is a set of loosely integratedtools for collecting, selecting, analyzing,and visualizing event data that hasbeen synchronized with video record-ings. Investigators can perform searchesby specifying predicates over the at-tributes contained within a single eventrecord. Investigators can then locate pat-terns of events by stringing together mul-tiple search specifications into regularexpressions. Investigators can then useintervals matched by such regular expres-sions (identified by begin and end events)to select data for visualization or to displaythe corresponding segments of the videorecording.

4.1.3 Strengths. The strengths of thesetechniques lie in their ability to inte-grate data sources with complementarystrengths and weaknesses and to allowsearches in one medium to locate re-lated information in the others. UI eventsprovide detailed performance informationthat can be searched, counted, and ana-lyzed using automated techniques. How-ever, UI events often leave out higher levelcontextual information that can more eas-ily be captured using video recordings andcoded observations.

4.1.4 Limitations. Techniques relying onsynchronizing UI events with video andcoded observations typically require theuse of video recording equipment andthe presence of observers. The use ofvideo equipment and the presence of ob-servers can make subjects self-consciousand affect performance and may not bepractical or permitted in certain circum-stances. Furthermore, video-based evalu-ations tend to produce massive amountsof data that can be expensive to analyze.The ratio of the time spent in analysisversus the duration of the sessions be-ing analyzed has been known to reach10:1 [Sanderson and Fisher 1994; Nielsen1993; Sweeny et al. 1993]. These mattersare all serious limiting factors on evalua-tion size, scope, location, and duration.

Some researchers have begun inves-tigating techniques for performing col-

laborative remote usability evaluationsusing video-conferencing software andapplication sharing technologies. Suchtechniques may help lift some of the lim-itations on evaluation location. However,more work must be done in the area ofautomating data collection and analysisif current restrictions on evaluation size,scope, and duration are to be addressed.

4.2 Transformation

4.2.1 Purpose. These techniques com-bine selection, abstraction, and recoding totransform event streams for various pur-poses, such as facilitating human patterndetection, comparison, and characteriza-tion, or to prepare data for input into au-tomatic techniques for performing thesefunctions.

Selection operates by subtracting in-formation from event streams, allowingevents and sequences of interest to emergefrom the “noise.” Selection involves speci-fying constraints on event attributes to in-dicate events of interest to be separatedfrom other events or to indicate eventsto be separated from events of interest.For instance, one may elect to disregardall events associated with mouse move-ments in order to focus analysis on higherlevel actions such as button presses andmenu selections. This can be accomplishedby “positively” selecting button press andmenu selection events or by “negatively”selecting, or filtering, mouse movementevents.

Abstraction operates by synthesizingnew events based on information in theevent stream, supplemented (in somecases) by contextual information outsideof the event stream. For instance, a pat-tern of events indicating that an inputfield had been edited, that a new valuehad been provided, and that the user’sediting attention had since shifted to an-other component might indicate the ab-stract event “VALUE PROVIDED,” which isnot signified by any single event in theevent stream. Furthermore, the same ab-stract event might be indicated by differ-ent events in different UI components, forexample, mouse events typically indicate



editing in nontextual components whilekeyboard events typically indicate editingin textual components. One may also wishto synthesize events to relate the use ofparticular UI components to higher levelconcepts such as the use of menus, tool-bars, or dialogs to which those componentsbelong.

Recoding involves producing new eventstreams based on the results of selec-tion and abstraction. This allows the samemanual or automated analysis techniquesnormally applied to raw event streamsto be applied to selected and abstractedevents, potentially leading to differentresults. Consider the example presentedin Section 3. If the sequences represent-ing four sequential print job activationswere embedded within the context of alarger sequence, they might not be iden-tified as being similar subsequences, par-ticularly by automated techniques suchas those presented below. However, af-ter performing abstraction based on thegrammar in that example, each of thesesequences could be recoded as “AAAA,”making them much more likely to be iden-tified as common subsequences by auto-matic techniques.

4.2.2 Examples. Chen presents an ap-proach to user interface event monitoringthat selects events based on the notion of“incidents” [Chen 1990]. Incidents are de-fined as only those events that actuallytrigger some response from the applica-tion and not just the user interface system.This technique was demonstrated by mod-ifying the X Toolkit Intrinsics [Nye andO’Reilly 1992] to report events that triggercallback procedures registered by applica-tions. This allows events not handled bythe application to be selected “out” auto-matically. Investigators may further con-strain event reporting by selecting specificincidents of interest to report. Widgets inthe user interface toolkit were modified toprovide a query procedure to return lim-ited contextual information when an eventassociated with a given widget triggers acallback.

Hartson and colleagues report an ap-proach to remote collaborative usability

evaluation that relies on users to selectevents [Hartson et al. 1996]. Users iden-tify potential usability problems that ariseduring the course of interacting with anapplication and report information re-garding these “critical incidents” by press-ing a “report” button that is supplied in theinterface. The approach uses E-Mail to re-port digitized video of the events leadingup to and following critical incidents alongwith contextual information provided byusers. In this case, selection is achievedby only reporting the n events leading upto, and m events following, user-identifiedcritical incidents (where n and m areparameters that can be set in advance byinvestigators).1

CHIME, the “Computer-Human Inter-action Monitoring Engine,” is similar, insome ways, to Chen’s approach [Badre andSantos 1991a]. CHIME allows investiga-tors to select ahead of time which eventsto report and which events to filter. An im-portant difference is that CHIME also sup-ports a limited notion of abstraction thatallows a level of indirection to be built ontop of the window system. The basic ideais that abstract “interaction units” (IUs)are defined that translate window systemevents into platform independent eventsupon which further monitoring infrastruc-ture is built. The events to be recorded arethen specified in terms of these platformindependent IUs.2

Hawk is an environment for selecting,abstracting, and recoding events in logfiles [Guzdial 1993]. Hawk’s main func-tionality is provided by a variant of theAWK programming language [Aho et al.1988], and an environment for manag-ing data files is provided by HyperCardTM

[Goodman 1998]. Events appear in the

1 This approach actually focuses on capturing videoand not events. However, the ideas embodied by theapproach can equally well be applied to the prob-lem of selecting events of interest surrounding user-identified critical incidents.2 The paper also alludes to the possibility of allow-ing higher level IUs to be hierarchically defined interms of lower level IUs (using a context-free gram-mar and pre-conditions) to provide a richer notionof abstraction. However, this appears to never havebeen implemented [Badre and Santos 1991a; 1991b].



event log one per line, and AWK pattern–action pairs are used to specify what isto be matched in each line of input (thepattern) and what is to be printed asoutput (the action). This allows fairly flex-ible selection, abstraction, and recoding tobe performed.

MacSHAPA, which is discussed fur-ther later, supports selection and recod-ing via a database query and manipula-tion language that allows investigators toselect event records based on attributesand define new streams based on the re-sults of queries [Sanderson et al. 1994].Investigators can also perform abstrac-tion by manually entering new recordsrepresenting abstract events and visu-ally aligning them with existing eventsin a spreadsheet representation (seeFigure 13).

EDEM, an “Expectation-Driven EventMonitoring” system, captures UI eventsand supports automated selection, ab-straction, and recoding [Hilbert andRedmiles 1998a]. Selection is achieved intwo ways: investigators specify ahead oftime which events to report, and users canalso cause events to be selected via a crit-ical incident reporting mechanism akinto that reported in Hartson et al. [1996].However, one important difference is thatEDEM also allows investigators to defineautomated agents to help in the detectionof “critical incidents,” thereby lifting someof the burden from users who often do notknow when their actions are violating ex-pectations about proper usage [Smilowitzet al. 1994]. Furthermore, investigatorscan use EDEM to define abstract eventsin terms of patterns of existing events.When EDEM detects a pattern of eventscorresponding to a predefined abstractevent, it generates an abstract event andinserts it into the event stream. Investiga-tors can configure EDEM to perform fur-ther hierarchical event abstraction by sim-ply defining higher level abstract eventsin terms of lower level abstract events. Allof this is done in context, so that contex-tual information can be used in selectionand abstraction. EDEM also calculates anumber of simple counts and summarystatistics.

4.2.3 Strengths. The main strength ofthese approaches lies in their explicit sup-port for selection, abstraction, and recod-ing which are essential steps in prepar-ing UI events for most types of analysis(as illustrated in Section 3). Chen andCHIME address issues of selection prior toreporting. Hartson and colleagues add tothis a technique for accessing contextualinformation via the user. EDEM adds tothese automatic detection of critical inci-dents and abstraction that is performed incontext. All these techniques might poten-tially be used to collect UI events remotely.

Hawk and MacSHAPA, on the otherhand, do not address event collection butprovide powerful and flexible environ-ments for transforming and analyzing al-ready captured UI events.

4.2.4 Limitations. The techniques that se-lect, abstract, and recode events while col-lecting them run the risk of throwing awaydata that might have been useful in anal-ysis. Techniques that rely exclusively onusers to select events are even more likelyto drop useful information. With the ex-ception of EDEM, the approaches thatsupport flexible abstraction and recodingdo so only after events have been captured,meaning contextual information that canbe critical in abstraction is not available.

4.3 Counts and Summary Statistics

4.3.1 Purpose. As noted previously, oneof the key benefits of UI events is howreadily details regarding on-line behav-ior can be captured and manipulatedusing automated techniques. Most inves-tigators rely on general-purpose analysisprograms such as spreadsheets and statis-tical packages to compute counts and sum-mary statistics based on collected data(e.g., feature use counts or error frequen-cies). However, some investigators haveproposed systems with specific built-infacilities for performing and reportingsuch calculations. This section providesexamples of some of the systems boast-ing specialized facilities for calculatingusability-related metrics.



4.3.2 Examples. The MIKE user inter-face management system (UIMS) is anearly example of a system offering built-in facilities for calculating and reportingmetrics [Olsen and Halversen 1988]. Be-cause MIKE controls all aspects of inputand output activity, and because it has anabstract description that links user inter-face components to the application com-mands they trigger, MIKE is in a uniquelygood position to monitor UI events andassociate them with responsible interfacecomponents and application commands.Example metrics include:r Performance time: How much time is

spent completing tasks such as specify-ing arguments for commands?r Mouse travel: Is the sum of the dis-tances between mouse clicks unneces-sarily high?r Command frequency: Which commandsare used most frequently or not at all?r Command pair frequency: Which com-mands are used together frequently?Can they be combined or placed closerto one another?r Cancel and undo: Which dialogs are fre-quently canceled? Which commands arefrequently undone?r Physical device swapping: Is the userswitching back and forth between key-board and mouse unnecessarily? Whichfeatures are associated with high physi-cal swapping counts?

MIKE logs all UI events and associatesthem with the interface components trig-gering them and the application com-mands triggered by them. Event logs arewritten to files that are later read by ametric collection and report generationprogram. This program uses the abstractdescription of the interface to interpret thelog and to generate human readable re-ports summarizing the metrics.

MacSHAPA also includes numerousbuilt-in features to support computationand reporting of simple counts and sum-mary statistics, including, for instance,the frequencies and durations of any se-lection of events specified by the user[Sanderson et al. 1994]. MacSHAPA’s

other more powerful analysis features aredescribed in following sections.

Automatic Usability Software (AUS) isreported to provide a number of auto-matically computed metrics such as helpsystem use, use of cancel and undo,mouse travel, and mouse clicks per win-dow [Chang and Dillon 1997].

Finally, ErgoLight Operation Record-ing Suite (EORS) and Usability ValidationSuite (EUVS) [ErgoLight Usability Soft-ware 1998] provide a number of built-incounts and summary statistics to charac-terize user interactions captured by thetools, either locally in the usability lab, orremotely over the Internet.

4.3.3 Related. A number of commer-cial tools such as Aqueduct AppScopeTM

[Aqueduct Software 1998] and Full Cir-cle TalkbackTM [Full Circle Software 1998]have recently become available for captur-ing data about application crashes overthe Internet. These tools capture metricsabout the operating system and applica-tion at the time crashes occur, and Talk-back allows users to provide feedback re-garding the actions leading up to crashes.These tools also provide APIs that appli-cation developers can use to report eventsof interest, such as application feature us-age. These tools then send captured datavia E-mail to developers’ computers whereit is stored in a database and plotted usingstandard database plotting facilities.

4.3.4 Strengths. With the number of pos-sible metrics, counts, and summary statis-tics that might be computed and thatmight be useful in usability evaluation, itis nice that some systems provide built-infacilities to perform and report such calcu-lations automatically.

4.3.5 Limitations. With the exception ofMacSHAPA, the systems described abovedo not provide facilities to allow evalua-tors to modify built-in counts, statistics,and reports, or to add new ones of theirown. Also, the computation of useful met-rics is greatly simplified when the systemcomputing the metrics has a model linkinguser interface components to application



commands, as in the case of MIKE, orwhen the application code is manuallyinstrumented to report the events to beanalyzed, as in the case of AppScopeand Talkback. AUS does not addressapplication-specific features and thus islimited in its ability to relate metrics re-sults to application features.

4.4 Sequence Detection

4.4.1 Purpose. These techniques detectoccurrences of concrete or abstractlydefined target sequences within sourcesequences.3 In some cases target se-quences are abstractly defined and aresupplied by the developers of the tech-nique (e.g., Fisher’s cycles, lag sequentialanalysis, multiple repeating pattern anal-ysis, and automatic chunk detection). Inother cases, target sequences are morespecific to a particular application andare supplied by the investigators usingthe technique (e.g., TOP/G, ExpectationAgents, and EDEM). Sometimes the pur-pose is to generate a list of matched sourcesubsequences for perusal by the investi-gator (e.g., Fisher’s cycles, maximal re-peating pattern analysis, and automaticchunk detection). Other times the pur-pose is to recognize sequences of UI eventsthat violate particular expectations aboutproper UI usage (e.g., TOP/G, ExpectationAgents, and EDEM). Finally, in some casesthe purpose may be to perform abstrac-tion and recoding of the source sequencebased on matches of the target sequence(e.g., EDEM).

4.4.2 Examples. TOP/G, the “Task-Oriented Parser/Generator,” parses se-quences of commands from a command-line simulation and attempts to inferthe higher level tasks that are beingperformed [Hoppe 1988]. Users of thetechnique model expected tasks in anotation based on Payne and Green’stask-action grammars [Payne and Green

3 The following sentences include names of ap-proaches in parentheses to indicate how the exam-ples explained in the next subsection fall into finersubcategories of the broader “sequence detection”category.

1986] and store this information asrewrite or production rules in a Pro-log database. Investigators can definecomposite tasks hierarchically in termsof elementary tasks, which they mustfurther decompose into “triggering rules”that map keystroke level events intoelementary tasks. Investigators may alsodefine rules to recognize “suboptimal”user behaviors that might be abbreviatedby simpler compositions of commands. Alater version attempted to use informa-tion about the side effects of commandsin the environment to recognize when alonger sequence of commands might bereplaced by a shorter sequence. In bothcases, TOP/G’s generator functionalitycould be used to generate the shorter, ormore “optimal,” command sequence.

Researchers involved in exploratorysequential data analysis (ESDA) haveapplied a number of techniques for detect-ing abstractly defined patterns in sequen-tial data. For an in-depth treatment see[Sanderson and Fisher 1994]. These tech-niques can be subdivided into two basiccategories:

Techniques sensitive to sequentiallyseparated patterns of events, for example:r Fisher’s cyclesr Lag sequential analysis (LSA)

Techniques sensitive to strict transi-tions between events, for example:r Maximal Repeating Pattern Analysis

(MRP)r Log linear analysisr Time-series analysis

Fisher’s cycles allow investigators tospecify beginning and ending events of in-terest that are then used to automaticallyidentify all occurrences of subsequencesbeginning and ending with those events(excluding those with further internal oc-currences of those events) [Fisher 1991].For example, assume an investigator isfaced with a source sequence of eventsencoded using the letters of the alpha-bet, such as: ABACDACDBADBCACCCD. Sup-pose further that the investigator wishesto find out what happened between all



occurrences of ‘A’ (as a starting point) and‘D’ (as an ending point), Fisher’s cycles pro-duces the following analysis:

Source sequence: ABACDACDBADBCACCCDBegin event: AEnd event: DOutput:

ABACDACDBADBCACCCD

ABACDACDBADBCACCCD

ABACDACDBADBCACCCD

ABACDACDBADBCACCCD

Cycle # Frequency Cycle

1 2 ACD2 1 AD3 1 ACCCD

The investigator could then note thatthere were clearly no occurrences of B inany A->D cycle. Furthermore, the investi-gator might use a grammatical techniqueto recode repetitions of the same event intoa single event, thereby revealing that thelast cycle (ACCCD) is essentially equivalentto the first two (ACD). This is one way of dis-covering similar subsequences in “noisy”data.

Lag sequential analysis (LSA) is an-other popular technique that identifies thefrequency with which two events occurat various “removes” from one another[Sackett 1978; Allison and Liker 1982;Faraone and Dorfman 1987; Sandersonand Fisher 1994]. LSA takes one event asa ‘key’ and another event as a ‘target’ andreports how often the target event occursat various intervals before and after thekey event. If ‘A’ were the key and ‘D’ thetarget in the previous example, LSA wouldproduce the following analysis:

Source sequence: ABACDACDBADBCACCCDKey event: ATarget event: DLag(s): −4 through +4Output:

Lag −4 −3 −2 −1 1 2 3 4

Occurrences 0 1 1 1 1 2 0 1

The count of 2 at Lag =+2 correspondsto the ACD cycles identified by Fischer’scycles above. Assuming the same recod-ing operation performed above to collapsemultiple occurrences of the same eventinto a single event, this count would in-crease to 3. The purpose of LSA is toidentify correlations between events (thatmight be causally related to one another)that might otherwise have been missedby techniques more sensitive to the stricttransitions between events.

An example of a technique that is moresensitive to strict transitions is the Max-imal Repeating Pattern (MRP) analysistechnique [Siochi and Hix 1991]. MRP op-erates under the assumption that repeti-tion of user actions can be an importantindicator of potential usability problems.MRP identifies all patterns occurringrepeatedly in the input sequence and pro-duces a listing of those patterns sorted bylength first followed by frequency of occur-rence in the source sequence. MRP appliedto the sequence above would produce thefollowing analysis:Source Sequence: ABACDACDBADBCACCCDOutput:

Pattern # Frequency Pattern

1 2 ACD2 3 AC3 3 CD4 2 BA5 2 DB

MRP is similar in spirit to Fisher’s cy-cles and LSA, however, the investigatordoes not specify particular events of in-terest. Notice that the ACCCD subsequenceidentified in the previous examples is notidentified by MRP since it only occurs oncein the source sequence.

Markov-based techniques can be used tocompute the transition probabilities fromone or more events to the next event. Sta-tistical tests can be applied to determinewhether the probabilities of these transi-tions is greater than would be expectedby chance [Sanderson and Fisher 1994].Other related techniques include log lin-ear analysis [Gottman and Roy 1990]and formal time-series analysis [Box and



Jenkins 1976]. All of these techniques at-tempt to find strict sequential patterns inthe data that occur more frequently thanwould be expected by chance.

Santos and colleagues have proposedan algorithm for detecting users’ “mentalchunks” based on pauses and flurries of ac-tivity in human computer interaction logs[Santos and Badre 1994]. The algorithm isbased on an extension of Fitts’ law [Fitts1964] that predicts the expected time be-tween events generated by a user who isactively executing plans, as opposed to en-gaging in problem solving and planningactivity. For each event transition in thelog, if the pause in interaction cannot bejustified by the predictive model, then thelag is assumed to signify a transition from“plan execution phase” to “plan acquisi-tion phase” [Santos et al. 1994]. The ap-proach uses the results of the algorithmto segment the source sequence into planexecution chunks and chunks most prob-ably associated with problem solving andplanning activity. The assumption is thatexpert users tend to have longer, more reg-ular execution chunks than novice users,so user expertise might be inferred onthe basis of the results of this chunkingalgorithm.

Finally, work done by Redmiles andcolleagues on “Expectation Agents” (EAs)[Girgensohn et al. 1994] and “Expectation-Driven Event Monitoring” (EDEM)[Hilbert and Redmiles 1998b] relyon sequence detection techniques totrigger various actions in response topre-specified patterns of events. Theseapproaches employ an event patternlanguage to allow investigators to specifycomposite events to be detected. When apattern of interest is detected, contextualinformation may also be queried beforeaction is taken. Possible actions includenotifying the user and/or investigatorthat a particular pattern was detected,collecting user feedback, and reporting UIstate and events leading up to detectedpatterns. Investigators may also config-ure EDEM to abstract and recode eventstreams to indicate the occurrence of ab-stract events associated with pre-specifiedevent patterns.

4.4.3 Related. EBBA is a debugging sys-tem that attempts to match the behaviorof a distributed program against partialmodels of expected behavior [Bates 1995].EBBA is similar to EDEM, particularly inits ability to abstract and recode the eventstream based on hierarchically defined ab-stract events.4

Amadeus [Selby et al. 1991] and YEAST[Krishnamurthy and Rosenblum 1995]are event-action systems used to detectand take actions based on patterns ofevents in software processes. These sys-tems are also similar in spirit to Expecta-tion Agents and EDEM. Techniques thathave been used to specify behavior of con-current systems, such as the Task Se-quencing Language (TSL) as described in[Rosenblum 1991] are also related. Theevent specification notations used in theseapproaches might be applicable to theproblem of specifying and detecting pat-terns of UI events.

4.4.4 Strengths. The strength of these ap-proaches lies in their ability to help in-vestigators detect patterns of interest inevents and not just perform analysis onisolated events. The techniques associatedwith ESDA help investigators detect pat-terns that may not have been anticipated.Languages for detecting patterns of inter-est in UI events based on extended regularexpressions [Sanderson and Fisher 1994]or on more grammatically inspired tech-niques [Hilbert and Redmiles 1998b] canbe used to locate patterns of interest and totransform event streams by recoding pat-terns of events into abstract events.

4.4.5 Limitations. The ESDA techniquesdescribed above tend to produce large

4 EBBA is sometimes characterized as a sequencecomparison system since the information carried ina partially matched model can be used to help theinvestigator better understand where the program’sbehavior has gone wrong (or where a model is inaccu-rate). However, EBBA does not directly indicate thatpartial matches have occurred or provide any diag-nostic measures of correspondence. Rather, the usermust notice that a full match has failed, and thenmanually inspect the state of the pattern matchingmechanism to see which events were matched andwhich were not.



amounts of output that can be difficultto interpret and that frequently do notlead to identification of usability prob-lems [Cuomo 1994]. The non-ESDA tech-niques require investigators to know howto specify the patterns for which they aresearching and to define them (sometimespainstakingly) before analysis can beperformed.

4.5 Sequence Comparison

4.5.1 Purpose. These techniques com-pare source sequences against concrete orabstractly defined target sequences indi-cating partial matches between the two.5Some techniques attempt to detect diver-gence between an abstract model of thetarget sequence and the source sequence(e.g., EMA and USINE). Others attemptto detect divergence between a concretetarget sequence produced, for example, byan expert user and a source sequence pro-duced by some other user (e.g., ADAM andUsAGE). Some produce diagnostic mea-sures of distance to characterize the cor-respondence between target and sourcesequences (e.g., ADAM). Others attemptto perform the best possible alignment ofevents in target and source sequences andpresent the results visually (e.g., UsAGEand MacSHAPA). Still others use pointsof deviation between the target and inputsequences to automatically indicate po-tential “critical incidents” (e.g., EMA andUSINE). In all cases, the purpose is tocompare actual usage against some modelor trace of “ideal” or expected usage toidentify potential usability problems.6

5 The following sentences include names of ap-proaches in parentheses to indicate how the exam-ples explained in the next subsection, fall into finersubcategories of the broader “sequence comparison”category.6 TOP/G, Expectation Agents, and EDEM (discussedabove) are also intended to detect deviations be-tween actual and expected usage to identify potentialusability problems. However, these approaches arebetter characterized as detecting complete matchesbetween source sequences and (“negatively defined”)target patterns that indicate unexpected, or sub-optimal behavior, as opposed to partially match-ing, or comparing, source sequences against (“posi-tively defined”) target patterns indicating expectedbehavior.

4.5.2 Examples. ADAM, an “AdvancedDistributed Associative Memory,” com-pares fixed length source sequencesagainst a set of target sequences thatwere used to train the memory [Finlayand Harrison 1990]. Investigators trainADAM by helping it associate exampletarget sequences with “classes” of eventpatterns. After training, when a sourcesequence is input, the associative mem-ory identifies the class that most closelymatches the source sequence and outputstwo diagnostic measures: a “confidence”measure that is 100% only when thesource sequence is identical to one of thetrained target sequences, and a “distance”measure, indicating how far the sourcepattern is from the next “closest” class.Investigators then use these measures todetermine whether a source sequence isdifferent enough from the trained se-quences to be judged as a possible “crit-ical incident.” Incidentally, ADAM mightalso be trained on examples of “expected”critical incidents so that these might bedetected directly.

MacSHAPA [Sanderson and Fisher1994] provides techniques for aligning twosequences of events as optimally as pos-sible based on maximal common subse-quences [Hirschberg 1975]. The resultsare presented visually as cells in adjacentspreadsheet columns with aligned eventsappearing in the same row and missingcells indicating events in one sequencethat could not be aligned with events inthe other (see Figure 17).

UsAGE applies a related technique inwhich a source sequence of UI events (re-lated to performance of a specific task) isaligned as optimally as possible with a tar-get sequence produced by an “expert” per-forming the same task [Ueling and Wolf1995]. UsAGE presents its alignment re-sults in visual form.

EMA, an “automatic analysis mecha-nism for the ergonomic evaluation of userinterfaces,” requires investigators to pro-vide a grammar-based model describingall the expected paths through a particu-lar user interface [Balbo 1996]. An evalua-tion program then compares a log of eventsgenerated by use of the interface against



the model, indicating in the log and themodel where the user has taken “illegal”paths. EMA also detects and reports theoccurrence of other simple patterns, forexample, the use of cancel or repeatedactions. The evaluator can then use thisinformation to identify problems in theinterface (or problems in the model).

USINE is a similar technique [Lecerofand Paterno 1998], in which investigatorsuse a hierarchical task notation to specifyhow lower-level actions combine to formhigher-level tasks, and to specify sequenc-ing constraints on actions and tasks. Thetool then compares logs of user actionsagainst the task model. All actions notspecified in the task model, or actions andtasks performed “out of order” accordingto the sequencing constraints specified inthe task model, are flagged as potentialerrors. The tool then computes a numberof built-in counts and summary statisticsincluding number of tasks completed, er-rors, and other basic metrics (e.g., windowresizing and scrollbar usage) and gener-ates simple graphs.

ErgoLight Usability Validation Suite(EUVS) also compares user interactionsagainst hierarchical representations ofuser tasks [ErgoLight Usability Soft-ware 1998]. EUVS is similar in spirit toEMA and USINE with the added ben-efit that it provides a number of built-in counts and summary statistics re-garding general user interface use inaddition to automatically-detected diver-gences between user actions and the taskmodel.

4.5.3 Related. Process validation tech-niques are related in that they compareactual traces of events generated by a soft-ware process against an abstract modelof the intended process [Cook and Wolf1997]. These techniques compute a diag-nostic measure of distance to indicate thecorrespondence between the trace and theclosest acceptable trace produced bythe model. Techniques for performing er-ror correcting parsing are also related. SeeCook and Wolf [1997] for further discus-sion and pointers to relevant literature.

4.5.4 Strengths. The strengths of theseapproaches lie their ability to compareactual traces of events against expectedtraces, or models of expected traces,in order to identify potential usabilityproblems. This is particularly appealingwhen expected traces can be specified “bydemonstration” as in the case of ADAMand UsAGE.

4.5.5 Limitations. Unfortunately, all ofthese techniques have significant limita-tions.

A key limitation of any approach thatcompares source sequences against con-crete target sequences is the underlyingassumption that: (a) source and target se-quences can be easily segmented for piece-meal comparison, as in the case of ADAM,or (b) that whole interaction sequencesproduced by different users will actuallyexhibit reasonable correspondence, as inthe case of UsAGE.

Furthermore, the output of all thesetechniques, (except in the case of per-fect matches) requires expert humaninterpretation to determine whether thesequences are interestingly similar or dif-ferent. In contrast to techniques that com-pletely match patterns that directly indi-cate violations of expected patterns (e.g.,as in the case of EDEM), these techniquesproduce output to the effect, “the sourcesequence is similar to a target sequencewith a correspondence measure of 61%,”leaving it up to investigators to decide ona case by case basis what exactly the cor-respondence measure means.

A key limitation of any technique com-paring sequences against abstract mod-els (e.g., EMA, USINE, ErgoLight EUVS,and the process validation techniques de-scribed by Cook and Wolf) is that in orderto reliably categorize a source sequenceas being a poor match, the model usedto perform the comparison must be rela-tively complete in its ability to describeall possible, or rather, expected paths.This is all but impossible in most non-trivial interfaces. Furthermore, the modelmust somehow deal with “noise” so thatlow-level events, such as mouse move-ments, won’t mask otherwise significant



correspondence between source sequencesand the abstract model. Because thesetechniques typically have no built-in fa-cilities for performing transformations oninput traces, this implies that either theevent stream has already been trans-formed, perhaps by manually instrument-ing the application (as with EMA), or com-plexity must be introduced into the modelto avoid sensitivity to “noise.” In contrast,techniques such as EDEM and EBBA useselection and abstraction to pick out pat-terns of interest from the noise. The mod-els need not be complete in any senseand may ignore events that are not ofinterest.

4.6 Sequence Characterization

4.6.1 Purpose. These techniques takesource sequences as input and attempt toconstruct an abstract model to summarize,or characterize, interesting sequential fea-tures of those sequences. Some techniquesproduce a process model with probabil-ities associated with transitions [Guz-dial 1993]. Others construct models thatcharacterize the grammatical structureof events in the input sequences [Olsonet al. 1994].

4.6.2 Examples. Guzdial describes atechnique, based on Markov Chain anal-ysis, that produces process models withprobabilities assigned to transitions tocharacterize user behavior with interac-tive applications [Guzdial 1993]. First,the investigator identifies abstract stages,or states, of application use. In Guzdial’sexample, a simple design environmentwas the object of study. The design envi-ronment provided functions supportingthe following stages in a simple designprocess: “initial review,” “decomposition,”“composition,” “debugging,” and “finalreview.” The investigator then creates amapping between each of the operationsin the interface and one of the abstractstages. For instance, Guzdial mappedall debugging related commands (whichincidentally all appeared in a single “de-bugging” menu) to the “debugging” stage.

The investigator then uses the Hawk toolto abstract and record the event stream toreplace low level events with the abstractstages associated with them (presumablydropping events not associated withstages). The investigator then uses Hawkto compute the observed probability ofentering any stage from the stage im-mediately before it to yield a transitionmatrix. The investigator can then use thematrix to create a process diagram withprobabilities associated with transitions.In Guzdial’s example, one subject wasobserved to have transitioned from “de-bugging” to “composition” more often (52%of all transitions out of “debugging”) thanto “decomposition” (10%) (see Figure 18).Guzdial then computed a steady statevector to reflect the probability of anyevent chosen at random belonging to eachparticular stage. He could then comparethis to an expected probability vector(computed by simply calculating thepercentage of commands associated witheach stage) to indicate user “preference”for classes of commands.

Olson and colleagues describe an ap-proach, based on statistical and grammat-ical techniques, for characterizing the se-quential structure of verbal interactionsbetween participants in design meetings[Olson et al. 1994]. They begin by mappingmeeting verbalizations into event cate-gories that are then used to manually en-code the transcript into an event sequencerepresentation. The investigator then ap-plies statistical techniques, including loglinear modeling and lag sequential analy-sis, to identify potential dependencies be-tween events in the sequence. The in-vestigator then uses the results of thesepattern detection techniques to suggestrules that might be included in a definiteclause grammar to summarize, or charac-terize, some of the sequential structure ofthe meeting interactions. The investigatorthen uses the resulting grammar rules torewrite some of the patterns embedded inthe sequence (i.e., abstraction and recod-ing), and the new sequence is subjected tothe same statistical techniques leading tofurther iterative refinement of the gram-mar. The result is a set of grammar rules



that provide insight into the sequentialstructure of the meeting interactions.

4.6.3 Related. Process discovery tech-niques are related in that they attemptto automatically generate a process model,in the form of a finite state machine, thataccounts for a trace of events producedby a particular software process [Cookand Wolf 1995]. It is not clear how wellthese techniques would perform with dataas noisy as UI events. A more promis-ing approach might be to perform se-lection, abstraction, and recoding of theevent stream prior to submitting it foranalysis.

4.6.4 Strengths. The strength of thesetechniques lies in their ability to help in-vestigators discover sequential structurewithin event sequences and to character-ize that structure abstractly.

4.6.5 Limitations. The technique de-scribed by Olson and colleagues requiresextensive human involvement and canbe very time-consuming [Olson et al.1994]. On the other hand, the automatedtechniques suggested by Cook and Wolfappear to be sensitive to noise and areless likely to produce models that makesense to investigators [Olson et al. 1994].

In our opinion, Markov-based models,while relying on overly simplifying as-sumptions, are more likely than grammar-based techniques to tell investigatorssomething about user interactions thatthey don’t already know. Investigatorsoften have an idea of the grammaticalstructure of interactions that may arisefrom the use of (at least portions of) a par-ticular interface. Grammars are thus use-ful in transforming low level UI eventsinto higher level events of interest, or todetect when actual usage patterns violateexpected patterns. However, the value ofgenerating a grammatical or FSM-basedmodel to summarize use is more limited.More often than not, a grammar- or FSM-based model generated on the basis of mul-tiple traces will be vacuous in that it willdescribe all observed patterns of usage of

Fig. 12 . Use of “Edit” menu operations is indicatedin black. Use of “Font” menu operations is indicatedin grey. Used to display results of transformations orsequence detection.

an interface without indicating which aremost common. While this may be useful indefining paths for UI regression testing,investigators interested in locating usabil-ity problems will more likely be interestedin identifying and determining the fre-quency of specific observed patterns thanin seeing a grammar to summarize themall.

4.7 Visualization

4.7.1 Purpose. These techniques presentthe results of transformation and analysisin forms allowing humans to exploit theirinnate visual analysis capabilities to inter-pret results. Some of these techniques arehelpful in linking results of analysis backto features of the interface.

4.7.2 Examples. Investigators have pro-posed a number of techniques to visualizedata based on UI events. For a survey ofsuch techniques see [Guzdial et al. 1994].Below are few examples of techniques thathave been used in support of the analysisapproaches described above.

Transformation: The results of perform-ing selection or abstraction on an eventstream can sometimes be visualized usinga timeline in which bands of colors indi-cate different selections of events in theevent stream. For example, one might usered to highlight the use of “Edit” menu op-erations and blue to highlight the use of“Font” menu operations in the evaluationof a word processor (Figure 12).

MacSHAPA [Sanderson and Fisher1994] visualizes events as occupying cellsin a spreadsheet. Event streams arelisted vertically (in adjacent columns) and



Fig. 13 . Correspondence between events at dif-ferent levels of abstraction is indicated by hor-izontal alignment. Single “Key” events in largecells correspond to “LostEdit,” “GotEdit,” and“ValueProvided” abstract interaction events insmaller, horizontally aligned cells.

correspondence of events in one streamwith events in adjacent streams is in-dicated by horizontal alignment (acrossrows). A large cell in one column may cor-respond to a number of smaller cells in an-other column to indicate abstraction rela-tionships (Figure 13).

Counts and summary statistics: Thereare a number of visualizations that canbe used to represent the results of countsand summary statistics, including static2D and 3D graphs, static 2D effects super-imposed on a coordinate space represent-ing the interface, and static and dynamic2D and 3D effects superimposed on topof an actual visual representation of theinterface.

The following are examples of static 2Dand 3D graphs:r Graph of keystrokes per window [Chang

and Dillon 1997].r Graph of mouse clicks per window[Chang and Dillon 1997].r Graph of relative command frequencies[Kay and Thomas 1995] (Figure 14).r Graph of relative command frequenciesas they vary over time [Kay and Thomas1995] (Figure 15).

Fig. 14 . Relative command frequencies ordered by“rank”.

The following are examples of static 2Deffects superimposed on an abstract coor-dinate space representing the interface:r Location of mouse clicks [Guzdial et al.

1994; Chang and Dillon 1997].r Mouse travel patterns between clicks[Buxton et al. 1983; Chang and Dillon1997].

The following are examples of static anddynamic 2D and 3D effects superimposedon top of a graphical representation of theinterface:r Static highlighting to indicate location

and density of mouse clicks [Guzdialet al. 1994].r Dynamic highlighting of mouse click ac-tivity as it varies over time [Guzdialet al. 1994].r 3D representation of mouse click loca-tion and density [Guzdial et al. 1994](Figure 16).

Sequence detection: The same techniqueillustrated in Figure 12 can be used tovisualize the results of selecting subse-quences of UI events based on sequencedetection techniques.

EDEM provides a dynamic visualizationof the occurrence of UI events by highlight-ing nodes in a hierarchical representa-tion of the user interface being monitored.A similar visualization is provided to in-dicate the occurrence of abstract events,defined in terms of abstract patterns of



Fig. 15 . Relative command frequencies over time.

events, by highlighting entries in a listof agents responsible for detecting thosepatterns. These visualizations help inves-tigators inspect the dynamic behavior ofevents, thereby supporting the process ofevent pattern specification [Hilbert andRedmiles 1998a].

Sequence comparison: As describedabove, MacSHAPA provides facilities foraligning two sequences of events as opti-mally as possible and presenting the re-sults visually as cells in adjacent spread-sheet columns [Sanderson and Fisher1994]. Aligned events appear in the samerow and missing cells indicate events inone sequence that could not be alignedwith events in the other (Figure 17).

UsAGE provides a similar visualizationfor comparing sequences based on draw-ing a connected graph of nodes [Ueling andWolf 1995]. The “expert” series of actions isdisplayed linearly as a sequence of nodesacross the top of the graph. The “novice”series of actions are indicated by draw-ing directed arcs connecting the nodes torepresent the order in which the noviceperformed the actions. Out of sequenceactions are indicated by arcs that skip ex-pert nodes in the forward direction or thatpoint backwards in the graph. Unmatchedactions taken by the novice appear asnodes (with a different color) placed belowthe last matched expert node.

Sequence characterization: Guzdial usesa connected graph visualization to illus-trate the results of his Markov-basedanalysis [Guzdial 1993]. The result is aprocess model with nodes representingprocess steps and arcs indicating the

Fig. 16 . A 3D representation of mouse clickdensity superimposed over a graphical repre-sentation of the interface.

observed probabilities of transitions be-tween process steps (Figure 18).

4.7.3 Strengths. The strengths of thesetechniques lie in their ability to presentthe results of analysis in forms al-lowing humans to exploit their innatevisual analysis capabilities to inter-pret results. Particularly useful are thetechniques that link results of analy-sis back to features of the interface,such as the techniques superimposinggraphical representations of behavior overactual representations of the interface.

4.7.4 Limitations. With the exception ofsimple graphs (which can typically begenerated using standard graphing capa-bilities provided by spreadsheets and sta-tistical analysis packages), most of the vi-sualizations above must be produced byhand. Techniques for accurately superim-posing graphical effects over visual repre-sentations of the interface can be particu-larly problematic.

4.8 Integrated Support

4.8.1 Purpose. Environments that fa-cilitate flexible composition of various



Fig. 17 . Results of an automatic alignment oftwo separate event streams. Horizontal align-ment indicates correspondence. Black spaces in-dicate where alignment was not possible.

transformation, analysis, and visualiza-tion capabilities provide integrated sup-port. Some environments also providebuilt-in support for managing domain-specific artifacts such as evaluations,subjects, tasks, data and results ofanalysis.

4.8.2 Examples. MacSHAPA is perhapsthe most comprehensive environment de-signed to support all manner of ex-ploratory sequential data analysis (ESDA)[Sanderson et al. 1994]. Features include:data import and export; video and codedobservation synch and search capabilities;facilities for performing selection, abstrac-tion, and recoding; a number of built-incounts and summary statistics; featuressupporting sequence detection, compar-ison, and characterization; a general-purpose database query and manipulationlanguage; and a number of built-in visual-izations and reports.

Fig. 18 . A process model characterizing user behav-ior with nodes representing process steps and arcsindicating observed probabilities of transitions be-tween process steps.

DRUM provides integrated featuresfor synchronizing events, observations,and video; for defining and managingobservation coding schemes; for calcu-lating pre-defined counts and summarystatistics; and for managing and manipu-lating evaluation-related artifacts regard-ing subjects, tasks, recording plans, logs,videos, and results of analysis [Macleodet al. 1993].

Hawk provides flexible support for cre-ating, debugging, and executing scriptsto automatically select, abstract, andrecode event logs [Guzdial 1993]. Man-agement facilities are also provided to or-ganize and store event logs and analysisscripts.

Finally, ErgoLight Operation Record-ing Suite (EORS) and Usability Valida-tion Suite (EUVS) [ErgoLight UsabilitySoftware 1998] offer a number of facili-ties for managing usability evaluations,both local and remote, as well as fa-cilities for merging data from multipleusers.

4.8.3 Strengths. The task of extractingusability-related information from UIevents typically requires the managementof numerous files and media types as wellas the creation and composition of variousanalysis techniques. Environments sup-porting the integration of such activities



Fig. 19 . A key to interpreting the values listed in columns of Table III.

can significantly reduce the burden of datamanagement and integration.

4.8.4 Limitations. Most of the environ-mentsabovepossessimportantlimitations.

While MacSHAPA is perhaps the mostcomprehensive integrated environmentfor analyzing sequential data, it is notspecifically designed for analysis of UIevents. As a result, it lacks supportfor event capture and focuses primar-ily on analysis techniques that, whenapplied to UI events, require extensivehuman involvement and interpretation.MacSHAPA provides many of the basicbuilding blocks required for an “ideal”environment for capturing and analyz-ing UI events, however, selection, ab-straction, and recoding cannot be eas-ily automated. Furthermore, because thepowerful features of MacSHAPA cannotbe used during event collection, con-textual information that might be use-ful in selection and abstraction is notavailable.

While providing features for managingand analyzing UI events, coded observa-tions, video data, and evaluation artifacts,DRUM does not provide features for se-lecting, abstracting, and recoding data.

Finally, while Hawk addresses the prob-lem of providing automated support forselection, abstraction, and recoding, likeMacSHAPA, it does not address UI eventcapture, and as a result, contextual infor-

mation cannot be used in selection and ab-straction.

5. DISCUSSION

5.1 Summary of the State of the Art

Synch and search techniques are amongthe most mature technologies for exploit-ing UI event data in usability evaluations.Tools supporting these techniques are be-coming increasingly common in usabil-ity labs. However, these techniques canbe costly in terms of equipment, humanobservers, and data storage and analy-sis requirements. Furthermore, synch andsearch techniques generally exploit UIevents as no more than convenient in-dices into video recordings. In some cases,events may be used as the basis for com-puting simple counts and summary statis-tics using spreadsheets or statistical pack-ages. However, such analyses typicallyrequire investigators to perform selectionand abstraction by hand.

The other, arguably more sophisticated,analysis techniques such as sequence de-tection, comparison, and characterizationcontinue to remain denizens of the re-search lab for the most part. Those thatare most compelling tend to require themost human intervention, interpretation,and effort (e.g., exploratory sequentialdata analysis techniques and the Markov-and Grammar-based sequence character-ization techniques). Those that are most



automated tend to be least compellingand most unrealistic in their assump-tions (e.g., ADAM, UsAGE, and EMA).One of the main problems limiting thesuccess of automated approaches may betheir lack of focus on transformation,which appears to be a necessary prerequi-site for meaningful analysis (for reasonsarticulated in Section 3 and discussed fur-ther below).

Nevertheless, few investigators have at-tempted to address the problem of trans-formation realistically. Of the twenty-five plus approaches surveyed here, onlya handful provide mechanisms that al-low investigators to perform transforma-tions at all (Microsoft, SunSoft, Apple,Chen, CHIME, EDEM, Hawk, and Mac-SHAPA). Of those, fewer still allow mod-els to be constructed and reused in anautomated fashion (CHIME, EDEM, andHawk). Of those, fewer still allow trans-formation to be performed in contextso that important contextual informationcan be used in selection and abstraction(EDEM).

5.2 Some Anticipated Challenges

There is very little data published regard-ing the relative utility of the surveyedapproaches in supporting usability eval-uations. As a result, we have focused onthe technical capabilities of the surveyedapproaches in order to classify, compare,and evaluate them. To take this analyti-cal evaluation a step further: our under-standing of the nature of UI events (basedon extensive Java, Windows, and X Win-dows programming experience) leads usto conclude that more work will likelybe needed in the area of transformingthe “raw” data generated by such event-based systems in preparation for othertypes of analysis in order to increase thelikelihood of useful results. This is be-cause most other types of analysis (includ-ing simple counts and summary statisticsas well as sequence analysis techniques)are sensitive to lexical-level differencesin event streams that can be removedvia transformation, as illustrated inSection 3.2.

There are a number of ways that in-vestigators have successfully side-steppedthe transformation problem. For instance,building data collection directly into a userinterface management system or requir-ing applications to report events them-selves can help ameliorate some of the is-sues. However, both of these approacheshave important limitations.

User interface management systems(UIMSs) typically model the relation-ships between application features and UIevents explicitly, so reasonable data col-lection and analysis mechanisms can bebuilt directly in, as in the case of theMIKE UIMS [Olsen and Halversen 1988],KRI/AG [Lowgren and Nordqvist 1992],and UsAGE [Ueling and Wolf 1995]. Be-cause UIMSs have dynamic access to mostaspects of the user interface, contextualinformation useful in interpreting the sig-nificance of events is also available. How-ever, many developers do not use UIMSs,thus, a more general technique that doesnot presuppose the use of a UIMS isneeded.

Techniques that allow applications toreport events directly via an event-reporting API provide a useful service,particularly in cases where events of in-terest cannot be inferred from UI events.This allows important application-specificevents to be reported by applicationsthemselves and provides a more generalsolution than a UIMS-based approach.However, this places an increased bur-den on application developers to cap-ture and transform events of interest,for example, as in Kay and Thomas[1995] and Balbo [1996]. This can becostly, particularly if there is no cen-tralized command dispatch loop, or sim-ilar mechanism, that can be tappedas a source of application events. Thisalso complicates software evolution sincedata collection code is typically intermin-gled with application code. Furthermore,there is much usability-related informa-tion not typically processed by appli-cations that can be easily captured bytapping into the UI event stream, forinstance, shifts in input focus, mousemovements, and the specific user interface



actions used to invoke application fea-tures. As a result, an event-reporting APIis just part of a more comprehensivesolution.

Thus, we conclude that more work isneeded in the area of transformationand data collection to ensure that usefulinformation can be captured in the firstplace, before automated analysis tech-niques, such as those surveyed above, canbe expected to yield meaningful results(where “meaningful” means the resultscan be related, without undue hardship, toaspects of the user interface and applica-tion being studied as well as users’ actionsat higher levels of abstraction than simplekey presses and mouse clicks). A reason-able approach would assume no more thana typical event-based user interface sys-tem, such as provided by the MacintoshOperating System, Microsoft Windows, XWindow System, or Java Abstract WindowToolkit, and developers would not be re-quired to adopt a particular UIMS nor callan API to report every potentially interest-ing event.

5.3 Related Work and Future Directions

There are a number of related techniquesthat have been explored, both in academiaand industry, that have the potential ofproviding useful insights into how to moreeffectively exploit UI events as a source ofusability information.

A number of researchers and practi-tioners have addressed related issues incapturing and evaluating event data inthe realm of software testing and debugg-ing:r Work in distributed event monitor-

ing, e.g., GEM [Mansouri-Samani andSloman 1994], and model-based testingand debugging, e.g., EBBA [Bates 1995]and TSL [Rosenblum 1991], have ad-dressed a number of problems in thespecification and detection of compositeevents and the use of context in inter-preting the significance of events. Theevent specification notations, infrastruc-ture, and experience that have come outof this work might provide useful in-sights that can be applied to the prob-

lem of capturing and analyzing UI eventdata.r Automated user interface testing tech-niques, e.g., WinRunnerTM [Mercury In-teractive 1998] and JavaStarTM [Sun Mi-crosystems 1998], are faced with theproblem of robustly identifying user in-terface components in the face of userinterface change, and evaluating eventsagainst specifications of expected UI be-havior in test scripts. The same prob-lem is faced in maintaining the rela-tionships between UI components andhigher-level specifications of applicationfeatures and abstract events of inter-est in usability evaluations based on UIevents.r Monitoring of application programmaticinterfaces (APIs), e.g., Hewlett Packard’sApplication Response-time Measure-ment API [Hewlett Packard 1998], ad-dresses the problem of monitoring APIusage to help software developers eval-uate the fit between the design of anAPI and how it is actually used. Insightsgained in this area may generalize to theproblem of monitoring UI usage to eval-uate the fit between the design of a UIand how it is actually used.r Internet-based application monitoringsystems, e.g., AppScopeTM [AqueductSoftware 1998] and TalkbackTM [FullCircle Software 1998], have begun toaddress issues of collecting applicationfailure data on a potentially large andongoing basis over the Internet. Thetechniques developed to make this prac-tical for application failure monitoringcould be applicable in the domain oflarge-scale, ongoing collection of user in-teraction data over the Internet.

A number of researchers have ad-dressed problems in the area of mappingbetween lower level events and higherlevel events of interest:r Work in the area of event histories,

e.g., Kosbie and Myers [1994], and undomechanisms has addressed issues in-volved in grouping lower level UI eventsinto more meaningful units from thepoint of view of users’ tasks. Insights



gained from this work, and the actualevent representations used to supportundo mechanisms, might be exploited tocapture events at higher levels of ab-straction than are typically available atthe window system level.r Work in the area of user modeling [UserModeling 1998] is faced with the prob-lem of inferring users’ tasks and goalsbased on user background, interactionhistory, and current context in orderto enhance human-computer interac-tion. The techniques developed in thisarea, which range from rule-based tostatistically-oriented machine-learningtechniques, might eventually beharnessed to infer higher level eventsfrom lower level events in supportof usability evaluations based on UIevents.r Work in the area of programmingby demonstration [Cypher 1993] andplan recognition and assisted comple-tion [Cypher 1991] also addresses prob-lems involved in inferring user intentbased on lower level interactions. Thiswork has shown that such inferenceis feasible in at least some structuredand limited domains, and program-ming by demonstration appears to bea desirable method for specifying ex-pected or unexpected patterns of eventsfor sequence detection and comparisonpurposes.r Layered protocol models of interaction,e.g., Nielsen [1986] and Taylor [1988a;1988b], allow human-computer interac-tions to be modeled at multiple levelsof abstraction. Such techniques mightbe useful in specifying how higher levelevents are to be inferred based onlower level events. Command languagegrammars (CLGs) [Moran 1981] andtask-action grammars (TAGs) [Payneand Green 1986] are other potentiallyuseful modeling techniques for spec-ifying relationships between human-computer interactions and users’ tasksand goals.

Work in the area of automated discoveryand validation of patterns in large corpora

of event data might also provide valuableinsights:

r Data mining techniques for discoveringassociation rules, sequential patterns,and time-series similarities in large datasets [Agrawal et al. 1996] may be ap-plicable in uncovering patterns relevantto investigators interested in evaluat-ing usage and usability based on UIevents.r The process discovery techniques inves-tigated by Cook and Wolf [1996] provideinsights into problems involved in auto-matically generating models to charac-terize the sequential structure of eventtracesr The process validation techniques inves-tigated by Cook and Wolf [1997] provideinsights into problems involved in com-paring traces of events against modelsof expected behavior.

Finally, there are numerous domains inwhich event monitoring has been used asa means of identifying and, in some cases,diagnosing and repairing breakdowns inthe operation of complex systems. For ex-ample:

r Network and enterprise managementtools for automating network and ap-plication administration, e.g., TIBCOHawkTM [TIBCO 1998].r Product condition monitoring, e.g., high-end photocopiers or medical devices thatreport data back to equipment manu-facturers to allow performance, failures,and maintenance issues to be tracked re-motely [Lee 1996].

6. CONCLUSIONS

We have surveyed a number of computer-aided techniques for extracting usability-related information from UI events. Ourclassification scheme includes the follow-ing categories: synch and search tech-niques; transformation techniques; tech-niques for performing simple countsand summary statistics; techniques forperforming sequence detection, compar-ison, and characterization; visualization



Table IV. Index into the References Based on the Categories Established by the Comparison Framework

Category Approaches

Synchronization and Searching Playback [Neal & Simmons 1983), Apple [Weiler 1993], SunSoft[Weiler 1993], Microsoft [Hoiem & Sullivan 1994], I-Observe [Badreet al. 1995]

Transformation Incident Monitoring [Chen 1990], User-Identified CIs [Hartson et al.1996], CHIME [Badre & Santos 1991], EDEM [Hilbert & Redmiles1997]

Counts and Summary Statistics UIMS [Buxton et al. 1983], MIKE [Olsen & Halversen 1988], KRI/AG[Lowgren & Nordqvist 1992], Long-Term Monitoring [Kay & Thomas1995], AUS [Chang & Dillon 1997], EORS & EUVS [ErgoLightUsability Software 1998]. Related: AppScope [Aqueduct Software1998], Talkback [Full Circle Software 1998]

Sequence Detection LSA [Sackett 1978], Fisher’s Cycles [Fisher 1988], TOP/G [Hoppe1988], MRP [Siochi & Hix 1991], Expectation Agents [Girgensohnet al. 1994], EDEM [Hilbert & Redmiles 1997], USINE [Lecerof &Paterno 1998]. Related: TSL [Rosenblum 1991], Amadeus [Selbyet al. 1991], YEAST [Krishnamurthy & Rosenblum 1995], EBBA[Bates 1995], GEM [Mansouri-Samani & Sloman 1997]

Sequence Comparison ADAM [Finlay & Harrison 1990], USAGE [Ueling & Wolf 1995], EMA[Balbo 1996], USINE [Lecerof & Paterno 1998], EUVS [ErgoLightUsability Software 1998]. Related: Process Validation [Cook & Wolf1997]

Sequence Characterization Markov-based [Guzdial 1993], Grammar-based [Olson et al. 1994].Related: Process Discovery [Cook & Wolf 1995]

Integrated Support MacSHAPA [Sanderson et al. 1994], DRUM [Macleod & Rengger1993]. Hawk [Guzdial 1993], EORS & EUVS [ErgoLight UsabilitySoftware 1998].

techniques; and finally, techniques thatprovide integrated evaluation support.

Very few of the surveyed approachessupport transformation, which we argue isa critical subprocess in the overall processof extracting meaningful usability-relatedinformation from UI events.

Our current research involves explor-ing techniques and infrastructure for per-forming transformation and analysis au-tomatically and in context in order togreatly reduce the amount of data thatmust ultimately be reported. It is an openquestion whether such an approach mightbe scaled up to large-scale and ongoinguse over the Internet. If so, we believethat automated techniques, such as thosesurveyed here, will be useful in captur-ing indicators of the “big picture” regard-ing application use in the field. How-ever, we believe that such techniques maybe less suited to identifying subtle, nu-anced usability issues. Fortunately, thesestrengths and weaknesses nicely comple-ment the strengths and weaknesses inher-ent in current usability testing practice,in which subtle usability issues are iden-

tified through careful human observation,but in which there is little sense of the “bigpicture” of how applications are used on alarge scale.

A usability professional from a largesoftware development organization re-cently reported to us that the usabilityteam is often approached by design and de-velopment team members with questionssuch as “how often do users do X?” or “howoften does Y happen?” This is obviouslyuseful information for developers wishingto assess the impact of suspected problemsor to focus development effort for the nextversion. However, it is not informationthat can be reliably collected in the usabil-ity lab. We believe that automated usagedata collection techniques will eventuallycomplement traditional usability evalua-tion practice, not only by supporting devel-opers as described above, but also in help-ing assess the impact of, and focusing theefforts of, usability evaluations.

REFERENCES

ABBOTT, A. 1990. A primer on sequence methods. Or-gan. Sci. 4.



AGRAWAL, R., ARNING, A., BOLLINGER, T., MEHTA, M.,SHAFER, J., AND SRIKANT, R. 1996. The Questdata mining system. In Proceedings of the 2ndInternational Conference on Knowledge Discov-ery in Databases and Data Mining.

AHO, A. V., KERNIGHAN, B. W., AND WEINBERGER, P. J.1988. The AWK Programming Language.Addison-Wesley, Reading, MA.

ALLISON, P. D. AND LIKER, J. K., 1982. Analyzing se-quential categorical data on dyadic interaction:A comment on Gottman. Psychological Bulletin2.

AQUEDUCT SOFTWARE. 1998. AppScope Web Pages.URL: http://www.aqueduct.com/.

BADRE, A. N. AND SANTOS, P. J. 1991a. CHIME:A Knowledge-Based Computer-Human Interac-tion Monitoring Engine. Tech Rept. GIT-GVU-91-06.

BADRE, A. N. AND SANTOS, P. J. 1991b. A Knowledge-Based System for Capturing Human-ComputerInteraction Events: CHIME. Tech. Rept. GIT-GVU-91-21.

BADRE, A. N., GUZDIAL, M., HUDSON, S. E., AND SANTOS,P. J. 1995. A user interface evaluation environ-ment using synchronized video, visualizations,and event trace data. J. of Software Qual. 4.

BAECKER, R. M, GRUDIN, J., BUXTON, W. A. S., AND

GREENBERG, S., Eds. 1995. Readings in Human-Computer Interaction: Toward the Year 2000.Morgan Kaufmann, San Mateo, CA.

BALBO, S. 1996. EMA: Automatic Analysis Mecha-nism for the Ergonomic Evaluation of User In-terfaces. CSIRO Tech. rep.

BATES, P. C. 1995. Debugging heterogeneous dis-tributed systems using event-based models ofbehavior. ACM Trans. on Comput. Syst. 13, 1.

BELLOTTI, V. 1990. A framework for assessing ap-plicability of HCI techniques. In Proceedings ofINTERACT ’90.

BUXTON, W., LAMB, M., SHEMAN, D., AND SMITH, K.1983. Towards a comprehensive user inter-face management system. In Proceedings ofSIGGRAPH ’83.

CHANG, E. AND DILLON, T. S. Automated usability test-ing. In Proceedings of INTERACT ’97.

CHEN, J. 1990. Providing intrinsic support for userinterface monitoring. In Proceedings of INTER-ACT ’90.

COOK, J. E. AND WOLF, A. L. 1994. Toward metrics forprocess validation. In Proceedings of ICSP ’94.

COOK, J. E. AND WOLF, A. L. 1995. Automating pro-cess discovery through event-data analysis. InProceedings of ICSE ’95.

COOK, J. E. AND WOLF, A. L. 1997. Software ProcessValidation: Quantitatively Measuring the Cor-respondence of a Process to a Model. Tech. Rep.CU-CS-840-97, Dept. of Computer Science, Univ.of Colorado at Boulder.

COOK, R., KAY, J., RYAN, G., AND THOMAS, R. C. 1995.A toolkit for appraising the long-term usabilityof a text editor. Software Qual. J. 4, 2.

CUOMO, D. L. 1994. Understanding the applicabil-ity of sequential data analysis techniques foranalysing usability data. In Usability Labora-tories Special Issue of Behavior and InformationTechnology, J. Nielsen, Ed., vol. 13, no.1 & 2.

CYPHER, A. 1991. Eager: programming repetitivetasks by example. In Proceedings of CHI ’91.

CYPHER, A., Ed. 1993. Watch what I do: Program-ming by Demonstration. MIT Press, CambridgeMA.

DODGE, M. AND STINSON, C. 1999. Running MicrosoftExcel 2000. Microsoft Press.

DOUBLEDAY, A., RYAN, M., SPRINGETT, M., AND SUTCLIFFE,A. 1997. A comparison of usability techniquesfor evaluating design. In Proceedings of DIS’97.

ELGIN, B. 1995. Subjective usability feedback fromthe field over a network. In Proceedings of CHI’95.

ERGOLIGHT USABILITY SOFTWARE. 1998. OperationRecording Suite (EORS) and Usability Vali-dation Suite (EUVS) Web pages. URL: http://www.ergolight.co.il/.

FARAONE, S. V. AND DORFMAN, D. D. 1987. Lag sequen-tial analysis: Robust statistical methods. Psycho-logical Bulletin 101.

FEATHER, M. S., NARAYANASWAMY, K., COHEN, D.,AND FICKAS, S. 1997. Automatic monitoring ofsoftware requirements. Research Demonstra-tion. In Proceedings of ICSE ’97.

FICKAS, S. AND FEATHER, M. S. 1995. Requirementsmonitoring in dynamic environments. In IEEEInternational Symposium on Requirements En-gineering.

FINLAY, J. AND HARRISON, M. 1990. Pattern recogni-tion and interaction models. In Proceedings ofINTERACT ’90.

FISHER, C. 1987. Advancing the study of program-ming with computer-aided protocol analysis. InEmpirical Studies of Programmers, 1987 Work-shop, G. Olson, E. Soloway, and S. Sheppard,Eds. Ablex, Norwood, NJ.

FISHER, C. 1991. Protocol Analyst’s Workbench: De-sign and Evaluation of Computer-Aided ProtocolAnalysis. Unpublished Ph.D. thesis, CarnegieMellon University, Dept. of Psychology, Pitts-burgh, PA.

FISHER, C. AND SANDERSON, P. 1996. Exploratory se-quential data analysis: Exploring continuous ob-servational data. Interactions 3, 2, ACM Press.

FITTS, P. M. 1964. Perceptual motor skill learning.In Categories of human learning, A. W. Melton,Ed. Academic Press, New York, NY.

FULL CIRCLE SOFTWARE. 1998. Talkback Web pages.URL: http://www.fullsoft.com/.

GIRGENSOHN, A., REDMILES, D. F., AND SHIPMAN, F. M. III.1994. Agent-based support for communicationbetween developers and users in software de-sign. In Proceedings of the Knowledge-BasedSoftware Engineering Conference ‘94. Monterey,CA, USA.



GOODMAN, D. 1998. Complete HyperCard 2.2 Hand-book. ToExcel.

GOTTMAN, J. M. AND ROY, A. K. 1990. Sequential anal-ysis: A guide for behavioral researchers. Cam-bridge University Press, Cambridge, England.

GRUDIN, J. 1992. Utility and usability: Research is-sues and development contexts. Interacting withcomput. 4, 2.

GUZDIAL, M. 1993. Deriving Software Usage Pat-terns from Log Files. Tech. Rept. GIT-GVU-93–41.

GUZDIAL, M, WALTON, C., KONEMANN, M., AND SOLOWAY,E. 1993. Characterizing Process Change Us-ing Log File Data. Tech. Rep. GIT-GVU-93–44.

GUZDIAL, M., SANTOS, P., BADRE, A., HUDSON, S., AND

GRAY, M. 1994. Analyzing and visualizing logfiles: A computational science of usability. In Pre-sented at HCI Consortium Workshop.

HARTSON, H. R., CASTILLO, J. C., KELSO, J., AND NEALE,W. C. 1996. Remote evaluation: The networkas an extension of the usability laboratory. InProceedings of CHI ’96.

HELANDER, M., Ed. 1998. Handbook of human-computer interaction. Elsevier Science Publish-ers B.V., North Holland.

HEWLETT PACKARD. 1998. Application ResponseMeasurement API. URL: http://www.hp.com/openview/rpm/arm/.

HILBERT, D. M. AND REDMILES, D. F. 1998a. An ap-proach to large-scale collection of application us-age data over the Internet. In Proceedings ofICSE ’98.

HILBERT, D. M. AND REDMILES, D. F. 1998b. Agentsfor collecting application usage data over theInternet. In Proceedings of Autonomous Agents’98.

HILBERT, D. M., ROBBINS, J. E., AND REDMILES, D. F.,1997. Supporting Ongoing User Involvementin Development via Expectation-Driven EventMonitoring. Tech Report UCI-ICS-97-19, Dept.of Information and Computer Science, Univ. ofCalifornia, Irvine.

HIRSCHBERG, D. S. 1975. A linear space algorithmfor computing maximal common subsequences.Commun of the ACM 18.

HOIEM, D. E. AND SULLIVAN, K. D. 1994. Designingand using integrated data collection and analy-sis tools: challenges and considerations. In Us-ability Laboratories Special Issue of Behaviorand Information Technology, J. Nielsen, Ed., vol.13, no. 1 & 2.

HOPPE, H. U. 1988. Task-oriented parsing: A diag-nostic method to be used by adaptive systems.In Proceedings of CHI ’88.

JOHN, B. E. AND KIERAS, D. E. 1996a. The GOMS fam-ily of user interface analysis techniques: Com-parison and contrast. ACM Trans. on Comput.–Hum. Interaction 3, 4.

JOHN, B. E. AND KIERAS, D. E. 1996b. Using GOMSfor user interface design and evaluation: which

technique? ACM Trans. on Comput.–Hum. Inter-action 3, 4.

KAY, J. AND THOMAS, R. C. 1995. Studying long-termsystem use. Commun. of the ACM 38, 7.

KOSBIE, D. S. AND MYERS, B. A. 1994. Extending pro-gramming by demonstration with hierarchicalevent histories. In Proceedings of East-West Hu-man Computer Interaction ’94.

KRISHNAMURTHY, B. AND ROSENBLUM, D. S. 1995.Yeast: A general purpose event-action system.IEEE Trans. on Software Eng. 21, 10.

LECEROF, A. AND PATERNO, F. 1998. Automatic sup-port for usability evaluation. IEEE Trans. onSoftware Engin. 24, 10.

LEE, B. 1996. Remote diagnostics and product life-cycle monitoring for high-end appliances: anew Internet-based approach utilizing intelli-gent software agents. In Proceedings of the Ap-pliance Manufacturer Conference.

LEWIS, R. AND STONE, M., Ed. 1999. Mac OS in a Nut-shell. O’Reilly and Associates.

LOWGREN, J. AND NORDQVIST, T. 1992. Knowledge-based evaluation as design support for graphicaluser interfaces. In Proceedings of CHI ’92.

MACLEOD, M. AND RENGGER, R. 1993. The Devel-opment of DRUM: A Software Tool for Video-assisted Usability Evaluation. In Proceedings ofHCI ’93.

MANSOURI-SAMANI, M. AND SLOMAN, M. 1997. GEM:A generalized event monitoring language fordistributed systems. IEE/BCS/IOP DistributedSyst. Eng. J. 4, 2.

MERCURY INTERACTIVE. 1998. WinRunner and XRun-ner Web Pages. URL: http://www.merc-int.com/.

MORAN, T. P. 1981. The command language gram-mar: A representation for the user interface ofinteractive computer systems. Int. J. of Man–Machine Studies, 15.

NEAL, A. S. AND SIMONS, R. M. PLAYBACK: A methodfor evaluating the usability of software andits documentation. In Proceedings of CHI ’83.1983.

NIELSEN, J. 1986. A virtual protocol model forcomputer-human interaction. Int. J. of Man–Machine Studies 24.

NIELSEN, J. 1993. Usability engineering. AcademicPress/AP Professional, Cambridge, MA.

NIELSEN, J. AND MACK, R. L., Eds. 1994. Usabilityinspection methods. Wiley, New York.

NYE, A. AND O’REILLY, T. 1992. X Toolkit Intrin-sics Programming Manual for X11, Release 5.O’Reilly and Associates.

OLSEN, D. R. AND HALVERSEN, B. W. 1988. Interfaceusage measurements in a user interface man-agement system. In Proceedings of UIST ’88.

OLSON, G. M., HERBSLEB, J. D., AND RUETER, H. H.1994. Characterizing the sequential structureof interactive behaviors through statistical andgrammatical techniques. Hum.–Comput. Inter-action Special Issue on ESDA, Vol. 9.



PAYNE, S. G. AND GREEN, T. R. G. 1986. Task-actiongrammars: A model of the mental representa-tion of task languages. Hum.-Comput. Interac-tion, Vol. 2.

PENTLAND, B. T. 1994. A grammatical model of or-ganizational routines. Administrative Sci. Quar-terly.

PENTLAND, B. T. 1994. Grammatical models of orga-nizational processes. Organ. Sci.

PETZOLD, C. 1998. Programming Windows. Mi-crosoft Press.

POLSON, P. G., LEWIS, C., RIEMAN, J., AND WHARTON,C. 1992. Cognitive walkthroughs: A methodfor theory-based evaluation of user inter-faces. Int. J. Man-Mach. Studies 36, 5, 741–773.

PREECE, J., ROGERS, Y., SHARP, H., BENYON, D., HOLLAND,S., AND CAREY, T. 1994. Human-computer inter-action. Addison-Wesley, Wokingham, UK.

ROSENBLUM, D. S. 1991. Specifying concurrent sys-tems with TSL. IEEE Software, 8, No. 3.

RUBIN, C. 1999. Running Microsoft Word 2000. Mi-crosoft Press.

SACKETT, G. P. 1978. Observing Behavior, Vol. 2.University Park Press, Baltimore, MD.

SANDERSON, P. M. AND FISHER, C. 1994. Exploratorysequential data analysis: foundations. Hum.–Comput. Interaction Special Issue on ESDA, 9.

SANDERSON, P. M., SCOTT, J. J. P, JOHNSTON, T., MAINZER,J., WATANABE, L. M., AND JAMES, J. M. 1994. Mac-SHAPA and the enterprise of Exploratory Se-quential Data Analysis (ESDA). International J.of Hum.–Comput. Studies, 41.

SANTOS, P. J. AND BADRE, A. N. 1994. Automaticchunk detection in human-computer interaction.In Proceedings of Workshop on Advanced VisualInterfaces AVI ’94. Also available as Tech ReportGIT-GVU-94-4.

SCHIELE, F. AND HOPPE, H. U. 1990. Inferring taskstructures from interaction protocols. In Pro-ceedings of INTERACT ’90.

SELBY, R. W., PORTER, A. A., SCHMIDT, D. C., AND BERNEY,J. 1991. Metric-driven analysis and feedbacksystems for enabling empirically guided soft-ware development. In Proceedings of ICSE ’91.

SIOCHI, A. C. AND EHRICH, R. W. 1991. Computeranalysis of user interfaces based on repetition

in transcripts of user sessions. ACM Trans. onInf. Syst.

SIOCHI, A. C. AND HIX, D. 1991. A study of computer-supported user interface evaluation using max-imal repeating pattern analysis. In Proceedingsof CHI ’91.

SMILOWITZ, E. D., DARNELL, M. J., AND BENSON, A. E.1994. Are we overlooking some usability test-ing methods? A comparison of lab, beta, and fo-rum tests. In Usability Laboratories Special Is-sue of Behavior and Information Technology, J.Nielsen, Ed., Vol. 13, No. 1 & 2.

SUN MICROSYSTEMS. 1998. SunTest JavaStar WebPages. URL: http://www.sun.com/suntest/.

SWEENY, M., MAGUIRE, M., AND SHACKEL, B. 1993.Evaluating human-computer interaction: Aframework. Int. J. of Man–Machine Stud., 38.

TAYLOR, M. M. 1988a. Layered protocols forcomputer-human dialogue I: Principles. Int. J.of Man–Machine Stud. 28.

TAYLOR, M. M. 1988b. Layered protocols forcomputer-human dialogue II: Some practical is-sues. Int. J. of Man–Machine Stud. 28.

TAYLOR, R. N. AND COUTAZ, J. 1994. workshop on soft-ware engineering and human-computer inter-action: Joint research issues. In Proceedings ofICSE ’94.

TIBCO, 1998. HAWK Enterprise Monitor WebPages. URL: http://www.tibco.com/.

UEHLING, D. L. AND WOLF, K. 1995. User ActionGraphing Effort (UsAGE). In Proceedings of CHI’95.

USER MODELING INC. (UM Inc.), 1998. Home Page.URL: http://um.org/.

WEILER, P. 1993. Software for the usability lab: asampling of current tools. In Proceedings of IN-TERCHI ’93.

WHITEFIELD, A., WILSON, F., AND DOWELL, J. 1991. Aframework for human factors evaluation. Behav.and Inf. Technol., 10, 1.

WOLF, A. L. AND ROSENBLUM, D. S. 1993. A study insoftware process data capture and analysis. InProceedings of the Second International Confer-ence on Software Process.

ZUKOWSKI, J. AND LOUKIDES, M., Ed. 1997. Java AwtReference. O’Reilly and Associates.

Received October 1998; revised July 1999; accepted March 2000


extracting usability information from user interface events · extracting usability information 385...

Documents