applicability of usability evaluation techniques to ... · pdf fileusability evaluation...

27
Applicability of Usability Evaluation Techniques to Aviation Systems Michael Clamann Booz Allen Hamilton Mclean, Virginia David B. Kaber Department of Industrial Engineering North Carolina State University Research in the field of human–computer interaction (HCI) has shown that early us- ability evaluation of human interfaces can reduce operator errors by optimizing func- tions for a specific population. HCI research has produced many methods for evaluat- ing usability, which have proven effective in developing highly complex computer systems. Given the importance of the human in the loop in aviation systems, it is pos- sible that advanced commercial cockpit and air traffic control systems may benefit from systematic application of usability research. This article identifies the special re- quirements of the aviation domain that will affect a usability evaluation and the char- acteristics of evaluation methods that may make them effective in this context. Rec- ommendations are made of usability evaluation techniques, or combinations of techniques, most appropriate for evaluating complex systems in aviation technology. The role of any system interface is to provide a dialog between the operator and the device. This dialog directs the actions of the operator to the device in the form of con- trols and from device to operator by converting raw data to useful information. If a system is useable, the dialog is intuitive and natural and allows the user to work in harmony with the system. There are numerous ways to evaluate and enhance the us- ability of a system. Some are empirical; others require the support of a usability ex- pert, and some are best performed by groups. Ideally, these methods are applied dur- THE INTERNATIONAL JOURNAL OF AVIATION PSYCHOLOGY, 14(4), 395–420 Copyright © 2004, Lawrence Erlbaum Associates, Inc. Requests for reprints should be sent to David B. Kaber, Department of Industrial Engineering, North Carolina State University, 2401 Stinson Drive, 328 Riddick Labs, Raleigh, NC 27695–7906. E-mail: [email protected]

Upload: nguyentram

Post on 10-Mar-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Applicability of Usability EvaluationTechniques to Aviation Systems

Michael ClamannBooz Allen Hamilton

Mclean, Virginia

David B. KaberDepartment of Industrial Engineering

North Carolina State University

Research in the field of human–computer interaction (HCI) has shown that early us-ability evaluation of human interfaces can reduce operator errors by optimizing func-tions for a specific population. HCI research has produced many methods for evaluat-ing usability, which have proven effective in developing highly complex computersystems. Given the importance of the human in the loop in aviation systems, it is pos-sible that advanced commercial cockpit and air traffic control systems may benefitfrom systematic application of usability research. This article identifies the special re-quirements of the aviation domain that will affect a usability evaluation and the char-acteristics of evaluation methods that may make them effective in this context. Rec-ommendations are made of usability evaluation techniques, or combinations oftechniques, most appropriate for evaluating complex systems in aviation technology.

The role of any system interface is to provide a dialog between the operator and thedevice.Thisdialogdirects theactionsof theoperator to thedevice in the formofcon-trols and from device to operator by converting raw data to useful information. If asystem is useable, the dialog is intuitive and natural and allows the user to work inharmony with the system. There are numerous ways to evaluate and enhance the us-ability of a system. Some are empirical; others require the support of a usability ex-pert, and some are best performed by groups. Ideally, these methods are applied dur-

THE INTERNATIONAL JOURNAL OF AVIATION PSYCHOLOGY, 14(4), 395–420Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Requests for reprints should be sent to David B. Kaber, Department of Industrial Engineering, NorthCarolina State University, 2401 Stinson Drive, 328 Riddick Labs, Raleigh, NC 27695–7906. E-mail:[email protected]

ing the development of the system to ensure a usable product. Conducting a usabilityevaluation for any system poses unique challenges.

In the complex, dynamic, tightly regulated environment of aviation, the chal-lenge of performing a usability evaluation expands considerably in comparison toevaluation of traditional human–computer interaction (HCI) applications. Dra-matic advances in technology and pressures to increase the number of commercialflights are driving factors in the evolution of interface controls on the flight deckand in air traffic control (ATC) workstations. As technological functions are addedto already highly complex interfaces, which may increase operator workload, it isimportant that the user be considered in the design of the system. Furthermore, asthe consequences of an error in aircraft piloting or ATC may be catastrophic, theimportance of the human operator in the decision process is even more evident.For example, in December 1995, the pilots of an American Airlines Boeing 757bound for Cali, Colombia, altered their landing approach from one that requiredthem to pass over the airport and reverse course to one that involved a shorter,straight-on approach. In the process of interacting with automated systems used tonavigate the aircraft’s flight path, the pilots became so distracted by display inter-faces and functions that they failed to notice they were heading directly toward ElDeluvio, a mountain peak 10 miles east of the airport in the San Jose mountainrange. At 9:38 p.m., the 757 struck the mountain at an elevation of 8,900 ft, killingall but 4 of the 163 passengers on board (Aeronautica Civil, 1996; Strauch, 1997).

In an industry where the human operator plays such a crucial role in the system,the effectiveness of systems usability evaluation is of significant importance. Cur-rent avionics manufacturers realize the importance of addressing usability issuesin the systems design process, and they have become more sophisticated in theircapabilities to deal with usability problems within the recent past. These develop-ments are in large part due to federal regulations requiring human factors test plansfor the development of new avionics systems. For example, Rockwell Scientificand other aircraft cockpit systems designers and developers are integrating sophis-ticated human factors evaluation methods, comparable to experimental methodsused in human factors lab research, into their design processes to develop new ad-vanced cockpit weather displays (W. E. Kelly, Research Scientist, Rockwell Sci-entific, personal communication, November 18, 2003). Despite this fact, there isvery little organized information on methods that designers and manufacturers likethis can use for evaluating the usability of highly complex systems (Abbott, Wise,& Wise, 1999). The purpose of this article is to identify usability evaluation tech-niques, or combinations of techniques, most appropriate for evaluating the com-plex systems used in aviation technology.

In this article we present an overview of two major types of aviation systems:those designed for aircraft cockpits and ATC. This includes a discussion of the en-vironments and usability issues common in existing aviation systems. Potentialproblems in the development of new systems using existing evaluation processes

396 CLAMANN AND KABER

are also discussed. We identify the existing requirements of regulators in terms ofhuman factors test plan development for avionics certification and workload (us-ability) assessment. Beyond this, we review several current usability evaluationtesting techniques, including usability inspection methods and usability testing.Some of the more widely used methods are presented, along with their advantagesand disadvantages, to provide a basis for comparisons of the different techniques.In a final section, the information on aviation technology and usability evaluationtechniques is integrated to determine which of the techniques best fits the contextof current aviation systems development. We conclude with a summary of howmanufacturers can use the recommendations on formal usability evaluation meth-ods to develop effective and cost effective human factors test plans for avionicssystems design and development.

THE AVIATION SYSTEMS ENVIRONMENT

On April 14, 1990, an Indian Airlines Airbus A320 aircraft crashed just short of therunway at the Bangalore, India airport destroying the aircraft and killing 90 people onboard. The investigators determined that the probable cause of the accident was thefailure of the pilots to realize the gravity of the situation and immediately apply thrust.The pilots spent the final seconds of the flight trying to understand why the autoflightsystem was in idle/open descent mode rather than taking appropriate action to avoidimpact with the ground. (p. 863)

As demonstrated in this example, described by Funk, Suroteguh, Wilson, andLyall (1998), aviation systems are complex, dynamic environments where theconsequences of errors can be disastrous. Interface design problems include theamount of information being presented and integration of interfaces. In aircraft,the number of stimuli in the cockpit, such as the densely packed controls and theoutput produced by warning indicators, status displays, flight path displays, ATCdata links, weather information, navigation information, and communications,combined with a constant requirement for situation awareness (SA), contributes tothe possibility of excessive mental workload for the pilot (Stokes & Wickens,1988). When new technology is added to a flight deck system that actually reducespilot workload, that reduction is often exploited by the introduction of another newsystem (Billings, 1997; Smith, 1999). Billings said the number and choice of dis-plays present in the advanced commercial aircraft cockpit may only be restrictedby the amount of available space, and the expansion of automation in the cockpitseems to be only limited by the imagination of designers with little considerationfor pilot performance consequences.

Air traffic controllers also have their own specialized problems. Although inmost conditions pilots receive instant feedback from their environment, air traffic

USABILITY EVALUATION TECHNIQUES 397

controllers work through a system representation and experience a delay when ob-serving the effects of their command sequences (Billings, 1997). They also facetheir own cognitive challenges. To forecast numerous flight paths simultaneously,controllers construct complex three-dimensional mental models or “pictures”from a two-dimensional graphical and text-based representation of current aircraftlayout combined with the expectations of their routes. If this model is lost, it mustbe reconstructed again through available tools, including computer systems, paperflight strips, and memories—a process that can take the controller as much as 15 to20 min (Billings, 1997; Garland, 1999). The consequences of losing this model aredescribed in a summary of events from Garland.

On February 1, 1991, a USAir Boeing 737–300 collided with a SkyWestFairchild Metroliner on the runway at Los Angeles International Airport. The 737was attempting to land while the smaller Metroliner was on the same runway wait-ing for takeoff clearance. The crash and the resulting fire destroyed both airplanes,and many passengers on both planes were killed or injured. The cause of the acci-dent according to the NTSB was “controller error.” More specifically, it was due tothe local controller forgetting she had cleared the Metroliner into position on thesame runway that she subsequently cleared for the 737.

Despite accidents like this one, there is still relatively little research on workingmemory in ATC (Garland, 1999) and the implications of failures on ATC perfor-mance. Furthermore, very little research, which was published at the time of thedesign of the current ATC system, was consulted during the system developmentprocess (Benel, 1998).

In the near future, air traffic controllers can expect to see additional automationof their workstations, as the ATC system is updated (Hopkin, 1998). Many of thedesign decisions and methodologies for how to deliver these updates will comefrom lessons learned from changes made to other complex systems, such as the air-craft cockpit. If these new systems are designed based on past conventions (tech-nology-centric automation design practices), it will be important to performexhaustive human factors evaluations during the design process.

In 1996, the Federal Aviation Administration (FAA) Human Factors StudyTeam Report on the interaction between flight crews and modern cockpits citedthat about 70% of the fatal accidents involving new generation aircraft are due tohuman error potentially induced by the design of cockpit interfaces (Singer, 1999).According to Woods and Sarter (1993), problems that occur in interactions be-tween users and complex systems are due in large part to inadequate feedback onstatus, behavior, and intentions of the system; lack of visibility that prevents theuser from constructing an effective model of the system’s behavior; and designsthat replace, rather than reduce, cognitive load. These problems can produce fail-ures in attention allocation that can ultimately contribute to a loss of SA (Sarter,Woods, & Billings, 1997). Advanced ATC and flight deck systems tend to consistof multiple layers of automated technology that separate the operator from the de-

398 CLAMANN AND KABER

vice. Like similar complex systems (e.g., nuclear power plants), a fundamentalproblem with aviation systems is that the operators do not fully understand whattheir automated systems are doing, or even why (Billings, 1997). This may alsolead to SA errors and overall system failures.

One interface that appears frequently in the human factors literature, when ex-amining weaknesses in the usability of cockpits, is the Flight Management System(FMS). The FMS has been shown to be one of the weakest links in high workload,dynamic operational situations (Singer, 1999). This system was originally de-signed to optimize flight paths, but over its 20-year history the features of the FMShave grown substantially (calculations for position, speed, and wind; managementof navigation data sources; trajectory calculation; flight path predictions; remain-ing fuel calculations; flight time calculations; error determination; steering andcontrol command generation; map computations, etc.; see Billings, 1997). Despitethe complexity of the information, the interface itself has remained relativelystatic, resulting in a device that alone requires about 1,200 hr of experience for a pi-lot to achieve a real understanding (Riley, 2001). In a survey conducted by Sarterand Woods (1992), 55% of pilots with over 1 year of experience reported unex-pected behavior from the FMS and 20% did not understand all the modes and fea-tures. Even though pilots can make the FMS work, they are often not able toexplain why.

One of the main problems identified with the FMS is that its command se-quences do not integrate with the pilot’s procedures (Riley, 2001). They must beconsidered and executed separately in a distinct series of operations in order to pre-pare a single flight plan. This can take 10 to 15 min, even for an experienced pilot.In general, the FMS is a good example of a system that has features defined froman engineering perspective rather than a user’s perspective; that is, all of the re-quired features are available, but they can be difficult to execute. The design of theFMS is only a single representation of the usability flaws in some aviation sys-tems. Funk et al. (1998) also specifically cited autopilot and auto-thrust systems,controls for the brakes and spoilers, and the electronic horizontal situation indica-tor as devices playing a role in recent incidents and accidents.

Contemporary ATC workstations have significantly less automation than flightdeck systems. However, given the workload requirements experienced by control-lers, several opportunities exist for the application of new and emerging technolo-gies. Benel (1998) listed new decision aids to support strategic planning, globalpositioning system technologies for more accurate aircraft position information,digital data links to the flight deck, voice as an input and output device, routing cal-culations based on fuel consumption, automated problem detection and resolution,airspace and runway sequencing and spacing (including runway conflict detec-tion), and new methods for data visualization as prospects for enhancing ATC.Any number of new problems could emerge from these new systems if not exam-ined carefully. For example, introducing a data link to automate information trans-

USABILITY EVALUATION TECHNIQUES 399

fer between pilots and controllers may offload some controller workload, but itmay also remove important nonverbal cues from the transmission that communi-cate urgency in a situation (Benel, 1998).

Lack of standardization of aviation system interfaces also contributes to errors(Singer,1999). Interfacescanvary fromoneenvironment toanother,making itdiffi-cult for pilots and air traffic controllers to transfer their skills to different cockpits orATC sectors or stations. For example, different versions of the FMS software mayexist across aircraft of one type or across types of vehicles, presenting a major chal-lengeforpilots in termsofadapting todifferent interfacestates (Kaber,Riley,&Tan,2002). In addition, the automation in many aviation systems does not necessarilytransfer toavarietyof typesofoperatorsandskill levelsworkingundervariedcondi-tions. This lack of automation accomodation of all operators may be attributed to thefact that systems are generally evaluated by experts with several years’ experience,so the needs of the less experienced operator may not be considered. With such highconsequences for errors in system operation, combined with the ultimate responsi-bility being given to the operator, it is imperative that new systems and new compo-nents for existing systems be evaluated not only for their functions but also for theirrelative usability.

The Design Process

According to Sexton (1988), until the last 30 years, the modernization of avia-tion systems has often been a process in which controls and displays are updatedpiecemeal for individual systems. More displays are added to cockpits withoutre-engineering the control array and are fit where there is still available space.Over time, this has resulted in a complex array of knobs, switches, and displaysthat does not necessarily integrate intuitively. Until the 1970s, flight deck tech-nology did not vary dramatically from that of the 1930s.

Due to the capabilities of new electronic technologies to integrate these displaysand controls (“glass” cockpit technologies), a rapid increase in changes to cockpitcontrols and displays began in the mid to late 1970s. Despite this sudden increase intechnological advances, the design and implementation methodologies employedfor introducing new components and retrofitting systems by many aircraft manufac-turers during this period were similar to those used in the 1960s (Sexton, 1988).Complex integrated displays were designed using methodologies developed for an-alog gauges. These methods produced cockpit designs like that found in the Boeing757 and 767 that use six, integrated, cathode ray tube displays instead of the clustersof single instruments of older models, but, according to Sexton (1988), they did notdemonstrate any substantial reduction in pilot workload.

The same development environment also produced systems such as the FMS.Although the FMS usability problems just identified have existed for some time

400 CLAMANN AND KABER

and still represent concerns for crew and human factors and safety professionalsworking in the aviation industry, several manufacturers and the FAA have recog-nized the importance of dealing with such usability issues, particularly as a meansfor managing crew workload and are currently conducting research and develop-ing enhanced design processes for future cockpit development (e.g., Feary, Sherry,Polson, & Fennel, 2003).

ATC has experienced a similar design evolution. As described by Benel (1998),until the 1970s, enhancements to the ATC system took the form of a series of up-grades, including radar, flight strip printing, and automated processing of flightdata. A traditional linear development cycle was used to design these legacy sys-tems, which addressed the designers’ requirements to produce a system efficientlyrather than addressing the needs of the users to have a useable system. Task analy-ses were based on user statements that the developers then used for the design withonly limited ongoing feedback from operators. Furthermore, in the early decisionsto allocate functions between the operator and system, human functions were notchosen because they were matched to human abilities but because they were infea-sible for the existing technology (Hopkin, 1988).

With many traditional approaches to system design, one important factor that isfrequently left out of the process is the early involvement of the end user (pilot or airtraffic controller) or a usability expert in the design process (Stein, 2000). This is ob-viously not the case with all aviation systems manufacturers. Early involvement ofusers is crucial to the design process because the cost of making revisions increasesdramatically as the development of the system proceeds. If domain experts find us-ability problems with the system after functional prototypes have been completed, itis much less likely that problems will be corrected, and the burden will be added tothe users during the training process (Sarter et al., 1997; Stein, 2000).

Several contemporary aviation system designers have realized the need to con-sider the user in the design process and have proposed participatory design meth-odologies. Advances in new technologies encourage participatory design throughthe availability of high-fidelity prototyping. Several methodologies for participa-tory design of aviation systems, including complete feedback loops and testing, doexist (Benel, 1998; Sexton, 1988; Williges, Williges, & Fainter, 1988); however,they are not applied consistently across manufacturers and domains. For example,beginning in the 1980s, Lockheed-Georgia Company made use of its own partici-patory design process for new systems, retrofits, and simulators of modern aircraft.The design process, described by Sexton (1988), follows this methodology:

1. Design team selection: Include a pilot familiar with the proposed system, hu-man factors engineers, mechanical design engineers, and avionics engineers. Eachmember of the design team must be involved full time with the project.

2. Mission analysis: Obtain, forecast, and determine information on the pro-posed aircraft with respect to user needs, operating environment, procedures, and

USABILITY EVALUATION TECHNIQUES 401

new technologies. Engineers from each area present forecasts for technology lev-els for their area of expertise. This information is then used in preparation of docu-mented mission scenarios, which detail the environments and situations in whichthe new equipment will be used.

3. Design: Team members conceptualize the aircraft design by using the mis-sion scenarios as drivers. The forecasts are also used as design considerations. Byinvolving all members of the team, technological trade-offs can be agreed on. Forexample, the human factors engineer can ensure that the design is not developed soas to benefit hardware efficiency at the cost of the interface. A full-scale mock-upis developed in this stage so the team can easily reconfigure it.

4. Test: The configuration is tested with real pilots using scripts that exposesubjects to prescribed situations. The tested design is then integrated in a flightsimulator.

The availability of powerful computer hardware and rapid prototyping tools hassimplified the process of designing mock-ups. Foam and paper interfaces andscripts have been replaced with graphical simulations and virtual reality scenariosduring the design and test phases (Sarter & Woods, 1992).

Despite the existence of such contemporary design processes that prescribe theinput of experts, such as the one used by Lockheed, and the usefulness ofhigh-fidelity prototypes, in general it is not apparent from the designs of contem-porary cockpit and ATC interfaces, and the number of aviation mishaps and acci-dents attributed to pilot error, that formal usability methods are very widely usedthroughout industry. On the basis of published research, some other manufacturersdo use rapid prototyping of user interfaces; walk-through evaluations with pilots;and more formal user evaluations, including human factors experiments on controlconfiguration, panel layout, and menu design (e.g., Honeywell; Riley, 2001).NASA recently formulated a rapid usability evaluation approach for Boeing’s fu-ture cockpit interface development (Feary et al., 2003). However, in those compa-nies that do conduct usability testing, unfortunately, the process may be seen asone of the reasons for slow adoption of new technologies in the flight deck becauseof the time required for a thorough assessment of an interface. This further sup-ports the need for effective and economical approaches for application in the avia-tion context.

Certification: Requirements of Regulators

Because of the public interest in civil aviation, the current certification require-ments for aviation systems are fairly strict. According to Singer (1999), mostcountries follow the rules established by the FAA and the Joint AirworthinessAuthorities. Both the airworthiness requirements (regulating design) and the op-

402 CLAMANN AND KABER

erational requirements (defining proper use) must be fulfilled to legally operatea commercial aircraft. The documented set of airworthiness requirements iscalled Federal Aviation Requirements, Part 25 (U.S. Department of Transporta-tion/FAA, 1995), and aircraft certification is a well-structured process summa-rized by the following steps (Singer, 1999):

1. A test description is provided to the certification authority. It defines newfunctions, lists the design requirements they fulfill, and lays out how com-pliance will be shown (e.g., simulation, flight test, pilot evaluation).

2. Iterative testing is performed in the development testing phase. Pilots andengineers test functions, system logic, and the display.

3. When testing is complete, the designers must submit a certification testplan describing the evaluation process.

4. Test pilots evaluate the fit, form, and function of the system, including the er-gonomics of the interface and overall layout. This process is very subjective.

5. A certification test report is then prepared based on the certification testplan, showing how each item has been successfully tested. Compliance canbe shown either by testing in a flight simulator or by testing individual com-ponents in a lab.

6. Crew evaluation certification reports are then prepared for qualitative re-quirements not covered by the engineers.

When all of these documents are approved, the system is declared airworthy.One of the main reasons for the design of the current certification process is thatthe FAA is attempting to lead manufacturers toward more rigorous “usability” as-sessments of cockpit interface designs. There is some evidence of this in currentcertification process documents, which identify common usability issues (accessi-bility of controls, ease of use, salience of displays) as factors in pilot workload.

Although the airworthiness and operational requirements cover a broad scopeof topics, the verbiage of the paragraphs defining human factors requirements isvery general. In many cases, these paragraphs were written for cockpits consistingof levers and buttons, not dynamic electronic screens. Furthermore, the require-ments primarily address concentration and fatigue issues but do not present tolera-ble levels for either, and they do not address the higher levels of pilot informationprocessing that may be compromised by unusable interface design. Overall, the re-quirements are general and do not provide much assistance to making engineeringdecisions for designing aviation interfaces and do not establish quantitative levelsfor compliance.

Singer (1999) said, workload, fatigue, ease, and simplicity are meaninglesswithout a frame of reference to define their tolerable levels. Based on the contentof the FAA’s Advisory Circular (AC) 25.1523–1 (U.S. Department of Transporta-tion/FAA, 1993), the agency considers workload to be defined by the functional

USABILITY EVALUATION TECHNIQUES 403

requirements of flight, including the number, urgency, and complexity of proce-dures as well as the duration of mental and physical effort in a normal operation. Inaddition, the FAA considers (usability) issues of accessibility, and ease of use, ofcontrols to be factors in workload. Specific measures for assessing these facets ofcockpit workload are not provided and there are no criteria in terms of the numberof tasks or levels of, for example, concentrated mental effort by which to make de-cisions regarding interface designs toward moderating pilot load. It is possible thatthere is a lack of definition of threshold values for workload, concentration, fa-tigue, and so on, by the FAA because there are few, if any, measures of these fac-tors, which are completely theoretically and empirically defensible, for whichthreshold values can be set.

Although the certification process is defined for systems examined in isolation,it does not provide structure for including the human operator. Currently, certifica-tion authorities evaluate flight deck workload in comparative terms based on ear-lier airplane certifications, a process that is carried out by experienced FAAcertification pilots, who consider multiple-flight conditions (Billings, 1997; U.S.Department of Transportation/FAA, 1993). This is the primary approach definedfor establishing minimum flight crew requirements (and related human factors is-sues) as part of the certification process. However, the FAA does permit multiplemethods for manufacturer compliance to allow a range of companies and methodsto achieve the requirements. For example, analytical methods (e.g., task analyses,timeline evaluations) can be used to assess new devices primarily impacting flightoperation procedures, and simulator demonstrations are considered acceptable forshowing compliance, provided there is sufficient substantiating data developed onthe validity and reliability of the assessment approach. Final decisions on work-load implications of new models of equipment are normally based on testing withqualified and trained pilots (“line pilots,” who fly the same aircraft, are recom-mended by the FAA as they can make direct comparison with operational experi-ence), unless traditional testing methods cannot be used with the new design.

Certification requires that a piece of hardware or software perform its intendedfunction. This does not necessarily mean that the equipment is user friendly. Meet-ing minimum requirements does not ensure a good or even a safe system (Stein,2000). Even though the requirements for performance of the system itself are wellestablished, the usability evaluation of aviation system interfaces is not well de-fined, and each vendor is free to adapt an independent philosophy in showing com-pliance. The requirements do not identify specific usability evaluation methods,metrics, or acceptable and unacceptable levels of performance. Appendix D ofAC25.1523–1 does provide some additional detail on what the FAA considers cri-teria significant for determining crew make-up—for example, ensuringconspicuity of instruments, ensuring adequate instrument feedback to direct pilotbehavior, identifying additional duties of crew members removing them from theirnormal duties, and identification of the level of automation or operator monitoring

404 CLAMANN AND KABER

required to guard against flight control and essential system failures. There is littlesupplemental guidance on methods to use and specific evaluation criteria to assessworkload and usability to prevent human error with new cockpit devices.

In the United States, the FAA also has the responsibility of certifying ATC sys-tems. Stein (2000) provided an overview of the certification process in which ex-perienced ATC specialists and evaluation personnel perform the certificationprocess with little input from human factors experts. If human factors experts areemployed, it is usually toward the end of development when they can do little to af-fect the system. The responsibilities of certification for ATC systems are definedin FAA Order 6000.39, which states that certification only means the required sys-tem functions have been accomplished. It does not ensure well-designed or usableequipment or requirements for human factors involvement. This does not meanthat human factors input was not used in the design of some systems, but it doesshow that the burden of defining measures for usability will ultimately fall onthose who design the system. To date, as with commercial cockpit systems design,specific usability measures have yet to be determined (Stein, 2000).

The certification process consists of numerous general guidelines for designingtechnologies that are expected to be valid for all types of aviation systems. The in-formation presented really represents (workload/usability) guidelines rather thanregulations, as there are no specific goals defined for new cockpit technologies aspart of the “requirements.” To be fully valid, the design guidelines should be pro-posed in the context of the particular system being designed to meet specific re-quirements, subject to specific constraints (Billings, 1997). Such requirementswould be based either on established human factors guidelines or on system guide-lines confirmed to be effective for human-centered design. In either case, the con-straints for the requirements should be evaluated using validated metrics.

USABILITY EVALUATION TECHNIQUES

The role of a usability evaluation is to confirm that an interactive system is be-having as expected in a way that makes sense to the user. Typically, this meansfinding usability problems in an interface (Nielsen & Mack, 1994). Current liter-ature on HCI describes many methods for this type of evaluation. In general,techniques can be grouped into two major classes, usability testing and usabilityinspection (Virzi, 1997). Testing involves formal experimental evaluation of aninterface, whereas inspection covers a variety of informal methods that can alsobe classified as walkthrough and nonwalkthrough (Newman & Lamming, 1995).Walkthrough techniques typically analyze sequences of events in an interface,determined by user goals. Examples include the cognitive walkthrough andmodel-based evaluation. Nonwalkthrough techniques do not require an evalua-tion sequence based on the task. Review-based evaluation and heuristic analysis

USABILITY EVALUATION TECHNIQUES 405

are included in this category. Usability tests may include aspects of bothwalkthrough and nonwalkthrough techniques, depending on the design of theexperiment. Here we review the more commonly used techniques.

The cognitive walkthrough is a technique developed to show how easy a systemis to learn. It is typically applied to systems with the expectation that users have lit-tle or no prior training (Newman & Lamming, 1995); however, the methodologyhas previously been adapted to the aircraft cockpit (Polson & Smith, 1999) byidentifying and attempting to analyze categories of domain activities. The methodcoordinates analysis by several experts on user’s task procedures with a focus oninterface usefulness, usability, visibility, and feedback (Dix, Finlay, Abowd, &Beale, 1998; Virzi, 1997). Experts simulate expected behaviors of users, interpretsystem responses to task performance, and determine if users would choose thecorrect course of interface actions (Newman & Lamming, 1995). The method alsohas some variations—for example, the cognitive jogthrough, which streamlinesthe normally time-consuming process of evaluating the interface by using groupsof experts in the process (Rowley & Rhoades, 1992).

Model-based evaluation techniques, such as Goals, Operators, Methods, andSelection rules (GOMS), use models of the user to predict performance on specifictasks using an interface (Virzi, 1997). A GOMS model is the result of a task de-composition fit into a specific structure. The models make assumptions about userinformation processing and are effective for describing expert performance in rou-tine tasks (Dix et al., 1998). GOMS models have been used successfully to defineimprovements to task training protocols and interfaces in aviation systems (Irving,Polson, & Irving, 1994).

Heuristic evaluation involves a systematic inspection of an interface, usuallyearly in the design cycle by a small group of evaluators. The evaluators use a gen-eral set of design principles (Dix et al., 1998), such as

1. Visibility of system status.2. Match between system and real world.3. User control and freedom.4. Consistency and standards.5. Error prevention.6. Recognition rather than recall.7. Flexibility and efficiency of use.8. Aesthetic and minimalist design.9. Help users recognize, diagnose, and recover from errors.

10. Help and documentation.

The evaluators produce lists of usability problems with the interface by com-parison to the set of principles or heuristics (Nielsen & Phillips, 1993). Thistechnique was originally designed as a means for nonexperts to perform a us-

406 CLAMANN AND KABER

ability analysis, but the results can be enhanced by the involvement of a usabil-ity expert (Virzi, 1997). It can also be performed at any point within thedevelopment life cycle, with or without the use of the actual interface, althoughthe more realistic the information, the more relevant and valuable the results.Heuristic evaluations have been used to produce what are called “cold,”“warm,” or “hot” estimates of system conformance with usability principles(Nielsen & Phillips, 1993), depending on the level of access the evaluators haveto the system. Hot estimates are performed on the implementation, or final ver-sion of a system. Warm estimates require a functioning prototype, and cold esti-mates can be conducted on a written specification. Because usability problemscan be examined early in the design cycle, changes to a design may be lesscostly for a team, as compared to using other methods, like GOMS, that may re-quire an actual interface prototype (Feary et al., 2003).

Of course, usability can also be evaluated through direct observation of userswhile they interact with the system (Dix et al., 1998). Users may be asked to de-scribe interface actions to nonexperts during task performance (verbal proto-cols), as in cooperative evaluation, or after a task, as in posttask walkthroughs.protocol analysis can also be used to evaluate behavior by capturing the users’actions through a variety of recording media including audiovisual recordings,automated computer logs, or even paper-and-pencil notes and user diaries (Dixet al., 1998).

Differences Among Inspection Methods

Virzi (1997) categorized inspection methods along three dimensions: the charac-teristics of the evaluators, the number of evaluators required, and the goals ofthe inspection. With respect to the first dimension, the level and type of exper-tise of the judges affects the outcome of the evaluation. Factors include theamount of usability expertise, the degree of domain knowledge, prior design ex-perience, and academic training. In all cases, a greater level of experience, spe-cifically in the particular domain, will result in a greater value of the results ofthe inspection. Experts with usability experience and domain knowledge may beable to find more usability problems than experts with only usability experience.However, there is a trade-off in that the level of expert training usually has a di-rect effect on the cost of the evaluation.

Techniques such as GOMS, heuristic evaluation, and cooperative evaluation donot rely on a usability professional with expert-level experience, whereas the cog-nitive walkthrough and usability tests (discussed next) do require the involvementof one or more experts. Within these methods, especially those requiring less eval-uator experience, expertise can also be combined to affect the outcome. In at leastone study, researchers found that the involvement of a usability expert greatly im-

USABILITY EVALUATION TECHNIQUES 407

proved the results of a usability inspection method designed for nonexperts(Jeffries, Miller, Wharton, & Uyeda, 1991).

The number of evaluators working during a single session is Virzi’s (1997)second dimension. Some evaluation techniques require independent work,whereas others use groups. Combining the analyses of several evaluators aftercompleting individual sessions can make the results of a usability inspection farmore comprehensive, especially when the level of experience of the evaluatorsis lower. Nielsen (1994) said the number of errors observed during a heuristicevaluation increases as the number of expert evaluators increases; two evalua-tors will find 50% of the usability problems, and 3 will find about 60%. This fig-ure starts to level off at about 5 evaluators, so to locate 90% of problems,approximately 15 expert evaluators would be required (Nielsen, 1994). The de-gree of complexity of a system will also affect the number of evaluators in-volved in an analysis with more evaluators needed with a more complexinterface.

It can also be advantageous to combine the opinions of professionals in separateareas of a development process, such as computer scientists, graphic artists, andengineers, to see how the different aspects of a system interact. In group design re-views, such experts work together to locate usability problems (Virzi, 1997).

The last of Virzi’s (1997) dimension is the goal of the inspection. Many typesof usability inspections attempt to uncover usability problems, whereas othersfocus on the extent to which the system supports specific aspects of use. For ex-ample, heuristic evaluation and cooperative evaluation are designed specificallyto find usability problems. Heuristic evaluation, however, is traditionally not ef-fective at evaluating user performance (Nielsen & Phillips, 1993).Walkthroughs, in contrast, are designed to evaluate ease of learning, whereasGOMS models evaluate performance times and disregard errors and learnabilityissues. The selection of an inspection method should be determined by the typeof usability data required. This point demonstrates the importance of establish-ing usability goals early in the process, prior to selecting the technique. Themeasurements that define the system’s usability should determine the methodsemployed and the evaluation team selected (Virzi, 1997). Selecting the methodin advance will likely constrain the results.

Despite the advantages of usability inspections, there is no replacement for aformal laboratory test. Because inspections may not include the actual users, thereis an important contributor missing in this category of evaluation technique. Us-ability tests are more efficient in terms of cost. Compared to usability testing, in-spections cost more on a per-problem basis, although they are more effective atfinding smaller usability problems. Virzi (1997) said that instead of being used ex-clusively, usability inspection should be either a precursor or an enhancement toan empirical usability test.

408 CLAMANN AND KABER

Usability Testing

Testing involves a controlled experiment complete with a working hypothesis,test participants, formal procedures, and analysis of statistics (Wixon & Wilson,1997). The imagination and resources of the experimenter determine the scopeand level of detail. Testing can be conducted on system use in either a lab orreal-world setting (Wixon & Wilson, 1997). The main purpose of testing is toanswer a specific question about the interface, such as the amount of time re-quired to perform a task, or the number of errors encountered. Subjective data,such as user satisfaction, can also be collected and evaluated.

An experiment is designed to evaluate whether the interface meets the criteriaestablished in the form of a usability goal defined at the outset of the project (e.g.,satisfaction, learnability). Prior to beginning, all goals are assigned quantitativelevels and matched to usability metrics in the context of the system being tested(Wixon & Wilson, 1997). The primary steps for conducting a usability test arevery similar to those followed in human factors experiments (cf. Sanders &McCormick, 1993, chap. 2; Wixon & Wilson, 1997).

There are some important disadvantages to usability testing that must be con-sidered in deciding whether it is appropriate for the task, including that the tech-nique requires an experimenter to observe the user and/or record performancedata. The method can also be quite intrusive to performance (Dix et al., 1998).Lab studies require staffing and equipment, which deter some companies fromformal evaluation. Furthermore, the testing cycle places additional time at theend of the development process, which may delay the release of a system. Insome cases, testing can be integrated in the development process, thereby reduc-ing any unreasonable extension of time beyond the completion date (Wixon &Wilson, 1997).

Comparisons of Inspection and Testing Methods

Opinions on which evaluation methods are the most effective for finding usabil-ity problems, or fostering usability solutions, are mixed. Several studies havecompared different techniques in terms of effectiveness, cost, and accuracy(Doubleday, Ryan, Springett, & Sutcliffe, 1997; Jeffries et al., 1991; Kantner &Rosenbaum, 1997; Nielsen & Phillips, 1993; Virzi, 1997). The following list in-cludes examples of some of the results:

• Usability tests can find more problems than the cognitive walkthrough andidentify problems believed to be more severe (Jeffries et al., 1991; Virzi,1997).

USABILITY EVALUATION TECHNIQUES 409

• Heuristic analysis performed by experts can find more problems, faster thanthink-aloud (observational) techniques (Virzi, 1997).

• Usability testing is most effective at identifying serious and recurring prob-lems but not low-priority ones (Virzi, 1997).

• Usability testing produces the most accurate performance data, followed byGOMS, and then by heuristic evaluation (Nielsen & Phillips, 1993).

• Usability testing is more expensive than cognitive walkthroughs, heuristicevaluation, or GOMS modeling (Nielsen & Phillips, 1993).

Despite the distribution of results, these studies unanimously agree that (a) nosingle technique is superior for all usability evaluation tasks, (b) any single tech-nique is better than none at all, and (c) the ideal combination is to use a usabilitytest combined with another technique. For development teams interested in con-ducting a usability evaluation, there are several issues that need to be consideredbefore selecting a method, or combination of methods. Each technique variesalong several parameters, including the optimal phase in system development atwhich it can be applied, the training and expertise of the person who will adminis-ter it, the size of the team conducting the evaluation, the amount of time required,the type of information and comprehensiveness of the results, the subjectivity orobjectivity of results, the degree of intrusiveness, and of course the cost of the eval-uation (Dix et al., 1998). When comparing the techniques, no single method is su-perior in all areas. If a development team were to select only one technique, someaspect of the evaluation would ultimately be compromised. However, in manycases, usability evaluation techniques can be effectively combined to producemore thorough results. (We address this in the next section.)

The techniques reviewed here represent only a small number of the availablemethods. There is no shortage of techniques for evaluating the usability of an inter-face, but each method has specific strengths and weaknesses. Despite this, therehas been only limited application of usability evaluation in aviation systems(Kaber et al., 2002).

USABILITY IN AN AVIATION CONTEXT

In general, use of contemporary usability techniques is often perceived as an ef-fort-intensive activity for a limited return on investment (Ivory & Hearst, 2001).Aviation systems design teams need methods which are relatively inexpensive touse and provide some aid in HCI-related decisions (Feary et al., 2003). Morespecifically, current development procedures, and the certification process, dic-tate that effective techniques for evaluating usability in aviation systems havethe following characteristics:

410 CLAMANN AND KABER

• Rapid execution: Technology is advancing faster than the capability to add itto existing or even new aircraft. Any usability evaluation technique should not slowthis process further.

• Cost-effective: Current requirements in aviation can increase the cost of de-signing a new component by a factor of three in some cases (Abbott et al., 1999).The usability evaluation should not pose an excessive additional financial burdenon vendors.

• Integrated in the full development life cycle: The need to build usability intothe development process has been emphasized by several publications, includingaviation human factors research (Abbott et al., 1999; Singer, 1999; Smith, 1999;Stager, 2000; Stein, 2000; Williges et al., 1988; Wixon & Wilson, 1997).

• Input from a variety of domain experts: As in designing any interactive sys-tem, it is crucial that aviation systems be designed for the intended user population.Pilots should be included early and continually in the development process (Stein,2000; Virzi, 1997).

• Transferability and scalability to cockpit design: Any usability evaluationtechnique must be adaptable to the context of cockpit interface design or ATCworkstation interface design and should be scalable for application to a single newcomponent or an entire system.

Applicability of the Evaluation Techniques

On the basis of the criteria just presented, we analytically evaluated the applica-bility of the various evaluation techniques reviewed. There has been some sup-port for usability testing in aviation systems design because of the quality of theresults, the flexibility of the application, the preferences of usability experts, andthe potential for wide acceptance (Williges et al., 1988). Although more expen-sive than inspection methods, the advantage of accurate, quantifiable results isoften considered to outweigh the costs. However, if cost becomes crucial to aproject, there are procedures for reducing the money and time spent on a formalexperiment.

For example, Miller and O’Donnell (1993) developed a methodology for usabil-ity testing that reduces the expected overhead for a lab environment: Low-overheadusability testing is designed to be executed in 1 or 2 days for a cost of, at most, $500.Theprocedurerequiresaspaciousofficeorconferenceroomand12(ormore)partic-ipants, who are divided into two groups performing tests in two separate sessions.The test environment is still treated as formal by the evaluators, who continue to be-have as if they were running a formal laboratory experiment. During the test itself,participants respond to questions on interface tasks using a predesigned, electronicfeedback form. At the same time, evaluators use an “observer” to record informationon participant behaviors and performance. Both the feedback and observation forms

USABILITY EVALUATION TECHNIQUES 411

contain subjective and objective data. After participants complete the evaluationsession, they answer subjective questions as a group, and evaluators consolidate thedata from all the forms and the group interview on an end-of-day checklist.

There are many parallels between the low-overhead method and the elaborateusability testing methodology described by Wixon and Wilson (1997), such as em-ploying an administrator and observers, organizing prescribed test materials in ad-vance, controlling the environment, selecting the participants, and basing data onthe participants’ performance in a defined set of tasks. Although they have cut sev-eral corners, Miller and O’Donnell (1993) adhered to the basic constructs of for-mal usability testing. The low-overhead method may be effective when the time tocomplete a project or the budget is constrained. The trade-off is that the limitedtime spent in the lab results in a corresponding reduction in the amount of data andthe accuracy of the evaluation, as compared to actual usability testing. It is up tothe aviation systems design team to determine which is more important.

For over 10 years, heuristic evaluation has been used effectively to performrapid usability inspections. Heuristic evaluation fits all the unique requirementsfor usability evaluation in the cockpit. It can be performed at any phase of thedevelopment process, either with or without an interface, and it can be per-formed both faster and with less expense than the other inspection techniques. Itcan be performed by experts or nonexperts, but in the context of aviation sys-tems it would likely need to be overseen by an expert trained both in usabilityand in piloting and so forth.

It is also likely that heuristic evaluations are transferable to aviation. Kaber etal. (2002) conducted a warm heuristic-based analysis of the Multi-Control DisplayUnit (MCDU) component of the FMS to identify usability issues. In their study,the evaluators observed the use of an MCDU based on the design implemented inthe MD–11 passenger aircraft in a hypothetical flight task. Seven experts partici-pated in a group evaluation of the MCDU based on principles under the majorheadings of learnability, flexibility, and robustness. The evaluators concluded thatthe MCDU violated several of the principles in these categories.

We can also use the example of the reported problems with the FMS to conducta cold heuristic evaluation. In review, Sarter and Woods (1992) cited weak feed-back, incomplete mental models, and imprecise expectations of the outcome ofsome command sequences, whereas Riley (2001) noted inconsistency with the pi-lot’s language and procedures as current problems in the FMS interface. Whencomparing these issues to the list of Nielsen’s (1994) heuristic principles presentedearlier, there appears to be some overlap. Table 1 shows one possible matching ofusability problems with the FMS to specific heuristics.

It would be too simplistic to say that the existing problems with the FMS wouldhave been prevented by application of a heuristic evaluation. However, given theoverlap between the problems with the FMS and the usability principles, as dem-onstrated by Kaber et al. (2002) and shown here, it is likely that heuristic evalua-

412 CLAMANN AND KABER

tion could have added some value to the design process and that designrecommendations based on this type of evaluation could be useful for improvingFMS usability.

It may be determined that current design heuristics for usability may not be di-rectly transferable to an aviation context. There may be a need to adapt existingheuristics and develop new ones for evaluations of aviation systems. Research hasalready been performed to identify new guidelines for modern aviation systems.Billings (1997) formulated 15 guidelines that could benefit the existing certifica-tion requirements for systems, including some degree of automation. It may bepossible to use these contemporary human factors principles as a basis for heuristicevaluation.

The direct observation technique, as part of a cooperative evaluation, also fitswith the requirements for evaluating aviation interfaces. This method is quick, isinexpensive, and utilizes the input of users and multiple experts by design. Thismethod has also been used for development of aviation systems (Williges et al.,1988), so its transferability and scalability may not be in question.

Combining Techniques

With respect to combining multiple techniques as part of a usability evaluation,given the importance of cost in developing aviation systems, the completeness andaccuracy of results should justify the specific approach. Because of the high level ofexpert involvement, combiningaformalusability testwithacognitivewalkthrough,

USABILITY EVALUATION TECHNIQUES 413

TABLE 1Matching of FMS Usability Problems With Heuristics

FMS Problem Applicable Heuristic Definition of Heuristica

Weak feedback Visibility of systemstatus

“The system should always keep system status usersinformed about what is going on, through appropriatefeedback within reasonable time” (p. 30).

Inconsistencywith pilotlanguage

Match betweensystem and thereal world

“The system should speak the users’ language, withwords, phrases and concepts familiar to the user,rather than system-oriented terms. Follow real-worldconventions, making information appear in a naturaland logical order” (p. 30).

Incompletementalmodels

Recognition ratherthan recall

“Make objects, actions, and options visible. The usershould not have to remember information from onepart of the dialogue to another. Instructions for use ofthe system should be visible or easily retrievablewhenever appropriate” (p. 30).

Note. FMS = flight management systems.aHeuristic definitions are from Nielsen (1994).

for example, would be fairly expensive and should only be considered if the resultsare far superior to less costly combinations. Given the scope of the walkthroughtechnique, it may not be necessary when usability tests are already being conducted.For similar reasons, a detailed GOMS model is not advantageous when usabilitytesting is possible. As stated earlier, the cognitive walkthrough specifically evalu-ateseaseof learningand theGOMSmodelproducesperformancedata.Performanceand learnability can be evaluated through empirical testing. That is, usability testscan be adapted to include the benefits of the cognitive walkthrough and GOMSmodel without the additional (and costly) involvement of experts required to per-formthecognitivewalkthrough,or theadditional time required todevelopadetailedGOMS model. In one experiment, the performance data derived from a usability testwere determined to be more accurate than the data produced by a GOMS model,which supports the use of the formal test (Nielsen & Phillips, 1993). In contrast, nei-ther the cognitive walkthrough nor the GOMS model can be adapted to produce re-sults equivalent to usability testing.

Usability Trends in Aviation

There is some additional evidence to support the transfer of usability evaluationtechniques to aviation interface design. In formulating suggestions for new cer-tification requirements, Singer (1999) consulted existing HCI guidelines forways to improve displays, such as using natural dialog; speaking the user’s lan-guage; minimizing the user’s memory load; being consistent; providing appro-priate feedback, shortcuts, clear exits to processes, and understandable errormessages; and preventing errors. He also suggested that pilots of varying back-grounds be used in simulation programs and tests instead of using seasoned test-ers to review new system prototypes. This evaluation approach is consistent withaspects of heuristic evaluation and cooperative evaluation.

Williges et al. (1988) also reviewed HCI guidelines and the software designprocess when proposing a development methodology for aircraft and ATC soft-ware. In their chapter, the authors described a 14-step design process (see Table 2).The methodology is similar to models recommended for user-centered iterativedesign, such as those described by Gould, Boies, and Ukelson (1997), which advo-cate an early and continual focus on users, validated by ongoing user testing, andan integrated and iterative design that reflects the user’s input. Because of the com-prehensive scope of the development process and application to software design inthe context of aviation systems, methodologies like these provide a good frame-work for determining the stage of development where usability evaluation willhave the greatest impact.

414 CLAMANN AND KABER

415

TABLE 2Development Methodology for Aircraft and ATC Software (Williger et al., 1988)

Step Description

Design objectives The design goals of the aviation system must be stated before the softwareis designed. This includes user-oriented metrics, measures oflearnability, usability, and human factors design principles.

Task-function This step includes identification of inputs analysis and outputs;specifications for dialog sequences; and the control structure forinterfaces, dialogs, and computations.

Focus on users The initial design should involve input from end users of the system. Thiscan include interviews, questionnaires, statistics, and literature reviews.

Dialogue designguidelines

Because each design situation can vary within a system, the structure of thehuman–computer interaction should also be defined in advance. Toensure consistency and improve the initial design.

Structured The final step of the initial design stage walkthroughs involves combiningall the information into a description of the proposed control structure.Users and designers then walkthrough actual aircraft scenarios using theproposed design, either formally or informally.

Initial designmodifications

Using feedback from the walkthrough, changes are often made to the initialdesign. The above steps are repeated until no modifications are required.

Rapid prototyping Software prototypes are drafted using the initial design specifications.User-defined

interfacesAs with the initial design stage, the end users should be involved during the

design of the software prototypes.User-acceptance

testingOnce prototypes are completed, they should be tested by the pilot or

controller who will be using them. These evaluations can be done one onone, in small groups, or compared to field data.

Iterative redesign As with the initial design, feedback from acceptance testing should beapplied directly to the prototype, which is modified further until itreceives approval from the operator.

Operational softwareinterface

Production software is completed.

Benchmarking The simulated tasks representing actual users’ tasks should be used inconducting a summative evaluation.

Formalexperimentation

An experiment testing the interface based on the requirements and thebenchmark values should be conducted to validate the design.

Feed-forward results The data obtained on the existing interface can be used in designing futureinterfaces.

DISCUSSION AND RECOMMENDATIONS

Based on our comparisons of the advantages and disadvantages of the variousevaluation techniques and our matching them to the needs of aviation systemsdesign, usability testing, heuristic evaluation, and cooperative evaluation may bemost applicable to the aviation context. These three techniques applied at strate-gic intervals within the development process may greatly reduce the potentialfor a system to go into production with usability flaws.

Heuristic Evaluation

A usability domain expert should perform a cold heuristic evaluation early in thedesign process during the task-function analysis step before any prototypes aredeveloped. This can expose flaws in the design before the design team beginsdrafting the first prototype, resulting in less time lost due to revisions. The use ofa usability/domain expert can increase the quality of the evaluation.

Another additional heuristic evaluation could be conducted each time a proto-type is developed for review by the design team in order to validate the design. Theresults of all the evaluations should be communicated within the design team andcarried forward to the next phase of evaluation.

Cooperative Evaluation

Similar to Williges et al. (1988), we recommend conducting cooperative usabil-ity evaluations, attended by representatives from various system developmentareas (computer programmer, graphic artist, human factors expert, etc.),throughout the development process subsequent to use of heuristic evaluation.Cooperative evaluations should be conducted after the results of the heuristicevaluations have been applied to the design or the prototypes to limit the amountof time the larger group will need to spend in evaluation meetings.

Usability Testing

Once a working model of the interface is complete and the inputs from the heu-ristic and cooperative evaluations have been applied, a formal usability test, in-volving a sample of the user population, should be conducted. Ideally, the pro-ject team should directly observe the process. In general, pilot performance dataand errors should be recorded as well as data relevant to the new aspects of theinterface. It is important to define usability metrics to assess the design princi-

416 CLAMANN AND KABER

ples of concern in advance of performing any tests. As noted by Singer (1999) inthe domain of aviation, and Wixon and Wilson (1997) in describing the usabilityengineering framework, usability measures are most useful when combined witha quantitative frame of reference.

Last, any aviation systems usability testing protocol should be designed to re-flect both the component that is changing as well as its affect on the system as awhole (Billings, 1997; Stein, 2000; Woods & Sarter, 1993). In addition to any per-formance data collected on the new component in isolation, there should also be atest that gathers quantitative data comparing overall system performance with thenew component to performance in the same environment without it. Stein offeredthat conditions such as motivation, mental fatigue, and the complexities of humaninformation processing may not be an issue when designing a piece of hardware orsoftware in isolation but will be apparent when the system is used in the real world.When the evaluation is complete, results should be preserved to provide bench-mark data for the next generation of system. Several databases already exist thatcan give values for errors per flight hour, which could be used as benchmarks forinitial testing (Singer, 1999).

CONCLUSIONS

Performing usability evaluations of highly complex and dynamic systems, suchas flight decks or ATC workstations, poses significant challenges for usabilityexperts. The classic problems of collaborating with a resistant design team, con-vincing managers of the importance of usability evaluation despite tight budgets,and struggling to assemble participants for empirical tests are still present butare overshadowed by the need to evaluate a very complex interface with narrowmargins for error and high private and public visibility. Furthermore, the envi-ronment created by the certification process makes the evaluation process evenmore difficult by restricting expense and applying outdated and ambiguous hu-man factors guidelines (Singer, 1999).

More usability problems can be found in a system by combining a mix of comple-mentary usability evaluation techniques, each of which specializes in a differentmethod of exposing usability problems. The effectiveness of these methods can beincreased even further through the involvement of human factors and domain ex-perts early in the development life cycle. Formal usability testing, heuristic evalua-tion,andcooperativeevaluationcanbeapplied toaviationsystemsin thismanner.

As technology continues to drive changes in aviation, there is a need for de-tailed usability evaluations of new components added to aviation systems. At pres-ent, technology is evolving more rapidly than the certification process canaccommodate. Usability evaluations of aviation systems need to be just as dy-namic as the technology itself to meet the demand.

USABILITY EVALUATION TECHNIQUES 417

Finally, this research may have relevance to other contexts, including manufac-turing organizations attempting to improve their human factors design process(e.g., advanced machine technologies, medical devices, etc.) in response to in-creasingly sophisticated human factors/ergonomics regulations and principles ofgood professional practice. As discussed, regulations in the avionics industry nowrequire human factors test plans for certification, and they ask for information onspecific measurement and evaluation techniques to be used, including objectivehuman performance measures to address design features of concern. The formalusability evaluation methods we describe in this article may be applicable to themanufacturing domain, and companies should consider the advantages and draw-backs of each in developing test plans that are both effective and economical.

ACKNOWLEDGMENTS

Michael Clamann’s work on this project was supported in part by a NASA GraduateStudent Researchers Program Fellowship (Grant NGT–1–01004). The technicalpoint of contact was Lance Prinzel. David Kaber’s work on this article was sup-ported by NASA Langley Research Center (Crew/Vehicle Integration Branch andthePsychological andPhysiologicalStressorsandFactorsProgram) throughGrantsNAG1–01–002 and NAG–1–02056. The grant monitors were Anna Trujillo andLancePrinzel.Theviewsandconclusionspresented in thisarticleare thoseof theau-thors and do not necessarily reflect the views of NASA.

REFERENCES

Aeronautica Civil of the Republic of Colombia. (1996). Aircraft accident report: Controlled flight intoterrain, American Airlines Flight 965, Boeing 757–223, N651AA, near Cali, Colombia, December20, 1995. Santafe de Bogota, Colombia: Author.

Abbott, D. W., Wise, M. A., & Wise, J. A. (1999). Underpinnings of system evaluation. In D. J. Garland,J. A. Wise, & V. D. Hopkin (Eds.), Handbook of aviation human factors (pp. 51–66). Mahwah, NJ:Lawrence Erlbaum Associates, Inc.

Benel, R. (1998). Workstation and software interface design in air traffic control. In M. Smolensky & E.Stein (Eds.), Human factors in air traffic control (pp. 341–390). San Diego, CA: Academic.

Billings, C. E. (1997). Aviation automation: The search for a human-centered approach. Mahwah, NJ:Lawrence Erlbaum Associates, Inc.

Dix, A., Finlay, J., Abowd, G., & Beale, R. (1998). Human-computer interaction (2nd ed., pp.406–441). London: Prentice Hall Europe.

Doubleday, A., Ryan, M., Springett, M., & Sutcliffe, A. (1997). A comparison of usability techniquesfor evaluating design. In Proceedings of DIS ’97: Designing interactive systems: Processes, prac-tices, methods, & techniques (pp. 101–110). New York: ACM.

Feary, M., Sherry, L., Polson, P., & Fennel, K. (2003). Incorporating cognitive usability into softwaredesign processes. In J. Jacko & C. Stephanidis (Eds.), Human–computer interaction: Theory and

418 CLAMANN AND KABER

practice (Part I). Proceedings of the 10th International Conference on Human–Computer Interac-tion (pp. 427–431). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Funk, K., Suroteguh, C., Wilson, J., & Lyall, B. (1998). Flight deck automation and task management.Proceedings of 1998 IEEE International Conference on Systems, Man, and Cybernetics (Vol. 1, pp.863–868). San Diego, CA: IEEE.

Garland, D. (1999). Air traffic controller memory: Capabilities, limitations, and volatility. In D. J. Gar-land, J. A. Wise, & V. D. Hopkin (Eds.), Handbook of aviation human factors (pp. 455–496).Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Gould, J., Boies, S., & Ukelson, J. (1997). How to design usable systems. In M. Helander, T. Landauer,& P. Prabhu (Eds.), Handbook of human-computer interaction (pp. 231–255). New York: Elsevier.

Hopkin, V. D. (1988). Air traffic control. In E. Wiener, & D. Nagel (Eds.), Human factors in aviation(pp. 639–663). San Diego, CA: Academic.

Hopkin, V. D. (1998). The impact of automation on air traffic control specialists. In M. Smolensky & E.Stein (Eds.), Human factors in air traffic control (pp. 391–419). San Diego, CA: Academic.

Irving, J., Polson, P. G., & Irving, J. (1994). Applications of formal methods of human computer interac-tion to training and use of the control and display unit (Tech. Rep. No. 94–08). Boulder: Universityof Colorado.

Ivory, M. Y., & Hearst, M. A. (2001). The state of the art in automating usability evaluation methods.ACM Computing Surveys, 33, 470–516.

Jeffries, R., Miller, J. R., Wharton, C., & Uyeda, K. M. (1991). User interface evaluation in the realworld: A comparison of four techniques. In S. P. Robertson, G. M. Olson, & J. S. Olson (Eds.), Pro-ceedings of ACM CHI ’91 Conference on Human Factors in Computing Systems (pp. 119–124).New York: ACM.

Kaber, D., Riley, J. & Tan, K.-W. (2002). Improved usability of aviation automation through direct ma-nipulation and graphical user interface design. The International Journal of Aviation Psychology,12, 153–180.

Kantner, L., & Rosenbaum, S. (1997). Usability studies of WWW sites: Heuristic evaluation vs. labora-tory testing. In Proceedings of SIGDOC ’97: ACM 15th International Conference on Systems Docu-mentation (pp. 153–160). New York: ACM.

Miller, M., & O’Donnell, C. (1993). Usability testing on a shoe string. In S. Ashlund, K. Mullet, A.Henderson, E. Hollnagel, & T. White (Eds.), Interchi ’93 Adjunct Proceedings (pp. 85–86). NewYork: ACM.

Newman, W., & Lamming, M. (1995). Interactive system design. Boston: Addison-Wesley.Nielsen, J. (1994). Heuristic evaluation. In J. Nielsen, & R. Mack (Eds.), Usability inspection methods

(pp. 25–62). New York: Wiley.Nielsen, J., & Mack, R. (1994). Usability inspection methods. New York: Wiley.Nielsen, E., & Phillips, V. (1993). Estimating the relative usability of two interfaces: Heuristic, formal,

and empirical methods compared. In S. Ashlund, K. Mullet, A. Henderson, E. Hollnagel, & T. White(Eds.), Proceedings of ACM INTERCHI ’93 Conference on Human Factors in Computing Systems(pp. 214–221), New York: ACM.

Polson, P., & Smith, N. (1999). The cockpit cognitive walkthrough. In R. S. Jensen (Ed.), Proceedingsof the Tenth International Symposium on Aviation Psychology. Columbus: Ohio State University.

Riley, V. (2001, Spring). A new language for pilot interfaces. Ergonomics in Design, pp. 21–26.Rowley, D. E., & Rhoades, D. G. (1992). The cognitive jogthrough: A fast-paced user interface evalua-

tion procedure. In P. Bauersfeld, J. Bennett, & G. Lynch (Eds), Proceedings of CHI ’92 (pp.389–395). New York: ACM.

Sanders, M. S., & McCormick E. J. (1993). Human factors in systems design (7th ed.). New York:McGraw-Hill.

Sarter, N., & Woods, D. (1992). Pilot interaction with cockpit automation: Operational experiences withthe flight management system. The International Journal of Aviation Psychology, 24, 303–321.

USABILITY EVALUATION TECHNIQUES 419

Sarter, N., Woods, D., & Billings, C. (1997). Automation surprises. In G. Salvendy (Ed.), Handbook ofhuman factors and ergonomics (2nd ed.) (pp. 1926–1943). New York: Wiley.

Sexton, G. A. (1988). Cockpit-crew systems design and integration. In E. Wiener & D. Nagel (Eds.),Human factors in aviation (pp. 495–525). San Diego, CA: Academic.

Singer, G. (1999). Filling the gaps in the human factors certification net. In S. Dekker & E. Hollnagel(Eds.), Coping with computers in the cockpit (pp. 87–107). Brookfield, VT: Ashgate.

Smith, C. (1999). Design of the Eurofighter human machine interface. In Air and space Europe (Vol. 1,No. 3, pp. 54–59). Amsterdam: Elsevier Science.

Stager, P. (2000). Achieving the objectives of certification through validation: Methodological issues.In J. Wise & V. Hopkin (Eds.), Human factors in certification (pp. 91–104). London: LawrenceErlbaum Associates, Ltd.

Stein, E. S. (2000). A critical component for air traffic control systems. In J. Wise & V. Hopkin (Eds.),Human factors in certification (pp. 57–63). London: Lawrence Erlbaum Associates, Ltd.

Stokes, A., & Wickens, C. (1988). Aviation displays. In E. Wiener & D. Nagel (Eds.), Human factors inaviation (pp. 387–431). San Diego, CA: Academic.

Strauch, B. (1997). Automation and decision making—Lessons from the Cali accident. In Proceedingsof the Human Factors and Ergonomics Society 41st Annual Meeting (pp. 195–199). Albuquerque,NM: Human Factors and Ergonomics Society.

U.S. Department of Transportation/Federal Aviation Administration. (1993). Advisory circular (AC)25.1523–1. Washington, DC: Author.

U.S. Department of Transportation/Federal Aviation Administration. (1995). 14 CFR Part25—Airworthinesess standards: Transport category airplanes (Chap. 1, Subchap. C). Washington,DC: Author.

Virzi, R. A. (1997). Usability inspection methods. In M. Helander, T. Landauer, & P. Prabhu (Eds.),Handbook of human-computer interaction (pp. 705–715). New York: Elsevier.

Williges, R. C., Williges, B. H., & Fainter, R. G. (1988). Software interfaces for aviation systems. In E.Wiener & D. Nagel (Eds.), Human factors in aviation (pp. 463–493). San Diego, CA: Academic.

Wixon, D., & Wilson, C. (1997). The usability framework for product design and evaluation. In M.Helander, T. Landauer, & P. Prabhu (Eds.), Handbook of human–computer interaction (pp.653–685). New York: Elsevier.

Woods, D., & Sarter, N. (1993). Human interaction with intelligent systems in complex dynamic envi-ronments. In D. Garland & J. Wise (Eds.), Human factors and advanced aviation technology (pp.107–110). Daytona Beach, FL: Embry-Riddle Aeronautical University Press.

Manuscript first received June 2002

420 CLAMANN AND KABER