developing operator capacity estimates for...

15
INTRODUCTION In both the military and commercial sectors, there has been significant research and develop- ment in the field of unmanned vehicles, which in- clude remotely piloted vehicles, unmanned air vehicles (UAVs), unmanned air combat vehicles, unmanned underwater vehicles, and a plethora of ground-based robotic systems. Unmanned sys- tems of the future are expected to become more autonomous with improvements in technology and to require less direct human input through manual control. Whereas current UAVs require relatively concentrated human input for flight con- trol, in the future it is likely that the human role for direct flight control will diminish and that the need for supervisory control, including higher lev- el cognitive reasoning, will become much more substantial. Because of the expected future use of clusters of unmanned vehicles that can both re- spond to human commands as well as operate au- tonomously, there is a clear need to explore the human requirements for supervisory control of autonomous air vehicles. In supervisory control, a human operator mon- itors a complex system state and intermittently ex- ecutes some level of control on a process, acting through some automated agent (Sheridan, 1992). This paper will detail the results of a research ef- fort designed to explore human supervisory con- trol issues in the management of multiple Tactical Tomahawk missiles. Because of new Global Posi- tioning System retargeting capabilities, Tactical Tomahawk missiles, which were previously fire- and-forget missiles, can now take on missions similar to those of UAVs in that their flight paths can be changed in flight to emergent, time-critical targets. As part of this new capability, Tactical Developing Operator Capacity Estimates for Supervisory Control of Autonomous Vehicles M. L. Cummings, Massachusetts Institute of Technology, Cambridge, Massachusetts, and Stephanie Guerlain, University of Virginia, Charlottesville, Virginia Objective: This study examined operators’ capacity to successfully reallocate highly autonomous in-flight missiles to time-sensitive targets while performing secondary tasks of varying complexity. Background: Regardless of the level of autonomy for unmanned systems, humans will be necessarily involved in the mission planning, higher level operation, and contingency interventions, otherwise known as human supervisory control. As a result, more research is needed that addresses the impact of dynamic decision support systems that support rapid planning and replanning in time- pressured scenarios, particularly on operator workload. Method: A dual screen simu- lation that allows a single operator the ability to monitor and control 8, 12, or 16 missiles through high level replanning was tested on 42 U.S. Navy personnel. Results: The most significant finding was that when attempting to control 16 missiles, participants’ performance on three separate objective performance metrics and their situation aware- ness were significantly degraded. Conclusion: These results mirror studies of air traf- fic control that demonstrate a similar decline in performance for controllers managing 17 aircraft as compared with those managing only 10 to 11 aircraft. Moreover, the re- sults suggest that a 70% utilization (percentage busy time) score is a valid threshold for predicting significant performance decay and could be a generalizable metric that can aid in manning predictions. Application: This research is relevant to human super- visory control of networked military and commercial unmanned vehicles in the air, on the ground, and on and under the water. Address correspondence to Mary L. Cummings, 77 Massachusetts Ave. 33-305, Cambridge, MA 02139; [email protected]. HUMAN F ACTORS, Vol. 49, No. 1, February 2007, pp. 1–15. Copyright © 2007, Human Factors and Ergonomics Society. All rights reserved.

Upload: phungdat

Post on 15-Sep-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

INTRODUCTION

In both the military and commercial sectorsthere has been significant research and develop-ment in the field of unmanned vehicles which in-clude remotely piloted vehicles unmanned airvehicles (UAVs) unmanned air combat vehiclesunmanned underwater vehicles and a plethora ofground-based robotic systems Unmanned sys-tems of the future are expected to become moreautonomous with improvements in technologyand to require less direct human input throughmanual control Whereas current UAVs requirerelatively concentrated human input for flight con-trol in the future it is likely that the human rolefor direct flight control will diminish and that theneed for supervisory control including higher lev-el cognitive reasoning will become much moresubstantial Because of the expected future use of

clusters of unmanned vehicles that can both re-spond to human commands as well as operate au-tonomously there is a clear need to explore thehuman requirements for supervisory control ofautonomous air vehicles

In supervisory control a human operator mon-itors a complex system state and intermittently ex-ecutes some level of control on a process actingthrough some automated agent (Sheridan 1992)This paper will detail the results of a research ef-fort designed to explore human supervisory con-trol issues in the management of multiple TacticalTomahawk missiles Because of new Global Posi-tioning System retargeting capabilities TacticalTomahawk missiles which were previously fire-and-forget missiles can now take on missionssimilar to those of UAVs in that their flight pathscan be changed in flight to emergent time-criticaltargets As part of this new capability Tactical

Developing Operator Capacity Estimates for SupervisoryControl of Autonomous Vehicles

M L Cummings Massachusetts Institute of Technology Cambridge Massachusetts andStephanie Guerlain University of Virginia Charlottesville Virginia

Objective This study examined operatorsrsquo capacity to successfully reallocate highlyautonomous in-flight missiles to time-sensitive targets while performing secondarytasks of varying complexity Background Regardless of the level of autonomy forunmanned systems humans will be necessarily involved in the mission planninghigher level operation and contingency interventions otherwise known as humansupervisory control As a result more research is needed that addresses the impact ofdynamic decision support systems that support rapid planning and replanning in time-pressured scenarios particularly on operator workload Method A dual screen simu-lation that allows a single operator the ability to monitor and control 8 12 or 16 missilesthrough high level replanning was tested on 42 US Navy personnel Results Themost significant finding was that when attempting to control 16 missiles participantsrsquoperformance on three separate objective performance metrics and their situation aware-ness were significantly degraded Conclusion These results mirror studies of air traf-fic control that demonstrate a similar decline in performance for controllers managing17 aircraft as compared with those managing only 10 to 11 aircraft Moreover the re-sults suggest that a 70 utilization (percentage busy time) score is a valid thresholdfor predicting significant performance decay and could be a generalizable metric thatcan aid in manning predictions Application This research is relevant to human super-visory control of networked military and commercial unmanned vehicles in the air onthe ground and on and under the water

Address correspondence to Mary L Cummings 77 Massachusetts Ave 33-305 Cambridge MA 02139 missycmiteduHUMAN FACTORS Vol 49 No 1 February 2007 pp 1ndash15 Copyright copy 2007 Human Factors and Ergonomics Society Allrights reserved

2 February 2007 ndash Human Factors

Tomahawk missiles could be stationed in loiterpatterns to await further orders much like what isenvisioned for both surveillance and combatUAVs Because of the similarities between thesemissions and the need to replan missions and re-allocate resources in real time under time pressurethe experimental results reported here are equallyrelevant to both Tactical Tomahawk missiles andunmanned vehicles

Human supervisory control of Tactical Toma-hawk missiles generally encompasses two prima-ry subtasks (a) monitoring the missiles in flightwhich requires tracking missile progress both intime and distance as well as monitoring subsystemintegrity such as navigation and communicationsand (b) retargeting missiles when an emergentsituation occurs Emergent situations for examplecan include the appearance of a mobile surface-to-air missile site or a missile system failure thatrequires redirection of an additional in-flight mis-sile to cover a high-priority target These tasks ofmonitoring and redirecting in-flight vehicles aregeneric to any command and control system thatseeks to reallocate resources in real time based ona priority-ranked need

The US Navy (2002) initially estimated thatin the domain of multiple missile control oneoperator should control at most four missiles Thisnumber was not empirically determined and wasonly an initial estimation The number of missilesa person can effectively monitor and retarget inflight is an important parameter but as stated isoverly simplistic and does not address all of thecomplex issues an operator faces in supervisorycontrol Many factors will affect how an individ-ual performs in a combat rapid real-time resourceallocation problem number of missiles (or UAVs)one person is assigned to monitor difficulty of de-cisions time stress secondary tasking situationalawareness available decision support character-istics of the evolving scenario and other factorssuch as fatigue training and experience BecauseTactical Tomahawk controllers were all novices inthis early stage of domain modeling design andtesting the human-in-the-loop experiment detailedin the next section focused on basic performanceand workload issues and not on any other metricsthat relied on training or on experience (althoughthese would be important parameters to model inmore mature systems) The goal of this experimentwas to develop better capacity predictions throughsimulation and experimentation given increasing

numbers of missiles varying workload levels andincreasing complexity of scenarios

The original hypothesis for the supervisorycontrol performance measures in this effort wasthat as the number of missiles and the rate of ar-rival increased for emergent situations decisiontimes would increase the number of incorrect de-cisions would increase and situational awarenesswould be negatively impacted Through experi-mental testing we set out to develop a humansupervisory control performance model to empir-ically provide appropriate manning estimates ndash forexample to measure the number of missiles thatone operator can effectively supervise during arange of representative retargeting scenarios Apreliminary experiment with a previous versionof the research test bed described here demon-strated that participants could monitor up to eightmissiles (Willis 2001) With the enhanced deci-sion support capabilities described in the nextsection we wanted to examine whether this num-ber could be further increased

METHODS

Apparatus

A dual screen simulation test bed was devel-oped (Figure 1) that allows a controller to monitorthe current status of all in-flight missiles while hav-ing a real-time depiction of all potential missile-target pairings in a matrix in the event of the needto possibly redirect missiles in flight The monitordisplay represents all missiles and targets in a geo-spatial display superimposed on a map of the localterrain The map allows the controller to perceivemissile information geographically and associat-ed time bars allow the controller to perceive im-portant temporal relationships such as missilelaunch time time of impact and time of fuel re-maining all in comparison with the actual timeand with each of the other missiles Adetailed dis-cussion of the domain analysis that led to the inter-face design and the specific details and the nuancesof the actual design of the interface are reportedelsewhere (Cummings 2003)

The retargeting display consists of a decisionmatrix missile and target information boxes anda communication box The decision matrix showsthe current status of all retargetable missiles aswell as all physically possible missile-target pair-ings Retargetable missiles are listed in the left-hand column and targets in a strike including

3

Fig

ure

1T

he s

uper

viso

ry c

ontr

ol in

terf

ace

for

mis

sile

man

agem

ent

4 February 2007 ndash Human Factors

emergent targets are represented across the topCurrent missile-target pairs and their associatedtime on targets are highlighted and empty cells inthe matrix demonstrate that no possible relation-ship exists between a missile and a target (eg dueto a wrong warhead type on the missile or the mis-sile is unable to reach the target) For the possiblemissile-target pairings the display shows the userwhen the missile can get to the target and the timeremaining left for the operator to make the deci-sion to retarget

The other significant display component for theretargeting display is a text-based communicationtool also referred to as the chat box which resem-bles current instant messaging programs in wide-spread use (Figure 2) The popularity of the chatbox is not just in mainstream pop culture as it isalready in use today aboard US Navy vessels Inthe context of this experiment operators wereldquochattingrdquo with superior officers who would dis-seminate information that required no action aswell as query for information that required actionon the part of the operator In addition the chatbox served as an embedded secondary workloadmeasurement tool as it allows for measuring boththe accuracy and latency of responses to queriesTraditional secondary tasking can be intrusive andintroduce an unrealistic artifact However embed-ded secondary tasks do not fundamentally changethe task or task performance and provide moresensitive measurements (Shingledecker 1987TsangampWilson1997WickensampHollands 2000)Because it is a tool that current US Navy person-nel are very familiar with the chat box providesfor ecologic and external validity

Training and testing was conducted using twodual monitor side-by-side stations as depicted inFigure 1 Each station used a Dell personal com-puter with two17-inch (43-cm) color monitors witha screen area of 1024 times 768 pixels 16-bit high-color resolution and a Pentium 3 processor Thedual monitors for each system were supported bya MATROXreg graphics card During testing alluser actions mouse movements and incomingand outgoing messages on both screens were re-corded by software into a text file In addition theusersrsquo interactions with the retargeting displaywere recorded using a scan converter and a VCRfor later data analysis

Participants

Forty-two US Navy personnel participated in the experiment The participant populationcontained both active duty officers and enlistedpersonnel who previously worked with the fire-and-forget versions of the Tomahawk missiles aswell as 13 retired US Navy personnel who eitherhad significant experience in operating older ver-sions of the Tomahawk or had experience in strikeplanning

Procedure

All participants received approximately 3 hr oftraining which included a slide presentation toexplain the prototype and two practice sessionseach of which lasted approximately 25 min Dur-ing the training participants interacted with dis-play elements and engaged in retargeting scenariossimilar to those that would be seen in test sessionsIn addition participants observed three additional

Figure 2 The test-bed chat interface

OPERATOR CAPACITY 5

practice sessions other than their own The rulesfor retargeting missiles were explained during allphases of training and reviewed before testingThe rules were relatively simple no missile pairedto a high-priority target should be retargeted loi-ter missiles were always preferred if they werecandidate missiles and communications throughthe chat box were always secondary to the higherpriority task of retargeting The first practice ses-sion consisted of a walk-through of all screen ele-ments and a demonstration of the retargetingprocess The second practice session mirrored anactual test session detailed in the next section Theonly difference between this second practice ses-sion and a test session was the possible pausingthat took place in case of mistakes or questionsParticipants were thoroughly debriefed and allquestions were addressed

After training all participants were tested ontwo separate experimental sessions with a breakof approximately 25 min between the two In bothtest sessions the participants had approximately3 min of warm-up in which they monitored mis-siles and incoming communication messages andwere presented with a simple retargeting sce-nario After the warm-up period each participantsaw three scenarios (a) an easy scenario (a singleemergent target arrival and only one candidatemissile available) (b) a medium-difficulty sce-nario (a participant had to redirect a missile fromits default target to another target in the strikebased on the instructions of a ldquosuperior officerrdquosimulated as a message coming through the chatbox) and (c) a hard retargeting scenario which con-sisted of the arrival of two emergent targets 15 sapart The arrival of the two emergent targets wasconsidered a dual retargeting scenario as both tar-gets were in competition for the same candidatemissiles In all scenarios there was one correct an-swer according to the rules However because thematrix presents all missiles that can physicallyreach a target it was possible for a participant toretarget a missile that was not the best choice butstill was viable

Communication messages were interspersedthroughout the test session including informationmessages such as status information of other shipsparticipating in the strike and health and statusmessages from missiles Both information andhealth and status messages required no responseParticipants also received action messages high-lighted in bold which by definition required a

response Six action messages appeared for eachparticipant in every test session three occurredduring monitoring periods and three occurred 30 safter commencing a retargeting scenario The testsessions concluded when participants retargeted amissile to all emergent targets and were no longerattempting to answer any action messages in thechat box After each test session a modified Situ-ation Awareness Global Assessment Technique(SAGAT) test (Endsley 2000) was administeredin the form of a single freeze-frame quiz In addi-tion the six action questions provided embeddedon-line situation awareness (SA) measurementswhich allowed for the collection of SAdata with-out the intrusion of freezes in the simulation Theuse of embedded SA measures in the chat box issimilar to the Situation Present Assessment Meth-od (SPAM Durso et al 1998) as well as to on-line SA probes used in air traffic control research(Endsley Sollenberger Nakata amp Stein 2000)

Experimental Design

There were two primary independent variablesof interest in this experiment (a) retargetable mis-sile level (8 12 or 16 missiles) and (b) operationaltempo (low or high) The three missile levels rep-resented the number of missiles that a participanthad to choose from during retargeting scenariosand was included to determine if and when par-ticipants would become overloaded because of anincreasing number of objects to cognitively pro-cess Participants were randomly assigned to thethree missile levels The second primary indepen-dent variable was tempo which represented thearrival rate of the retargeting problems simulat-ing either low- or high-tempo operations In thelow-tempo sessions retargeting scenarios werespaced approximately 4 min apart and in the high-tempo sessions the arrivals were spaced only 2 minapart Each low- and high-tempo testing sessionincluded the three scenarios (easy medium orhard) and all participants were tested on all threelevels of difficulty but in each they saw only onelevel of missiles The order of presentation of thefast and slow tempos was counterbalanced acrossparticipants

Several dependent variables were used deci-sion time utilization (percentage busy time) andSA In addition for each participant an overall per-formance score was calculated based on decisiontimes ability to manage the communications and

6 February 2007 ndash Human Factors

decision accuracy(These metrics will be discussedin detail in subsequent sections) For each of thedependent measures the experiment was a mixedfactorial design in that all participants experi-enced two test sessions in which they saw bothhigh and low tempos but for each session theyused the same level of missiles The retargetingscenarios in each of the two tempo sessions werecomparable in level of difficulty and presentationof possible correct choices although they were notexactly the same Screen objects were modifiedin name routings were slightly different and tar-gets were moved slightly to make it seem as if theproblems were different

A between-subjects secondary factor of ses-sion effect was included for all metrics as a mea-sure of experimental internal validity This factorwas represented by ldquolow firstrdquo and ldquohigh firstrdquowhich represent whether a participant was testedon the low-tempo or the high-tempo test sessionfirst One recognized threat to internal validity ismaturation in which participants can gain expe-rience or somehow change over the course of anexperiment (Adelman 1991) The session effectwas included as a secondary independent factorbecause participant training was limited and it waspossible that learning would take place betweenthe two test sessions and introduce a potential con-found By including session effect as a factor wecould detect any significant learning in the resultsand measure this threat to internal validity

For each of the dependent variables a singlecovariate of age was used whereas the use of co-variates can lead to a loss of power a repeated mea-sures design and the addition of a few meaningfulcovariates helps to increase power (Stevens1996)Two other covariates were explored video gameexperience and time in the US Navy but thesewere not significant and not used in the final analy-sis The age covariate was used because the par-ticipants ranged in age from 22 to 59 years Ingeneral the speed of mental processing slows withage and this slowing accounts for a significantportion of age-related variance in cognitive tasksmore so than any other cognitive mechanism(Zacks Hasher amp Li 2000) If age is suspected tobe a potential confound it should be included as acovariate so as not to compromise external valid-ity (Nichols Rogers amp Fisk 2003) Covariateassumptions of linearity of regression and homo-geneity of regression coefficients were met withthe exception of one secondary independent vari-

able (session effect) for the decision time variablewhich will be discussed further

RESULTS

Given the three levels of missiles (between sub-jects) the two levels of operational tempo (withinsubjects) and the two levels of the session effectsecondary factor (between subjects) the generallinear statistical model was a 3 times 2 times 2 repeatedmeasures linear mixed model Only two-factor in-teractions were considered in order to avoid a sat-urated model and to increase degrees of freedomOne participant was removed from the data poolbecause standardized scores consistently weregreater than 329 (Tabachnick amp Fidell 2001) Forall the reported results α = 05

Performance Measures

Two measures were explored to determine theimpact of the different levels of missiles and oper-ational tempos on human performance The firstmeasure examined was decision time for each re-targeting event but because this metric alone doesnot give a sense of how well participants were ableto manage an entire test session an additional per-formance measure was used called the figure ofmerit which was an aggregate score based on allavailable time and accuracy measurements

Decision time analysis The first performancemetric examined was decision time for each retar-geting scenario and as discussed previously therewere three scenarios (easy medium and hard)Because the rapid prototype was created in Macro-media Directorreg each testing session containingthe three scenarios was in effect a movie and assuch it was impossible to randomize the order ofthe three different scenarios To determine if theinability to randomize the scenario order wouldsignificantly affect the internal validity of the deci-sion time dependent variable a preliminary exper-iment was conducted to determine if the order inwhich participants saw the scenarios produced sig-nificantly different results for the decision time de-pendent variable The results demonstrated that theorder in which participants saw the scenarios wasnot significant (Cummings 2003) Thus the con-founding effect attributable to the lack of random-ization for scenarios for this primary experimentwas not considered to be a significant confound

Because of the additional factor of scenariolevel of difficulty (a within-subjects factor) the

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

2 February 2007 ndash Human Factors

Tomahawk missiles could be stationed in loiterpatterns to await further orders much like what isenvisioned for both surveillance and combatUAVs Because of the similarities between thesemissions and the need to replan missions and re-allocate resources in real time under time pressurethe experimental results reported here are equallyrelevant to both Tactical Tomahawk missiles andunmanned vehicles

Human supervisory control of Tactical Toma-hawk missiles generally encompasses two prima-ry subtasks (a) monitoring the missiles in flightwhich requires tracking missile progress both intime and distance as well as monitoring subsystemintegrity such as navigation and communicationsand (b) retargeting missiles when an emergentsituation occurs Emergent situations for examplecan include the appearance of a mobile surface-to-air missile site or a missile system failure thatrequires redirection of an additional in-flight mis-sile to cover a high-priority target These tasks ofmonitoring and redirecting in-flight vehicles aregeneric to any command and control system thatseeks to reallocate resources in real time based ona priority-ranked need

The US Navy (2002) initially estimated thatin the domain of multiple missile control oneoperator should control at most four missiles Thisnumber was not empirically determined and wasonly an initial estimation The number of missilesa person can effectively monitor and retarget inflight is an important parameter but as stated isoverly simplistic and does not address all of thecomplex issues an operator faces in supervisorycontrol Many factors will affect how an individ-ual performs in a combat rapid real-time resourceallocation problem number of missiles (or UAVs)one person is assigned to monitor difficulty of de-cisions time stress secondary tasking situationalawareness available decision support character-istics of the evolving scenario and other factorssuch as fatigue training and experience BecauseTactical Tomahawk controllers were all novices inthis early stage of domain modeling design andtesting the human-in-the-loop experiment detailedin the next section focused on basic performanceand workload issues and not on any other metricsthat relied on training or on experience (althoughthese would be important parameters to model inmore mature systems) The goal of this experimentwas to develop better capacity predictions throughsimulation and experimentation given increasing

numbers of missiles varying workload levels andincreasing complexity of scenarios

The original hypothesis for the supervisorycontrol performance measures in this effort wasthat as the number of missiles and the rate of ar-rival increased for emergent situations decisiontimes would increase the number of incorrect de-cisions would increase and situational awarenesswould be negatively impacted Through experi-mental testing we set out to develop a humansupervisory control performance model to empir-ically provide appropriate manning estimates ndash forexample to measure the number of missiles thatone operator can effectively supervise during arange of representative retargeting scenarios Apreliminary experiment with a previous versionof the research test bed described here demon-strated that participants could monitor up to eightmissiles (Willis 2001) With the enhanced deci-sion support capabilities described in the nextsection we wanted to examine whether this num-ber could be further increased

METHODS

Apparatus

A dual screen simulation test bed was devel-oped (Figure 1) that allows a controller to monitorthe current status of all in-flight missiles while hav-ing a real-time depiction of all potential missile-target pairings in a matrix in the event of the needto possibly redirect missiles in flight The monitordisplay represents all missiles and targets in a geo-spatial display superimposed on a map of the localterrain The map allows the controller to perceivemissile information geographically and associat-ed time bars allow the controller to perceive im-portant temporal relationships such as missilelaunch time time of impact and time of fuel re-maining all in comparison with the actual timeand with each of the other missiles Adetailed dis-cussion of the domain analysis that led to the inter-face design and the specific details and the nuancesof the actual design of the interface are reportedelsewhere (Cummings 2003)

The retargeting display consists of a decisionmatrix missile and target information boxes anda communication box The decision matrix showsthe current status of all retargetable missiles aswell as all physically possible missile-target pair-ings Retargetable missiles are listed in the left-hand column and targets in a strike including

3

Fig

ure

1T

he s

uper

viso

ry c

ontr

ol in

terf

ace

for

mis

sile

man

agem

ent

4 February 2007 ndash Human Factors

emergent targets are represented across the topCurrent missile-target pairs and their associatedtime on targets are highlighted and empty cells inthe matrix demonstrate that no possible relation-ship exists between a missile and a target (eg dueto a wrong warhead type on the missile or the mis-sile is unable to reach the target) For the possiblemissile-target pairings the display shows the userwhen the missile can get to the target and the timeremaining left for the operator to make the deci-sion to retarget

The other significant display component for theretargeting display is a text-based communicationtool also referred to as the chat box which resem-bles current instant messaging programs in wide-spread use (Figure 2) The popularity of the chatbox is not just in mainstream pop culture as it isalready in use today aboard US Navy vessels Inthe context of this experiment operators wereldquochattingrdquo with superior officers who would dis-seminate information that required no action aswell as query for information that required actionon the part of the operator In addition the chatbox served as an embedded secondary workloadmeasurement tool as it allows for measuring boththe accuracy and latency of responses to queriesTraditional secondary tasking can be intrusive andintroduce an unrealistic artifact However embed-ded secondary tasks do not fundamentally changethe task or task performance and provide moresensitive measurements (Shingledecker 1987TsangampWilson1997WickensampHollands 2000)Because it is a tool that current US Navy person-nel are very familiar with the chat box providesfor ecologic and external validity

Training and testing was conducted using twodual monitor side-by-side stations as depicted inFigure 1 Each station used a Dell personal com-puter with two17-inch (43-cm) color monitors witha screen area of 1024 times 768 pixels 16-bit high-color resolution and a Pentium 3 processor Thedual monitors for each system were supported bya MATROXreg graphics card During testing alluser actions mouse movements and incomingand outgoing messages on both screens were re-corded by software into a text file In addition theusersrsquo interactions with the retargeting displaywere recorded using a scan converter and a VCRfor later data analysis

Participants

Forty-two US Navy personnel participated in the experiment The participant populationcontained both active duty officers and enlistedpersonnel who previously worked with the fire-and-forget versions of the Tomahawk missiles aswell as 13 retired US Navy personnel who eitherhad significant experience in operating older ver-sions of the Tomahawk or had experience in strikeplanning

Procedure

All participants received approximately 3 hr oftraining which included a slide presentation toexplain the prototype and two practice sessionseach of which lasted approximately 25 min Dur-ing the training participants interacted with dis-play elements and engaged in retargeting scenariossimilar to those that would be seen in test sessionsIn addition participants observed three additional

Figure 2 The test-bed chat interface

OPERATOR CAPACITY 5

practice sessions other than their own The rulesfor retargeting missiles were explained during allphases of training and reviewed before testingThe rules were relatively simple no missile pairedto a high-priority target should be retargeted loi-ter missiles were always preferred if they werecandidate missiles and communications throughthe chat box were always secondary to the higherpriority task of retargeting The first practice ses-sion consisted of a walk-through of all screen ele-ments and a demonstration of the retargetingprocess The second practice session mirrored anactual test session detailed in the next section Theonly difference between this second practice ses-sion and a test session was the possible pausingthat took place in case of mistakes or questionsParticipants were thoroughly debriefed and allquestions were addressed

After training all participants were tested ontwo separate experimental sessions with a breakof approximately 25 min between the two In bothtest sessions the participants had approximately3 min of warm-up in which they monitored mis-siles and incoming communication messages andwere presented with a simple retargeting sce-nario After the warm-up period each participantsaw three scenarios (a) an easy scenario (a singleemergent target arrival and only one candidatemissile available) (b) a medium-difficulty sce-nario (a participant had to redirect a missile fromits default target to another target in the strikebased on the instructions of a ldquosuperior officerrdquosimulated as a message coming through the chatbox) and (c) a hard retargeting scenario which con-sisted of the arrival of two emergent targets 15 sapart The arrival of the two emergent targets wasconsidered a dual retargeting scenario as both tar-gets were in competition for the same candidatemissiles In all scenarios there was one correct an-swer according to the rules However because thematrix presents all missiles that can physicallyreach a target it was possible for a participant toretarget a missile that was not the best choice butstill was viable

Communication messages were interspersedthroughout the test session including informationmessages such as status information of other shipsparticipating in the strike and health and statusmessages from missiles Both information andhealth and status messages required no responseParticipants also received action messages high-lighted in bold which by definition required a

response Six action messages appeared for eachparticipant in every test session three occurredduring monitoring periods and three occurred 30 safter commencing a retargeting scenario The testsessions concluded when participants retargeted amissile to all emergent targets and were no longerattempting to answer any action messages in thechat box After each test session a modified Situ-ation Awareness Global Assessment Technique(SAGAT) test (Endsley 2000) was administeredin the form of a single freeze-frame quiz In addi-tion the six action questions provided embeddedon-line situation awareness (SA) measurementswhich allowed for the collection of SAdata with-out the intrusion of freezes in the simulation Theuse of embedded SA measures in the chat box issimilar to the Situation Present Assessment Meth-od (SPAM Durso et al 1998) as well as to on-line SA probes used in air traffic control research(Endsley Sollenberger Nakata amp Stein 2000)

Experimental Design

There were two primary independent variablesof interest in this experiment (a) retargetable mis-sile level (8 12 or 16 missiles) and (b) operationaltempo (low or high) The three missile levels rep-resented the number of missiles that a participanthad to choose from during retargeting scenariosand was included to determine if and when par-ticipants would become overloaded because of anincreasing number of objects to cognitively pro-cess Participants were randomly assigned to thethree missile levels The second primary indepen-dent variable was tempo which represented thearrival rate of the retargeting problems simulat-ing either low- or high-tempo operations In thelow-tempo sessions retargeting scenarios werespaced approximately 4 min apart and in the high-tempo sessions the arrivals were spaced only 2 minapart Each low- and high-tempo testing sessionincluded the three scenarios (easy medium orhard) and all participants were tested on all threelevels of difficulty but in each they saw only onelevel of missiles The order of presentation of thefast and slow tempos was counterbalanced acrossparticipants

Several dependent variables were used deci-sion time utilization (percentage busy time) andSA In addition for each participant an overall per-formance score was calculated based on decisiontimes ability to manage the communications and

6 February 2007 ndash Human Factors

decision accuracy(These metrics will be discussedin detail in subsequent sections) For each of thedependent measures the experiment was a mixedfactorial design in that all participants experi-enced two test sessions in which they saw bothhigh and low tempos but for each session theyused the same level of missiles The retargetingscenarios in each of the two tempo sessions werecomparable in level of difficulty and presentationof possible correct choices although they were notexactly the same Screen objects were modifiedin name routings were slightly different and tar-gets were moved slightly to make it seem as if theproblems were different

A between-subjects secondary factor of ses-sion effect was included for all metrics as a mea-sure of experimental internal validity This factorwas represented by ldquolow firstrdquo and ldquohigh firstrdquowhich represent whether a participant was testedon the low-tempo or the high-tempo test sessionfirst One recognized threat to internal validity ismaturation in which participants can gain expe-rience or somehow change over the course of anexperiment (Adelman 1991) The session effectwas included as a secondary independent factorbecause participant training was limited and it waspossible that learning would take place betweenthe two test sessions and introduce a potential con-found By including session effect as a factor wecould detect any significant learning in the resultsand measure this threat to internal validity

For each of the dependent variables a singlecovariate of age was used whereas the use of co-variates can lead to a loss of power a repeated mea-sures design and the addition of a few meaningfulcovariates helps to increase power (Stevens1996)Two other covariates were explored video gameexperience and time in the US Navy but thesewere not significant and not used in the final analy-sis The age covariate was used because the par-ticipants ranged in age from 22 to 59 years Ingeneral the speed of mental processing slows withage and this slowing accounts for a significantportion of age-related variance in cognitive tasksmore so than any other cognitive mechanism(Zacks Hasher amp Li 2000) If age is suspected tobe a potential confound it should be included as acovariate so as not to compromise external valid-ity (Nichols Rogers amp Fisk 2003) Covariateassumptions of linearity of regression and homo-geneity of regression coefficients were met withthe exception of one secondary independent vari-

able (session effect) for the decision time variablewhich will be discussed further

RESULTS

Given the three levels of missiles (between sub-jects) the two levels of operational tempo (withinsubjects) and the two levels of the session effectsecondary factor (between subjects) the generallinear statistical model was a 3 times 2 times 2 repeatedmeasures linear mixed model Only two-factor in-teractions were considered in order to avoid a sat-urated model and to increase degrees of freedomOne participant was removed from the data poolbecause standardized scores consistently weregreater than 329 (Tabachnick amp Fidell 2001) Forall the reported results α = 05

Performance Measures

Two measures were explored to determine theimpact of the different levels of missiles and oper-ational tempos on human performance The firstmeasure examined was decision time for each re-targeting event but because this metric alone doesnot give a sense of how well participants were ableto manage an entire test session an additional per-formance measure was used called the figure ofmerit which was an aggregate score based on allavailable time and accuracy measurements

Decision time analysis The first performancemetric examined was decision time for each retar-geting scenario and as discussed previously therewere three scenarios (easy medium and hard)Because the rapid prototype was created in Macro-media Directorreg each testing session containingthe three scenarios was in effect a movie and assuch it was impossible to randomize the order ofthe three different scenarios To determine if theinability to randomize the scenario order wouldsignificantly affect the internal validity of the deci-sion time dependent variable a preliminary exper-iment was conducted to determine if the order inwhich participants saw the scenarios produced sig-nificantly different results for the decision time de-pendent variable The results demonstrated that theorder in which participants saw the scenarios wasnot significant (Cummings 2003) Thus the con-founding effect attributable to the lack of random-ization for scenarios for this primary experimentwas not considered to be a significant confound

Because of the additional factor of scenariolevel of difficulty (a within-subjects factor) the

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

3

Fig

ure

1T

he s

uper

viso

ry c

ontr

ol in

terf

ace

for

mis

sile

man

agem

ent

4 February 2007 ndash Human Factors

emergent targets are represented across the topCurrent missile-target pairs and their associatedtime on targets are highlighted and empty cells inthe matrix demonstrate that no possible relation-ship exists between a missile and a target (eg dueto a wrong warhead type on the missile or the mis-sile is unable to reach the target) For the possiblemissile-target pairings the display shows the userwhen the missile can get to the target and the timeremaining left for the operator to make the deci-sion to retarget

The other significant display component for theretargeting display is a text-based communicationtool also referred to as the chat box which resem-bles current instant messaging programs in wide-spread use (Figure 2) The popularity of the chatbox is not just in mainstream pop culture as it isalready in use today aboard US Navy vessels Inthe context of this experiment operators wereldquochattingrdquo with superior officers who would dis-seminate information that required no action aswell as query for information that required actionon the part of the operator In addition the chatbox served as an embedded secondary workloadmeasurement tool as it allows for measuring boththe accuracy and latency of responses to queriesTraditional secondary tasking can be intrusive andintroduce an unrealistic artifact However embed-ded secondary tasks do not fundamentally changethe task or task performance and provide moresensitive measurements (Shingledecker 1987TsangampWilson1997WickensampHollands 2000)Because it is a tool that current US Navy person-nel are very familiar with the chat box providesfor ecologic and external validity

Training and testing was conducted using twodual monitor side-by-side stations as depicted inFigure 1 Each station used a Dell personal com-puter with two17-inch (43-cm) color monitors witha screen area of 1024 times 768 pixels 16-bit high-color resolution and a Pentium 3 processor Thedual monitors for each system were supported bya MATROXreg graphics card During testing alluser actions mouse movements and incomingand outgoing messages on both screens were re-corded by software into a text file In addition theusersrsquo interactions with the retargeting displaywere recorded using a scan converter and a VCRfor later data analysis

Participants

Forty-two US Navy personnel participated in the experiment The participant populationcontained both active duty officers and enlistedpersonnel who previously worked with the fire-and-forget versions of the Tomahawk missiles aswell as 13 retired US Navy personnel who eitherhad significant experience in operating older ver-sions of the Tomahawk or had experience in strikeplanning

Procedure

All participants received approximately 3 hr oftraining which included a slide presentation toexplain the prototype and two practice sessionseach of which lasted approximately 25 min Dur-ing the training participants interacted with dis-play elements and engaged in retargeting scenariossimilar to those that would be seen in test sessionsIn addition participants observed three additional

Figure 2 The test-bed chat interface

OPERATOR CAPACITY 5

practice sessions other than their own The rulesfor retargeting missiles were explained during allphases of training and reviewed before testingThe rules were relatively simple no missile pairedto a high-priority target should be retargeted loi-ter missiles were always preferred if they werecandidate missiles and communications throughthe chat box were always secondary to the higherpriority task of retargeting The first practice ses-sion consisted of a walk-through of all screen ele-ments and a demonstration of the retargetingprocess The second practice session mirrored anactual test session detailed in the next section Theonly difference between this second practice ses-sion and a test session was the possible pausingthat took place in case of mistakes or questionsParticipants were thoroughly debriefed and allquestions were addressed

After training all participants were tested ontwo separate experimental sessions with a breakof approximately 25 min between the two In bothtest sessions the participants had approximately3 min of warm-up in which they monitored mis-siles and incoming communication messages andwere presented with a simple retargeting sce-nario After the warm-up period each participantsaw three scenarios (a) an easy scenario (a singleemergent target arrival and only one candidatemissile available) (b) a medium-difficulty sce-nario (a participant had to redirect a missile fromits default target to another target in the strikebased on the instructions of a ldquosuperior officerrdquosimulated as a message coming through the chatbox) and (c) a hard retargeting scenario which con-sisted of the arrival of two emergent targets 15 sapart The arrival of the two emergent targets wasconsidered a dual retargeting scenario as both tar-gets were in competition for the same candidatemissiles In all scenarios there was one correct an-swer according to the rules However because thematrix presents all missiles that can physicallyreach a target it was possible for a participant toretarget a missile that was not the best choice butstill was viable

Communication messages were interspersedthroughout the test session including informationmessages such as status information of other shipsparticipating in the strike and health and statusmessages from missiles Both information andhealth and status messages required no responseParticipants also received action messages high-lighted in bold which by definition required a

response Six action messages appeared for eachparticipant in every test session three occurredduring monitoring periods and three occurred 30 safter commencing a retargeting scenario The testsessions concluded when participants retargeted amissile to all emergent targets and were no longerattempting to answer any action messages in thechat box After each test session a modified Situ-ation Awareness Global Assessment Technique(SAGAT) test (Endsley 2000) was administeredin the form of a single freeze-frame quiz In addi-tion the six action questions provided embeddedon-line situation awareness (SA) measurementswhich allowed for the collection of SAdata with-out the intrusion of freezes in the simulation Theuse of embedded SA measures in the chat box issimilar to the Situation Present Assessment Meth-od (SPAM Durso et al 1998) as well as to on-line SA probes used in air traffic control research(Endsley Sollenberger Nakata amp Stein 2000)

Experimental Design

There were two primary independent variablesof interest in this experiment (a) retargetable mis-sile level (8 12 or 16 missiles) and (b) operationaltempo (low or high) The three missile levels rep-resented the number of missiles that a participanthad to choose from during retargeting scenariosand was included to determine if and when par-ticipants would become overloaded because of anincreasing number of objects to cognitively pro-cess Participants were randomly assigned to thethree missile levels The second primary indepen-dent variable was tempo which represented thearrival rate of the retargeting problems simulat-ing either low- or high-tempo operations In thelow-tempo sessions retargeting scenarios werespaced approximately 4 min apart and in the high-tempo sessions the arrivals were spaced only 2 minapart Each low- and high-tempo testing sessionincluded the three scenarios (easy medium orhard) and all participants were tested on all threelevels of difficulty but in each they saw only onelevel of missiles The order of presentation of thefast and slow tempos was counterbalanced acrossparticipants

Several dependent variables were used deci-sion time utilization (percentage busy time) andSA In addition for each participant an overall per-formance score was calculated based on decisiontimes ability to manage the communications and

6 February 2007 ndash Human Factors

decision accuracy(These metrics will be discussedin detail in subsequent sections) For each of thedependent measures the experiment was a mixedfactorial design in that all participants experi-enced two test sessions in which they saw bothhigh and low tempos but for each session theyused the same level of missiles The retargetingscenarios in each of the two tempo sessions werecomparable in level of difficulty and presentationof possible correct choices although they were notexactly the same Screen objects were modifiedin name routings were slightly different and tar-gets were moved slightly to make it seem as if theproblems were different

A between-subjects secondary factor of ses-sion effect was included for all metrics as a mea-sure of experimental internal validity This factorwas represented by ldquolow firstrdquo and ldquohigh firstrdquowhich represent whether a participant was testedon the low-tempo or the high-tempo test sessionfirst One recognized threat to internal validity ismaturation in which participants can gain expe-rience or somehow change over the course of anexperiment (Adelman 1991) The session effectwas included as a secondary independent factorbecause participant training was limited and it waspossible that learning would take place betweenthe two test sessions and introduce a potential con-found By including session effect as a factor wecould detect any significant learning in the resultsand measure this threat to internal validity

For each of the dependent variables a singlecovariate of age was used whereas the use of co-variates can lead to a loss of power a repeated mea-sures design and the addition of a few meaningfulcovariates helps to increase power (Stevens1996)Two other covariates were explored video gameexperience and time in the US Navy but thesewere not significant and not used in the final analy-sis The age covariate was used because the par-ticipants ranged in age from 22 to 59 years Ingeneral the speed of mental processing slows withage and this slowing accounts for a significantportion of age-related variance in cognitive tasksmore so than any other cognitive mechanism(Zacks Hasher amp Li 2000) If age is suspected tobe a potential confound it should be included as acovariate so as not to compromise external valid-ity (Nichols Rogers amp Fisk 2003) Covariateassumptions of linearity of regression and homo-geneity of regression coefficients were met withthe exception of one secondary independent vari-

able (session effect) for the decision time variablewhich will be discussed further

RESULTS

Given the three levels of missiles (between sub-jects) the two levels of operational tempo (withinsubjects) and the two levels of the session effectsecondary factor (between subjects) the generallinear statistical model was a 3 times 2 times 2 repeatedmeasures linear mixed model Only two-factor in-teractions were considered in order to avoid a sat-urated model and to increase degrees of freedomOne participant was removed from the data poolbecause standardized scores consistently weregreater than 329 (Tabachnick amp Fidell 2001) Forall the reported results α = 05

Performance Measures

Two measures were explored to determine theimpact of the different levels of missiles and oper-ational tempos on human performance The firstmeasure examined was decision time for each re-targeting event but because this metric alone doesnot give a sense of how well participants were ableto manage an entire test session an additional per-formance measure was used called the figure ofmerit which was an aggregate score based on allavailable time and accuracy measurements

Decision time analysis The first performancemetric examined was decision time for each retar-geting scenario and as discussed previously therewere three scenarios (easy medium and hard)Because the rapid prototype was created in Macro-media Directorreg each testing session containingthe three scenarios was in effect a movie and assuch it was impossible to randomize the order ofthe three different scenarios To determine if theinability to randomize the scenario order wouldsignificantly affect the internal validity of the deci-sion time dependent variable a preliminary exper-iment was conducted to determine if the order inwhich participants saw the scenarios produced sig-nificantly different results for the decision time de-pendent variable The results demonstrated that theorder in which participants saw the scenarios wasnot significant (Cummings 2003) Thus the con-founding effect attributable to the lack of random-ization for scenarios for this primary experimentwas not considered to be a significant confound

Because of the additional factor of scenariolevel of difficulty (a within-subjects factor) the

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

4 February 2007 ndash Human Factors

emergent targets are represented across the topCurrent missile-target pairs and their associatedtime on targets are highlighted and empty cells inthe matrix demonstrate that no possible relation-ship exists between a missile and a target (eg dueto a wrong warhead type on the missile or the mis-sile is unable to reach the target) For the possiblemissile-target pairings the display shows the userwhen the missile can get to the target and the timeremaining left for the operator to make the deci-sion to retarget

The other significant display component for theretargeting display is a text-based communicationtool also referred to as the chat box which resem-bles current instant messaging programs in wide-spread use (Figure 2) The popularity of the chatbox is not just in mainstream pop culture as it isalready in use today aboard US Navy vessels Inthe context of this experiment operators wereldquochattingrdquo with superior officers who would dis-seminate information that required no action aswell as query for information that required actionon the part of the operator In addition the chatbox served as an embedded secondary workloadmeasurement tool as it allows for measuring boththe accuracy and latency of responses to queriesTraditional secondary tasking can be intrusive andintroduce an unrealistic artifact However embed-ded secondary tasks do not fundamentally changethe task or task performance and provide moresensitive measurements (Shingledecker 1987TsangampWilson1997WickensampHollands 2000)Because it is a tool that current US Navy person-nel are very familiar with the chat box providesfor ecologic and external validity

Training and testing was conducted using twodual monitor side-by-side stations as depicted inFigure 1 Each station used a Dell personal com-puter with two17-inch (43-cm) color monitors witha screen area of 1024 times 768 pixels 16-bit high-color resolution and a Pentium 3 processor Thedual monitors for each system were supported bya MATROXreg graphics card During testing alluser actions mouse movements and incomingand outgoing messages on both screens were re-corded by software into a text file In addition theusersrsquo interactions with the retargeting displaywere recorded using a scan converter and a VCRfor later data analysis

Participants

Forty-two US Navy personnel participated in the experiment The participant populationcontained both active duty officers and enlistedpersonnel who previously worked with the fire-and-forget versions of the Tomahawk missiles aswell as 13 retired US Navy personnel who eitherhad significant experience in operating older ver-sions of the Tomahawk or had experience in strikeplanning

Procedure

All participants received approximately 3 hr oftraining which included a slide presentation toexplain the prototype and two practice sessionseach of which lasted approximately 25 min Dur-ing the training participants interacted with dis-play elements and engaged in retargeting scenariossimilar to those that would be seen in test sessionsIn addition participants observed three additional

Figure 2 The test-bed chat interface

OPERATOR CAPACITY 5

practice sessions other than their own The rulesfor retargeting missiles were explained during allphases of training and reviewed before testingThe rules were relatively simple no missile pairedto a high-priority target should be retargeted loi-ter missiles were always preferred if they werecandidate missiles and communications throughthe chat box were always secondary to the higherpriority task of retargeting The first practice ses-sion consisted of a walk-through of all screen ele-ments and a demonstration of the retargetingprocess The second practice session mirrored anactual test session detailed in the next section Theonly difference between this second practice ses-sion and a test session was the possible pausingthat took place in case of mistakes or questionsParticipants were thoroughly debriefed and allquestions were addressed

After training all participants were tested ontwo separate experimental sessions with a breakof approximately 25 min between the two In bothtest sessions the participants had approximately3 min of warm-up in which they monitored mis-siles and incoming communication messages andwere presented with a simple retargeting sce-nario After the warm-up period each participantsaw three scenarios (a) an easy scenario (a singleemergent target arrival and only one candidatemissile available) (b) a medium-difficulty sce-nario (a participant had to redirect a missile fromits default target to another target in the strikebased on the instructions of a ldquosuperior officerrdquosimulated as a message coming through the chatbox) and (c) a hard retargeting scenario which con-sisted of the arrival of two emergent targets 15 sapart The arrival of the two emergent targets wasconsidered a dual retargeting scenario as both tar-gets were in competition for the same candidatemissiles In all scenarios there was one correct an-swer according to the rules However because thematrix presents all missiles that can physicallyreach a target it was possible for a participant toretarget a missile that was not the best choice butstill was viable

Communication messages were interspersedthroughout the test session including informationmessages such as status information of other shipsparticipating in the strike and health and statusmessages from missiles Both information andhealth and status messages required no responseParticipants also received action messages high-lighted in bold which by definition required a

response Six action messages appeared for eachparticipant in every test session three occurredduring monitoring periods and three occurred 30 safter commencing a retargeting scenario The testsessions concluded when participants retargeted amissile to all emergent targets and were no longerattempting to answer any action messages in thechat box After each test session a modified Situ-ation Awareness Global Assessment Technique(SAGAT) test (Endsley 2000) was administeredin the form of a single freeze-frame quiz In addi-tion the six action questions provided embeddedon-line situation awareness (SA) measurementswhich allowed for the collection of SAdata with-out the intrusion of freezes in the simulation Theuse of embedded SA measures in the chat box issimilar to the Situation Present Assessment Meth-od (SPAM Durso et al 1998) as well as to on-line SA probes used in air traffic control research(Endsley Sollenberger Nakata amp Stein 2000)

Experimental Design

There were two primary independent variablesof interest in this experiment (a) retargetable mis-sile level (8 12 or 16 missiles) and (b) operationaltempo (low or high) The three missile levels rep-resented the number of missiles that a participanthad to choose from during retargeting scenariosand was included to determine if and when par-ticipants would become overloaded because of anincreasing number of objects to cognitively pro-cess Participants were randomly assigned to thethree missile levels The second primary indepen-dent variable was tempo which represented thearrival rate of the retargeting problems simulat-ing either low- or high-tempo operations In thelow-tempo sessions retargeting scenarios werespaced approximately 4 min apart and in the high-tempo sessions the arrivals were spaced only 2 minapart Each low- and high-tempo testing sessionincluded the three scenarios (easy medium orhard) and all participants were tested on all threelevels of difficulty but in each they saw only onelevel of missiles The order of presentation of thefast and slow tempos was counterbalanced acrossparticipants

Several dependent variables were used deci-sion time utilization (percentage busy time) andSA In addition for each participant an overall per-formance score was calculated based on decisiontimes ability to manage the communications and

6 February 2007 ndash Human Factors

decision accuracy(These metrics will be discussedin detail in subsequent sections) For each of thedependent measures the experiment was a mixedfactorial design in that all participants experi-enced two test sessions in which they saw bothhigh and low tempos but for each session theyused the same level of missiles The retargetingscenarios in each of the two tempo sessions werecomparable in level of difficulty and presentationof possible correct choices although they were notexactly the same Screen objects were modifiedin name routings were slightly different and tar-gets were moved slightly to make it seem as if theproblems were different

A between-subjects secondary factor of ses-sion effect was included for all metrics as a mea-sure of experimental internal validity This factorwas represented by ldquolow firstrdquo and ldquohigh firstrdquowhich represent whether a participant was testedon the low-tempo or the high-tempo test sessionfirst One recognized threat to internal validity ismaturation in which participants can gain expe-rience or somehow change over the course of anexperiment (Adelman 1991) The session effectwas included as a secondary independent factorbecause participant training was limited and it waspossible that learning would take place betweenthe two test sessions and introduce a potential con-found By including session effect as a factor wecould detect any significant learning in the resultsand measure this threat to internal validity

For each of the dependent variables a singlecovariate of age was used whereas the use of co-variates can lead to a loss of power a repeated mea-sures design and the addition of a few meaningfulcovariates helps to increase power (Stevens1996)Two other covariates were explored video gameexperience and time in the US Navy but thesewere not significant and not used in the final analy-sis The age covariate was used because the par-ticipants ranged in age from 22 to 59 years Ingeneral the speed of mental processing slows withage and this slowing accounts for a significantportion of age-related variance in cognitive tasksmore so than any other cognitive mechanism(Zacks Hasher amp Li 2000) If age is suspected tobe a potential confound it should be included as acovariate so as not to compromise external valid-ity (Nichols Rogers amp Fisk 2003) Covariateassumptions of linearity of regression and homo-geneity of regression coefficients were met withthe exception of one secondary independent vari-

able (session effect) for the decision time variablewhich will be discussed further

RESULTS

Given the three levels of missiles (between sub-jects) the two levels of operational tempo (withinsubjects) and the two levels of the session effectsecondary factor (between subjects) the generallinear statistical model was a 3 times 2 times 2 repeatedmeasures linear mixed model Only two-factor in-teractions were considered in order to avoid a sat-urated model and to increase degrees of freedomOne participant was removed from the data poolbecause standardized scores consistently weregreater than 329 (Tabachnick amp Fidell 2001) Forall the reported results α = 05

Performance Measures

Two measures were explored to determine theimpact of the different levels of missiles and oper-ational tempos on human performance The firstmeasure examined was decision time for each re-targeting event but because this metric alone doesnot give a sense of how well participants were ableto manage an entire test session an additional per-formance measure was used called the figure ofmerit which was an aggregate score based on allavailable time and accuracy measurements

Decision time analysis The first performancemetric examined was decision time for each retar-geting scenario and as discussed previously therewere three scenarios (easy medium and hard)Because the rapid prototype was created in Macro-media Directorreg each testing session containingthe three scenarios was in effect a movie and assuch it was impossible to randomize the order ofthe three different scenarios To determine if theinability to randomize the scenario order wouldsignificantly affect the internal validity of the deci-sion time dependent variable a preliminary exper-iment was conducted to determine if the order inwhich participants saw the scenarios produced sig-nificantly different results for the decision time de-pendent variable The results demonstrated that theorder in which participants saw the scenarios wasnot significant (Cummings 2003) Thus the con-founding effect attributable to the lack of random-ization for scenarios for this primary experimentwas not considered to be a significant confound

Because of the additional factor of scenariolevel of difficulty (a within-subjects factor) the

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

OPERATOR CAPACITY 5

practice sessions other than their own The rulesfor retargeting missiles were explained during allphases of training and reviewed before testingThe rules were relatively simple no missile pairedto a high-priority target should be retargeted loi-ter missiles were always preferred if they werecandidate missiles and communications throughthe chat box were always secondary to the higherpriority task of retargeting The first practice ses-sion consisted of a walk-through of all screen ele-ments and a demonstration of the retargetingprocess The second practice session mirrored anactual test session detailed in the next section Theonly difference between this second practice ses-sion and a test session was the possible pausingthat took place in case of mistakes or questionsParticipants were thoroughly debriefed and allquestions were addressed

After training all participants were tested ontwo separate experimental sessions with a breakof approximately 25 min between the two In bothtest sessions the participants had approximately3 min of warm-up in which they monitored mis-siles and incoming communication messages andwere presented with a simple retargeting sce-nario After the warm-up period each participantsaw three scenarios (a) an easy scenario (a singleemergent target arrival and only one candidatemissile available) (b) a medium-difficulty sce-nario (a participant had to redirect a missile fromits default target to another target in the strikebased on the instructions of a ldquosuperior officerrdquosimulated as a message coming through the chatbox) and (c) a hard retargeting scenario which con-sisted of the arrival of two emergent targets 15 sapart The arrival of the two emergent targets wasconsidered a dual retargeting scenario as both tar-gets were in competition for the same candidatemissiles In all scenarios there was one correct an-swer according to the rules However because thematrix presents all missiles that can physicallyreach a target it was possible for a participant toretarget a missile that was not the best choice butstill was viable

Communication messages were interspersedthroughout the test session including informationmessages such as status information of other shipsparticipating in the strike and health and statusmessages from missiles Both information andhealth and status messages required no responseParticipants also received action messages high-lighted in bold which by definition required a

response Six action messages appeared for eachparticipant in every test session three occurredduring monitoring periods and three occurred 30 safter commencing a retargeting scenario The testsessions concluded when participants retargeted amissile to all emergent targets and were no longerattempting to answer any action messages in thechat box After each test session a modified Situ-ation Awareness Global Assessment Technique(SAGAT) test (Endsley 2000) was administeredin the form of a single freeze-frame quiz In addi-tion the six action questions provided embeddedon-line situation awareness (SA) measurementswhich allowed for the collection of SAdata with-out the intrusion of freezes in the simulation Theuse of embedded SA measures in the chat box issimilar to the Situation Present Assessment Meth-od (SPAM Durso et al 1998) as well as to on-line SA probes used in air traffic control research(Endsley Sollenberger Nakata amp Stein 2000)

Experimental Design

There were two primary independent variablesof interest in this experiment (a) retargetable mis-sile level (8 12 or 16 missiles) and (b) operationaltempo (low or high) The three missile levels rep-resented the number of missiles that a participanthad to choose from during retargeting scenariosand was included to determine if and when par-ticipants would become overloaded because of anincreasing number of objects to cognitively pro-cess Participants were randomly assigned to thethree missile levels The second primary indepen-dent variable was tempo which represented thearrival rate of the retargeting problems simulat-ing either low- or high-tempo operations In thelow-tempo sessions retargeting scenarios werespaced approximately 4 min apart and in the high-tempo sessions the arrivals were spaced only 2 minapart Each low- and high-tempo testing sessionincluded the three scenarios (easy medium orhard) and all participants were tested on all threelevels of difficulty but in each they saw only onelevel of missiles The order of presentation of thefast and slow tempos was counterbalanced acrossparticipants

Several dependent variables were used deci-sion time utilization (percentage busy time) andSA In addition for each participant an overall per-formance score was calculated based on decisiontimes ability to manage the communications and

6 February 2007 ndash Human Factors

decision accuracy(These metrics will be discussedin detail in subsequent sections) For each of thedependent measures the experiment was a mixedfactorial design in that all participants experi-enced two test sessions in which they saw bothhigh and low tempos but for each session theyused the same level of missiles The retargetingscenarios in each of the two tempo sessions werecomparable in level of difficulty and presentationof possible correct choices although they were notexactly the same Screen objects were modifiedin name routings were slightly different and tar-gets were moved slightly to make it seem as if theproblems were different

A between-subjects secondary factor of ses-sion effect was included for all metrics as a mea-sure of experimental internal validity This factorwas represented by ldquolow firstrdquo and ldquohigh firstrdquowhich represent whether a participant was testedon the low-tempo or the high-tempo test sessionfirst One recognized threat to internal validity ismaturation in which participants can gain expe-rience or somehow change over the course of anexperiment (Adelman 1991) The session effectwas included as a secondary independent factorbecause participant training was limited and it waspossible that learning would take place betweenthe two test sessions and introduce a potential con-found By including session effect as a factor wecould detect any significant learning in the resultsand measure this threat to internal validity

For each of the dependent variables a singlecovariate of age was used whereas the use of co-variates can lead to a loss of power a repeated mea-sures design and the addition of a few meaningfulcovariates helps to increase power (Stevens1996)Two other covariates were explored video gameexperience and time in the US Navy but thesewere not significant and not used in the final analy-sis The age covariate was used because the par-ticipants ranged in age from 22 to 59 years Ingeneral the speed of mental processing slows withage and this slowing accounts for a significantportion of age-related variance in cognitive tasksmore so than any other cognitive mechanism(Zacks Hasher amp Li 2000) If age is suspected tobe a potential confound it should be included as acovariate so as not to compromise external valid-ity (Nichols Rogers amp Fisk 2003) Covariateassumptions of linearity of regression and homo-geneity of regression coefficients were met withthe exception of one secondary independent vari-

able (session effect) for the decision time variablewhich will be discussed further

RESULTS

Given the three levels of missiles (between sub-jects) the two levels of operational tempo (withinsubjects) and the two levels of the session effectsecondary factor (between subjects) the generallinear statistical model was a 3 times 2 times 2 repeatedmeasures linear mixed model Only two-factor in-teractions were considered in order to avoid a sat-urated model and to increase degrees of freedomOne participant was removed from the data poolbecause standardized scores consistently weregreater than 329 (Tabachnick amp Fidell 2001) Forall the reported results α = 05

Performance Measures

Two measures were explored to determine theimpact of the different levels of missiles and oper-ational tempos on human performance The firstmeasure examined was decision time for each re-targeting event but because this metric alone doesnot give a sense of how well participants were ableto manage an entire test session an additional per-formance measure was used called the figure ofmerit which was an aggregate score based on allavailable time and accuracy measurements

Decision time analysis The first performancemetric examined was decision time for each retar-geting scenario and as discussed previously therewere three scenarios (easy medium and hard)Because the rapid prototype was created in Macro-media Directorreg each testing session containingthe three scenarios was in effect a movie and assuch it was impossible to randomize the order ofthe three different scenarios To determine if theinability to randomize the scenario order wouldsignificantly affect the internal validity of the deci-sion time dependent variable a preliminary exper-iment was conducted to determine if the order inwhich participants saw the scenarios produced sig-nificantly different results for the decision time de-pendent variable The results demonstrated that theorder in which participants saw the scenarios wasnot significant (Cummings 2003) Thus the con-founding effect attributable to the lack of random-ization for scenarios for this primary experimentwas not considered to be a significant confound

Because of the additional factor of scenariolevel of difficulty (a within-subjects factor) the

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

6 February 2007 ndash Human Factors

decision accuracy(These metrics will be discussedin detail in subsequent sections) For each of thedependent measures the experiment was a mixedfactorial design in that all participants experi-enced two test sessions in which they saw bothhigh and low tempos but for each session theyused the same level of missiles The retargetingscenarios in each of the two tempo sessions werecomparable in level of difficulty and presentationof possible correct choices although they were notexactly the same Screen objects were modifiedin name routings were slightly different and tar-gets were moved slightly to make it seem as if theproblems were different

A between-subjects secondary factor of ses-sion effect was included for all metrics as a mea-sure of experimental internal validity This factorwas represented by ldquolow firstrdquo and ldquohigh firstrdquowhich represent whether a participant was testedon the low-tempo or the high-tempo test sessionfirst One recognized threat to internal validity ismaturation in which participants can gain expe-rience or somehow change over the course of anexperiment (Adelman 1991) The session effectwas included as a secondary independent factorbecause participant training was limited and it waspossible that learning would take place betweenthe two test sessions and introduce a potential con-found By including session effect as a factor wecould detect any significant learning in the resultsand measure this threat to internal validity

For each of the dependent variables a singlecovariate of age was used whereas the use of co-variates can lead to a loss of power a repeated mea-sures design and the addition of a few meaningfulcovariates helps to increase power (Stevens1996)Two other covariates were explored video gameexperience and time in the US Navy but thesewere not significant and not used in the final analy-sis The age covariate was used because the par-ticipants ranged in age from 22 to 59 years Ingeneral the speed of mental processing slows withage and this slowing accounts for a significantportion of age-related variance in cognitive tasksmore so than any other cognitive mechanism(Zacks Hasher amp Li 2000) If age is suspected tobe a potential confound it should be included as acovariate so as not to compromise external valid-ity (Nichols Rogers amp Fisk 2003) Covariateassumptions of linearity of regression and homo-geneity of regression coefficients were met withthe exception of one secondary independent vari-

able (session effect) for the decision time variablewhich will be discussed further

RESULTS

Given the three levels of missiles (between sub-jects) the two levels of operational tempo (withinsubjects) and the two levels of the session effectsecondary factor (between subjects) the generallinear statistical model was a 3 times 2 times 2 repeatedmeasures linear mixed model Only two-factor in-teractions were considered in order to avoid a sat-urated model and to increase degrees of freedomOne participant was removed from the data poolbecause standardized scores consistently weregreater than 329 (Tabachnick amp Fidell 2001) Forall the reported results α = 05

Performance Measures

Two measures were explored to determine theimpact of the different levels of missiles and oper-ational tempos on human performance The firstmeasure examined was decision time for each re-targeting event but because this metric alone doesnot give a sense of how well participants were ableto manage an entire test session an additional per-formance measure was used called the figure ofmerit which was an aggregate score based on allavailable time and accuracy measurements

Decision time analysis The first performancemetric examined was decision time for each retar-geting scenario and as discussed previously therewere three scenarios (easy medium and hard)Because the rapid prototype was created in Macro-media Directorreg each testing session containingthe three scenarios was in effect a movie and assuch it was impossible to randomize the order ofthe three different scenarios To determine if theinability to randomize the scenario order wouldsignificantly affect the internal validity of the deci-sion time dependent variable a preliminary exper-iment was conducted to determine if the order inwhich participants saw the scenarios produced sig-nificantly different results for the decision time de-pendent variable The results demonstrated that theorder in which participants saw the scenarios wasnot significant (Cummings 2003) Thus the con-founding effect attributable to the lack of random-ization for scenarios for this primary experimentwas not considered to be a significant confound

Because of the additional factor of scenariolevel of difficulty (a within-subjects factor) the

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

OPERATOR CAPACITY 7

statistical model used for this portion was a 3 times 2times 2 times 3 repeated measures linear mixed modelBecause of deviations from normality the decisiontime data were transformed using a square roottransformation Of the main effects only scenarioand missiles were found to be significant scenarioF(1 197) = 20286 p lt 001 missiles F(2 198) =1094 p lt 001 However there was one marginalinteraction for the Tempo times Session term F(1195) = 380 p = 053 and a significant interactionfor the Scenario times Missiles term F(4 130) = 319p = 016 The age covariate was significant F(1196) = 7640 p = 006 however when checkingfor homogeneity of regression coefficients it wasnoted that age interacted with session (p = 041)Given that other linearity and homogeneity assump-tions were met for the remaining independent var-iables and also that the session independentvariable was a secondary factor included to test alearning effect this interaction was not consid-ered confounding

The significance of the Tempo times Session inter-action for decision time is important as it indicatesa possible learning effect Because of the limited

training time for participants and the use of a re-peated measures design one potential confoundfor this experiment was that learning could occurbetween the two sessions Repeated measures ex-perimental designs are useful for experiments thatseek to measure learning or practice effects how-ever if learning is not an explicit measure of thestudy it can introduce a possible confound calleda carry-over effect (Keppel 1982) In the case ofthe two test sessions participantsrsquo performancecould improve as a result of learning between thefirst and second sessions which is why the ses-sion effect factor was included in the design Onesolution to combat the carry-over effect is to coun-terbalance the order of presentation (Keppel1982)which was the case in these experiments How-ever this interaction was marginal (p = 053) sothe possible learning carry-over effect for onecombination (high session first) was not a signif-icant experimental confound

The significant interaction between scenarioand number of missiles also required investiga-tion Figure 3 demonstrates that although the deci-sion times were essentially the same for all missile

Figure 3 Scenario times Missiles interaction for decision time (in seconds)

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

8 February 2007 ndash Human Factors

levels for the easy scenarios the decision times in-creased for both the increasing levels of difficultyand the increasing levels of missiles for the medi-um and hard scenarios The scenario factor wassignificant as expected as the easy medium andhard scenarios were designed to produce differentresponse times and were initially tested to ensurethis effect

Because a primary area of exploration was toinvestigate the level of missiles that influencedcontroller performance the significant main ef-fect for missiles was examined ABonferroni com-parison between missile levels yielded interestingresults The contrast between the 8- and 12-missilecategories was not significant (p = 268) but thatfor both the 8- and 16-missile (p lt 001) and the12- and 16-missile comparisons (p = 015) wassignificant

Overall performance analysis Because thedecision time analysis examined only the perfor-mance of participants for specific retargeting sit-uations not taking into account their answeraccuracy or their abilities to attend to both prima-ry and secondary tasking an overall performancemeasure was needed To measure overall perfor-mance per session a figure of merit (FOM) scorewas developed Because each test session wascomposed of three test scenarios (easy mediumand hard) the FOM represented overall controllerperformance per test session as defined by thesummation of a performance score for each sce-nario Each scenario performance score was theproduct of weighted decision times (the quickesttimes were rewarded with higher points) answeraccuracy (1 for the best answer 05 for a satisfic-ing answer and 0 for completely wrong) and sce-nario complexity weight (thus easier scenarioswere not weighted as much as difficult scenarios)Answer accuracy in retargeting scenarios includ-ed satisficing responses Satisficing is a form ofbounded rationality in which a solution is selectedthat achieves a ldquogood enoughrdquo status In the con-text of selecting missiles for retargeting satisfic-ing occurs when a missile is selected that is not theoptimal missile but can achieve the mission none-theless In addition six secondary tasking perfor-mance terms were included in the FOM whichincorporated both the time taken to respond to sec-ondary tasking introduced through the chat boxaction messages as well as answer accuracy Thesewere also weighted but on a scale of one-third theweightings for the retargeting scenarios (The de-

tailed FOM derivation is given in Cummings2003)

Thus the FOM contained nine algebraic termsin two basic categories three for the weighted re-targeting scenario scores and six for the actionmessage secondary tasking scores Pilot testingwas used to determine the correct weights whichwere designed so that in the nominal conditionthe secondary tasking FOM component would beone third of the weight of the retargeting scenarioscores which indicate primary task performanceThus primary task performance was weightedtwice as much as secondary task performanceScores ranged from 67 to 346 with higher scoresindicating better overall performance For theexperimental data the average secondary taskcomponent of the FOM was 359 which wascommensurate with predictions However the sec-ondary task percentage of the total FOM rangedfrom 7 if participants performed the retarget-ing and secondary tasks up to 76 if participantsneglected to perform either or both of those tasksand this resulted in lower overall performancescores For the average FOM of 193 124 pointswere attributable to the three retargeting scenarioscores and 69 points were attributed to secondarytasking

The experimental design for the FOM variablewas a 3 times 2 times 2 repeated measures linear mixedmodel The scenario level of difficulty was omit-ted because the FOM was an aggregate score thatincluded these levels of difficulty As in the deci-sion time analysis the age covariate was used Forthe FOM overall performance metric missile levelwas significant F(2 71) = 508 p = 009 as wastempo F(1 71) = 658 p = 012 As in the deci-sion time analysis the age covariate was signifi-cant There was a single significant higher orderinteraction Tempo times Session F(1 71) = 814 p =006 The Bonferroni comparison results for theFOM dependent variable demonstrated the sameresults as in the decision time analysis Overallperformance was significantly degraded when themissile level increased from 8 to 16 (p = 028) andfrom 12 to 16 (p =018) whereas performance wasessentially the same between 8 and 12 missiles(p = 100) This result demonstrates further evi-dence that a significant relationship exists among8 12 and 16 missiles Figure 4 demonstrates therelationship among missile level tempo and theFOM There is a clear decrement in performanceat the 16-missile level

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

OPERATOR CAPACITY 9

The tempo significant result for the FOM de-pendent variable was unexpected because the ini-tial hypothesis was that as operational tempoincreased overall performance would decreaseSurprisingly these experimental results revealedthat overall performance increased as tempo in-creased However because there was a significantinteraction involving tempo the significance ofthe tempo main effect must be interpreted in lightof the significant interaction For the Tempo times Ses-sion interaction if participants saw the low-temposession first they did much better overall on thesubsequent high-tempo testing session Howeverthere was essentially no difference in performancebetween the two if the high-tempo session wasseen first Thus the significance of tempo was con-founded most likely by learning as discussed pre-viously and is not a conclusive result This resultsuggests that in complex testing scenarios that arelikely to introduce a learning effect in multiple testsessions the statistical need to randomize presen-tation should be balanced against the likely learn-ing effect

Workload Measurement

When designing a command and control deci-sion support system for rapid real-time replanningof in-flight vehicles consideration of workload andhow changes in system states impact workloadand subsequent mission success or failure is para-mount In this experimental analysis of workloadworkload was objectively measured through a uti-lization metric The utilization metric is defined asthe percentage of time the operator is busy ρ =operator busy timetotal operation time Busy wasdefined as the time the operator was actively en-gaged in either a retargeting decision or respond-ing to a communication message The time in bothcases was measured from notification of the eventto either the successful completion of the retarget-ing task decision or answering the incoming mes-sage Notification of events was highly salient inthat an auditory alarm sounded and in the case ofemergent targets pop-up windows appeared onboth screens The results from the experiment wereexpected to illustrate that for given arrival rates

Figure 4 Missile levels versus figure of merit overall performance

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

10 February 2007 ndash Human Factors

and missile levels operators would experiencespecific workload levels as expressed through uti-lization rates For example given an arrival rate of5 targets in 20 min an operator who can retargetmissiles on average in 2 min will experience lowerutilizationworkload than will an operator whoneeds 4 min to solve a problem To measure wheth-er or not the independent variables significantlyimpacted participant utilization we used the 3 times2 times 2 repeated measures linear mixed model in-cluding the age covariate Because of deviationsfrom normality the utilization data were trans-formed using a square root transformation

Tempo the arrival rate of retargeting scenarioswas significant F(1 62) = 1967 p lt 001 whichwas an expected result as was missile level F(263) = 734 p = 001 The covariate of age was sig-nificant F(1 64) = 650 p = 013 There were nosignificant interactions ABonferroni comparisonbetween missile levels demonstrated that the uti-lization dependent variable between 8 and 12 mis-siles was not significantly different however the16-missile level was significantly different fromboth the 8- and 12-missile levels (8 vs 12 p =100 8 vs 16 p = 001 12 vs 16 p = 019) Theseresults which mirror those of the decision timeand FOM missile comparisons illustrate that bothtempo and missile level significantly impacted theworkload of participants with 16 missiles pro-ducing significantly higher workload than thatproduced by 8 and 12 missiles

Prior human performance models using steadystate queuing models estimate that humans cannotsuccessfully perform at utilization rates greaterthan 70 (Rouse 1983 Schmidt 1978) To de-termine whether or not this prediction was true forthis series of experiments using the FOM overallperformance measure we performed a Pearson cor-relation (which measures the linear relationship be-tween two quantitative variables) which revealeda significant relationship between utilization ratesand overall performance r = ndash394 p lt 001 Asutilization rates increased overall performance de-clined Figure 5 representing the utilization per-cents from both test sessions for every participantillustrates that given 10 increases in utilizationrates there is a significant drop in overall perfor-mance when utilization rates rise above 70 Thisdifference in overall performance scores belowand above 70 was statistically confirmed witha nonparametric Mann-Whitney test (p lt 001 forboth test scenarios)

Interestingly for utilization rates below 40 itappears that overall performance scores also de-clined however no experimental data existed be-low 33 utilization As will be discussed in a latersection it is possible that under low-workload sit-uations situational awareness can decrease lead-ing to an overall degradation in performance Thisanalysis confirms that participants do performsignificantly worse at utilization rates greaterthan 70 but also it is possible that under low

Figure 5 Performance for increasing utilization scores

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

OPERATOR CAPACITY 11

utilizations performance will also degrade how-ever this degradation could not be tested for sig-nificance with an N of 13

Situational Awareness

SA has long been recognized as a critical hu-man factor in military command and control sys-tems Military command and control centers mustattempt to assimilate and reconstruct the battle pic-ture based on information from a variety of sensorsources such as weapons satellites and voice com-munications In fact Klein (2000) determined thatSA was a key element for those naval personnelengaged in tasks associated with air target track-ing aboard US Navy AEGIS cruisers (which arethe same platforms that will be responsible forTactical Tomahawk launches and could controlfuture UAV clusters) In a domain similar to thatof the Tomahawk poorly designed human inter-faces have led to reduced SA as well as degradedmission performance in remotely piloted vehicles(Ruff Narayanan amp Draper 2002) Because ofthe complexity and dynamic nature of the com-mand and control environment maintenance ofSAis considered to be of utmost importance (San-toro amp Amerson 1998)

SAcan be measured both through objective andsubjective measures However conclusions drawnfrom SAself-ratings are extremely limited and pro-vide insight into judgment processes rather than anobjective SA performance measurement (Jones2000) In order to avoid any confounds associatedwith subjective measures only objective SAmea-sures were used in this effort As discussed pre-viously a single freeze-frame quiz administeredat the conclusion of each testing session was usedto assess SA in addition to the on-line SA mea-surements introduced through the chat box Onlyresponse accuracy was considered for SA mea-surements as reaction times were indicative ofworkload not SA SAquestions introduced throughboth the on-line probes and the SAGAT quiz tar-geted comprehension and future projection andthus were testing levels 2 and 3 SA All partici-pants received practice SA queries both on lineand SAGAT prior to testing

Using the SA scores from both the on-lineprobes and the SAGAT quiz as the dependent var-iables we used the 3 times 2 times 2 repeated measureslinear mixed model to analyze any possible differ-ences using the same independent variables whichwere number of missiles arrival rate (tempo) and

session order For the on-line probes there weretwo significant results those for the session effectF(1 68) = 815 p = 002 and for missiles F(168) = 457 p = 014 Participants consistently hadhigher on-line probe SA scores for both test ses-sions when they experienced the lower workloadtest session first This result was not unexpectedbecause the tempo of the second test session wasessentially twice that of the first test session soparticipants had more time to assess and projectcorrectly during the lower workload session Thusthis is an example of an established significantlearning effect In addition the significance of theon-line probe SAmissiles factor showed an inter-esting trend not seen in the other variables In aBonferroni comparison SA as measured by theon-line probes was significantly different only be-tween 12 and 16 missiles (p = 022) whereas com-parisons between 8 and 12 missiles and between8 and 16 missiles were not (p = 056 a marginal-ly significant effect and p = 1000 respectively)This result suggests that SAdecreased from 12 to16 missiles but was unchanged between 8 and 16missiles Because of the learning effect stronglyexhibited for SAas measured by the on-line probesthe significance of the missiles factor is not ascompelling as it is for the other dependent vari-ables However in a study examining the SAof airtraffic controllers tasked to interact with up to 20aircraft performance significantly degraded be-yond 12 (Endsley amp Rogers 1996)

In a similar study on air traffic control usingmultiple SA measures on-line probes did not re-veal any significant effects although the metricwas reaction time and not accuracy because par-ticipants achieved an overall accuracy of 95(Endsley et al 2000) In this Tomahawk studyoverall participant accuracy for on-line probes was77 and thus this was a viable metric for deter-mining experimental effects Although the use ofan on-line probe as a valid SA measurement toolcontinues to show mixed results (Durso et al1998 Endsley et al 2000) this study demonstratesthat it might show important trends Howevermore research is needed to specifically addressconstruct and validity concerns

In the case of the SA dependent variable rep-resented by the SAGAT questions there was nosignificance for either the main effects or any inter-actions This lack of significance of main effectsfor SAGAT questions could be interpreted in dif-ferent ways First the interface design could be

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

12 February 2007 ndash Human Factors

effective in promoting SA so despite increasingworkload SAis not compromised as the workloadincreases However it is likely that exposing theparticipants to two 20-min scenarios was not longenough to allow a true SA picture to build In ad-dition because the SAGAT quiz was administeredonly twice it is likely that more measurementswould have been needed to capture any significanteffect

The SAGAT method has demonstrated a levelof predictive validity that links the SAGAT scoresto overall participant performance (Endsley 2000)however no such predictive ability has been de-monstrated for on-line probes (Endsleyetal 2000)In general improvement in overall performance isclosely linked with improvement in SA(Vidulich2000) To see if this relationship between perfor-mance and SA held true for these SA scores bothon-line and SAGAT SA scores were correlatedwith the participantsrsquo FOM scores which repre-sented overall performance There was no signif-icant correlation with the SAGAT scores howeverthe on-line probe SA measures were significantlycorrelated (Pearson correlation = 432 which wassignificant at the p lt 001 level)

DISCUSSION

Table 1 presents the aggregate data analysisresults for all the dependent variables and the pre-determined main effects The session effect inde-pendent variable was included to determine if alearning effect would confound the results andthreaten internal validity There was only one sig-nificant main session effect the SAon-line probedependent variable so this dependent measurewas sensitive to the learning that occurred betweentest sessions However there was a significant ses-sion effect interaction with the tempo independentvariable for the FOM dependent variable and a

marginally significant interaction for decision timeIn both cases it appears that a learning effect didinfluence the tempo significance result This analy-sis illustrates the importance of analyzing inter-action effects before interpreting significant maineffects

The remaining significant effects were notcompromised with interactions The significanceof the tempo independent variable for the utiliza-tion (percentage busy time) dependent variablewhich represented both high and low arrival ratesof retargeting problems was expected A prelim-inary discrete event simulation model predictedthat the tempo factor would be significant for uti-lization rates given the 2- and 4-min arrival inter-vals In addition the significance of the scenarioindependent variable for decision time was ex-pected because the scenarios were designed fordifferent levels of difficulty and thus would takelonger as complexity increased As expected be-cause of the age range of participants the age co-variate was significant for all metrics which heldtrue except for SA This result suggests that it isimportant for the age of the participant pool toclosely resemble that of the user population whichfor the Tactical Tomahawk community is expect-ed to be 20 to 40 years

Given the large range in participant ages from22 to 59 years it is important to ensure that thesignificance among 8 12 and 16 missiles is notconfounded by groupings of older participantswithin the between-subjects missile factor Table2 details the mean age in each missile category aswell as the minimum and maximum ages Thefinding that controlling 16 missiles as opposed to8 and 12 missiles caused degraded performanceis not confounded by age

The most important finding in the data sum-mary (Table 1) is that the level of missiles was notonly significant for three primary performance

TABLE 1 Independent Variable Significance Summary

Decision Figure of SAFactor Utilization Time Merit On Line

Tempo lt001 186 012 538Missiles lt001 lt001 009 014Session 334 338 235 006Scenario na lt001 na naAge (covariate) 013 006 lt001 108

Note Numbers in italics indicate statistical significance at alpha = 05

Interaction detected with the session independent variable

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

OPERATOR CAPACITY 13

measures (FOM utilization and decision time)but also more importantly in each of these casesthe 8- and 12-missile levels were statistically nodifferent whereas the 16-missile level producedsignificantly different results (Table 3)

These results have significant implications forhuman supervisory control of multiple airbornevehicles Ruff et al (2002) determined that at mostonly four UAVs could be effectively controlled bya human However this limitation was primarilyattributable to the fact that these four UAVs re-quired significantly more manual human controlthan do Tactical Tomahawk missiles which requireno manual input and for which the humanrsquos role isstrictly supervisory In the case of the Tomahawkmissiles and more autonomous UAVs of the fu-ture human intervention is required only for emer-gencies and higher level mission planning not formanual control Thus as systems become moreautonomous humans will be able to individuallycontrol a larger number

Interestingly in other air traffic control studiesinvestigating workload and performance as a func-tion of increasing numbers of aircraft similar re-sults have been reported In a study examiningsubjective and objective workload of air trafficcontrollers under free flight conditions (pilot self-separation) participants were presented with alow-workload condition of 10 aircraft on averageas well as a high-workload condition of 17 aircrafton average Just as in the Tomahawk study partic-ipantsrsquo objective performance was significantlydegraded at the 17-aircraft level as compared with10 aircraft (Hilburn Bakker Pekela amp Parasur-

aman 1997) In another study looking at air traf-fic controller performance in free flight conditionsparticipants were presented with either 11 or 17aircraft and as in the Hilburn et al (1997) studyperformance of participants was significantly de-graded at the higher level Under the moderatetraffic conditions of 11 aircraft conflict detectionperformance approached 100 but dropped to50 under the high-traffic condition (GalsterDuley Masalonis amp Parasuraman 2001)

The similarity between the numbers of missilesand commercial aircraft that could be effectivelycontrolled by controllers under free flight condi-tions is remarkable During free flight operationsaircraft are operated ldquoautonomouslyrdquo by the pi-lots ndash that is all navigation decisions are madeinternal to the aircraft whereas air traffic con-trollers monitor the progress of aircraft and are re-quired to intervene only when some unexpectedcondition (eg an emergency) or change in a goalstate occurs (eg the desire to land at another air-port because of weather) This is very similar tothe supervision of missiles or autonomous UAVsin that they navigate on their own and the operatorneeds to intervene only in an emergency or the oc-currence of an emergent target which representsa change in goal state Thus in two related humansupervisory control domains that require rapiddecision in real time under time pressure humanseffectively controlled 10 to 12 airborne vehiclesbut in both domains performance was significant-ly degraded at 16 to 17 airborne vehicles Thismeta-analysis is preliminary and this area deservesmore research as it suggests some clear design lim-itations for command and control operators of au-tonomous systems

Lastly another important finding is that thisstudy provides experimental evidence as pre-dicted by Rouse (1983) and Schmidt (1978) thatoperator performance beyond utilization rates(percentage busy times) of 70 will significantlydegrade These results appear to be in keeping withthe classic Yerkes-Dodson inverted U-shaped func-tion The original Yerkes-Dodson paper detailed arelationship between stimulus strength and learn-ing (Yerkes amp Dodson 1908) however a similarrelationship was proposed for the effects of arousallevel on performance (Hebb 1955) In this experi-ment the arousal level is represented by increasingutilization rates which reflect increasing stimulias well as stress The stimuli in this experimentdefined by increasing numbers of missiles and

TABLE 2 Missile Factor Age Statistics

Missiles

8 12 16

Mean age 39 36 33Minimum age 26 24 22Maximum age 59 57 51

TABLE 3 Missile Factor Bonferroni Comparison

Comparison 8 vs12 12 vs16 8 vs16

Utilization 1000 019 001Decision time 268 015 lt001Figure of merit 1000 018 028

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

14 February 2007 ndash Human Factors

increasing operational tempo represent increasesin overall mental workload Participants tended toachieve optimal performance between 50 and60 busy time (Figure 5) with performance sig-nificantly dropping as utilization rates increasedand possibly dropping under lower utilizations

This finding provides a tangible design guide-line in that systems that task operators beyond the70 utilization level will operate at a suboptimallevel Whereas the decision times and FOM per-formance scores are specific to the interface andexperimental design and thus cannot be effective-ly compared with other interfaces or other un-manned vehicle systems the utilization dependentvariable is not as limited Because it is based on theamount of time an operator is engaged in controltasks and the 70 ceiling applies across humansupervisory control settings it is a metric that canprovide insight for both interfaces and system de-signs This is important in that it can be used notonly to predict manning levels for a single systembut also to provide a generalizable metric that canbe compared across automation levels interfacedesigns and system architectures such as networksof heterogeneous vehicles

CONCLUSION

There is significant interest in the Department ofDefense to design and build networks of unmannedvehicles that will have the ability to operate auton-omously Not only will these changes affect themilitary command and control battle space theywill also affect the civilian airspace control pictureas many civilian agencies are considering the useof autonomous vehicles in domains such as cargoflights border patrol and other homeland defensepurposes With the influx of higher levels of con-trol autonomy into unmanned vehicles the natureof human supervisory control is changing and thusintroducing new layers of cognitive complexityIdentification of automated decision support issuesfor unmanned command and control systems iscritical because human operators must integratetemporal and spatial elements as well as solve prob-lems manage assets and perform contingencyplanning in a varying workload environment Tothis end this research effort was designed to builda human-in-the-loop simulation platform to mimicthe multiple-vehicle supervisory control domainand to investigate basic cognitive limitations inthe execution of human supervisory control tasksof autonomous vehicles

The most significant finding from the examina-tion of the impact of increasing levels of missileswith increasing operational tempo rates was thatregardless of operational tempo participantsrsquoper-formance on three separate objective performancemetrics was significantly degraded when control-ling 16 missiles as was their situation awarenessStrikingly these results are very similar to thoseof air traffic control studies that demonstrated asimilar decline in performance for controllers man-aging 17 aircraft These results highlight the needto develop a predictive model for controller capac-ity that is not based on a single interface design anda single level of automation as demonstrated inthis experiment but instead can be applied acrossdomains interfaces and autonomy levels Yet an-other significant finding was that a 70 utilization(percentage busy time) score was a valid thresholdfor predicting significant performance decay andcould be a generalizable metric that can aid in man-ning predictions as well as address system archi-tecture and decision support design questions

ACKNOWLEDGMENTS

This research was sponsored by the Naval Sur-face Warfare Center Dahlgren Division (NSWCDD) through a grant from the Office of Naval Re-search Knowledge and Superiority Future NavalCapabilities Program Any opinions findings andconclusions or recommendations expressed in thismaterial are those of the authors and do not neces-sarily reflect the views of the NSWCDD or the Of-fice of Naval Research In addition we would liketo thank Richard Jagacinski Mica Endsley andother anonymous reviewers for their insightfulcomments for both this and follow-on research

REFERENCES

Adelman L (1991) Experiments quasi-experiments and case studiesAreview of empirical methods for evaluating decision support sys-tems IEEE Transactions on Systems Man and Cybernetics 22293ndash301

Cummings M L (2003) Designing decision support systems for revo-lutionary command and control domains Unpublished doctoral dis-sertation University of Virginia Charlottesville

Durso F T Hackworth C A Truitt T R Crutchfield J Nikolic Damp Manning C A (1998) Situation awareness as a predictor ofperformance for en route air traffic controllers Air Traffic ControlQuarterly 6 1ndash20

Endsley M (2000) Direct measurement of situation awarenessValidity and use of SAGAT In M Endsley amp D J Garland (Eds)Situation awareness analysis and measurement (pp 73ndash112)Mahwah NJ Erlbaum

Endsley M amp Rogers M D (1996) Attention distribution and situ-ation awareness in air traffic control In Proceedings of the Human

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005

OPERATOR CAPACITY 15

Factors and Ergonomics Society 40th Annual Meeting (pp 82ndash85)Santa Monica CA Human Factors and Ergonomics Society

Endsley M Sollenberger R L Nakata A amp Stein E S (2000)Situation awareness in air traffic control Enhanced displays foradvanced operations Atlantic City NJ US Department ofTransportation

Galster S M Duley J A Masalonis A J amp Parasuraman R (2001)Air traffic controller performance and workload under mature freeflight Conflict detection and resolution of aircraft self-separationInternational Journal of Aviation Psychology 11 71ndash93

Hebb D O (1955) Drives and the CNS Psychological Review 62243ndash254

Hilburn B G Bakker M W P Pekela W D amp Parasuraman R(1997 October) The effect of free flight on air traffic controller men-tal workload monitoring and system performance Paper presentedat the 10th International Conference of European Aerospace Socie-ties Amsterdam

Jones D G (2000) Subjective measures of situation awareness In MEndsley amp D J Garland (Eds) Situation awareness analysis andmeasurement (pp 113ndash128) Mahwah NJ Erlbaum

Keppel G (1982) Design and analysis A researcherrsquos handbookEnglewood Cliffs NJ Prentice Hall

Klein G (2000) Analysis of situation awareness from critical incidentreports In M Endsley amp D J Garland (Eds) Situation awarenessanalysis and measurement (pp 51ndash71) Mahwah NJ Erlbaum

Nichols T A Rogers W A amp Fisk A D (2003) Do you know howold your participants are Ergonomics in Design 11(3) 22ndash26

Rouse W B (1983) Systems engineering models of human-machineinteraction New York North Holland

Ruff H A Narayanan S amp Draper M H (2002) Human interactionwith levels of automation and decision-aid fidelity in the supervi-sory control of multiple simulated unmanned air vehicles Presence11 335ndash351

Santoro T P amp Amerson T L (1998 June) Definition and measure-ment of situation awareness in the submarine attack center Paperpresented at the Command and Control Research and TechnologySymposium Monterey CA

Schmidt D K (1978) A queuing analysis of the air traffic controllerrsquosworkload IEEE Transactions on Systems Man and Cybernetics8 492ndash498

Sheridan T B (1992) Telerobotics automation and human supervi-sory control Cambridge MA MIT Press

Shingledecker C A (1987) In-flight workload assessment usingembedded secondary radio communication tasks In A Roscoe (Ed)The Practical Assessment of Pilot Workload (AGARDograph No282 pp 11ndash14) Neuilly sur Seine France Advisory Group forAerospace Research and Development

Stevens J (1996) Applied multivariate statistics for the social sciences(3rd ed) Mahwah NJ Erlbaum

Tabachnick B G amp Fidell L S (2001) Using multivariate statistics(4th ed) New York HarperCollins

Tsang P amp Wilson G F (1997) Mental Workload In G Salvendy(Ed) Handbook of Human Factors and Ergonomics (2nd ed pp417ndash489) New York John Wiley amp Sons Inc

US Navy (2002) Tomahawk weapons system (Baseline IV) opera-tional requirements document Washington DC Department ofDefense

Vidulich M A (2000) Testing the sensitivity of situation awarenessmetrics in interface evaluations In D J Garland (Ed) Situationawareness analysis and measurement (pp 227ndash246) Mahwah NJErlbaum

Wickens C D amp Hollands J G (2000) Engineering psychology andhuman performance (3rd ed) Upper Saddle River NJ Prentice-Hall

Willis R A (2001) Tactical Tomahawk weapon control system userinterface analysis and design Unpublished masterrsquos thesis Uni-versity of Virginia Charlottesville

Yerkes R M amp Dodson J D (1908) The relation of strength of stim-ulus to rapidity of habit-formation Journal of ComparativeNeurology and Psychology 1 459ndash482

Zacks R T Hasher L amp Li K Z H (2000) Human memory In FI M Craik amp T A Salthouse (Eds) The handbook of aging andcognition (pp 293ndash357) Mahwah NJ Erlbaum

Mary L ldquoMissyrdquo Cummings is an assistant professor ofaeronautics and astronautics and director of the Humansand Automation Laboratory at the Massachusetts In-stitute of Technology She received her PhD in systemsengineering in 2003 from the University of Virginia

Stephanie Guerlain is an associate professor of systemsand information engineering and director of the Human-Computer Interaction Laboratory at the University ofVirginia She received her PhD in industrial and systemsengineering in 1995 from the Ohio State University atColumbus

Date received May 12 2004Date accepted November 7 2005