visualization is better! a comparative evaluationjgoodall/goodall-vizsec09.pdf · user testing for...
TRANSCRIPT
Visualization is Better! A Comparative Evaluation
John R. Goodall *
Secure Decisions division of Applied Visions Inc.
ABSTRACT
User testing is an integral component of user-centered design, but
has only rarely been applied to visualization for cyber security
applications. This paper describes a comparative evaluation of a
visualization application and a traditional interface for analyzing
network packet captures, that was conducted as part of the user-
centered design process. Structured, well-defined tasks and
exploratory, open-ended tasks were completed with both tools.
Accuracy and efficiency were measured for the well-defined
tasks, number of insights was measured for exploratory tasks and
user perceptions were recorded for each tool. The results of this
evaluation demonstrated that users performed significantly more
accurately in the well-defined tasks, discovered a higher number
of insights and demonstrated a clear preference for the
visualization tool. The study presented here may be useful for
future visualization for network security visualization evaluation
designers. Some of the challenges and lessons learned are
described.
KEYWORDS: User testing, comparative evaluation, security visualization, user-centered design.
INDEX TERMS: H.5.2 [Information Interfaces and Presentation]
User Interfaces; I.3.8 [Computer Graphics] Applications
1 INTRODUCTION
Visualization for Cyber Security (VizSec), has rapidly matured
over the past several years and there are now many techniques and
tools applying information visualization to the problems of cyber
security, particularly in network traffic analysis [1-7]. However,
while the design of several of these tools are grounded in the tasks
that real world users face, these tools are rarely tested empirically.
This paper attempts to define a study design that is applicable to
user testing for network analysis applications for VizSec by
presenting a within subject comparative evaluation of a VizSec
application with a commonly used network traffic analysis tool.
TNV, shown in Figure 1, is a visualization tool to facilitate the
analysis of network packet capture data [8]. Usability guidelines
were followed during design work and usability problems were
fleshed out through multiple formative evaluations, including two
heuristic reviews and one round of usability testing. The
evaluation described in this paper tested the performance and
perception of users with TNV in comparison with a common tool
used for network analysis.
User testing can help to determine the utility and limitations of
systems. Information visualization evaluation practices vary, and
can be summarized into four areas [9]:
• Controlled experiments comparing design elements: a
comparison of specific widgets or information mappings.
• Usability evaluation of a tool: an evaluation of problems
users encounter when using a tool as part of the design
process.
• Controlled experiments comparing two or more tools: a
comparison of multiple visualizations or the state of the art
with a novel visualization.
• Case studies of tools in realistic settings: an evaluation of
a visualization tool in a natural setting with users using the
tool to accomplish real tasks.
The evaluation described in this paper falls into the third
category, controlled experiments comparing two tools.
2 RELATED WORK
User testing is an essential component of user-centered design.
Despite an increasing body of research, user testing is still
atypical in VizSec research. The following is a representative
sample of studies from the information visualization community
that empirically evaluated two or more tools.
Sebrechts, et al. [10] used a between-subject study design in a
comparison of text with both 2D and 3D visual representations of
search results. Subjects were quasi-randomly assigned to each of
the three visual conditions. Sixteen timed tasks were completed
using a think aloud protocol followed by a satisfaction
questionnaire. Results were examined qualitatively and
quantitatively.
Stasko, et al. [11] used a within-subject study design in a
comparison of two space-filling methods for visualizing
hierarchical data. Sixteen subjects performed sixteen timed tasks
on both tools using different hierarchies to avoid learning effects
due to working with the same data twice. The hierarchies were
approximately the same size, depth, and overall structure. The
ordering and conditions varied across participants. A subjective
evaluation followed the experiment and results were examined
qualitatively and quantitatively.
Plaisant, Grosjean, and Bederson [12] also compared methods
for viewing hierarchical data. They used a within-subject design
to compare a traditional interface (Windows Explorer), the
Hyberbolic tree browser, and SpaceTree with seven tasks. The
order of the interface and task sets was counterbalanced. The three
different task sets used different branches of the hierarchy that
were similar in size and complexity. A subjective evaluation
followed the experiment and results were examined qualitatively
and quantitatively.
Risden, et al. [13] used a within-subject study design in a
comparison of 3D and traditional browsers for directory
management typical of a web content developer. Subjects
performed directory management tasks using the 3D visualization
and one of the two traditional interfaces, with ordering split
evenly. The subjects timed tasks themselves by activating and
then stopping a timer. Results were examined quantitatively.
Although there are many other evaluations comparing a
visualization tool with a traditional interface or another
visualization in the literature, these are representative of the study
* email: [email protected]
57
6th International Workshop on Visualization for Cyber Security 200911 October, Atlantic City, New Jersey, USA978-1-4244-5413-6/09/$25.00 ©2009 IEEE
designs and methodologies used. The study design used here
draws on these existing studies.
3 STUDY DESIGN
The goal of this study was to compare user performance on TNV
and the current state of the art tool for network analysis. This
comparative evaluation followed a repeated measure within-
subject design where each participant performed the same series
of tasks with both of the tools. Tasks measured performance of
typical network analysis tasks using both TNV and a commonly
used tool for packet capture and analysis, Ethereal. The same data
sets were used for each of the tools, but the order of tool usage
was counterbalanced. Results were examined both quantitatively
and qualitatively (based on observations of the strategy used to
answer questions and exploratory tasks). The data sets and tasks
were pilot tested with an undergraduate student who was familiar
with both tools. The data set size was reduced as a result of this
pilot testing, and the tasks were made more specific (e.g., instead
of asking “which host,” wording was changed to “what is the IP
address of the host”) to prevent possibly ambiguous questions.
3.1 Participants
Eight Information Systems undergraduate and graduate students
participated in this study. Participants consisted of two females
and six males, with a mean age of 28.1 (standard deviation: 3.7).
All participants were familiar with the basics of computer
networking and had taken at least one class in networking (mean:
2.1 classes, standard deviation: 1.4). The mean self-reported
knowledge of computer networking, on a scale of 1 (lowest) to 10
was 4.6 (standard deviation: 2.3). Although participants were not
domain experts in networking or intrusion detection, three
participants had some experience with Ethereal.
The study tested with novice users for several reasons. First, the
problems of using current tools for learning were at the core of the
field study results. Novices used various strategies, but to learn
the basics of networking, many of the participants discussed
“playing” with various tools, including Ethereal, the tool
compared in this evaluation. Because TNV was specifically
designed to facilitate learning – both domain-level learning and
situated learning – using novices in the evaluation targeted one of
the targeted populations of TNV. Second, expert users would have
extensive experience with Ethereal coming into the study, with no
exposure to TNV. This experience with one tool would likely
skew the results of the study, since the population would be a
threat to validity, as differences in results may be due to previous
tool experience rather than to the differences in the tools and tasks
in the study. The last factor is a practical issue: repeated
solicitations of expert users to come to the lab to participate in the
study were initially met with interest, but no commitments.
3.2 Tools
This study compared two tools for network packet analysis, a
visualization tool and a traditional tool. The visualization, TNV,
presents packet capture data in a visually compact display that
emphasizes ‘local’ networks, the IP space that users are most
concerned with. The display is essentially split between three
areas. To the left is a narrow area that displays remote hosts, in
the center is the area that displays links between hosts, and the
large area to the right displays local hosts (those defined as being
“local” to the user), which is divided into a grid where each row
represents a unique local host and each column represents a time
interval, with each resulting cell color coded to the number of
packets to and from that host within that time period. Bisecting
the display to separately show local and remote hosts increased
the scalability of the visual display, so that many more hosts can
be displayed at once by dividing the available screen real estate
between local and remote hosts. In addition to being able to
display more hosts at a time, this partitioning also fits well with
analysts’ perceptions of what they deem to be important. Because
local hosts are of primary concern in most analysis tasks [14], the
majority of the display space is devoted to the local hosts. A time
slider is the primary navigation mechanism, and there are controls
for filtering and highlighting packets, drawn as triangles within
each cell.
Figure 1. TNV: visual network packet analysis tool.
Ethereal, now called Wireshark, presents packet capture
summary data as rows in a sortable table. Ethereal, shown in
Figure 2, was chosen because of its popularity for packet capture
analysis: 62% of survey respondents reported using Ethereal
frequently, and another 26% reported using the tool occasionally
[15].
Figure 2. Ethereal: traditional network packet analysis tool.
3.3 Data Sets
Three different data sets were used, each of which was used for
the same set of tasks for each tool. A very small data set was used
for training. The other two data sets were subsets of the HoneyNet
58
Project’s Scan of the Month data. The first of these consisted of
210 packets over a sixteen-hour period with eight local hosts and
thirteen remote hosts. The second consisted of 762 packets over a
nine-hour period with nine local hosts and eighteen remote hosts.
These data sets were kept intentionally small due to results from
pilot testing, in which the participant was completely
overwhelmed with more than 1,000 packets using Ethereal.
3.4 Procedure
The participants each followed the following format: a) a brief
introduction to the study and each of the tools and then requested
to sign a consent form, b) training using either TNV or Ethereal,
c) a series of timed tasks using that tool, d) training using the
second tool, e) a series of timed tasks using the second tool, f) a
satisfaction questionnaire was given. Half of the participants used
TNV first, and the other half used Ethereal first.
During the training period the participant was first introduced to
the tool. For Ethereal, the participant was given a “cheat sheet” of
commonly used filters and an explanation of three statistical
aggregation functions. Ethereal has a rich, but complex, filtering
syntax, so providing commonly used filters was a way to
minimize frustration and aid the novice users. Three of Ethereal’s
aggregation functions – Summary, Conversations, and Endpoints
– were introduced. Summary presents an overview of the data set.
Conversations lists traffic between two endpoints, and Endpoints
lists traffic to and from each IP address. The aggregation
functions were first briefly described, and each of these functions
was then used during the training tasks. For TNV, the participant
read the “Quick Start” which contained a screenshot of TNV with
labels identifying each of the major functions.
The participant and the evaluator then walked through a series
of tasks and associated questions for a very small data set.
Answers to those questions were printed on the evaluation script,
which was shared with the participant. Where there were multiple
possible methods to arrive at the same answer, each of these
methods was shown to the participant even if they arrived at the
correct answer through a different method. For example, the first
training question asked the participant to report the number of
packets in the data set; using TNV there are multiple ways to
answer this question, such as looking at the title bar, counting the
number of packets, summing the histogram totals, or looking at
the packet details. Regardless of whether the answer is correct or
incorrect, all of the possible methods were introduced, with the
participant driving the tool at all times. Participants could ask
questions and take as much time as they needed during this
training period. During this training period, participants were
encouraged to think-aloud as they interacted with the system.
Notes were taken on the strategies used to answer the questions
and the users’ think-aloud remarks.
After the training period the analyst performed a series of well-
defined tasks using two different data sets for each tool. These
tasks were timed and answers to these specific were recorded and
scored for correctness. Tasks were timed out after five minutes,
and noted as such.
Following these directed tasks, participants were asked to
explore and describe the data sets in detail for each of the tools.
Participants had five minutes to describe what they found
interesting in the data. The data set used for this open-ended task
was different for each tool depending on which tool the analysts
used first; the smaller data set was used for the first tool; for
example, if the analysts started with Ethereal, then they described
the smaller data set using Ethereal and the larger using TNV. This
was a largely arbitrary decision, but examining the same data set
in detail twice would have introduced learning effects. The time to
each of the insights described by participants was recorded, as
were the insights themselves and the total time used (up to five
minutes).
Following the training and tasks for each of the two tools, a
questionnaire was administered to the participants to measure
satisfaction and user perceptions. Each of the close-ended, Likert
scale questions in the questionnaire was repeated for both tools.
3.5 Tasks
Typical tasks were derived from the field study to be
representative of those performed during the course of ID
analysis. The tasks can be divided into two broad categories:
• Well-defined tasks are directed with one possible correct
answer;
• Exploratory tasks that ask subjects to draw open-ended
conclusions from the data.
Each of these will be discussed in turn.
3.5.1 Well-Defined Tasks
Each participant for each tool completed ten tasks consisting of a
total of sixteen questions (some tasks had multiple, related
questions). The first five tasks consisting of seven questions were
asked regarding the first data set, followed by five tasks consisting
of nine questions regarding the second, larger data set. The tasks
were of varying complexity and were chosen to be representative
of the kinds of typical tasks a user would perform during network
analysis. Wehrend and Lewis [16] defined multiple categories of
operational tasks for information visualization tools: identify,
locate, distinguish, categorize, cluster, distribution, rank, compare,
within and between relationships, associate, and correlate. To
simplify analysis and avoid ambiguous assignments of tasks to
categories, only two of these categories were used in this study.
Tasks were divided into the following high-level categories:
Compare and Identify. Comparison tasks refer to those that
require the user to make a judgment as to which of two or more
entities are larger. These tasks tended to ask higher-level
questions about the data and required the participant to use the
entire data set to make a decision. Identification tasks require the
user to locate and identify an entity based on its attributes. These
tasks were at a lower-level and participants could often focus their
attention on a small subset of the data to answer the questions.
There were five comparison and five identification tasks, although
some of these had multiple sub-questions.
3.5.2 Exploratory Tasks
The ability to explore and interact with data to draw meaningful
conclusions is an essential activity in information visualization
applications. Information visualization is “sometimes described as
a way to answer questions you didn’t know you had” [17].
However, it is impossible to measure the ability of a visualization
tool to answer these kinds of unknown questions with predefined,
directed tasks. However, exploratory tasks are very difficult to
measure quantitatively. The few attempts at doing so in previous
lab-based visualization evaluations are described below.
Mark, Kobsa, and Gonzalez [18], in a comparison of
collaborative versus individual discovery using information
visualization, described an experiment in which subjects were
asked to discover as many findings in population survey data as
they could. This task required no specific background. The
number of findings was counted and the correctness of the
findings was verified. The proportion of meaningful results within
this set was determined by two independent coders judging
whether the results constituted a meaningful finding, defined as
59
one that included a comparison between variables, indicated a
minimum or maximum, and/or had a “surprise” value.
Juarez, Hendrickson, and Garrett [19] argued for the importance
of measuring solution quality in their visualization evaluation
framework. They acknowledged the difficulty of measuring
something subjective such as quality and recommended that
quality be defined for specific domains by a group of experts for
each task solution. The experts in that study used a 1-10 rating
scale.
In the former, each result is examined to determine if it is
correct and if it is “meaningful” based on a coarse criterion of
meaningfulness; no domain expertise is required because the data
was generic. In the latter, domain experts rated each result on a
scale of “quality.” Neither of these is a perfect fit for this study.
Rather, incorrect and any insights that were part of the well-
defined task section were discarded, and the number of correct
insights found was measured. The exploratory tasks were also
examined qualitatively.
3.6 Independent and Dependent Variables
The independent variables were tool (TNV vs. Ethereal) and task
type (Comparison vs. Identification). The dependent measures
were accuracy (whether or not a given task answer was correct),
completion time (time taken to perform a successful task), and
user perceptions. User perceptions were measured through a post-
test questionnaire with a 7-point Likert scale of seven questions,
each of which was repeated for both tools, related to users’
perceptions of satisfaction, confidence, and performance.
3.7 Hypotheses
Because the visual paradigm used is expected to be more intuitive
to novice users, performance using TNV is anticipated to result in
more accurate and faster responses overall than Ethereal,
especially in comparison tasks. Specifically, this study examined
the following hypotheses:
Accuracy:
Hypothesis 1: TNV will result in fewer errors as compared to
Ethereal.
Hypothesis 1a: The advantage of fewer errors using TNV will
be more pronounced in comparison tasks as compared to Ethereal.
Hypothesis 1b: There will not be a significant advantage of
fewer errors using TNV in identification tasks as compared to
Ethereal.
Participants were expected to be more accurate and have fewer
errors across all tasks using TNV than Ethereal. Hypothesis 1 is
testing for a main effect of tool, while 1a and 1b are testing the
interaction effects between tool and task type. Accuracy was
expected to be more pronounced for comparison tasks, but
relatively comparable for identification tasks. This is expected
because of Ethereal’s powerful searching capability. The expected
performance differences across the types of tasks are shown in
Figure 3.
Figure 3. Expected performance by tool and task type.
Efficiency:
Hypothesis 2: TNV will result in shorter task completion times
as compared to Ethereal.
Hypothesis 2a: The advantage of shorter task completion times
using TNV will be more pronounced in comparison tasks as
compared to Ethereal.
Hypothesis 2b: There will not be a significant advantage of
shorter task completion times using TNV in identification tasks as
compared to Ethereal.
Participants were expected to complete tasks faster using TNV
than Ethereal. Hypothesis 2 is testing for a main effect of tool,
while 2a and 2b are testing the interaction effects between tool
and task type. Efficiency was expected to be more pronounced for
comparison tasks, but reasonably similar for identification tasks.
Exploration:
Hypothesis 3: TNV will result in a greater number of insights
during data exploration as compared to Ethereal.
During the exploratory tasks, participants are expected to
perform better, measured by the number of insights discovered,
using TNV.
User Perceptions:
Hypothesis 4: TNV will result in more positive user perceptions
as compared to Ethereal.
Finally, participants are expected to give higher satisfaction
ratings to TNV.
The order of tool usage is not expected to affect performance.
4 RESULTS AND DISCUSSION
Participants completed the training tasks in an average of 18
minutes for TNV and 10 minutes for Ethereal. This difference
could be due to greater interest in the visual tool over the textual
tool. Observationally, participants were quick to learn and eager
to explore TNV, and many expressed enthusiasm about using the
tool. While they grasped the concepts of Ethereal, most
participants had trouble understanding the tool’s aggregation
functions.
4.1 Well-Defined Tasks (Hypotheses 1 & 2)
The primary performance measures for the well-defined tasks
were the number of correctly answered questions and the time to
complete questions that were answered correctly as a function of
tool and task. Each task was measured for accuracy, with a 1
indicating correct and a 0 indicating incorrect, unable to answer,
or timed out. (For purposes of accuracy, each of these is treated
the same.) There were a total 10 tasks consisting of 16 questions
(some tasks had subtasks); subtasks were summed and divided by
the number of subtasks to derive an accuracy score for tasks that
consisted of multiple subtasks. All tasks were timed and analysis
was conducted on time to completion for successful tasks.
Table 1 summarizes the mean and standard deviation of correct
responses for each of the tools, broken down by task type,
implying a trend towards more accurate performance using TNV
for both types of tasks. The performance difference for
comparison tasks was more pronounced than for identification
tasks, but the mean for both was higher using TNV than using
Ethereal.
60
Table 1. Mean of total number of successfully completed tasks by
task type (maximum = 5), and for all individual tasks (max. = 10),
and tasks including individual subtasks (max. = 16). Standard
deviations are shown in parentheses.
TNV Ethereal
Comparison Tasks (max: 5) 4.250 (0.886) 2.750 (0.463)
Identification Tasks (max:
5)
4.458 (0.478) 3.646 (1.255)
Individual Tasks (max: 10) 8.708 (1.171) 6.396 (1.563)
Individual Subtasks (max:
16)
14.00 (1.604) 11.50 (3.251)
A repeated measures analysis of variance (RMANOVA) with
repeated measures for tool (TNV, Ethereal) and task type
(Comparison, Identification) was conducted. To ensure that
counterbalancing the tool order usage had no effect on
performance, order was treated as a between subject variable. The
between subject variable of tool order was not significant in any
of the tests.
As expected, there was a significant main effect of tool, F(1,6)
= 14.72, p = 0.009, with the mean number of correct responses
suggesting more accurate performance overall using TNV. Figure
4 plots the mean number of accurate responses across all tasks,
showing a higher average number of correct responses overall
using TNV. Hypothesis 1, that there would be a main effect of
tool for accuracy of responses, was supported by the data;
participants had significantly fewer errors using TNV than using
Ethereal.
Figure 4. Mean and 95% confidence interval of accurate responses
by tool. (maximum = 10)
Figure 5 plots the mean number of accurate responses by tool
and task type, graphically showing a marked difference between
tools for comparison tasks. Although there was no interaction
effect between tool and task type – F(1,6) = 2.139, p = 0.194 – the
graphical depiction of the data suggests that the differences
between tools was more pronounced for comparison tasks.
Figure 5. Mean and 95% confidence interval of accurate responses
by tool and task type. (max. = 5)
To further clarify the effect of task type on performance
accuracy, a paired sample t-test was conducted for the two
tool/task type pairs:
• TNV Comparison vs. Ethereal Comparison: t = 5.612, p =
0.001
• TNV Identification vs. Ethereal Identification: t = 1.860, p
= 0.105
Hypothesis 1a, that TNV users would be much more accurate
for comparison, was not supported. However, further examination
of the data stemming from the graph showing a marked difference
between the tools for comparison tasks, the t-test for comparison
tasks across tools was significant. Hypothesis 1b, that TNV users
and Ethereal users would perform about equally in identification
tasks, was supported. These results show significantly more
accurate performance using TNV than Ethereal, and suggest that
users performed much more accurately for comparison tasks than
identification tasks across the tools.
In addition to task performance, time to complete successful
tasks was also measured and evaluated. (A successful task is one
in which all questions were answered correctly, partially correct
answers were not considered successful.) Only successfully
completed tasks were analyzed because incorrect responses could
have been quick guesses or based on confusion of the tool.
Additionally, tasks that were timed out (after five minutes) or
tasks that participants gave up on were also not included in the
analysis, as these could have skewed the results.
Because the tasks were of varying levels of difficulty and the
average time for each task varied greatly, a standardized task
completion time was computed for each task. This standardization
permitted the analysis of all tasks regardless of the wide-ranging
levels of difficulty and differing average times. The standardized
time was computed by subtracting the average time for a task
(collapsed across both tools) from the participant’s time on that
task and dividing the result by the standard deviation. Thus, for
61
each successful task for each participant the following equation
was used:
Standardized_Time = (Participant_Time – Mean_Task_Time) /
Task_Standard_Deviation
For example, a participant who successfully completed a task
taking exactly the average time for that task would have a
standardized time of 0. A participant who took two standard
deviations longer than the average would have a standardized time
of +2.0. Negative number indicated that a participant completed a
task faster than the average, while positive numbers represent
slower than average completion times. This approach is similar to
the method used to compute a standardized time in an experiment
evaluating two visualization described in Stasko, et al. [11]. Table
2 lists the computed average standardized task times by tool and
by task type, showing a trend towards faster performance with
TNV.
Table 2. Standardized average total task completion times for
successfully completed tasks. Standard deviations are shown in
parentheses.
TNV Ethereal
Avg. Comparison Time –0.61 (0.63) 0.61 (0.94)
Avg. Identification Time –0.03 (0.84) 0.03 (1.20)
Avg. Time –0.19 (0.88) 0.19 (1.13)
A repeated measure ANOVA (tool and task type as repeated
measure variables) with a between subject variable of tool order
was conducted for standardized task completion time. There was a
main effect of tool approaching significance, F(1,6) = 5.581, p =
0.056 on task performance time. (With a larger sample size, it
seems likely that there would be an effect of tool for task time.)
Hypothesis 2, that there would be a main effect of tool for
successful task completion time, was not supported by the data,
but the mean trend suggests that participants performed faster
using TNV, as shown in Figure 6.
Figure 6. Mean and 95% confidence interval of standardized time to
successful tasks by tool.
Figure 7 plots the mean standardized time of successfully
completed responses by tool and task type, graphically showing a
much more sizeable difference between tools for comparison tasks
than for identification tasks, which were nearly identical.
Although there was no interaction effect between tool and task
type – F(1,6) = 2.558, p = 0.161– the graph suggests that the
differences between tools was more striking for comparison tasks.
Figure 7. Mean and 95% confidence interval of standardized time to
successful tasks by tool and task type.
Because of the graphical depiction of the data and to further
clarify the effect of task type on performance in terms of time to
complete successful tasks, a paired sample t-test was conducted
for the two tool/task type pairs:
• TNV Comparison vs. Ethereal Comparison: t = –4.615, p
= 0.002
• TNV Identification vs. Ethereal Identification: t = –0.085,
p = 0.934
Hypothesis 2a, that TNV users would have much faster task
completion times in comparison tasks, was not supported.
However, graphing the data suggests a marked difference between
the tools for comparison tasks, and the t-test for comparison tasks
across tools was significant. Hypothesis 2b, that TNV users and
Ethereal users would have similar task completion times in
identification tasks, was supported. These results suggest faster
performance using TNV than Ethereal, and suggest that users
performed much faster for comparison tasks than identification
tasks across the tools. The effect of task type on performance
warrants further explanation.
While task accuracy and completion time was pronounced
across tools for comparison tasks, performance – particularly
completion time – was nearly identical for identification tasks.
This was expected because of the sophisticated filtering
functionality of Ethereal, which participants frequently used for
identification tasks to filter out the noise to answer the questions.
Because the data sets were relatively small, the filters removed
nearly all non-relevant results, allowing participants to quickly
answer these questions using simple filters. For example, the
62
average task time for all three questions making up Task 5 was
94.5 seconds (standard deviation: 75.94) across the six correct
responses with Ethereal. This question was relatively easy to
answer when applying a filter, which is how most of the
participants answered the question quickly. The same average
standard task time for TNV was 193.5 (standard deviation:
228.51) for the four correct responses with TNV, as participants
could highlight the relevant traffic easily, but many times the
participants did not see exactly where the highlighted packets
were on the cluttered screen. This indicates that a similar filtering
function that removes irrelevant traffic – as opposed to only
highlighting it – may improve task times for identification tasks in
TNV. This filtering functionality, mirroring the highlighting
functionality present at the time of this evaluation, was added to
TNV as a product of these results (as shown in Figure 1).
Unlike identification tasks, there was a marked contrast in task
performance times on comparison tasks across tools, which may
be explained by the strategy used by participants. The comparison
tasks generally required users to make judgments of proportions
of the data across the entire data set. This type of overview
comparison was one of the driving factors in the design of TNV,
and is generally one of the advantages of information
visualization techniques. Ethereal has several statistical functions
that can aid in aggregating the data to make these comparisons,
but participants’ who used this strategy for Ethereal comparison
tasks held values in their head and mentally did math to perform
the necessary further aggregation. For example, participants who
were asked to determine the largest source port value for
incoming traffic in the data used Ethereal’s Endpoints function to
see which port had the most packets, but would then switch
between the TCP and UDP screens to see if there were any
duplicate ports and mentally add the numbers together.
The other common strategy for comparison tasks using Ethereal
was to sort the entire data set and then scroll through the data. For
example, when asked which local host (the host on the predefined
“home” network) had the most packets, the participants would
sort by source address and estimate the rows associated with each
host, then do the same when sorted by destination and try to put
those two mental estimations together to answer the question.
Comparison tasks were thus less accurate and also took longer
using Ethereal. Using TNV, however, many of the participants
would simply glance at the screen and be able to answer the
question in a few seconds. Some comparison tasks, particularly
those that were port related, required that participants examine a
visualization panel in TNV other than the main display, which
would often result in a large time increase while searching for the
functionality. This caused the two comparison tasks that required
port information to be answered slightly slower than other
comparison tasks within TNV.
Examining tasks – and the strategies that participants used to
answer the tasks – such as these in more detail can help explicate
the results. The percentage of correct answers by task for each
tool is shown in Figure 8. Except for Task 5, users were more or
equally accurate with TNV for every task. The marked differences
in the accuracy of Tasks 2, 3, and 4 require further explanation.
Figure 8. Percentage of correct responses by tool and task.
Tasks 2 and 3 required that the participant compare the port
numbers of all packets to judge which port numbers were most
prevalent. Since both tasks were asking different versions of a
similar question, it would be expected that participants would
perform in the same way for each tool. This was problematic with
TNV, however. The problem appeared to have been an issue of
visibility [20].
Figure 9. Port activity panel in TNV.
The port visualization, shown in Figure 9, is, by default, hidden
from the user’s view in TNV due to lack of screen space. The
highlighting panel and the port visualization panel are collocated
in the same tabbed panel, but the highlighting panel is shown by
default. The user must thus remember to either choose the menu
item to view all port activity for the data set or to highlight a local
host and right click to choose to see the port activity associated
with the host; in both cases the port visualization will
automatically be given focus. The problem was that neither of
these functions (main menu or popup menu) was visible, and the
63
port panel itself was likewise not visible. So participants ended up
looking on what was displayed for something related to ports and
were either unable to find it (four of eight in Task 2) or took a
long time to find it (mean: 81 seconds; standard deviation: 48.85
in Task 2).
The next task required the same type of information, and
participants learned from the previous experience. For Task 3, two
additional participants answered correctly and the mean time was
reduced from 81 to 22 seconds (four of the six correct answers
were under 20 seconds). This learning process was not
demonstrated using Ethereal for the same tasks.
Task 4 is defined as an Identification task, asking the user to
identify the remote host that communicated with every local host.
In TNV, this can be quickly accomplished through a visual
inspection of the links, although the large number of links and
resulting clutter make this difficult to detect. Participants
generally would look for remote hosts that had multiple links, and
select those hosts (thus highlighting the links and corresponding
local hosts) in turn to identify the remote host that had links going
to all of the local hosts. Using Ethereal, participants generally
used the scrolling method, where they would sort by source
address and try to identify the remote hosts that had packets going
to multiple local hosts; once found, they would repeat the process.
An easier method was to use the statistical aggregation functions,
but the two participants who attempted this approach were still
unable to answer. One participant gave up, another timed out after
five minutes. This was the most difficult identification task for
participants using Ethereal. Even the two correct answers took a
long time, both using the scrolling method. This highlights one of
the advantages of the visual search and identification capability of
TNV versus the mental note and comparison strategy used for
Ethereal: recognition is more accurate than recall. To minimize
the user’s memory load, computer interfaces should emphasize
recognition of information objects rather than require users to
remember them [21]. This is a recurring theme in information
visualization, which performance results for this particular task
highlight.
In addition to the contrasts between the accuracy in a few tasks,
there are also striking differences in the time to completion for
some tasks. In particular, participants performed Tasks 7
(compare directionality – incoming or outgoing – of all traffic)
and 8 (compare local hosts to determine which had the most
traffic) much faster when using TNV than when using Ethereal.
The full details of each task and the corresponding task accuracy
and completion time for each task are shown in Table 3, in the
appendix.
Using TNV, all participants were successful in Task 7, a
comparison task that asked participants to judge the most
prevalent direction of traffic. Participants completed this task
using TNV in an average of 5.63 seconds (standard deviation:
5.37). Using Ethereal, five participants were successful in Task 7,
which was completed in an average of 59.2 seconds (standard
deviation: 38.13). For Task 8, TNV average completion time for
all eight successful responses was 5.38 seconds (standard
deviation: 9.77), while average completion time for all eight
successful responses using Ethereal was 68.13 seconds (standard
deviation: 61.44). The strategy employed by participants for these
tasks using TNV was a visual search; no manipulation of the data
or the tool was required. The visual patterns were simply
compared; in the case of Task 7, participants compared the
number of packets pointing in both directions or the size of the
incoming and outgoing histograms for all local hosts, and in Task
8, participants compared either the density of packets or the
histograms for each local host. These tasks were much more
difficult for participants using Ethereal. Participants either used
the statistical aggregation functions or sorted the table of packets
and scrolled though making mental comparisons, both of which
participants had trouble performing. The visual interface made
these comparison tasks, which were complicated using textual
tools, relatively trivial.
4.2 Exploratory Tasks (Hypothesis 3)
In addition to performing the series of timed, well-defined tasks,
participants were also asked to spend five minutes exploring to
describe any insights they had into the data. This exploratory task
was repeated for both tools on separate data sets. The results were
mixed. One source of confusion was that the participants began
explaining their perception of the tools, rather than the data itself.
Once corrected, the participants had a difficult time describing
what they found interesting in the data, particularly with Ethereal.
Many participants would simply scroll up and down looking at the
packets, but had a difficult time articulating their interpretation.
Several of the participants gave up before the allotted time; one
participant gave up on Ethereal after 75 seconds saying “I can’t
get much information from this.”
In spite of this, there were several interesting trends in the
exploratory tasks. Each of the unique insights was recorded and
measured for correctness based on two criteria: first, was the
insight technically correct, and second, was the insight one that
was not derived from the previously conducted well-defined tasks.
On average, participants discovered more insights into the data
and spent a longer time exploring using TNV. This was expected,
as information visualization tools encourage exploration and have
the ability to yield insights into the data that would not be found
otherwise. A two-tailed paired sample t-test demonstrated that this
difference in the number of insights was significant, t = 2.986, p =
0.020. Figure 10 plots the number of insights found by tool.
Hypothesis 3, that participants would discover more insights with
TNV than Ethereal, was supported. In addition to the number of
insights, the kind of insights participants reported revealed a
pattern.
Figure 10. Mean and 95% confidence interval of the number of
insights discovered.
64
Not surprisingly, exploration of the data using Ethereal tended
to be at a lower level, while using TNV tended to be at a higher-
level. For example, one participant noticed that a certain remote
host was trying and failing to login over FTP to multiple hosts
using Ethereal. The same participant using TNV reported that
there was a substantial gap for a certain period of time in the data.
Another participant using Ethereal noticed that there were clear
text passwords (i.e., passwords that were not encrypted, but sent
over the network “in the clear”) in the FTP and Telnet packet
data, and noted the most commonly used port for incoming traffic
using TNV.
These different levels of insights reflect both the strengths and
the visibility of functionality of each of the tools. TNV was
designed to provide a “big picture” view of the data, while
allowing for low-level detailed analysis, but the main screen does
not provide these details. This was a change from earlier iterations
of TNV, in which the details were integrated into the display. This
was changed, however, to increase the amount of available screen
space for the visualization, thus moving the details into a new
window. This allowed for the maximum possible space to be used
by both the visual overview and the textual details, but meant that
the user had to flip back and forth between them when using a
single monitor, as was used for the evaluation. On the other hand,
this separation into two separate windows permits using dual
monitors to enable both the details and the visualization to be
present simultaneously with no overlapping. Additionally, the
user must actively choose to examine the details, which only one
participant did while exploring with TNV. By contrast, Ethereal
excels at row-by-row detailed analysis and the tabular structure of
the main screen reflects this. There are several useful aggregation
functions that help provide an overview of the data, but these
require the user actively choose the menu item. In both cases,
participants generally adhered to the level of analysis that is most
visible by default instead of choosing the more hidden, but more
appropriate, functionality. It is expected that as the users became
more familiar with each of the tools’ functionalities, they would
be able to make better progress in exploratory type tasks.
Lacking domain expertise probably also contributed to the
small number of insights participants reported about the data;
particularly with Ethereal, they were simply unsure of what they
were looking at. Using TNV, at least, they could see visual
patterns and anomalies, even if they were unable to articulate the
exact meanings of those trends.
4.3 User Perceptions (Hypothesis 4)
The tendency for greater success and faster task completion times
for TNV suggests that TNV is easier to learn than Ethereal. Self-
reported user perception data related to ease of learning supports
this, the mean score for TNV was 6.0 compared to 3.1 for
Ethereal on a 7-point Likert scale. (For consistency, the scales
were reversed here on one question that was worded negatively;
all scores have a minimum of 1, strongly disagree, and a
maximum of 7, strongly agree.) Figure 11 plots the mean response
for each of the questionnaire results, clearly showing that
participants favored TNV over Ethereal.
Figure 11. Mean satisfaction ratings on 7-point scale by tool.
A two-tailed paired samples t-test was conducted for each of
the pairs of ratings for TNV and Ethereal. All pairs were
significantly different:
• Overall satisfaction: t = –5.333, p = 0.001
• Ease of information processing: t = –6.565, p < 0.000
• Ease of searching: t = –3.365, p = 0.012
• Ease of learning: t = –5.578, p = 0.001
• Ease of seeing patterns: t = –20.579, p < 0.000
• Perceived level of confidence: t = –3.813, p = 0.007
• Perceived level of performance: t = –5.118, p = 0.001
Hypothesis 4, that TNV will result in more positive user
satisfaction perceptions as compared to Ethereal, was supported
for all given measures of satisfaction. This was expected due to
the generally intuitive nature of the visualization techniques used
in TNV, as compared to the more arcane interface presented by
Ethereal. However, there may have been social pressure to
respond positively to TNV, since the participants knew that the
evaluator was also the designer of the tool.
The user perceptions questionnaire included open-ended
questions asking participants to share their thoughts on what they
liked and did not like about using each of the tools. The
quantitative results described above are supported by these
responses. Related to searching, one participant noted that
“graphics help searching and analyzing task.” The ability to
process information and the ease of learning TNV as compared to
Ethereal was emphasized in responses such as: “TNV was
intuitive and kind of fun to play with” and “TNV is more
intuitive.” One participant responded that: “I think TNV probably
has a shorter learning curve because you can see the results of
actions as soon as you click on something and most of the tools
are not buried in the menu items.” Conversely, one participant
responded that when using Ethereal “it’s extremely difficult for
the novice to answer analyzing question or detect abnormal
behavior.”
The participants’ perception of being able to see patterns was
the most marked contrast between the two tools, and the open-
ended responses reflected this. Responses included “TNV is
definitely more helpful to find patterns” and using TNV it was
“easy to recognize patterns.” One participant wrote that when
using TNV “I felt I was more aware of the environment – the tool
made exploring activity very easy.” That statement summarizes
65
the design goals of TNV: to provide context to enhance awareness
and to encourage and support exploration. It was also hoped that
TNV would support novices in learning about their environment.
Although not the purpose of the evaluation, one participant
commented that “I learned a lot about networks” during the study,
which seems to support this design goal.
5 LESSONS LEARNED
Future researchers who would like to include user testing as part
of their design and evaluation methodology may benefit from the
study design presented here. While new tools will present their
own unique challenges and data sets, there are some common
threads that most network security visualization tools will follow.
This section highlights some of the challenges network security
visualization evaluation designers may face related to users,
training, data and tasks.
One problem is the difficulty in finding domain experts to serve
as users. For this evaluation, one of the design aspects that was
tested was the ability of novice users to grasp patterns and
anomalies in networking data without having vast domain
knowledge. This was an a priori study design decision because
one class of the target user population is junior security analysts
who are unlikely to have much domain experience. Tools that
target more advanced users should include representative users in
the evaluations. One way to encourage domain experts to
participate in user testing is to include them in the design process
early. This approach builds their stake in the design and can get
potential evaluation participants excited by tool and make them
eager to participate in the design.
Training is another challenging issue that evaluation designers
need to be solve. Network security visualizations do not need to
be intuitive, they need to be effective at supporting tasks that
network security analysts must accomplish. A powerful, complex
visual paradigm that can facilitate the discovery of novel patterns
may require some training to understand. This training may be
self-directed, as is often the case in network security [22]. But
time must be set aside for training both the visualization tool and
the baseline tool, if evaluation participants do not already know it.
This raises a problem that is intertwined with the domain expert
issue, domain experts probably already know the baseline tool,
while they are likely to be completely unfamiliar with the
visualization tool. This intimate knowledge of one of the
independent variables can skew the results – another reason that
novices were used in the study presented here – so care must be
taken when selecting participants, and the results of experts may
need to be analyzed separate from novices.
Another challenge is finding appropriate data sets. While there
are many data sets available, many are unlabeled. That is, they do
not provide ground truth and it can be difficult for the evaluation
designer to identify targets of specific tasks, which leads to a
related problem. Defining realistic tasks that can be answered in a
pre-determined, relatively brief time period is challenging on its
own. To compound the problem, tasks and data sets are
intrinsically linked; any given data set may not include the data
appropriate to a task. While synthetic data sets can be used, this
decreases the ecological validity of the evaluation. Determining
tasks and identifying appropriate data sets is an iterative process.
Identifying realistic tasks that can be solved with available data
sets is a problem in evaluations that attempt to quantify traditional
usability metrics, typically accuracy (correctness of a task) and
efficiency (speed to complete correct tasks). Another approach,
which was somewhat addressed in this evaluation, is to use open-
ended tasks. In this evaluation, we found that novices had trouble
in finding interesting patterns or anomalies in the data, but there
were some unexpected insights that were discovered. Insight-
based studies are becoming a popular technique for information
visualization evaluation [23, 24]. This is based on the
acknowledgement that information visualization can be very
effective at exploration. However, the metrics for quantifying
insights are still being developed. The approach taken in the
evaluation described in this paper was to combine traditional user
testing metrics with basic insight-based methods. This may be the
best approach in network security visualizations, since the goals
of the tool likely include both increasing the accuracy and
efficiency of performing of common tasks as well as increasing
users’ abilities to find novel insights in the data.
Finally, especially for open-ended tasks, instructions should be
very clear. One problem we found in the exploration tasks is that
the participants were focused on describing the tool, not the data.
This was not discovered in pilot testing, but several of the
participants used this approach, which was counter to the intent of
the study design.
6 CONCLUSION
This paper has presented a comparative evaluation between an
information visualization tool and a textual tool for network
packet analysis. In support of Hypothesis 1, participants
performed significantly more accurately using TNV than Ethereal
in the well-defined tasks. Concerning Hypothesis 2, time to
successful completion across tools was approaching significance,
but the means suggest faster performance using TNV. Although
there was no interaction effect between tool and task type for
accuracy or task completion time, graphing the data suggests
better performance for comparison tasks. In support of Hypothesis
3, participants discovered a higher number of insights using TNV
than Ethereal in exploratory tasks. In support of Hypothesis 4,
participants clearly preferred TNV to Ethereal, most strikingly in
the perceived ease of seeing patterns and anomalies in the data.
Especially for novice users attempting to learn the domain and
situated knowledge needed to accomplish intrusion detection
analysis, this evaluation has been in a first step in validating the
visual approach used in TNV as compared to the near ubiquitous
network analysis tool used today. This was especially true for
making accurate and timely judgments of proportion for given
attributes. Those who already have expertise in Ethereal may also
benefit from the added context that TNV provides, but this
requires further study.
While there are some limitations in the study – the relatively
small sample size, although this is consistent with much of the
information visualization literature, and the use of novices,
although several reasons for this were enumerated – it is important
that the visualization for cyber security community begin to
incorporate user testing into the design process. This is a first step
towards encouraging that in future work by outlining a potential
methodology for testing.
ACKNOWLEDGEMENTS
The author wishes to thank Liewei Dai and Medha Umarji for
their suggestions and feedback on this research.
REFERENCES
[1] G. Conti, and K. Abdullah, "Passive visual fingerprinting of network
attack tools." pp. 45-54.
66
[2] R. F. Erbacher, K. L. Walker, and D. A. Frincke, “Intrusion and
Misuse Detection in Large-Scale Systems,” IEEE Computer
Graphics and Applications, vol. 22, no. 1, pp. 38-48, 2002.
[3] K. Lakkaraju, R. Bearavolu, A. Slagell et al., "Closing-the-Loop:
Discovery and Search in Security Visualizations." pp. 58-63.
[4] C. P. Lee, and J. A. Copeland, "Flowtag: a collaborative attack-
analysis, reporting, and sharing tool for security researchers." pp.
103-108.
[5] J. McPherson, K.-L. Ma, P. Krystosk et al., "PortVis: a tool for port-
based detection of security events." pp. 73-81.
[6] D. Phan, J. Gerth, M. Lee et al., "Visual Analysis of Network Flow
Data with Timelines and Event Plots," VizSEC 2007: Proceedings of
the Workshop on Visualization for Computer Security, J. R. Goodall,
G. Conti and K. L. Ma, eds., pp. 85-99: Springer, 2008.
[7] W. A. Pike, C. Scherrer, and S. Zabriskie, "Putting Security in
Context: Visual Correlation of Network Activity with Real-World
Information," VizSEC 2007: Proceedings of the Workshop on
Visualization for Computer Security, J. R. Goodall, G. Conti and K.
L. Ma, eds., pp. 203-220: Springer, 2008.
[8] J. R. Goodall, W. G. Lutters, P. Rheingans et al., “Focusing on
Context in Network Traffic Analysis,” IEEE Computer Graphics
and Applications, vol. 26, no. 2, pp. 72-80, 2006.
[9] A. Komlodi, A. Sears, and E. Stanziola, Information Visualization
Evaluation Review, UMBC-ISRC-2004-1, 2004.
[10] M. M. Sebrechts, J. V. Cugini, S. J. Laskowski et al., "Visualization
of search results: a comparative evaluation of text, 2D, and 3D
interfaces." pp. 3-10.
[11] J. Stasko, R. Catrambone, M. Guzdial et al., “An evaluation of
space-filling information visualizations for depicting hierarchical
structures,” International Journal of Human Computer Studies, vol.
53, no. 5, pp. 663-694, 2000.
[12] C. Plaisant, J. Grosjean, and B. B. Bederson, "Space Tree:
Supporting Exploration in Large Node Link tree, Design Evolution
and Empirical Evaluation." pp. 57-64.
[13] K. Risden, M. P. Czerwinski, T. Munzner et al., “An initial
examination of ease of use for 2D and 3D information visualizations
of web content,” International Journal of Human Computer Studies,
vol. 53, no. 5, pp. 695-714, 2000.
[14] R. Ball, G. A. Fink, and C. North, "Home-centric visualization of
network traffic for security administration." pp. 55-64.
[15] J. R. Goodall, “Defending the Network: Visualizing Network Traffic
for Intrusion Detection Analysis,” Ph. D. Dissertation, University of
Maryland-Baltimore County, Department of Information Systems,
2007.
[16] S. Wehrend, and C. Lewis, "A problem-oriented classification of
visualization techniques." pp. 139-143.
[17] C. Plaisant, "The challenge of information visualization evaluation."
pp. 109-116.
[18] G. Mark, A. Kobsa, and V. Gonzalez, "Do four eyes see better than
two? Collaborative versus individual discovery in data visualization
systems." pp. 249-255.
[19] O. Juarez, C. Hendrickson, and J. H. Garrett, Jr., "Evaluating
visualizations based on the performed task." pp. 135-142.
[20] D. A. Norman, The design of everyday things, New York, NY:
Doubleday, 1988.
[21] J. Nielsen, Usability Engineering, San Francisco, CA, USA: Morgan
Kaufmann, 1993.
[22] J. R. Goodall, W. G. Lutters, and A. Komlodi, “Developing
Expertise for Network Intrusion Detection,” Information Technology
& People, vol. 22, no. 2, pp. 92-108, 2009.
[23] P. Saraiya, C. North, and K. Duca, "An Evaluation of Microarray
Visualization Tools for Biological Insight." pp. 1-8.
[24] P. Saraiya, C. North, V. Lam et al., “An Insight-Based Longitudinal
Study of Visual Analytics,” IEEE Transactions on Visualization and
Computer Graphics, vol. 12, no. 6, pp. 1511-1522, 2006.
67
APPENDIX
Table 3. Average task completion times in seconds for successfully completed tasks. Total number of correct responses per task indicated in
parentheses (maximum = 8).
Specific Task TNV Ethereal
1. What protocol (ICMP, TCP, UDP) is most predominant in the data? 16.5 (8) 40.9 (7)
2. Which source port number is the most active for incoming (ingress) traffic in this data set? 81.0 (4) 167.5 (2)
3. Which source port number is the most active for outgoing (egress) traffic in this data set? 20.8 (6) - (0)
4. Several of the remote hosts scanned more than one of the hosts on the local network, but which
remote host scanned every local host?
45.6 (7) 166.5 (2)
5. There is a burst of FTP (ports 20 and 21) traffic in this data set.
5a. About what time did this burst of activity begin? 106.5 (4) 43.5 (6)
5b. During this burst there are a number of large packets (a length of 1500). What local host is
associated with these large packets?
50.7 (7) 54.9 (7)
Dat
a S
et 1
(2
10
Pac
ket
s)
5c. What remote IP address is associated with this traffic? 59.0 (7) 3.0 (7)
7. What direction (incoming or outgoing) is most of the traffic going? 5.6 (8) 59.2 (5)
8. Which local host has the most traffic? 5.4 (8) 68.1 (8)
9. Several remote hosts scanned the entire local network (communicating with all local hosts).
9a. Of those remote hosts that scanned the entire local network, which remote host scanned the
entire network using ICMP packets?
21.9 (8) 16.9 (8)
9b. About what time of day did this ICMP scan start? 12.1 (7) 5.3 (7)
10. Only three local hosts sent outgoing packets to remote hosts.
10a. Which three local hosts sent outgoing packets? 16.0 (8) 55.1 (7)
10b. Of those, which was the only local host that sent out UDP packets? 9.7 (7) 29.7 (7)
10c. Which remote host were these packets sent to? 19.8 (8) 16.4 (7)
11. At the end of this data set a number of UDP packets came into the local network to a group of
local hosts.
11a. Did they all originate from the same remote host? (yes or no) 30.0 (8) 17.8 (5)
Dat
a S
et 2
(7
62
Pac
ket
s)
11b. All of these packets had the same data payload – what was it? 15.1 (7) 24.3 (7)
68