[ieee 2013 xv symposium on virtual and augmented reality (svr) - cuiabá - mato grosso, brazil...
TRANSCRIPT
Fusion4D: 4D unencumbereddirect manipulation and visualization
Roberto Sonnino, Keila Keiko Matsumura, Joao Luiz Bernardes Junior,
Ricardo Nakamura and Romero Tori
Department of Computer and Systems Engineering
Universidade de Sao Paulo
Sao Paulo, Brazil
[email protected], [email protected], [email protected],
[email protected], [email protected]
Abstract—It is possible to predict that sometime in the future,holographic interactive displays will be available as householdcommodities, requiring new interaction techniques. Fusion4D isour proposal for unencumbered direct manipulation interfacesinvolving three dimensions in physical space, as well as a fourthdata-dependent dimension (such as time-varying information ofan object.) A proof-of-concept prototype has been developed andsubjected to user testing. The results of the tests indicate thatthere are some points for improvement, such as the visualizationtechnology, but the system is well accepted by users.
I. INTRODUCTION
An important paradigm change in human-machine interfaces
happened when we migrated from the “command line input -
alphanumeric output” model towards the “direct manipulation
- graphical output” model. When Ben Shneiderman [1] first
introduced the concept of direct manipulation almost three
decades ago, defining it as a “continuous representation of
objects of interest, and rapid, reversible, incremental actions
and feedback”, he was able to synthesize the essence of the
concept, without binding it to specific metaphors, models or
technologies. Because of that, direct manipulation has contin-
ued to evolve, even though its most popular implementations
are based on WIMP (“windows, icons, menus, pointers”) [2],
with signals of an eventual replacement of the mouse pointers
by multi-touch.
But how will this evolution continue? What will the next
paradigm shift be? Screens have already reached a resolution
comparable to that of human retinas. Graphics cards are able to
render 3D models in real time with high fidelity. Nevertheless,
we can only “touch” virtual objects when separated by a “glass
barrier”, through intermediate devices such as touch screens,
mice and data gloves or at a distance through gesture recog-
nition. It is possible to predict that sometime in the future we
will have holographic displays that will allow 3D projections
in space as household commodities, completely eliminating
all barriers and boundaries between our touch and the virtual
objects. At that moment, the concept of direct manipulation
will most likely acquire a literally new dimension.
The investigation of direct manipulation 3D interaction tech-
niques in holographic displays is the main goal of the work
presented in this paper. Our proposed system uses stereoscopic
projection and a depth map sensor (Microsoft Kinect) to sim-
ulate a future holographic display. We use this infrastructure
to implement a 3D hands-free direct manipulation method that
allows users to view and examine 3D virtual objects, without
intermediary devices, as if those objects were floating in the
space between their hands. Besides, the system incorporates
an interface that allows navigation in a fourth dimension (for
instance, analyzing a virtual model of a fetus and at the same
time changing its development stage, or changing between
different versions of an engine model while examining it.)
To evaluate this method, a prototype was developed, and an
initial set of user tests was executed, the results of which are
presented and discussed here.
II. RELATED WORK
The desire to mimic or even surpass the real world and
interact with virtual objects in 3D comes from the pioneers
of human-computer interfaces who developed systems such
as Heilig’s Sensorama and Sutherland’s Ultimate Display and
HMD in the sixties. Later these interfaces incorporated Mul-
timodal [3] and multidimensional [4] interactions. “Put-that-
there” [3] made use of the multimodal combination of pointing
gestures and speech recognition to interact in a 2D interface.
A virtual hand for object manipulation in virtual environments
viewed through an HMD was implemented in 1986 using a
dataglove [4], as were other techniques for navigation and
object selection. Charade [5] made use of symbolic hand
gestures as remote commands to computer applications such
as presentation viewers and discussed guidelines for these
interfaces as well as future applications, many of which have
currently become reality.
More recently, the availability of the low-cost depth sensor
Kinect has started a new wave of research in unencumbered
manipulation and augmented reality interfaces such as [6], [7],
[8], [9]. In general these works explore direct manipulation pri-
marily through mapping hand position to pointers and gesture
sequences that adapt traditional GUIs for gestural command
or through the use of separate modes and well-trained gesture
combinations for interaction.
Vermeer [6] allows interaction within the display volume of
a 360 viewable 3D display much like the holographic displays
2013 XV Symposium on Virtual and Augmented Reality
978-0-7695-5001-5/13 $26.00 © 2013 IEEE
DOI 10.1109/SVR.2013.40
134
mentioned earlier. This display is based on spinning optics
placed at the bottom of a pair of parabolic mirrors, a setup
that exploits a well-known optical illusion causing the objects
shown in the spinning display to be reimaged “floating” just
above the mirrors. Thus users can touch and interact with the
virtual objects with their hands inside the display volume.
A few simple interaction techniques using the system are
described, based on tracking user fingertip positions in 3D
and either touching specific regions of the virtual objects to
issue commands or letting the fingertips collide with them and
using physics simulation to control their response.
The visualization of medical images is an important appli-
cation for unencumbered interfaces because during surgical
procedures touching the interface should be avoided due to
the risk of contamination. It is also the subject we chose for
out user tests, although we envision Fusion4D being used
more for educational and training purposes than in surgical
rooms. [7] is an example of application for the surgical room.
It uses an open-source solution for 3D reconstruction and
visualization of medical images and takes advantage of its
already existing mouse-based interface, simply mapping the
tracked hand positions to 2D mouse cursor positions. Two
techniques may be used to generate mouse clicks. In the
first, leaving the hand still in the gesture-detection volume
for 1s generates a left-button press and removing the hand
from the volume generates a release. The second maps the
movements of the user’s other hand into mouse left and right
button press and release events. Siemens is also developing a
similar system [8] on top of one of their existing visualization
solutions and demonstrated it at RSNA 2011. Their system
uses finger pointing to control mouse movement and image
rotation and mentions other gestures such as spreading the
hands apart to zoom out, but we could not find more details
about the interaction techniques used. [9] presents yet another
solution and describes the use of gestures in more detail.
Finger pointing is also used to control cursor position, and
mouse clicking uses a gesture (fist-palm-fist) with the user’s
other hand. Bimanual manipulation controls translation, zoom
and rotation (rotation is differentiated by requiring a closed
fist posture) while pre-defined gestures such as assuming a
folded arms posture or quickly moving the dominant hand as
if cleaning are used to issue commands such as selecting or
erasing a region of interest. The system also allows the control
of animations by moving the non-dominant hand alone in the
gesture-detection volume as if along an horizontal timeline
slider. Gestures are the only modality of interaction used
and no user tests are reported. All these 3 solutions consider
primarily 2D displays.
One system that mentions multimodal interaction, combin-
ing voice and mostly pointing gestures, was applied to the task
of Urban Search and Rescue, i.e. coordinating geographically
distributed teams working in disaster relief [10]. Preliminary
Wizard-of-Oz user studies showed the importance of mul-
timodal interaction but at first only single-hand isolated or
continuous gestures were implemented, to interact with Google
Earth controlling 2D translation, two rotations and zoom.
The paper does not describe in more detail the interaction
techniques used.
A work with a manipulation interface similar to ours
was proposed to aid in particular Computer-Aided Design
tasks [11]. The system tracks the positions (but not the
orientation) of both hands without any gloves or markers
so the user can easily transition from working with mouse
and keyboard to using gestures. Direct hand manipulation
is used for 6 degrees of freedom (DOF) translation and
rotation of virtual objects and of the scene camera (the
scene is displayed in a 2D device), for which it is much
better suited than the conventional interface. But mouse and
keyboard are still used for other tasks. Object selection is
done with a pinching gesture and then movements with one
hand are mapped to translations of the virtual object, while
bimanual manipulation becomes either object or camera
rotation depending on whether an object is selected or not.
During the direct hand manipulation this is the only mode
of interaction used and other system functionality would be
offered through the conventional WIMP interface.
Three-dimensional direct hand manipulation is even being
researched in the context of 2D multi-touch devices. [12]
implements and discusses an interesting technique for direct
rigid-body manipulation of 3D objects in such devices using
three fingertips touching the object as points of reference and
constraints and [13] uses grasping or bimanual pinching for
the same task.
Even though those works have deeply impacted the realm of
unencumbered 3D interfaces, they still present some important
limitations: First of all, the effect of immersion that blends
the real and virtual world is still limited by the need of
elements such as icons and cursors and often by the use of 2D
displays. Besides, the gestures chosen for interaction in those
works often require previous training, are the only mode of
interaction and their use carries a significant cognitive load.
III. 3D INTERACTION PRINCIPLES IN FUSION4D
In this work, we adopt the decision presented in the OCGM
metaphor [14] to differentiate between gestures (discrete in-
teraction with symbolic meaning) and manipulation (contin-
uous and direct interaction with interface elements) even if
both use hands. Many authors, such as [15], include both
forms of interaction in their definition of gestures and, while
we do not argue against such definitions, we believe it is
important to differentiate between them in the context of
interaction tasks and techniques because, while technologically
their recognition might demand similar resources, gestures and
manipulations as defined in OCGM can be used for interaction
in very distinct and often complementary ways. We often use
the expression “direct hand manipulation” herein to describe
this continuous, direct and unencumbered interaction with
the virtual objects (which some authors would perhaps call
manipulation gestures).
With that in mind, the main task in our system is the spatial
manipulation of one object (although it can be separated into
component parts) as a rigid body (although it can behave
otherwise when moving back or forward in time). Using the
methodology described in [16] we divided the manipulation
task in the four canonical subtasks of selection, positioning,
135
rotation and scaling and then chose appropriate techniques for
each.
Because we would usually have only one object to select and
deselect, we simply used voice commands to do so (although
we could have used gesture commands as well, but we had
decided to investigate using that channel for manipulation
only). Voice commands are also used to view help, change
the object being displayed, navigate in the time dimension,
explode the object into its components and reassemble it.
We are thus combining voice commands and direct hand
manipulation in a multimodal interface. [16] discusses some
advantages of multimodal interfaces that apply to this setup
such as decoupling system control from manipulation in dif-
ferent channels, thus decreasing cognitive load and comparing
the inputs from both channels for their disambiguation.
Our application requires repositioning the object in only
relatively short distances so we could choose an isomorphic
mapping between hand and object movement similarly to
simple virtual hand techniques, except that no virtual hand,
cursor or avatar has to be drawn since the user can view and
use his own hand as a reference. Bowman et al. [16] discuss
how isomorphic techniques are often more natural and precise,
but limited by device tracking ranges or the limits of the user’s
body. In our case, these limits can easily accommodate the
needs of positioning.
Our system does not currently track hand orientation, thus
object rotation is implemented with a bimanual technique. By
simply “grabbing” the virtual object with both hands and using
them to rotate it as they would do with a large physical object,
users are free to perform the rotation as either a symmetric or
asymmetric (for instance, with the non-dominant hand serving
as a pivot while the dominant hand does the finer rotation)
bimanual action. Zooming in and out (or scaling the object)
also uses this bimanual strategy but in this case users tend to
prefer the symmetric movement as was confirmed in our tests.
In this way, we have a position control device with 6 DOF (3
from each hand) well-suited for object manipulation. While the
hands have independent control, using them in this integrated
manner to manipulate larger objects is still a natural and well
coordinated task we often perform with larger physical objects.
The time navigation in our applications typically involves
short length animations, so the use of a simple timeline slider,
instead of the more sophisticated solution presented in [17],
proves adequate. [18], however, shows a way to use manipula-
tion of objects in a video sequence as a sometimes better way
to advance or go back in an animation. We are currently using
the classic horizontal timeline slider to control navigation in
the time dimension.
While we briefly discussed here the basic 3D interaction
principles that guided us in developing Fusion4D, the interac-
tion techniques themselves and even some further motivation
for our decisions are discussed in greater detail in the next
section.
IV. BRIDGING THE GAP - INTERACTION
TECHNIQUES IN FUSION4D
Given the unencumbered 3D manipulation methods
explored previously, we noted that a gap between virtual
object manipulation and user gestures still exists, and it is
now bridged by the use of avatars or cursors. The proposal
presented here dissolves this boundary through the use of a
multimodal interface (body movement + voice) which, by
hypothesis, gives the user a greater feeling that the virtual
objects are immersed in the real environment, with their
direct manipulation.Aiming at eliminating boundaries between real and virtual
worlds as a guiding principle, we have taken as a basic re-
quirement to avoid any visual elements not directly connected
to the virtual object (such as icons and menus). This has
brought in some important challenges related to user feed-
back during manipulations, as well as challenges in enabling
easy instruction/discoverability of the actions available in the
system, which are described in this section.In this first approach to the problem, we have focused our
efforts on enabling basic tasks that are common requirements
when visualizing isolated 3D objects for many applications.
Because of that, the designed tasks comprehend: translation,
rotation and scaling of 3D objects in each spatial axis; separa-
tion and reassembling of the parts that compose a model; vi-
sualization of descriptive textual labels relative to each model;
and the choice of a model between several available options.
To implement these interaction techniques we took advantage
of Microsoft’s Kinect for Windows SDK.Regarding the stereoscopic visualization of virtual 3D ob-
jects and seeking greater accessibility for users, we chose
the anaglyph technique. We are aware of this technique’s
negative points, such as low quality of the 3d model color,
discomfort after prolonged use and its inefficiency for daltonic
people [19], but we assume they will not interfere in our
test results since we have not considered virtual object color
as a parameter in them and we have selected no daltonic
subjects. Furthermore, we implemented methods to improve
color perception and to reduce ghosting in the visualization
[20] [21].
A. Translation, Rotation and ScalingThe task of interactively viewing and analyzing a 3D object
usually includes manipulation in at least one of the spatial axes
and often in all three [16]. Purporting virtual manipulations
that maintain a parallel with real-world interactions, for this
study we enabled every manipulation available in the physical
world for solid objects (six-axis translation and rotation) as
well as a manipulation that is not available in the real world
but is useful for 3D analysis tasks (proportional scaling).To do that, we developed a set of algorithms that map 3D
hand positions to virtual “contact points” in the object when
the manipulation starts, making the object “stick” to the user’s
hands during the manipulation. This design choice has enabled
a natural implementation of object scaling: after the object is
grabbed by the user’s hands, if the user increases or decreases
the distance between his/her hands the object has to scale to
keep itself connected to the virtual contact points. Figure 1
shows a user engaged in the spatial manipulation mode in
Fusion4D.To enable manipulations to be performed in sequence with-
out losing previous manipulation results, we implemented a
136
way to enter and exit the manipulation mode through voice
commands — the “grab” command grabs the object storing the
current hand position as starting position, while the “release”
command exits this mode. While the manipulation is being
executed, all changes are computed relative to the starting
point, allowing for a more comfortable experience in which
the user can stop manipulating (and lower their hands) when
analyzing the object.
Fig. 1. User grabs a virtual object using Fusion4D (concept).
Another important issue regarding spatial manipulation is
feedback to the user showing what are the points of contact
between his hands and the virtual object before the manip-
ulation starts. To keep the interface free of any cursors and
avatars that are not part of the 3D objects being visualized,
we implemented a projection of the user hands in the shape
of a virtual spotlight that projects itself directly over the
virtual objects. With this projection, the user can see the 3D
object without the interference of cursors while still having
full control of the point of interaction.
B. Separation in Parts and Reassembling
Another common task in 3D model analysis is the separation
of models in subparts and subsequent reassembling of those
parts. To enable this task — which can be executed in parallel
to a spatial manipulation — we adopted the same strategy
used to initiate spatial manipulations, which is the use of
voice commands that do not obstruct the object under study
and that do not interfere with the immersive effect. Therefore,
the user can say “explode” to separate the parts of an object
followed by “assemble” to regroup those parts. These actions
can be executed even when the object is grabbed for spatial
manipulation, as shown in Figure 2 — left.
C. Descriptive Labels
We believe that understanding and learning about complex
objects in 3D often also involves learning the names of these
objects and their component parts and that, depending on
model complexity, being able to refer back to these names
interactively can facilitate the comprehension of what the user
is visualizing. For this reason, we added a method to the
system for showing text labels for each model (shown in
Figure 2 — right), which is activated in the same way as
Fig. 2. Left: A user analyzes an exploded virtual object without leaving thespatial manipulation mode. Right: Showing descriptive labels for an object.
the “model explosion” task: the “show labels” voice command
shows the descriptive labels that can be dismissed by the “hidelabels” command.
D. Choosing among Models
One last feature that is essential to any model analysis
system is the ability to choose different models for analysis.
In this proof-of-concept we developed a rudimentary system
for choosing between two models, using a voice command
to show the available models (“change model”), displayed in
Figure 3 — top. From that point, the choice is made using
the same visual feedback of a spatial manipulation: a spotlight
represents the virtual projection of the user’s hand; the user can
select the highlighted model with the “grab” voice command
(or alternatively “select”).
Fig. 3. Top: Choosing between two models. Bottom: Help sidebar isdisplayed alongside the 3D model.
E. Discoverability and Recall
The principle of reducing visual elements had an important
impact in the discoverability of the system. How would a user
discover the voice commands? How would she/he remember
those commands during usage?
To work around this issue, we added a “help” mode to
Fusion4D that temporarily shows up a list of available voice
137
commands in a sidebar, shown in 3 — bottom. The 3D scene
remains interactive while the help sidebar is displayed, and
by hypothesis the user should feel like the help bar is in
the same depth as the screen while models float in front of
it. This solution conflicts with our basic requirement of no
visual elements that are not directly related to the object being
examined, but this choice was a good compromise between
usability and immersion, as the user can call or dismiss the
help sidebar anytime through voice commands (“help” / “closehelp”).
F. Navigation in the Fourth Dimension
Going beyond common 3D visualization tasks, during our
research we noted that the use of voice commands as ad-
ditional input modality could enable other dimensions for
navigation that exceed the limits of 3D space. We therefore
decided to test this hypothesis by adding a 4th dimension
navigation to the proof-of-concept — such as time or other
varying attributes of the model — to verify the impact and
usability of an extra-spatial navigation in the context of an
object in augmented reality.
To develop this implementation of the 4th dimension, we
added another voice command that would start navigation in
a new dimension (in our case, “time”). For this study, the 4th
dimension was mapped to a one-dimensional horizontal line,
representing “time” (perpendicular to gravity), which the user
can manipulate with one hand, as shown in Figure 4.
Fig. 4. Navigating in the fourth dimension — time. Top: The fourthdimension is mapped to a one-dimensional timeline. Bottom: The user canchange the model development stage using horizontal hand movements.
V. USER STUDY
In order to have a fair evaluation and understanding of the
techniques described in this paper we conducted a series of
user studies. Through these studies, we could evaluate the diffi-
culties perceived by the users and get their remarks concerning
the application and problems in this new interaction paradigm
for direct 3D manipulation.
We started by elaborating a test protocol for the experiment.
The following questions were addressed: Is the interaction
discoverable and predictable? How simple is it to perform the
proposed tasks? Which classes of manipulation tasks can be
best performed using our proposed interface paradigm? Which
aspects of the implemented prototype represent barriers to the
use of the proposed paradigm? What is the degree of user
satisfaction? The variables in study were: the time to execute
each task, the number of mistakes in each task, simplicity in
understanding and using the system, user’s opinion about the
system, ease of learning and user satisfaction.
The study exposed 28 participants to 7 simple tasks and
analyzed various behavioral and experimental aspects of inter-
action during task execution. All the participants came from a
student or academic background, all had normal or corrected to
normal vision and no mobility impairments, and they did not
have any prior knowledge of the system. They were evenly
distributed between male and female and their ages varied
between 17 and 48 years old, with a young average age of 22.
While the objects used for this test were of a medical nature,
previous tests ran by us with teachers and students in this area
showed us that enthusiasm with the subject often made these
users gloss over deficiencies. In this test, therefore, we chose to
have most of our users come from a background in interaction
design and they proved to be significantly more critical. The
experiment was conducted in a usability lab using a computer
(configuration: Quad 1.6GHz CPU, 2GB RAM, 1GB dedicated
graphics card), a Kinect sensor for Windows device, red-blue
anaglyph glasses and a 50” Full HD LCD display (resolution:
1920x1080).
The tasks (shown in Figures 5, 6 and 7) were presented in
the same order to each participant.
In Task 1, the participant was requested to show the ob-
ject identification. In Task 2, the participant should show
the exploded view of the object. In Task 3, the participant
was requested to regroup the exploded object. The outcomes
expected from these tasks are shown in Figure 5.
Fig. 5. Task 1: Show the object identification. Task 2: Show the explodedview of the object. Task 3: Assemble the exploded object.
For Task 4, the participant should pick up the object and
translate it horizontally, releasing it when done. In Task 5, the
participant was requested to rotate the object by 45 degrees in
any axis and release it when done. For Task 6, the participant
should select a different 3D model. These tasks have the
expected outcomes shown in Figure 6.
138
Fig. 6. Task 4: Translate the object. Task 5: Rotate the object. Task 6: Selecta different 3D model.
In Task 7, the participant was requested to enter the time
navigation mode and change the development week of a fetus
model from week 6 to week 5. The different weeks available
for the fetus model are shown in Figure 7.
Fig. 7. Task 7: Change the model’s development week.
During the experiment, participants were not given any
specific instruction for the tasks to perform. The instructions
given in the beginning of the test suite were: the user’s starting
position in the room; a basic explanation that users should
interact using their voices and bodies; and an explanation that
there was a “help” voice command that would show the voice
commands available in the system and that could be used
anytime if they were lost. The participants could try out each
new task as many times as they wanted.
A. Issues with Voice Recognition and Visual Feedback
Although we have performed several tests to calibrate the
voice recognition, we questioned the eficacy of this feature
in preliminary tests performed with 11 users, since some
users found it dificult to complete tasks ? they had to repeat
the correct command many times to proceed with the task
and sometimes they had to voice a command too loudly.
Based on feedback from these preliminary participants and
since our objective is not to evaluate the performance of the
voice recognition, we decided to drop the voice recognition
feature during further testing and simulate it using the Wizard
of Oz technique [22]. In other words, the test participant
thought he or she was communicating with the device using a
speech interface, but the participant commands were actually
being entered into the computer by an operator and processed
as a text stream, rather than as an audio stream. The first
results (using full speech recognition) were not considered for
analysis in this paper.
VI. RESULTS AND DISCUSSION
After performing the 7 tasks in the tests, users answered
a questionnaire to determine their subjective opinion about
certain aspects of the system as well as to collect qualitative
information such as complaints, suggestions, explanations, fa-
vorite and worst aspects of the system etc.
When asked to evaluate specific aspects of the system,
users graded these aspects individually from 1 to 5, with 1
always indicating the least and 5 the greatest understanding,
satisfaction, performance etc. It was observed that users were
very reluctant to assign either the lowest or the highest score
for a given feature, even when in their qualitative answers and
personal interactions they clearly indicated high enthusiasm or
dissatisfaction.
The aspects subjectively evaluated by the users were: overall
impression about the system, how clear the 3D objects and the
texts (present in the labels and help, for instance) were in our
setup, how closely the virtual object followed the user hand
motions, how comfortable the system was to use, how easy to
perform the test tasks were and how well users could perceive
depth in our simple anaglyph stereo setup.
Taking into consideration the fact that users hardly ever
assigned the highest or the lowest scores to any questions
despite their clear enthusiasm or dissatisfaction, we decided
that an average score greater than 3.5 (i.e. closer to “good”
than to “average” in our scale) indicated a satisfactory result
for that aspect, while lower averages indicated problems.
For the overall impression about the system, we could refute
the hypothesis that the average was below 3.5 with a high
certainty (p = 0.00). The average for this score was 4.1 and
this was the one aspect in which it was less uncommon for
users to attribute the maximum score of 5. Given the results
for the other aspects, we question this average somewhat and
wonder if it indicates a wish of our users to encourage future
work. The clarity of images and text got the second highest
average score with 3.9, which also refutes the hypothesis
of problems with this aspect with p = 0.00. Users were
also very satisfied with how close the virtual object followed
their physical hand movements during direct manipulation.
This aspect had an average score of 3.7 and refuted the
hypothesis of problems with p = 0.01, which we consider
quite acceptable.
Another aspect that we are quite certain of is that our
strategy for visualization of depth using anaglyph stereo views,
low-cost glasses and no calibration based on each user has
problems. We actually consider the score for this aspect was
low enough to indicate there might be a problem with our
stereo implementation. The average score for this aspect was
2.9 and we can refute the hypothesis that this value is greater
than 3.5 with p = 0.00. Furthermore, 61% of all participants
reported they could not sense the object depth most of the time
and 6 of the 28 users reported discomfort caused by the stereo
visualization after the brief tests. We are currently searching
139
for possible implementation errors in the stereo visualization
and, whether they are found or not, we plan to explore more
sophisticated techniques for depth perception.
The two remaining aspects did get scores greater than or
equal to 3.5, but they were close enough to this value for
us not to refute the hypothesis of problems, due to the high
probability that this average would be lower with a different
sample space.
For the ease of execution of test tasks using the system, the
average score was 3.5, which yields a value of p as large as
0.5. We believe this was partially due to our test procedure and
partially due to our strategy for voice commands, particularly
the decision of not giving the users any training and only
the most basic of information prior to the test. Out of the
14 users who assigned a score lower than 4 to test ease,
4 complained about the lack of training in their qualitative
answers, 3 for some reason commented they did not know
which tasks to perform and 4 complained about having to
use voice commands. While we wished to test the system
discoverability, we counted on users taking advantage of the
help functionality that was explained to them and did not
implement a more flexible speech recognition strategy (even
when using the Wizard of Oz technique we stuck to the
original simple commands). Several users, however, resisted
to using the help and instead kept trying to guess correct
commands. This guessing would have been easier if we had
simply accepted more synonyms as options, including object
names for object selection, which would certainly have in-
creased discoverability and was even suggested by several
users. This is an improvement we will prioritize in future
work, even more than improving our stereo visualization (since
the current stereo implementation is just a stand-in for more
sophisticated solutions). Using voice commands for grabbing
and releasing the virtual object was also pointed out as more
cumbersome than the other commands. These commands were
not only repeated more often during the tests but it would
also probably have been more natural and closer to how
we interact with physical objects to grab the virtual one by
closing the hands instead of issuing a voice command. We are
currently implementing this improvement for the next version
of Fusion4D.
Finally, regarding comfort of use, the average score was
3.6 but we cannot refute the hypothesis of problems, either,
because p in this case is was 0.25. Once again, 14 users
assigned scores below 4 to this aspect. The discomfort due
to the stereo visualization certainly contributed to the lower
scores in this aspect, since 6 users complained specifically
about it. We also suspect that, because users were not in-
structed to keep their arms and elbows closer to their body in a
more comfortable position during the manipulation, which was
certainly possible, several ended up using broader movements
than necessary which could be a cause of fatigue or discomfort.
In this prototype we took advantage of Kinect’s skeleton
tracking to obtain hand positions, so users had to be standing to
interact with the system. While no users explicitly mentioned
or commented on this need to be standing up in any of their
qualitative feedback, we believe this may have been another
cause for some of the lower scores in the comfort aspect and
plan to investigate this further in the future.
Our statistics for task completion time were sadly uninfor-
mative, with very large variance among the users even for
rather simple tasks such as those that simply required voicing
a one-word command. The number of errors recorded per user
and per task, however, was low (less than 1) and showed
much less variance among users. Remembering that they were
presented to the tasks and the system with minimal instructions
and no training, this leads us to believe that this variation
chiefly represents a difference in the time each user took to
learn how to perform each task.
This conclusion is supported by observations of the tests
themselves. As an example, the very first task only required
the users to speak the command “label” once. Average task
completion time, however, was 15.9s with a standard devi-
ation of 15.5s. During the tests, however, we could clearly
distinguish between three groups of users: most of them found
out how to and executed this task (and the others) quite fast.
Some took moderately longer trying out alternatives or calling
up and reading the help system. Only a few took very long
times, usually due to insisting on errors or resisting calling up
the help. Applying k-means clustering with three classes to
the times recorded for this first task pointed out these groups
more precisely: 55% of users were in a cluster with average
task time of 5.9s and standard deviation of 2.9s, 30% averaged
18.2s with a deviation of 3.8s and 15% took 47.7s on average,
with a standard deviation of 11.7s.
Another evidence that points in this direction is that, while
in general the error rates were low, for the tasks that were the
first to introduce a new interaction element, such as the first
time speech or hand movements or the timeline slider were
used, average error rates were significantly higher than for
the tasks that immediately followed them and used the same
elements (around 7 times higher, with p = 0.00 in paired
T-tests). This even happened when the following task was
theoretically more complex, such as when rotation followed
translation. In fact, if it were not for these “first time discov-
ery” errors, our error rates for all tasks would be very close to
zero. Incidentally, this may present another argument in favor
of investigating the use of object manipulation to navigate back
and forward in time [18], as discussed earlier, so we can reduce
the number of additional interface elements and use previously
acquired skills for more tasks.
Given this impact of the discovery and learning times on
task execution times, error rates and even, apparently, in user
subjective opinions, in the future we should be careful to test
discoverabilty in an isolated experiment and, when testing to
evaluate performance or satisfaction, provide adequate training
to the users, either by showing them a short instruction video
and letting them explore the system for a fixed period of
time before giving them the first task, or by combining the
instructions with training tasks (not to be evaluated as part of
the test) in a game-like tutorial.
Regarding qualitative feedback from the users, 50% de-
scribed considerable enthusiasm for the 3D object manipula-
tion using their own hands and chose it as their favorite system
feature. Only 2 of the 28 users disliked it, but we could not
determine why. The ability to navigate in time and to visualize
140
the object’s component parts was the favorite of 29% of all
users and only 1 of them disliked it. As for the use of voice
commands, only 14% of the users enjoyed this feature and
approximately an equal number had complaints about it but
we believe this is due, at least in part, to our implementation
of the grab and release commands using speech instead of
the gestures that would be more natural for this task (as
was already discussed) and that are being implemented in the
system’s next version.
VII. CONCLUSIONS
Direct manipulation interfaces may go through significant
changes with holographic displays. In this work we presented
a prototype of 3D interaction techniques implemented on a
simulated holographic display and the user tests performed
with it. The results show that the technology for our simulated
holographic display requires improvement, and the interface’s
discoverability requires further evaluation. Furthermore, while
we are currently using the classic horizontal timeline slider to
control navigation in the time dimension, in future works we
would like to investigate the technique based on the manip-
ulation of the objects themselves [18] as a potentially more
natural way to navigate in time in the context of our applica-
tion. Another future work we are interested in is comparing
user performance and satisfaction in a system like the one we
presented in this paper and a system using a video-avatar, i.e.
a system where the user’s own image is captured, segmented
[23] and integrated in a virtual environment as an avatar, to
manipulate virtual objects. Despite the improvements that are
still necessary, our test results were positive in terms of user
acceptance of the proposed techniques. This evidence moti-
vates us to continue along this line of research to investigate
and propose direct manipulation and visualization techniques
suitable for holographic displays.
VIII. ACKNOWLEDGMENTS
We would like to thank Centro Universitario Senac, Eduardo
Sonnino and Ian Muntoreanu.
REFERENCES
[1] B. Shneiderman, “Direct manipulation: A step beyond programminglanguages,” Computer, vol. 16, no. 8, pp. 57–69, Aug. 1983. [Online].Available: http://dx.doi.org/10.1109/MC.1983.1654471
[2] J. Canny, “The future of human-computer interaction,” Queue,vol. 4, no. 6, pp. 24–32, Jul. 2006. [Online]. Available: http://doi.acm.org/10.1145/1147518.1147530
[3] R. A. Bolt, ““put-that-there”: Voice and gesture at the graphicsinterface,” SIGGRAPH Comput. Graph., vol. 14, no. 3, pp. 262–270, July1980. [Online]. Available: http://doi.acm.org/10.1145/965105.807503
[4] W. Robinett and R. Holloway, “Implementation of flying, scalingand grabbing in virtual worlds,” in Proceedings of the 1992symposium on Interactive 3D graphics, ser. I3D ’92. NewYork, NY, USA: ACM, 1992, pp. 189–192. [Online]. Available:http://doi.acm.org/10.1145/147156.147201
[5] T. Baudel and M. Beaudouin-Lafon, “Charade: remote control of objectsusing free-hand gestures,” Commun. ACM, vol. 36, no. 7, pp. 28–35, Jul.1993. [Online]. Available: http://doi.acm.org/10.1145/159544.159562
[6] A. Butler, O. Hilliges, S. Izadi, S. Hodges, D. Molyneaux, D. Kim,and D. Kong, “Vermeer: direct interaction with a 360o viewable3d display,” in Proceedings of the 24th annual ACM symposiumon User interface software and technology, ser. UIST ’11. NewYork, NY, USA: ACM, 2011, pp. 569–576. [Online]. Available:http://doi.acm.org/10.1145/2047196.2047271
[7] G. Ruppert, P. Amorim, T. Moraes, and J. Silva, “Touchless gestureuser interface for 3d visualization using the kinect platform andopen-source frameworks,” in Innovative Developments in Virtual andPhysical Prototyping. Boca Raton: CRC Press, 2012, pp. 215–219.[Online]. Available: http://dx.doi.org/10.1201/b11341-35
[8] Siemens, “Game console technology in the operating room,” http://www.siemens.com/innovation/en/news/2012/e inno 1206 1.htm, 2012.
[9] L. Gallo, A. Placitelli, and M. Ciampi, “Controller-free explorationof medical image data: Experiencing the kinect,” in Computer-BasedMedical Systems (CBMS), 2011 24th International Symposium on, June2011, pp. 1–6.
[10] Y. Yin and R. Davis, “Toward natural interaction in the realworld: real-time gesture recognition,” in International Conferenceon Multimodal Interfaces and the Workshop on Machine Learningfor Multimodal Interaction, ser. ICMI-MLMI ’10. New York,NY, USA: ACM, 2010, pp. 15:1–15:8. [Online]. Available: http://doi.acm.org/10.1145/1891903.1891924
[11] R. Wang, S. Paris, and J. Popovic, “6d hands: markerless hand-trackingfor computer aided design,” in Proceedings of the 24th annual ACMsymposium on User interface software and technology, ser. UIST ’11.New York, NY, USA: ACM, 2011, pp. 549–558. [Online]. Available:http://doi.acm.org/10.1145/2047196.2047269
[12] J. L. Reisman, P. L. Davidson, and J. Y. Han, “A screen-spaceformulation for 2d and 3d direct manipulation,” in Proceedings ofthe 22nd annual ACM symposium on User interface software andtechnology, ser. UIST ’09. New York, NY, USA: ACM, 2009, pp. 69–78. [Online]. Available: http://doi.acm.org/10.1145/1622176.1622190
[13] O. Hilliges, S. Izadi, A. D. Wilson, S. Hodges, A. Garcia-Mendoza,and A. Butz, “Interactions in the air: adding further depth to interactivetabletops,” in Proceedings of the 22nd annual ACM symposiumon User interface software and technology, ser. UIST ’09. NewYork, NY, USA: ACM, 2009, pp. 139–148. [Online]. Available:http://doi.acm.org/10.1145/1622176.1622203
[14] R. George and J. Blake, “Objects, containers, gestures, and manipula-tions: Universal foundational metaphors of natural user interfaces,” inProceedings of CHI10, 2010.
[15] V. Pavlovic, R. Sharma, and T. Huang, “Visual interpretation of handgestures for human-computer interaction: a review,” Pattern Analysisand Machine Intelligence, IEEE Transactions on, vol. 19, no. 7, pp.677–695, July 1997.
[16] D. A. Bowman, E. Kruijff, J. J. LaViola, and I. Poupyrev, 3D UserInterfaces: Theory and Practice. Redwood City, CA, USA: AddisonWesley Longman Publishing Co., Inc., 2004.
[17] S. Pongnumkul, J. Wang, G. Ramos, and M. Cohen, “Content-awaredynamic timeline for video browsing,” in Proceedings of the 23ndannual ACM symposium on User interface software and technology,ser. UIST ’10. New York, NY, USA: ACM, 2010, pp. 139–142.[Online]. Available: http://doi.acm.org/10.1145/1866029.1866053
[18] D. B. Goldman, C. Gonterman, B. Curless, D. Salesin, and S. M. Seitz,“Video object annotation, navigation, and composition,” in Proceedingsof the 21st annual ACM symposium on User interface software andtechnology, ser. UIST ’08. New York, NY, USA: ACM, 2008, pp.3–12. [Online]. Available: http://doi.acm.org/10.1145/1449715.1449719
[19] A. J. H. N. J. SHARPE, L.; STOCKMAN, “Opsin genes, cone pho-topigments, color vision, and color blindness,” in Color vision: Fromgenes to perception, 1999.
[20] D. SANFTMANN, H.; WEISKOPF, “Anaglyph stereo without ghost-ing,” in Proceedings of the 2011 Eurographics Symposium on Rendering,2011.
[21] L. IDESES, I.; YAROSLAVSKY, “Three methods that improve thevisual quality of colour anaglyphs,” Journal of Optics A: Pure andApplied Optics, vol. 7, 2005.
[22] J. Kelley, “An iterative design methodology for user-friendly naturallanguage office information applications,” ACM Transactions on OfficeInformation Systems, vol. 2, no. 1, pp. 26–41, 1984.
[23] S. R. R. Sanches, R. Nakamura, V. da Silva, and R. Tori, “Bilayersegmentation of live video in uncontrolled environments for backgroundsubstitution: An overview and main challenges,” Latin America Trans-actions, IEEE (Revista IEEE America Latina), vol. 10, no. 5, pp. 2138–2149, 2012.
141