[ieee 2013 xv symposium on virtual and augmented reality (svr) - cuiabá - mato grosso, brazil...

Fusion4D: 4D unencumbereddirect manipulation and visualization

Roberto Sonnino, Keila Keiko Matsumura, Joao Luiz Bernardes Junior,

Ricardo Nakamura and Romero Tori

Department of Computer and Systems Engineering

Universidade de Sao Paulo

Sao Paulo, Brazil

[email protected], [email protected], [email protected],

[email protected], [email protected]

Abstract—It is possible to predict that sometime in the future,holographic interactive displays will be available as householdcommodities, requiring new interaction techniques. Fusion4D isour proposal for unencumbered direct manipulation interfacesinvolving three dimensions in physical space, as well as a fourthdata-dependent dimension (such as time-varying information ofan object.) A proof-of-concept prototype has been developed andsubjected to user testing. The results of the tests indicate thatthere are some points for improvement, such as the visualizationtechnology, but the system is well accepted by users.

I. INTRODUCTION

An important paradigm change in human-machine interfaces

happened when we migrated from the “command line input -

alphanumeric output” model towards the “direct manipulation

- graphical output” model. When Ben Shneiderman [1] first

introduced the concept of direct manipulation almost three

decades ago, defining it as a “continuous representation of

objects of interest, and rapid, reversible, incremental actions

and feedback”, he was able to synthesize the essence of the

concept, without binding it to specific metaphors, models or

technologies. Because of that, direct manipulation has contin-

ued to evolve, even though its most popular implementations

are based on WIMP (“windows, icons, menus, pointers”) [2],

with signals of an eventual replacement of the mouse pointers

by multi-touch.

But how will this evolution continue? What will the next

paradigm shift be? Screens have already reached a resolution

comparable to that of human retinas. Graphics cards are able to

render 3D models in real time with high fidelity. Nevertheless,

we can only “touch” virtual objects when separated by a “glass

barrier”, through intermediate devices such as touch screens,

mice and data gloves or at a distance through gesture recog-

nition. It is possible to predict that sometime in the future we

will have holographic displays that will allow 3D projections

in space as household commodities, completely eliminating

all barriers and boundaries between our touch and the virtual

objects. At that moment, the concept of direct manipulation

will most likely acquire a literally new dimension.

The investigation of direct manipulation 3D interaction tech-

niques in holographic displays is the main goal of the work

presented in this paper. Our proposed system uses stereoscopic

projection and a depth map sensor (Microsoft Kinect) to sim-

ulate a future holographic display. We use this infrastructure

to implement a 3D hands-free direct manipulation method that

allows users to view and examine 3D virtual objects, without

intermediary devices, as if those objects were floating in the

space between their hands. Besides, the system incorporates

an interface that allows navigation in a fourth dimension (for

instance, analyzing a virtual model of a fetus and at the same

time changing its development stage, or changing between

different versions of an engine model while examining it.)

To evaluate this method, a prototype was developed, and an

initial set of user tests was executed, the results of which are

presented and discussed here.

II. RELATED WORK

The desire to mimic or even surpass the real world and

interact with virtual objects in 3D comes from the pioneers

of human-computer interfaces who developed systems such

as Heilig’s Sensorama and Sutherland’s Ultimate Display and

HMD in the sixties. Later these interfaces incorporated Mul-

timodal [3] and multidimensional [4] interactions. “Put-that-

there” [3] made use of the multimodal combination of pointing

gestures and speech recognition to interact in a 2D interface.

A virtual hand for object manipulation in virtual environments

viewed through an HMD was implemented in 1986 using a

dataglove [4], as were other techniques for navigation and

object selection. Charade [5] made use of symbolic hand

gestures as remote commands to computer applications such

as presentation viewers and discussed guidelines for these

interfaces as well as future applications, many of which have

currently become reality.

More recently, the availability of the low-cost depth sensor

Kinect has started a new wave of research in unencumbered

manipulation and augmented reality interfaces such as [6], [7],

[8], [9]. In general these works explore direct manipulation pri-

marily through mapping hand position to pointers and gesture

sequences that adapt traditional GUIs for gestural command

or through the use of separate modes and well-trained gesture

combinations for interaction.

Vermeer [6] allows interaction within the display volume of

a 360 viewable 3D display much like the holographic displays

2013 XV Symposium on Virtual and Augmented Reality

978-0-7695-5001-5/13 $26.00 © 2013 IEEE

DOI 10.1109/SVR.2013.40

134

mentioned earlier. This display is based on spinning optics

placed at the bottom of a pair of parabolic mirrors, a setup

that exploits a well-known optical illusion causing the objects

shown in the spinning display to be reimaged “floating” just

above the mirrors. Thus users can touch and interact with the

virtual objects with their hands inside the display volume.

A few simple interaction techniques using the system are

described, based on tracking user fingertip positions in 3D

and either touching specific regions of the virtual objects to

issue commands or letting the fingertips collide with them and

using physics simulation to control their response.

The visualization of medical images is an important appli-

cation for unencumbered interfaces because during surgical

procedures touching the interface should be avoided due to

the risk of contamination. It is also the subject we chose for

out user tests, although we envision Fusion4D being used

more for educational and training purposes than in surgical

rooms. [7] is an example of application for the surgical room.

It uses an open-source solution for 3D reconstruction and

visualization of medical images and takes advantage of its

already existing mouse-based interface, simply mapping the

tracked hand positions to 2D mouse cursor positions. Two

techniques may be used to generate mouse clicks. In the

first, leaving the hand still in the gesture-detection volume

for 1s generates a left-button press and removing the hand

from the volume generates a release. The second maps the

movements of the user’s other hand into mouse left and right

button press and release events. Siemens is also developing a

similar system [8] on top of one of their existing visualization

solutions and demonstrated it at RSNA 2011. Their system

uses finger pointing to control mouse movement and image

rotation and mentions other gestures such as spreading the

hands apart to zoom out, but we could not find more details

about the interaction techniques used. [9] presents yet another

solution and describes the use of gestures in more detail.

Finger pointing is also used to control cursor position, and

mouse clicking uses a gesture (fist-palm-fist) with the user’s

other hand. Bimanual manipulation controls translation, zoom

and rotation (rotation is differentiated by requiring a closed

fist posture) while pre-defined gestures such as assuming a

folded arms posture or quickly moving the dominant hand as

if cleaning are used to issue commands such as selecting or

erasing a region of interest. The system also allows the control

of animations by moving the non-dominant hand alone in the

gesture-detection volume as if along an horizontal timeline

slider. Gestures are the only modality of interaction used

and no user tests are reported. All these 3 solutions consider

primarily 2D displays.

One system that mentions multimodal interaction, combin-

ing voice and mostly pointing gestures, was applied to the task

of Urban Search and Rescue, i.e. coordinating geographically

distributed teams working in disaster relief [10]. Preliminary

Wizard-of-Oz user studies showed the importance of mul-

timodal interaction but at first only single-hand isolated or

continuous gestures were implemented, to interact with Google

Earth controlling 2D translation, two rotations and zoom.

The paper does not describe in more detail the interaction

techniques used.

A work with a manipulation interface similar to ours

was proposed to aid in particular Computer-Aided Design

tasks [11]. The system tracks the positions (but not the

orientation) of both hands without any gloves or markers

so the user can easily transition from working with mouse

and keyboard to using gestures. Direct hand manipulation

is used for 6 degrees of freedom (DOF) translation and

rotation of virtual objects and of the scene camera (the

scene is displayed in a 2D device), for which it is much

better suited than the conventional interface. But mouse and

keyboard are still used for other tasks. Object selection is

done with a pinching gesture and then movements with one

hand are mapped to translations of the virtual object, while

bimanual manipulation becomes either object or camera

rotation depending on whether an object is selected or not.

During the direct hand manipulation this is the only mode

of interaction used and other system functionality would be

offered through the conventional WIMP interface.

Three-dimensional direct hand manipulation is even being

researched in the context of 2D multi-touch devices. [12]

implements and discusses an interesting technique for direct

rigid-body manipulation of 3D objects in such devices using

three fingertips touching the object as points of reference and

constraints and [13] uses grasping or bimanual pinching for

the same task.

Even though those works have deeply impacted the realm of

unencumbered 3D interfaces, they still present some important

limitations: First of all, the effect of immersion that blends

the real and virtual world is still limited by the need of

elements such as icons and cursors and often by the use of 2D

displays. Besides, the gestures chosen for interaction in those

works often require previous training, are the only mode of

interaction and their use carries a significant cognitive load.

III. 3D INTERACTION PRINCIPLES IN FUSION4D

In this work, we adopt the decision presented in the OCGM

metaphor [14] to differentiate between gestures (discrete in-

teraction with symbolic meaning) and manipulation (contin-

uous and direct interaction with interface elements) even if

both use hands. Many authors, such as [15], include both

forms of interaction in their definition of gestures and, while

we do not argue against such definitions, we believe it is

important to differentiate between them in the context of

interaction tasks and techniques because, while technologically

their recognition might demand similar resources, gestures and

manipulations as defined in OCGM can be used for interaction

in very distinct and often complementary ways. We often use

the expression “direct hand manipulation” herein to describe

this continuous, direct and unencumbered interaction with

the virtual objects (which some authors would perhaps call

manipulation gestures).

With that in mind, the main task in our system is the spatial

manipulation of one object (although it can be separated into

component parts) as a rigid body (although it can behave

otherwise when moving back or forward in time). Using the

methodology described in [16] we divided the manipulation

task in the four canonical subtasks of selection, positioning,

135

rotation and scaling and then chose appropriate techniques for

each.

Because we would usually have only one object to select and

deselect, we simply used voice commands to do so (although

we could have used gesture commands as well, but we had

decided to investigate using that channel for manipulation

only). Voice commands are also used to view help, change

the object being displayed, navigate in the time dimension,

explode the object into its components and reassemble it.

We are thus combining voice commands and direct hand

manipulation in a multimodal interface. [16] discusses some

advantages of multimodal interfaces that apply to this setup

such as decoupling system control from manipulation in dif-

ferent channels, thus decreasing cognitive load and comparing

the inputs from both channels for their disambiguation.

Our application requires repositioning the object in only

relatively short distances so we could choose an isomorphic

mapping between hand and object movement similarly to

simple virtual hand techniques, except that no virtual hand,

cursor or avatar has to be drawn since the user can view and

use his own hand as a reference. Bowman et al. [16] discuss

how isomorphic techniques are often more natural and precise,

but limited by device tracking ranges or the limits of the user’s

body. In our case, these limits can easily accommodate the

needs of positioning.

Our system does not currently track hand orientation, thus

object rotation is implemented with a bimanual technique. By

simply “grabbing” the virtual object with both hands and using

them to rotate it as they would do with a large physical object,

users are free to perform the rotation as either a symmetric or

asymmetric (for instance, with the non-dominant hand serving

as a pivot while the dominant hand does the finer rotation)

bimanual action. Zooming in and out (or scaling the object)

also uses this bimanual strategy but in this case users tend to

prefer the symmetric movement as was confirmed in our tests.

In this way, we have a position control device with 6 DOF (3

from each hand) well-suited for object manipulation. While the

hands have independent control, using them in this integrated

manner to manipulate larger objects is still a natural and well

coordinated task we often perform with larger physical objects.

The time navigation in our applications typically involves

short length animations, so the use of a simple timeline slider,

instead of the more sophisticated solution presented in [17],

proves adequate. [18], however, shows a way to use manipula-

tion of objects in a video sequence as a sometimes better way

to advance or go back in an animation. We are currently using

the classic horizontal timeline slider to control navigation in

the time dimension.

While we briefly discussed here the basic 3D interaction

principles that guided us in developing Fusion4D, the interac-

tion techniques themselves and even some further motivation

for our decisions are discussed in greater detail in the next

section.

IV. BRIDGING THE GAP - INTERACTION

TECHNIQUES IN FUSION4D

Given the unencumbered 3D manipulation methods

explored previously, we noted that a gap between virtual

object manipulation and user gestures still exists, and it is

now bridged by the use of avatars or cursors. The proposal

presented here dissolves this boundary through the use of a

multimodal interface (body movement + voice) which, by

hypothesis, gives the user a greater feeling that the virtual

objects are immersed in the real environment, with their

direct manipulation.Aiming at eliminating boundaries between real and virtual

worlds as a guiding principle, we have taken as a basic re-

quirement to avoid any visual elements not directly connected

to the virtual object (such as icons and menus). This has

brought in some important challenges related to user feed-

back during manipulations, as well as challenges in enabling

easy instruction/discoverability of the actions available in the

system, which are described in this section.In this first approach to the problem, we have focused our

efforts on enabling basic tasks that are common requirements

when visualizing isolated 3D objects for many applications.

Because of that, the designed tasks comprehend: translation,

rotation and scaling of 3D objects in each spatial axis; separa-

tion and reassembling of the parts that compose a model; vi-

sualization of descriptive textual labels relative to each model;

and the choice of a model between several available options.

To implement these interaction techniques we took advantage

of Microsoft’s Kinect for Windows SDK.Regarding the stereoscopic visualization of virtual 3D ob-

jects and seeking greater accessibility for users, we chose

the anaglyph technique. We are aware of this technique’s

negative points, such as low quality of the 3d model color,

discomfort after prolonged use and its inefficiency for daltonic

people [19], but we assume they will not interfere in our

test results since we have not considered virtual object color

as a parameter in them and we have selected no daltonic

subjects. Furthermore, we implemented methods to improve

color perception and to reduce ghosting in the visualization

[20] [21].

A. Translation, Rotation and ScalingThe task of interactively viewing and analyzing a 3D object

usually includes manipulation in at least one of the spatial axes

and often in all three [16]. Purporting virtual manipulations

that maintain a parallel with real-world interactions, for this

study we enabled every manipulation available in the physical

world for solid objects (six-axis translation and rotation) as

well as a manipulation that is not available in the real world

but is useful for 3D analysis tasks (proportional scaling).To do that, we developed a set of algorithms that map 3D

hand positions to virtual “contact points” in the object when

the manipulation starts, making the object “stick” to the user’s

hands during the manipulation. This design choice has enabled

a natural implementation of object scaling: after the object is

grabbed by the user’s hands, if the user increases or decreases

the distance between his/her hands the object has to scale to

keep itself connected to the virtual contact points. Figure 1

shows a user engaged in the spatial manipulation mode in

Fusion4D.To enable manipulations to be performed in sequence with-

out losing previous manipulation results, we implemented a

136

way to enter and exit the manipulation mode through voice

commands — the “grab” command grabs the object storing the

current hand position as starting position, while the “release”

command exits this mode. While the manipulation is being

executed, all changes are computed relative to the starting

point, allowing for a more comfortable experience in which

the user can stop manipulating (and lower their hands) when

analyzing the object.

Fig. 1. User grabs a virtual object using Fusion4D (concept).

Another important issue regarding spatial manipulation is

feedback to the user showing what are the points of contact

between his hands and the virtual object before the manip-

ulation starts. To keep the interface free of any cursors and

avatars that are not part of the 3D objects being visualized,

we implemented a projection of the user hands in the shape

of a virtual spotlight that projects itself directly over the

virtual objects. With this projection, the user can see the 3D

object without the interference of cursors while still having

full control of the point of interaction.

B. Separation in Parts and Reassembling

Another common task in 3D model analysis is the separation

of models in subparts and subsequent reassembling of those

parts. To enable this task — which can be executed in parallel

to a spatial manipulation — we adopted the same strategy

used to initiate spatial manipulations, which is the use of

voice commands that do not obstruct the object under study

and that do not interfere with the immersive effect. Therefore,

the user can say “explode” to separate the parts of an object

followed by “assemble” to regroup those parts. These actions

can be executed even when the object is grabbed for spatial

manipulation, as shown in Figure 2 — left.

C. Descriptive Labels

We believe that understanding and learning about complex

objects in 3D often also involves learning the names of these

objects and their component parts and that, depending on

model complexity, being able to refer back to these names

interactively can facilitate the comprehension of what the user

is visualizing. For this reason, we added a method to the

system for showing text labels for each model (shown in

Figure 2 — right), which is activated in the same way as

Fig. 2. Left: A user analyzes an exploded virtual object without leaving thespatial manipulation mode. Right: Showing descriptive labels for an object.

the “model explosion” task: the “show labels” voice command

shows the descriptive labels that can be dismissed by the “hidelabels” command.

D. Choosing among Models

One last feature that is essential to any model analysis

system is the ability to choose different models for analysis.

In this proof-of-concept we developed a rudimentary system

for choosing between two models, using a voice command

to show the available models (“change model”), displayed in

Figure 3 — top. From that point, the choice is made using

the same visual feedback of a spatial manipulation: a spotlight

represents the virtual projection of the user’s hand; the user can

select the highlighted model with the “grab” voice command

(or alternatively “select”).

Fig. 3. Top: Choosing between two models. Bottom: Help sidebar isdisplayed alongside the 3D model.

E. Discoverability and Recall

The principle of reducing visual elements had an important

impact in the discoverability of the system. How would a user

discover the voice commands? How would she/he remember

those commands during usage?

To work around this issue, we added a “help” mode to

Fusion4D that temporarily shows up a list of available voice

137

commands in a sidebar, shown in 3 — bottom. The 3D scene

remains interactive while the help sidebar is displayed, and

by hypothesis the user should feel like the help bar is in

the same depth as the screen while models float in front of

it. This solution conflicts with our basic requirement of no

visual elements that are not directly related to the object being

examined, but this choice was a good compromise between

usability and immersion, as the user can call or dismiss the

help sidebar anytime through voice commands (“help” / “closehelp”).

F. Navigation in the Fourth Dimension

Going beyond common 3D visualization tasks, during our

research we noted that the use of voice commands as ad-

ditional input modality could enable other dimensions for

navigation that exceed the limits of 3D space. We therefore

decided to test this hypothesis by adding a 4th dimension

navigation to the proof-of-concept — such as time or other

varying attributes of the model — to verify the impact and

usability of an extra-spatial navigation in the context of an

object in augmented reality.

To develop this implementation of the 4th dimension, we

added another voice command that would start navigation in

a new dimension (in our case, “time”). For this study, the 4th

dimension was mapped to a one-dimensional horizontal line,

representing “time” (perpendicular to gravity), which the user

can manipulate with one hand, as shown in Figure 4.

Fig. 4. Navigating in the fourth dimension — time. Top: The fourthdimension is mapped to a one-dimensional timeline. Bottom: The user canchange the model development stage using horizontal hand movements.

V. USER STUDY

In order to have a fair evaluation and understanding of the

techniques described in this paper we conducted a series of

user studies. Through these studies, we could evaluate the diffi-

culties perceived by the users and get their remarks concerning

the application and problems in this new interaction paradigm

for direct 3D manipulation.

We started by elaborating a test protocol for the experiment.

The following questions were addressed: Is the interaction

discoverable and predictable? How simple is it to perform the

proposed tasks? Which classes of manipulation tasks can be

best performed using our proposed interface paradigm? Which

aspects of the implemented prototype represent barriers to the

use of the proposed paradigm? What is the degree of user

satisfaction? The variables in study were: the time to execute

each task, the number of mistakes in each task, simplicity in

understanding and using the system, user’s opinion about the

system, ease of learning and user satisfaction.

The study exposed 28 participants to 7 simple tasks and

analyzed various behavioral and experimental aspects of inter-

action during task execution. All the participants came from a

student or academic background, all had normal or corrected to

normal vision and no mobility impairments, and they did not

have any prior knowledge of the system. They were evenly

distributed between male and female and their ages varied

between 17 and 48 years old, with a young average age of 22.

While the objects used for this test were of a medical nature,

previous tests ran by us with teachers and students in this area

showed us that enthusiasm with the subject often made these

users gloss over deficiencies. In this test, therefore, we chose to

have most of our users come from a background in interaction

design and they proved to be significantly more critical. The

experiment was conducted in a usability lab using a computer

(configuration: Quad 1.6GHz CPU, 2GB RAM, 1GB dedicated

graphics card), a Kinect sensor for Windows device, red-blue

anaglyph glasses and a 50” Full HD LCD display (resolution:

1920x1080).

The tasks (shown in Figures 5, 6 and 7) were presented in

the same order to each participant.

In Task 1, the participant was requested to show the ob-

ject identification. In Task 2, the participant should show

the exploded view of the object. In Task 3, the participant

was requested to regroup the exploded object. The outcomes

expected from these tasks are shown in Figure 5.

Fig. 5. Task 1: Show the object identification. Task 2: Show the explodedview of the object. Task 3: Assemble the exploded object.

For Task 4, the participant should pick up the object and

translate it horizontally, releasing it when done. In Task 5, the

participant was requested to rotate the object by 45 degrees in

any axis and release it when done. For Task 6, the participant

should select a different 3D model. These tasks have the

expected outcomes shown in Figure 6.

138

Fig. 6. Task 4: Translate the object. Task 5: Rotate the object. Task 6: Selecta different 3D model.

In Task 7, the participant was requested to enter the time

navigation mode and change the development week of a fetus

model from week 6 to week 5. The different weeks available

for the fetus model are shown in Figure 7.

Fig. 7. Task 7: Change the model’s development week.

During the experiment, participants were not given any

specific instruction for the tasks to perform. The instructions

given in the beginning of the test suite were: the user’s starting

position in the room; a basic explanation that users should

interact using their voices and bodies; and an explanation that

there was a “help” voice command that would show the voice

commands available in the system and that could be used

anytime if they were lost. The participants could try out each

new task as many times as they wanted.

A. Issues with Voice Recognition and Visual Feedback

Although we have performed several tests to calibrate the

voice recognition, we questioned the eficacy of this feature

in preliminary tests performed with 11 users, since some

users found it dificult to complete tasks ? they had to repeat

the correct command many times to proceed with the task

and sometimes they had to voice a command too loudly.

Based on feedback from these preliminary participants and

since our objective is not to evaluate the performance of the

voice recognition, we decided to drop the voice recognition

feature during further testing and simulate it using the Wizard

of Oz technique [22]. In other words, the test participant

thought he or she was communicating with the device using a

speech interface, but the participant commands were actually

being entered into the computer by an operator and processed

as a text stream, rather than as an audio stream. The first

results (using full speech recognition) were not considered for

analysis in this paper.

VI. RESULTS AND DISCUSSION

After performing the 7 tasks in the tests, users answered

a questionnaire to determine their subjective opinion about

certain aspects of the system as well as to collect qualitative

information such as complaints, suggestions, explanations, fa-

vorite and worst aspects of the system etc.

When asked to evaluate specific aspects of the system,

users graded these aspects individually from 1 to 5, with 1

always indicating the least and 5 the greatest understanding,

satisfaction, performance etc. It was observed that users were

very reluctant to assign either the lowest or the highest score

for a given feature, even when in their qualitative answers and

personal interactions they clearly indicated high enthusiasm or

dissatisfaction.

The aspects subjectively evaluated by the users were: overall

impression about the system, how clear the 3D objects and the

texts (present in the labels and help, for instance) were in our

setup, how closely the virtual object followed the user hand

motions, how comfortable the system was to use, how easy to

perform the test tasks were and how well users could perceive

depth in our simple anaglyph stereo setup.

Taking into consideration the fact that users hardly ever

assigned the highest or the lowest scores to any questions

despite their clear enthusiasm or dissatisfaction, we decided

that an average score greater than 3.5 (i.e. closer to “good”

than to “average” in our scale) indicated a satisfactory result

for that aspect, while lower averages indicated problems.

For the overall impression about the system, we could refute

the hypothesis that the average was below 3.5 with a high

certainty (p = 0.00). The average for this score was 4.1 and

this was the one aspect in which it was less uncommon for

users to attribute the maximum score of 5. Given the results

for the other aspects, we question this average somewhat and

wonder if it indicates a wish of our users to encourage future

work. The clarity of images and text got the second highest

average score with 3.9, which also refutes the hypothesis

of problems with this aspect with p = 0.00. Users were

also very satisfied with how close the virtual object followed

their physical hand movements during direct manipulation.

This aspect had an average score of 3.7 and refuted the

hypothesis of problems with p = 0.01, which we consider

quite acceptable.

Another aspect that we are quite certain of is that our

strategy for visualization of depth using anaglyph stereo views,

low-cost glasses and no calibration based on each user has

problems. We actually consider the score for this aspect was

low enough to indicate there might be a problem with our

stereo implementation. The average score for this aspect was

2.9 and we can refute the hypothesis that this value is greater

than 3.5 with p = 0.00. Furthermore, 61% of all participants

reported they could not sense the object depth most of the time

and 6 of the 28 users reported discomfort caused by the stereo

visualization after the brief tests. We are currently searching

139

for possible implementation errors in the stereo visualization

and, whether they are found or not, we plan to explore more

sophisticated techniques for depth perception.

The two remaining aspects did get scores greater than or

equal to 3.5, but they were close enough to this value for

us not to refute the hypothesis of problems, due to the high

probability that this average would be lower with a different

sample space.

For the ease of execution of test tasks using the system, the

average score was 3.5, which yields a value of p as large as

0.5. We believe this was partially due to our test procedure and

partially due to our strategy for voice commands, particularly

the decision of not giving the users any training and only

the most basic of information prior to the test. Out of the

14 users who assigned a score lower than 4 to test ease,

4 complained about the lack of training in their qualitative

answers, 3 for some reason commented they did not know

which tasks to perform and 4 complained about having to

use voice commands. While we wished to test the system

discoverability, we counted on users taking advantage of the

help functionality that was explained to them and did not

implement a more flexible speech recognition strategy (even

when using the Wizard of Oz technique we stuck to the

original simple commands). Several users, however, resisted

to using the help and instead kept trying to guess correct

commands. This guessing would have been easier if we had

simply accepted more synonyms as options, including object

names for object selection, which would certainly have in-

creased discoverability and was even suggested by several

users. This is an improvement we will prioritize in future

work, even more than improving our stereo visualization (since

the current stereo implementation is just a stand-in for more

sophisticated solutions). Using voice commands for grabbing

and releasing the virtual object was also pointed out as more

cumbersome than the other commands. These commands were

not only repeated more often during the tests but it would

also probably have been more natural and closer to how

we interact with physical objects to grab the virtual one by

closing the hands instead of issuing a voice command. We are

currently implementing this improvement for the next version

of Fusion4D.

Finally, regarding comfort of use, the average score was

3.6 but we cannot refute the hypothesis of problems, either,

because p in this case is was 0.25. Once again, 14 users

assigned scores below 4 to this aspect. The discomfort due

to the stereo visualization certainly contributed to the lower

scores in this aspect, since 6 users complained specifically

about it. We also suspect that, because users were not in-

structed to keep their arms and elbows closer to their body in a

more comfortable position during the manipulation, which was

certainly possible, several ended up using broader movements

than necessary which could be a cause of fatigue or discomfort.

In this prototype we took advantage of Kinect’s skeleton

tracking to obtain hand positions, so users had to be standing to

interact with the system. While no users explicitly mentioned

or commented on this need to be standing up in any of their

qualitative feedback, we believe this may have been another

cause for some of the lower scores in the comfort aspect and

plan to investigate this further in the future.

Our statistics for task completion time were sadly uninfor-

mative, with very large variance among the users even for

rather simple tasks such as those that simply required voicing

a one-word command. The number of errors recorded per user

and per task, however, was low (less than 1) and showed

much less variance among users. Remembering that they were

presented to the tasks and the system with minimal instructions

and no training, this leads us to believe that this variation

chiefly represents a difference in the time each user took to

learn how to perform each task.

This conclusion is supported by observations of the tests

themselves. As an example, the very first task only required

the users to speak the command “label” once. Average task

completion time, however, was 15.9s with a standard devi-

ation of 15.5s. During the tests, however, we could clearly

distinguish between three groups of users: most of them found

out how to and executed this task (and the others) quite fast.

Some took moderately longer trying out alternatives or calling

up and reading the help system. Only a few took very long

times, usually due to insisting on errors or resisting calling up

the help. Applying k-means clustering with three classes to

the times recorded for this first task pointed out these groups

more precisely: 55% of users were in a cluster with average

task time of 5.9s and standard deviation of 2.9s, 30% averaged

18.2s with a deviation of 3.8s and 15% took 47.7s on average,

with a standard deviation of 11.7s.

Another evidence that points in this direction is that, while

in general the error rates were low, for the tasks that were the

first to introduce a new interaction element, such as the first

time speech or hand movements or the timeline slider were

used, average error rates were significantly higher than for

the tasks that immediately followed them and used the same

elements (around 7 times higher, with p = 0.00 in paired

T-tests). This even happened when the following task was

theoretically more complex, such as when rotation followed

translation. In fact, if it were not for these “first time discov-

ery” errors, our error rates for all tasks would be very close to

zero. Incidentally, this may present another argument in favor

of investigating the use of object manipulation to navigate back

and forward in time [18], as discussed earlier, so we can reduce

the number of additional interface elements and use previously

acquired skills for more tasks.

Given this impact of the discovery and learning times on

task execution times, error rates and even, apparently, in user

subjective opinions, in the future we should be careful to test

discoverabilty in an isolated experiment and, when testing to

evaluate performance or satisfaction, provide adequate training

to the users, either by showing them a short instruction video

and letting them explore the system for a fixed period of

time before giving them the first task, or by combining the

instructions with training tasks (not to be evaluated as part of

the test) in a game-like tutorial.

Regarding qualitative feedback from the users, 50% de-

scribed considerable enthusiasm for the 3D object manipula-

tion using their own hands and chose it as their favorite system

feature. Only 2 of the 28 users disliked it, but we could not

determine why. The ability to navigate in time and to visualize

140

the object’s component parts was the favorite of 29% of all

users and only 1 of them disliked it. As for the use of voice

commands, only 14% of the users enjoyed this feature and

approximately an equal number had complaints about it but

we believe this is due, at least in part, to our implementation

of the grab and release commands using speech instead of

the gestures that would be more natural for this task (as

was already discussed) and that are being implemented in the

system’s next version.

VII. CONCLUSIONS

Direct manipulation interfaces may go through significant

changes with holographic displays. In this work we presented

a prototype of 3D interaction techniques implemented on a

simulated holographic display and the user tests performed

with it. The results show that the technology for our simulated

holographic display requires improvement, and the interface’s

discoverability requires further evaluation. Furthermore, while

we are currently using the classic horizontal timeline slider to

control navigation in the time dimension, in future works we

would like to investigate the technique based on the manip-

ulation of the objects themselves [18] as a potentially more

natural way to navigate in time in the context of our applica-

tion. Another future work we are interested in is comparing

user performance and satisfaction in a system like the one we

presented in this paper and a system using a video-avatar, i.e.

a system where the user’s own image is captured, segmented

[23] and integrated in a virtual environment as an avatar, to

manipulate virtual objects. Despite the improvements that are

still necessary, our test results were positive in terms of user

acceptance of the proposed techniques. This evidence moti-

vates us to continue along this line of research to investigate

and propose direct manipulation and visualization techniques

suitable for holographic displays.

VIII. ACKNOWLEDGMENTS

We would like to thank Centro Universitario Senac, Eduardo

Sonnino and Ian Muntoreanu.

REFERENCES

[1] B. Shneiderman, “Direct manipulation: A step beyond programminglanguages,” Computer, vol. 16, no. 8, pp. 57–69, Aug. 1983. [Online].Available: http://dx.doi.org/10.1109/MC.1983.1654471

[2] J. Canny, “The future of human-computer interaction,” Queue,vol. 4, no. 6, pp. 24–32, Jul. 2006. [Online]. Available: http://doi.acm.org/10.1145/1147518.1147530

[3] R. A. Bolt, ““put-that-there”: Voice and gesture at the graphicsinterface,” SIGGRAPH Comput. Graph., vol. 14, no. 3, pp. 262–270, July1980. [Online]. Available: http://doi.acm.org/10.1145/965105.807503

[4] W. Robinett and R. Holloway, “Implementation of flying, scalingand grabbing in virtual worlds,” in Proceedings of the 1992symposium on Interactive 3D graphics, ser. I3D ’92. NewYork, NY, USA: ACM, 1992, pp. 189–192. [Online]. Available:http://doi.acm.org/10.1145/147156.147201

[5] T. Baudel and M. Beaudouin-Lafon, “Charade: remote control of objectsusing free-hand gestures,” Commun. ACM, vol. 36, no. 7, pp. 28–35, Jul.1993. [Online]. Available: http://doi.acm.org/10.1145/159544.159562

[6] A. Butler, O. Hilliges, S. Izadi, S. Hodges, D. Molyneaux, D. Kim,and D. Kong, “Vermeer: direct interaction with a 360o viewable3d display,” in Proceedings of the 24th annual ACM symposiumon User interface software and technology, ser. UIST ’11. NewYork, NY, USA: ACM, 2011, pp. 569–576. [Online]. Available:http://doi.acm.org/10.1145/2047196.2047271

[7] G. Ruppert, P. Amorim, T. Moraes, and J. Silva, “Touchless gestureuser interface for 3d visualization using the kinect platform andopen-source frameworks,” in Innovative Developments in Virtual andPhysical Prototyping. Boca Raton: CRC Press, 2012, pp. 215–219.[Online]. Available: http://dx.doi.org/10.1201/b11341-35

[8] Siemens, “Game console technology in the operating room,” http://www.siemens.com/innovation/en/news/2012/e inno 1206 1.htm, 2012.

[9] L. Gallo, A. Placitelli, and M. Ciampi, “Controller-free explorationof medical image data: Experiencing the kinect,” in Computer-BasedMedical Systems (CBMS), 2011 24th International Symposium on, June2011, pp. 1–6.

[10] Y. Yin and R. Davis, “Toward natural interaction in the realworld: real-time gesture recognition,” in International Conferenceon Multimodal Interfaces and the Workshop on Machine Learningfor Multimodal Interaction, ser. ICMI-MLMI ’10. New York,NY, USA: ACM, 2010, pp. 15:1–15:8. [Online]. Available: http://doi.acm.org/10.1145/1891903.1891924

[11] R. Wang, S. Paris, and J. Popovic, “6d hands: markerless hand-trackingfor computer aided design,” in Proceedings of the 24th annual ACMsymposium on User interface software and technology, ser. UIST ’11.New York, NY, USA: ACM, 2011, pp. 549–558. [Online]. Available:http://doi.acm.org/10.1145/2047196.2047269

[12] J. L. Reisman, P. L. Davidson, and J. Y. Han, “A screen-spaceformulation for 2d and 3d direct manipulation,” in Proceedings ofthe 22nd annual ACM symposium on User interface software andtechnology, ser. UIST ’09. New York, NY, USA: ACM, 2009, pp. 69–78. [Online]. Available: http://doi.acm.org/10.1145/1622176.1622190

[13] O. Hilliges, S. Izadi, A. D. Wilson, S. Hodges, A. Garcia-Mendoza,and A. Butz, “Interactions in the air: adding further depth to interactivetabletops,” in Proceedings of the 22nd annual ACM symposiumon User interface software and technology, ser. UIST ’09. NewYork, NY, USA: ACM, 2009, pp. 139–148. [Online]. Available:http://doi.acm.org/10.1145/1622176.1622203

[14] R. George and J. Blake, “Objects, containers, gestures, and manipula-tions: Universal foundational metaphors of natural user interfaces,” inProceedings of CHI10, 2010.

[15] V. Pavlovic, R. Sharma, and T. Huang, “Visual interpretation of handgestures for human-computer interaction: a review,” Pattern Analysisand Machine Intelligence, IEEE Transactions on, vol. 19, no. 7, pp.677–695, July 1997.

[16] D. A. Bowman, E. Kruijff, J. J. LaViola, and I. Poupyrev, 3D UserInterfaces: Theory and Practice. Redwood City, CA, USA: AddisonWesley Longman Publishing Co., Inc., 2004.

[17] S. Pongnumkul, J. Wang, G. Ramos, and M. Cohen, “Content-awaredynamic timeline for video browsing,” in Proceedings of the 23ndannual ACM symposium on User interface software and technology,ser. UIST ’10. New York, NY, USA: ACM, 2010, pp. 139–142.[Online]. Available: http://doi.acm.org/10.1145/1866029.1866053

[18] D. B. Goldman, C. Gonterman, B. Curless, D. Salesin, and S. M. Seitz,“Video object annotation, navigation, and composition,” in Proceedingsof the 21st annual ACM symposium on User interface software andtechnology, ser. UIST ’08. New York, NY, USA: ACM, 2008, pp.3–12. [Online]. Available: http://doi.acm.org/10.1145/1449715.1449719

[19] A. J. H. N. J. SHARPE, L.; STOCKMAN, “Opsin genes, cone pho-topigments, color vision, and color blindness,” in Color vision: Fromgenes to perception, 1999.

[20] D. SANFTMANN, H.; WEISKOPF, “Anaglyph stereo without ghost-ing,” in Proceedings of the 2011 Eurographics Symposium on Rendering,2011.

[21] L. IDESES, I.; YAROSLAVSKY, “Three methods that improve thevisual quality of colour anaglyphs,” Journal of Optics A: Pure andApplied Optics, vol. 7, 2005.

[22] J. Kelley, “An iterative design methodology for user-friendly naturallanguage office information applications,” ACM Transactions on OfficeInformation Systems, vol. 2, no. 1, pp. 26–41, 1984.

[23] S. R. R. Sanches, R. Nakamura, V. da Silva, and R. Tori, “Bilayersegmentation of live video in uncontrolled environments for backgroundsubstitution: An overview and main challenges,” Latin America Trans-actions, IEEE (Revista IEEE America Latina), vol. 10, no. 5, pp. 2138–2149, 2012.

141

[ieee 2013 xv symposium on virtual and augmented reality (svr) - cuiabá - mato grosso, brazil...

Documents