ubii: towards seamless interaction between digital and ...symlab.ust.hk/papers/ubii.pdf · mobile...

Ubii: Towards Seamless Interaction between Digital andPhysical Worlds

Zhanpeng Huang Weikai Li Pan HuiHKUST-DT System and Media Laboratory

Hong Kong University of Science and Technology, Hong Kong, China{soaroc, weikaili, panhui}@ust.hk

ABSTRACTWe present Ubii (Ubiquitous interface and interaction), aninterface system that aims to expand people’s perceptionand interaction from the digital space to the physical world.The centralized user interface is broken into pieces woven inthe domain environment. Augmented user interface is pairedto the physical objects, where physical and digital presen-tations are displayed in the same context. The augmentedinterface and physical affordance respond as one control toprovide seamless interaction. By connecting digital interfacewith physical objects, the system presents a nearby embod-iment to afford users sense of awareness to interact withdomain objects. Integrated on wearable devices as GoogleGlass, a less intrusive and more convenient interaction isafforded. Our research illustrates the great potential of di-rect mapping of interaction between digital interfaces andphysical affordance by converging wearable devices and aug-mented reality (AR) technology.

Categories and Subject DescriptorsH.5.m. [Information Interfaces and Presentation (e.g.HCI)]: Augmented and Virtual Realities

General TermsDesign, Human Factor

KeywordsPhysical affordance; augmented reality; wearable device; userinterface; freehand interaction

1. INTRODUCTIONIn the past thirty years, we have relied primarily on screen-

based Graphical User Interface (GUI) to interact with dig-ital world. No matter whether the screen is placed on adesk (e.g., PC screen), held on a hand (e.g., mobile phonescreen), or displayed on a wall (e.g., projection screen), it

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’15, October 26–30, 2015, Brisbane, Australia.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-3459-4/15/10 ...$15.00.DOI: http://dx.doi.org/10.1145/2733373.2806266.

Figure 1: Ubii enables users to perform in-situ inter-actions with physical objects using pinch gestures.Users are able to 1) copy documents between com-puters, 2) print a document with a printer, 3)sharedocuments on an architecture partition, and 4) dis-play documents on a projector screen by picking,dragging, and dropping AR overlays aligned with 1©computers, 2© printers, 3© architecture partitions,and 4© projector screens at a distance without phys-ical touch.

has cultivated a predominantly visual paradigm of human-computer interaction [36]. However, traditional centralizedGUI displays all information in the foreground. The interac-tion is separated from the physical world that we live in andinteract with. It cannot leverage the rich senses and skillsthat we have acquired from the lifetime interaction with thephysical environment [20].

Efforts have been focused on developing direct interactionbetween physical and digital worlds. As physical objectsappeal to be more of our senses, they are redesigned as in-teraction metaphors. Architectural partition [20], physicalprops [15], mobile devices [6], and human body [13] are usedto afford tangible interaction at a distance. However, phys-ical objects as manipulation metaphor introduces intrusionto user interaction. Particularly, conventional user interfaceis designed to be irrelevant to the physical affordance of un-derlying objects. Users have to memorize manipulations liketraditional GUI interaction.

As a new interface straddling the real and the digital, ARbridges the gaps between physical and virtual worlds [18].It supports a sense of shared presence and communal so-cial space. AR has been widely used to help users accessadditional information for a task without getting distracted

from the task [16][34]. Another baseline leverages AR to pro-vide 3D menu interaction in current view of the real world[24][30]. These works generate egocentric virtual user inter-face in front of current view, neither breaking of traditionalcentralized paradigm nor able to interact with physical ob-jects.

Inspired by Weiser’s argument that computational ser-vices should be delivered through different devices whosedesign and location are tailored to support a particular taskor set of tasks [38], we designed Ubii, a ubiquitous AR-basedsystem to pair tailored interface to the physical affordanceof objects. Ubii breaks traditional centralized user interfaceinto individual pieces aligned with the physical objects. Itis a directly mapped interface enabling users to perform in-situ interactions with the physical objects at a distance. Asillustrated in Figure 1, users are able to manipulate physi-cal objects including computers, projector screens, printers,and architecture partitions to complete tasks such as docu-ment copy, printing, sharing, and projection display. Indi-vidual AR menu overlays are aligned with physical objectsto present their physical affordance. Developed on GoogleGlass, Ubii enables gesture-based freehand interaction withmuch less intrusion and more convenience.

2. RELATED WORKSUbii is built on several technologies including interaction

at a distance, freehand interaction, and AR menu.

2.1 Interaction at a distanceEarly systems such as PointRight [21] and Perspective

Cursor [27] allow distant pointer motion across independently-driven display surfaces that are not organized in a singleplane. The systems requires indirecting local input and com-plicated setups. Other works [39] [25] overcome the prob-lems using absolute pointing techniques, but require addi-tional pointing devices. In these methods motion errors areamplified as distance increases.

Mobile devices have been widely used as physical interfaceto provide foundation of new interaction paradigm [1]. Therelative pointing on mobile devices screen is transferred todistant screens to reduce interaction distance. The Point &Shoot [2] technique employs visual codes to set up an ab-solute coordinate system for objects selection with mobilephones. To create a self-contained system without fiducialmarkers, Boring et al. leveraged preformed content arrange-ment to determine the spatial relationship between the mo-bile device and display screen [5]. Chang and Li [10] allowedarbitrary content arrangement by matching the captured im-ages with content on display screens. The method does notsupport continuous interaction as content snapshot are re-quired in advance for further analysis.

Some works support fluent interaction through video streamcaptured by built-in cameras on mobile devices. Devicescreens act as portholes of physical world, through whichusers can interact with the physical objects. The worldin miniature metaphor facilitates interaction at a distance.Touch Projector [6] affords bimanual interaction with onehand holding a mobile phone and the other hand interact-ing with remote screens through live video on the mobilephone. Virtual Projection [3] uses mobile phone as a con-trollable projection metaphor to interact with content onstationary displays. More works can be referred to survey[26]. As mobile devices have to occupy users’ hands when

they are used as the manipulation metaphors, these methodscan not provide friendlier freehand interaction. Addition-ally, interaction context is scaled down on smaller displayscreens, impeding the precision of fine manipulations.

2.2 Freehand interactionFreehand interaction has been explored to deliver natural

and intuitive yet effective interaction [30], especially for non-desktop scenarios (e.g., mid-air interaction) when keyboardand mouse devices are not appropriate or available. Previousworks [19] [29] require fiducial markers or digital gloves totrack hand gestures. Other works leverage Microsoft Kinectto detect hand poses and movements for freehand menu se-lection [12] and object manipulation [35]. However, thesemethods suffer from drawbacks. Instrumented gloves areencumbered and prone to induce fatigue. Fiducial markersand ambient sensors require delicate setups and calibrations.

Image-based methods [40] [4] have also been proposed todetect and recognize hand gestures using image processingtechnology, which are suitable for both closed and publicenvironments [33]. As mobile devices become powerful, theimage-based methods are promising for mobile devices asbuilt-in cameras can be used without resorting to additionaldevices or sensors.

2.3 AR menuMenus are widely used in traditional GUI to increase com-

munication bandwidth with computers. However, using menusin the domain of AR is not straightforward, especially whenthe context is extended to three-dimensional environment.In order to support mid-air operations, non-pull-down menussuch as ring menu [23] and tile menu [32] have been designedfor manipulation in three-dimensional space.

As traditional mouse and keyboard devices have been dis-carded to reduce intrusion in AR applications, other inputsare developed for menu manipulation. In Studierstube [41],a tablet-and-pen device is used for menu selection. However,users have to take the tablet and pen with their hands. Somephysical props such as fiducial markers [31] and hand fingersare also used to select menu items. The physical props arerecognized and tracked using image processing techniquesand then mapped for menu manipulation. To capture di-verse motions such as wrist tilt and multiple pinch gestures,digital gloves shipped with motion tracking sensors are used[30].

2.4 RethinkingEarly systems provide seamless interaction in order to

share the rare computing resources. The problem is partiallyalleviated as computing devices (e.g. smartphones) becomeminiaturized and ubiquitous. Mobile devices are promisingto provide friendly user interaction when direct touch is un-available or cross-display interaction is required. Althoughembedded cameras on mobile devices have been addressedas a suitable input [7], device intrusion disturbs user expe-riences. Freehand gesture-based method does not introducemuch intrusion, but it has the deficiency of low precision [4].Gesture-based menu system is promising to provide freehandinteraction with higher communication bandwidth, but it isdifficult to select menu items even with advanced-low passfilters [30].

Our lifetime interaction with the physical world tells thata large part of our interaction is daily operations, which

Figure 2: Left panel: a user picks a document itemfrom a computer and drags it towards another com-puter. Right Panel: the user drops the documentitem to the other computer to finish document copy.

requires minimal visual attention as the functionality of ob-jects is fully understood and works as expected [17]. An idealuser interface should embrace the richness of human senseand full skills acquired from our daily activities. It appearswhen and where it is required, otherwise it disappears tominimize visual attention and intrusion. Interaction shouldbe also directly channeled to the physical objects rather thanthrough a single workstation (as traditional GUI), so that itis not separated from the context where we reside and act.

To address the issues, we propose Ubii, a system to beinvisible in the ecology of work places. Herein invisibilitymeans it does not intrude user activities. Ubii is similar tothe work [17] with a series of improvements. Rather thanreprogramming functions of physical objects, we anchor ourviews on seamless interaction with individual physical ob-jects without intrusion. Wearable devices such as GoogleGlass replace mobile devices to enable freehand interaction.3D ring menu is used to support mid-air operations whilepinch gesture is adopted to meet required precision of menumanipulation. By converging the technologies together, Uiiestablishes a direct type of affordance, which is similar tothe affordance of direct touch. The aim of our research is toillustrate a concrete interaction approach to move beyondcurrent GUI paradigm which is bounded to computers withflat windows, mice, and keyboards.

3. SYSTEM DESIGNDesigning a system as Ubii requires consideration of phys-

ical affordance, which determines the physical objects per-ceived by the users, the virtual interface presented to theusers, and the interaction for users to bridge the virtual in-terface and the physical objects.

3.1 Physical affordanceAll objects present a sort of physical affordance, while

we only focus on those that have relations with the digitalworld in our system. For instance, a computer provides anaffordance for entering the cyberspace; a printer exhibits anobvious affordance for printing digital documents. Physicalsurfaces such as tabletop and architecture partitions (e.g.,walls and ceilings) can also act as tangible interface. Toname an example, a blank wall can be treated as a virtualbulletin board for sharing information among group mem-bers. Authorized members are able view notes and docu-ments on the virtual bulletin board while walk-in visitorsonly see the blank wall.

Users require interface to interact with the objects. WithUbii users perform the affordance through the interface ata distance without physical touch. Remote operations re-

Figure 3: Left panel: a user is dragging a documentitem towards a printer. The capture zone of theprinter is indicated by a virtual blue overlay. RightPanel: The printer captures the document item andprints out the document.

Table 1: The applicable unidirectional (row to col-umn) interaction between pair devices.

Computer Printer Screen SurfaceComputer X X X XPrinter × × × ×Screen × × X XSurface X X × X

duce manipulation distance and facilitate cross-device inter-action, which is especially preferable for different groups ofobjects that have connection with each other. Table 1 liststhe physical objects and their applicable interaction whichUbii is interested in. Although the physical objects maybe connected through network communication, interactionbetween the objects is not mutual from the user’s point ofview. For instance, a user can drag and drop a documentfrom a computer to a printer for printing service, but it isunable or meaningless to transfer anything from a printer toa computer. The applicable interaction between the objectsis explained in detail as below:

Computer : Computers are able to interact with all objectslisted in Table 1. Ubii enables a computer to interact withitself for documents and image manipulations (e.g., movingand zooming) at a distance. It can interact with other com-puters for document copy without portable mass storage de-vices (e.g., USB devices). A computer can ask a printer forprinting documents without users’ physical touching. Witha single gesture it sends documents for display on a largeprojector screen. In Ubii system it even can “upload” digi-tal documents to physical surfaces (e.g., walls) for documentsharing among multiple users. Figure 2 shows an embodi-ment of transmitting documents between two computers. Auser pinches a document item from a computer, drags anddrops towards another computer to finish transmitting thedocument between the computers, without physically oper-ating the computers.

Printer : Printers are designed to work as passive receiversand do not have much interaction in general. In Ubii print-ers do not actively interact with other objects. However, asan important output device, it can passively interact withother objects including computers and physical surfaces. Asillustrated in Figure 3, a user is dragging a document itemfrom a computer and towards a printer. When the docu-ment item is moved near the virtual capture zone of printeraffordance, the printer captures the document and starts toprint it.

Figure 4: Left panel: a user is dragging an imageitem from a computer with pinch gesture. RightPanel: the image item is captured by a projectorscreen and then displayed on the projector screen.

Projector Screen: The projector screen presents an im-plicit affordance of large-screen display. From users’ pointof view, we directly interact with the projector screen, whileit has to collaborate with a projector and a computer tocomplete the interaction. As a passive display device, theprojector screen is unable to actively interact with the com-puter. However, it can interact with other projector screensand physical surfaces to afford cross-display interaction at adistance. It is possible to ask a printer to print a presenteddocument on a projector screen. However, considering thatthe document must be stored in a computer, it is easy toturn to the computer and printer for the printing job. Ubiidisables the direct interaction between the projector screenand the printer. Figure 4 gives an example of displaying animage on a projector screen. A user pinches an image itemfrom a computer with his fingers. When the image itemreaches capture zone of a projector screen, the projectorscreen captures and displays the image on the screen.

Physical Surface: Acting as document sharing workspacein Ubii, the physical surface enables downloading documentsto personal computers, a reversed interaction of uploadingdocuments from computers to the physical surface. Autho-rized group members are able to drag down a document toa printer to print it out. Ubii disconnects the interactionbetween the physical surface and the projector screen withprivate considerations. In the case of projecting a documenton a projector screen, users can always resort to a computerwhere the document is uploaded. Figure 5 shows a user ismanipulating a document on an architecture partition withhis hand. Documents are oriented and aligned to the par-tition surface, so that users are able to move documents onthe physical surface.

By exploring applicable physical affordance in a typicalworkspace, we figure out some guidelines of selecting physi-cal objects for our system:

• Physical objects should have connection with digitalworld. It is possible to add virtual interface to thephysical objects isolated from the digital world to en-hance the relations with cyberspace, but it requiresadditional actuators [17] to perform the interaction,which needs to refit functions of the physical objects.

• Only simple and coarse-grained interactions (e.g., doc-ument copy and printing) are considered, which havecommon features of short time, frequent use, and sim-ple input. Complicated and fine-grained interactions(e.g., text editing and fine art creating) that requirelong-time, precise manipulations, or multiple inputsare not suitable for Ubii.

Figure 5: Left panel: a user is moving a documentitem on a physical surface with pinch gesture. RightPanel: the selected document is moving along withhis hand on the physical surface.

• The augmented interface should neither redefine thephysical affordance nor desensitize users’ awareness ofthe physical affordance, otherwise new users requirepre-training before using the system. It should bestraightforward for most users to perform the oper-ations.

3.2 Menu designA menu system in the domain of AR applications is not

necessarily to be two-dimensional. It is able to feature depth,rotation, and position in 3D space. We follows the criteriain [8] to design an interactive menu for our system:

Placement : Menu placement is classified into four cate-gories [11]. We adopted the object-referenced placement toattach the menu to the physical objects. It is consistent withthe design goal of breaking centralized interface into tailoredindividual interface associated with physical objects. Visualmarkers tagged on the physical objects are used to anchorthe individual menus. The menus move along with the phys-ical objects. Menus’ size vary according to relative spatialrelations between user and physical objects to indicate depthand distance.

Orientation: Aligning the menu with users’ view makes iteasy to read but may occlude the environment [8]; Rather,we align the menu with physical object surface which istagged with the marker to improve readability. In addition,it achieves a better 3D spatial presence of the menu.

Trigger mechanism: A menu may be visible or hiddenbased on active state of the corresponding object. The menuwill be visible if the physical object and its visual markerappear in user’s current view. A menu item is triggeredwhenever a user pinches the item with fingers. A capturezone is activated if a menu item is dropped to its capturingregion. All triggers are translated as sockets to performunderlying network communication.

Attributed to the capability of short target-seeking timeand low error in selection [9], ring menu is adopted to collab-orate with hand-based interaction for freehand menu selec-tion in Ubii. Menu items are distributed on an invisible ringaround an anchor determined by the visual marker. Tradi-tional ring menu enlarges the ring size or uses hierarchicalstructure when it is too crowded to display all menu itemsin a single ring. Neither method is feasible as items increaseto squeeze the space. To solve the problem, we enable menuitems to be rotatable along the ring. As illustrated in Fig-ure 6, only a few menu items are active and others are folded.Active items disappear and hidden items appear as the ringmenu rotates clockwise.

Figure 7: The supported hand gesture based interactions in Ubii.

Figure 6: A rotatable and foldable ring menu.

Rather than providing operating items (e.g., “translate”and “print”), operated items (e.g., document and image) aredefined as menu items as physical affordance of the objectitems is already understood by users. Users are able to di-rectly perform actions on object items without calling oper-ating commands. For instance, dragging a document itemimplies copying, printing, or displaying the document ac-cording to the target that the document is dropped to.

Capture zone is designed to decrease required precisionof interaction. Any menu item dropped to the range of acapture zone is captured by the receiver to trigger under-lying commands. For instance, when a user selects a menuitem (e.g., a document) of a computer, all printers, projectorscreens, and predefined physical surfaces in user’s currentview are overlaid with their capture zones. If the item isdropped to the capture zone of a printer, the printer cap-tures the document and starts to print it out. When theitem is dropped to the capture zone of a projector screen,the computer connected with the projector will receive thedocument, and then asks the projector screen to displays it.

3.3 Interaction designUii extends previous video-based interaction to hand ges-

ture interaction without any digital gloves or fiducial markertracking. However, pointing gesture is difficult for direct po-sitioning operations (e.g., menu selection) because it cannotachieve the same high precision as physical input devices(e.g., mouse and stylus) in free spaces [30]. Directional ges-ture is ambiguous and error-prone due to lack of distinctivegesture delimiters [37].

Our essential design principle is to leverage multiple dis-crete gesture inputs to reduce the required precision of con-tinuous hand gesture. The pinch gesture is stable and preciseattributed to its discrete and unambiguous states of fingerstogether and apart [40]. In addition, it appears natural tousers as it is evocative of grabbing a physical object [30].Pinch gesture is suitable for interaction which requires highprecision, such as menu picking and dropping. Combined

Figure 8: The transition of interaction operations.

with directional gesture, it also can be used for coarse in-teraction, such as dragging. Ubii integrates multiple handgestures to support several types of freehand interaction:

• Pick: a pinch gesture (as illustrated in Figure 7a) isinterpreted as a pick operation. The pick operationshould be performed on a menu item, otherwise it isinvalid.

• Drop: an un-pinch gesture (as illustrated in Figure 7b)acts as dropping the selected item. A pick operationshould always finish with a drop operation.

• Drag: triggered when a directional gesture is performedwith a pick operation (as illustrated in Figure 7c). Theselected item is moved along with directional gesture ifonly a single drag operation is acted. The drag inter-action terminates whenever a drop operation happens.

• Rotate: a combo operation of two pinch gestures thatmove reversely along the Y direction on screen space(as illustrated in Figure 7d). The interaction is usedto rotate a ring menu which is located between the twopinch gestures. Relative distance is mapped as rota-tion angle. Rotation operation stops whenever eitheror both pinch gestures end.

• Zoom: a combination of two drag operations performedon the same item (as illustrated in Figure 7e). Rela-tive movement is transferred to scale up and down theitem size. As one or both of the drag operations end,the zoom operation is terminated.

The transition of different interactions is illustrated in Fig-ure 8. Pick and drop are instantaneous interactions. Allinteractions start with the pick and end with the drop op-eration. The other three start from the pick operation andend to the drop operation. They also enclose self-transitionsfor continuous interaction.

Figure 9: The system workflow of Ubii.

4. IMPLEMENTATIONImplementing our system requires technologies from fields

of object tracking and hand gesture recognition. We em-ployed a QR-based method for both object tracking and poseevaluation. A double-threshold algorithm was proposed toaccurately detect pinch gestures. Our work explored thesynergetic benefits that go beyond the sum of multiple tech-nologies to achieve our design goals.

Figure 9 shows the system workflow. The built-in cam-era on Google Glass takes live video of the scene aroundusers, which is streamed for image analysis of object trackingand hand recognition. The system extracts object informa-tion of IP address and object type from QR code, which isused for physical affordance recognition and network com-munication. Camera pose is also evaluated by calculatingthe homography of QR code images. Virtual camera is up-dated according to the real camera pose to render 3D menusfrom corresponding viewpoint, making virtual menus alignwith surfaces of physical objects. Menus are rendered as ARoverlays to blend with video stream, and then displayed onGoogle Glass display prism.

Hand contours are extracted from video stream using amodified inexpensive method based on [40]. The contoursare then filtered using double thresholds to determine pinchgestures. Users are able to interact with physical objectsthrough virtual menus using pinch gestures, which are inter-preted as socket to perform communication between physicalobjects through underlying network.

4.1 Object trackingUbii heavily relies on the built-in camera on Google Glass

to understand the domain environment. The object track-ing involves in recognizing object and evaluating spatial re-lationship between objects and users. Rather than non-intrusive markerless methods such as [22], we adopted themarker-based baseline, which is generally more accurate androbust to image distortion and illumination variation [28]. Inour system QR codes are used as fiducial markers to encodeadditional information for understanding object functions.

As illustrated in Figure 10, object information includingIP address and object type is encoded into QR code in ad-vance using the qrcode tool1. For non-wired physical objectssuch as physical surfaces (e.g., wall and tabletop) and pro-jector screens, IP addresses of their counterparts, namelycomputers storing documents on physical surfaces and com-puters connected to projectors for projection display, areencoded. Camera calibration is also performed beforehandto calculate the intrinsic matrix K with the classic chessboard method [42]. At runtime whenever one or more QRcodes appear in the live video stream captured by embedded

1https://pypi.python.org/pypi/qrcode

Figure 10: Object tracking and pose estimation withQR markers.

camera, the system resorts to the integrated Zbar library2 todecode object information from QR codes. It is then able todetermine menu item layout according to the extracted in-formation. The large finders in QR codes are also extractedto get the homography matrix H by solving the coordinatevector-valued function, which is then used to evaluate thepose matrix R|T = K − I ∗ H. In practise, we adopted atricky implementation by using the more efficient and accu-rate solvePnP method from the library OpenCV4Android3.The intrinsic matrix K is mapped to the Normalized De-vice Coordinates (NDC) in OpenGL ES rendering system forperspective projection, while the pose matrix R|T should beflipped along X axis for consistently locating virtual menusin OpenGL coordinate frame.

In Ubii QR markers are added to the objects, which maybe not consistent with our vision of designing a self-containedsystem. We view QR markers as a temporary and minorviolation of our premise, and believe that advances in ubiq-uitous sensors and smart devices will help us to understandenvironment without additional annotation in future.

4.2 Hand gesture recognitionPinch gesture provides a simple and reliable way to detect

when interactions start and end without additional gesturedelimiters. It can be determined by the thumb and indexfingers touching together to make a hole as illustrated inthe Figure 11a. We basically follow Wilson’s vision-basedmethod [40] to detect one and two-hand pinching gestures,with several improvements in our system.

The first step is to distinguish hands and background us-ing the well-studied segmentation and connected compo-nents techniques in computer vision community. In Wil-son’s method the background of interaction context is con-strained to a black computer keyboard. As the backgroundis constantly darker than the foreground hands, the methodchooses to separate the background from the image. In Ubiithe background varies according to user’s view angle. Alter-natively, we extract the foreground hands from the image byusing hand skin color as reference color. With the same rea-son in Wilson’s method, it is reasonable to take the largestpixel components as hand sections because hands generallyoccupy the largest region from user-centered perspective.

Contours of hand section are then extracted. The largestone is the outer contour of the hand section as illustrated inFigure 11b. Any inner contour indicates a hole in the hand

2https://github.com/ZBar/ZBar3http://docs.opencv.org/

Figure 11: The pinch gesture detection. a) a typicalpinch gesture; b) outer contour of hand is extracted;c) and d): red closed regions are recognized as pinchholes; e) and f): green closed regions are eliminatedas non-pinch holes.

section. Ambient illumination and other non-pinch gesturesmay also produce inner contours. Our method adopts con-tour size to filter out the wrong contours. Any inner contourwith size out of given bounds is discarded. The bounds aredetermined by the acceptance and elimination curves as il-lustrated in Figure 12. The elimination curve representsaccuracy of excluding non-pinch gestures that have innercontours in terms of a lower threshold of the inner contour’ssize. The inner contours of non-pinch gestures are generallysmall. The gestures will be more accurately expelled if weset the lower threshold to be larger. Conversely, holes inpinch gestures are normally large. Most pinch gestures de-tection will pass if the threshold is small, as shown by theacceptance curve that recognition accuracy declines whenthe threshold gets larger. We chose a lower bound 0.6 andupper bound 0.8 to achieve over 85% accuracy of both pinchgesture detection and non-pinch gesture elimination.

The method works in both one and two-hand pinch ges-ture detection as shown in Figure 11c and Figure 11d. Itis also robust to eliminate other non-pinch gestures. Fig-ure 11e and Figure 11f show two gestures with holes in handsections, but they are eliminated in our method as the pinchholes are smaller than the given lower bound value. In prac-tise, it is applicable to uncheck the upper bound as resultset of a small upper bound always includes result of a largeupper bound. However, size of pinch hole is constrained tohand anatomy. A reasonable upper bound guarantees ex-cluding outliers.

The pinch gesture detection relies solely on the hole be-tween thumb and index fingers. The method fails if the holebackground is occluded by other curled fingers [40]. It nei-ther works when the hole is invisible to the camera, suchas the thumb and index fingers being horizontally coplanar.We see it to be a reasonable compromise for freehand inter-action without device intrusion.

Figure 12: The inner contour acceptance and elimi-nation.

5. EVALUATIONWe designed several experiments to evaluate the system

from different aspects. Participants were invited to performthe experiments. Their task completion time and failed tri-als were measured and compared. The participants wereasked to answer paper questionnaires after the experiments.A technology acceptance model (TAM) [14] was adopted toevaluate user experience.

5.1 ParticipantsWe invited 10 participants (6 males and 4 females) from

other departments in our campus, with ages between 19and 31. They are frequently computer users and famil-iar with common computer operations including documentcopy, print, and projector display. Several (4) have some ex-perience of gesture interaction, including Microsoft Kinectand Wii for computer games. Three of them have experienceor hear about AR techniques.

5.2 Experiment designWe designed four tasks in associated with selected physi-

cal affordance in Section 3.1. The tasks were conducted in aclosed work space with dimensions of about 10m*5m. Thework space is equipped with several interconnected comput-ers, printers, and a set of projector and display screen. Sev-eral USB devices were available for document copy in ourexperiments. The computers had installed instant messag-ing software Skype4 and share tool Dropbox5 for transmit-ting and sharing document. We registered several accountsin advance to use the software. Computers, printers, andprojector were running so that participants did not need tologin or switch on devices. The computer connected withthe projector was also configured beforehand for projectiondisplay. Computers’ names, and printers’ name, documentnames and paths on computers, were listed on task cards forreference.

Ex1: Copying documents between computers. Participantswere required to copy specified documents between assignedcomputers with USB flash copy, Skype, and Ubii. By usingUSB flash copy participants had to copy documents betweendifferent computers and USB devices. They also needed tofind other Skype accounts on target computers in order totransmit documents through Skype. In Ubii, participantsneeded to perform pick, drag, and drop operations with their

4http://www.skype.com/5https://www.dropbox.com/

Figure 13: Comparison of task competition timeof four experiments by using Ubii and traditionalmethods.

hands to finish document copy. Rotate operation may berequired when the specified document items are folded inmenus.

Ex2: Printing documents. Participants were asked toprint specified documents on assigned printers by using func-tions provided by computers and Ubii. In traditional wayparticipants had to select “print” menu item and fill the con-figuration panel such as assigning printer name. In Ubiiparticipants had to pick, drag, and then drop the specifieddocument item to the capture zone of assigned printer.

Ex3: Displaying documents on projector screens. Partici-pants finished the task of displaying specified documents onthe projector screen with traditional USB flash copy and al-ternative way of Ubii. In traditional way participants copieddocuments to the computer connected with projector usingUSB devices, and then opened the document for projectiondisplay. Using Ubii system, participants directly interactedwith the projector screen rather than the computer con-nected with the projector.

Ex4: Sharing documents. Participants had to share doc-uments with other participants through Dropbox and Ubii.By using Dropbox participants had to copy specified docu-ments to Dropbox share folders or share Dropbox link, andthen wait for documents updating or received. In Ubii par-ticipants used an architecture wall as virtual share space foruploading and downloading document with drag and dropinteraction.

Each participant was required to perform the four tasks,and each task was performed 4 times. The tasks were pre-sented in random order among the participants. We got atotal of 10 (participant) × 4 (task) × 4 (repetition) = 160times of operations, with each task repeatedly performed 40times. Each participant completed the whole study within60 minutes or less.

5.3 Results & discussionFigure 13 compares participants’ task completion time us-

ing Ubii system (green bars) and other traditional ways suchas document copy through USB devices and Skype, and doc-ument share through Dropbox. The time cost of Ubii ismuch less than other traditional methods in all four tasks.Particulary, it is (M=7440ms, SD=360ms) almost 4 to 7

Table 2: The accumlative failed trials of each expri-ments.

failed trial failed ratioEx1 2 5 %Ex2 3 7.5%Ex3 3 7.5%Ex4 4 10%

times faster respectively than transferring on Skype (maroonbar, M=33780ms, SD=3320ms) and USB flash copy (brownbar, M=53475ms, SD=3530ms) in Ex1. It (M=7690ms,SD=300ms) also achieves high performance gain comparedwith traditional projection display using USB devices (brownbar, M=56370ms, SD=6510ms). Using USB flash copy re-quires plugging in, plugging out, and copying documents be-tween USB devices and computers. Ubii simplifies the ma-nipulations without directly physical touch. For documentprinting Ubii (M=10250ms, SD=380ms) eliminates opera-tions of clicking “print...” menu item and selecting specifiedprinter from printer list required by traditional method (bluebar, M=20250ms, SD=574ms). We found that some partic-ipants referenced to task cards when they selected a printerfrom configuration panel. Ubii (M=8740ms, SD=310ms)is also more efficient in document sharing compared withDropbox sharing (yellow bar, M=13510ms, SD=1130ms),as it does not require confirming target user’s Dropbox ac-count, which is normally not the same as target computer’saccount.

We compared completion time (green bars) of differenttasks using Ubii. The task completion time includes bothoperation time and document transfer time. The most time-consuming task is printing operation, followed by documentcopy, projection display, and share. As document print-ing requires most time due to document transfer betweencomputers and printers, namely almost three times longer(M=6100ms) than that between computers (M=2100ms),it is in fact the fastest operation (M=4200ms, SD=200ms)comparing to document copy (M=4840ms, SD=310ms), pro-jection display (M=5200ms, S=330ms), and document shar-ing (M=6650ms, SD=300ms).

We measured failed trials and ratios of each task as illus-trated in Table 2. All failed trials are induced by failures ofUbii. Document sharing has the highest failed ratio of 10%,and document copy with the lowest of 5%. No participantencountered more than twice failures in single task. Thefailures mainly lie in the wrong detection of hand gestures,which are caused by fast hand movements and backgroundcolor.

In addition to profile the system from the aspect of run-time performance, we evaluated user experience of Ubii us-ing the TAM model, which includes four criteria of perceivedenjoyment (PE), perceived usefulness (PU), perceived easeof use (PEOU), and intention to use (IOU). We used the Lik-ert scale6 to quantitatively measure the four criteria, whichscale from 1 to 5 scores, with strong disagree (1), disagree(2), neutral (3), agree (4), and strongly agree (5). As illus-trated by Figure 14, the four terms in each task are positivein general. In Ex2 (document printing) and Ex3 (projectiondisplay) the first quartiles (Q1) pass over 3, and medians(Q2) are not lower than 4. The bottom whiskers also end

6http://en.wikipedia.org/wiki/Likert scale

Figure 14: User experience evaluation of four exper-iments scored with Likert scale.

up at 3 or above, with a few outliers at 3 and 5. Espe-cially, boxes of Ex3 are all located between 4 and 5, indicat-ing strong agreements on all terms. The result can be alsofound in Ex2, with an exception of weak agreements on PEpartly due to the long transfer time between computers andprinters. Boxes in Ex1 (document copy) and Ex4 (documentsharing) are mainly located between 3 and 4, with bottomwhiskers ending above 3, indicating that less than a quar-ter of participants reserve their opinions on user experience.Particularly, two highlights are easily observed in Ex1 andEx4: the relatively low and short box of PEOU in Ex1 andbottom whiskers of IOU and PE in Ex4 ending at 2, implyingless preference for document copy and sharing. Specifically,participants’ opinions on Ex4 are much more diverse, rang-ing from strong agreement (5) to disagreement (2). Fromanswers to the question of “any comments?” on our ques-tionnaires we discovered a few participants had been usedto traditional methods such as IM tools (e.g., Skype) andcloud storage (e.g., Dropbox) for document copy and shar-ing, and they thought traditional methods were convenientenough for them.

Overall, the results are promising to reduce operation timecompared with traditional methods. Most participants aresatisfied with the system, albeit occasional failures in theiroperations. Several said directly operation with physicaldevices enables them avoid remembering device names. Forinstance, people have to know the IP address or device nameto select the exact one they need from the printer list. Withour system they are able to assign print job to any printerthey like by directly dragging and dropping the documentto the printer without knowing its name or IP address. Theprojection display is popular among participants as indi-cated in Figure 14, which lies in the fact that most peo-ple only care about document display on projection screenrather than where the document is located. Our system helpthem simplify the procedure.

6. CONCLUSIONUbii provides a decentralized interface system for manip-

ulation between the physical and digital worlds. User inter-face and interaction are delicately designed to afford free-

hand user experience. The system is evaluated from bothaspects of technical and user experience. By interacting withobjects at a high-level of physical affordance rather than un-derlying digital counterparts, the system helps users performoperations in much convenient and natural way. It has beenproved to be useful for simple and frequently-performed ma-nipulations with connected digital devices in closed environ-ments.

In our experiments, we found that most failures were causedby hand gesture detection. The color-based method is cho-sen to relieve computational cost, but it is not accurateenough in backgrounds with similar color of hand skin. Infuture we plan to combine depth and visual cameras to im-prove the accuracy. Another problem is the limitation ofpoor computational capability and limited power capacityof Google Glass. We will try to leverage device offloadingstrategies to outsource computationally intensive tasks tocompanion mobile devices. The method has been proved toimprove both real-time performance and runtime sustain-ability.

7. ACKNOWLEDGMENTSThis work has been partially supported by the European

Commission under the Horizon 2020 Program through theRAPID (H2020-ICT-644312) project.

8. REFERENCES[1] R. Ballagas, J. Borchers, M. Rohs, and J. G. Sheridan.

The smart phone: a ubiquitous input device. IEEEPervasive Computing, 5(1):70–77, 2006.

[2] R. Ballagas, M. Rohs, and J. G. Sheridan. Sweep andpoint and shoot: Phonecam-based interactions forlarge public displays. In Proceedings of CHI ExtendedAbstracts, pages 1200–1203, 2005.

[3] D. Baur, S. Boring, and S. Feiner. Virtual projection:Exploring optical projection as a metaphor formulti-device interaction. In Proceedings of CHI, pages1693–1702, 2012.

[4] H. Benko. Beyond flat surface computing: challengesof depth-aware and curved interfaces. In Proceedingsof ACM Multimedia, pages 935–944, 2009.

[5] S. Boring, M. Altendorfer, G. Broll, O. Hilliges, andA. Butz. Shoot and copy: Phonecam-basedinformation transfer from public displays onto mobilephones. In Proceedings of Mobility, pages 24–31, 2007.

[6] S. Boring, D. Baur, A. Butz, S. Gustafson, andP. Baudisch. Touch projector: mobile interactionthrough video. In Proceedings of CHI, pages2287–2296. ACM New York, USA, 2010.

[7] S. Boring, M. Jurmu, and A. Butz. Scroll, tilt or moveit: Using mobile phones to continuously controlpointers on large public display. In Proceedings ofOzCHI, pages 161–168, 2009.

[8] F. Brudy. Interactive menus in augmented realityenvironments. Technical report, Media Informatics atthe University of Munich, 2013.

[9] J. Callahan, D. Hopkins, M. Weiser, andB. Shneidermann. An empirical comparison of pie vs.linear menus. In Proceedings of CHI, pages 95–100,1988.

[10] T. H. Chang and Y. Li. Deep shot: A framework formigrating tasks across devices using mobile phone

cameras. In Proceedings of CHI, pages 2163–2172,2011.

[11] R. Dachselt and A. Hubner. Virtual environments:Three-dimensional menus: A survey and taxonomy.Computers and Graphics archive, 31(1):53–65, 2007.

[12] F. Guimbretiere and C. Nguyen. Bimanual markingmenu for near surface interactions. In Proceedings ofCHI, pages 825–828, 2012.

[13] S. G. Gustafson, B. Rabe, and P. M. Baudisch.Understanding palm-based imaginary interfaces: therole of visual and tactile cues when browsing. InProceedings of CHI, pages 889–898. ACM New York,USA, 2013.

[14] A. C. Haugstvedt and J. Krogstie. Mobile augmentedreality for cultural heritage: A technology acceptancestudy. In IEEE International Symposium on Mixedand Augmented Reality (ISMAR), pages 247–255,2012.

[15] S. J. Henderson and S. Feiner. Opportunistic controls:leveraging natural affordances as tangible userinterfaces for augmented reality. In Proceedings of theACM symposium on Virtual reality software andtechnology, pages 211–218. ACM New York, USA,2008.

[16] S. J. Henderson and S. K. Feiner. Augmented realityin the psychomotor phase of a procedural task. InProceedings of ISMAR, pages 191–200, 2011.

[17] V. Heun, S. Kasahara, and P. Maes. Smarter objects:using ar technology to program physical objects andtheir interactions. In Proceedings of CHI EA, pages2817–2818, 2013.

[18] Z. Huang, P. Hui, C. Peylo, and D. Chatzopoulos.Mobile augmented reality survey: A bottom-upapproach. Technical report, HKUST TechnicalReport-20130903-182109-4052 (cite asarXiv:1309.4413), 2013.

[19] W. Hurst and J. Dekker. Tracking-based interactionfor object creation in mobile augmented reality. InProceedings of ACM Multimedia, pages 93–101, 2013.

[20] H. Ishii and B. Ullmer. Tangible bits: towardsseamless interfaces between people, bits and atoms. InProceedings of the ACM SIGCHI, pages 234–241.ACM Press, 1997.

[21] B. Johanson, G. Hutchins, T. Winograd, andM. Stone. Pointright: experience with flexible inputredirection in interactive workspaces. In Proceedings ofUIST, pages 227–234, 2002.

[22] A. Kanezaki, Y. Kuniyoshi, and T. Harada.Weakly-supervised multi-class object detection usingmulti-type 3d features. In Proceedings of ACMMultimedia, pages 605–608, 2013.

[23] J. Liang and M. Green. A highly interactive 3dmodeling system. Computers & Graphics,18(4):499–506, 1994.

[24] A. Marzo and O. Ardaiz. Collart: a tool for creating3d photo collages using mobile augmented reality. InProceedings of ACM Multimedia, pages 585–588, 2013.

[25] B. Myers, R. Bhatnagar, J. Nichols, C. H. Peck,D. Miller, and A. C. Long. Interacting at a distance:measuring the performance of laser pointers and otherdevices. In Proceedings of CHI, pages 33–40, 2002.

[26] M. A. Nacenta, C. Gutwin, D. Aliakseyeu, andS. Subramanian. There and back again: Cross-displayobject movement in multi-display environments.Journal of Human-Computer Interaction,24(1):170–229, 2009.

[27] M. A. Nacenta, S. Sallam, B. Champoux,S. Subramanian, and C. Gutwin. Perspective cursor:perspective-based interaction for multi-displayenvironments. In Proceedings of CHI, pages 289–298,2006.

[28] U. Neumann and S. Y. You. Natural feature trackingfor augmented reality. IEEE Transactions onMultimedia, 1(1):53–64, 1999.

[29] T. Ni, D. Bowman, and C. North. Airstroke: bringingunistroke text entry to freehand gesture interfaces. InProceedings of CHI, pages 2473–2476, 2011.

[30] T. Ni, D. A. Bowman, C. North, and R. P. McMahan.Design and evaluation of freehand menu selectioninterfaces using tilt and pinch gestures. JournalInternational Journal of Human-Computer Studies,69(9):551–562, 2011.

[31] I. Poupyrev, D. S. Tan, M. Billinghurst, H. Kato,H. Regenbrecht, and N. Tetsutani. Developing ageneric augmented-reality interface. IEEE Computer,35(3):44–50, 2002.

[32] M. Rahman, S. Gustafson, P. Irani, andS. Subramanian. Tilt techniques: investigating thedexterity of wrist-based input. In Proceedings of CHI,pages 1943–1952, 2009.

[33] G. Ren and E. O. Neill. 3d selection with freehandgesture. Computers & Graphics, 37(3):101–120, 2013.

[34] D. Schmalstieg and D.Wagner. Experiences withhandheld augmented reality. In In Proceedings ofMixed and Augmented Reality, 6th IEEE and ACMInternational Symposium, pages 3–18, 2007.

[35] P. Song, W. B. Goh, W. Hutama, C.-W. Fu, andX. Liu. A handle bar metaphor for virtual objectmanipulation with mid-air interaction. In Proceedingsof CHI, pages 1297–1306, 2012.

[36] B. Ullmer and H. Ishii. Emerging frameworks fortangible user interfaces. IBM Systems Journal,39(3):579–601, 2000.

[37] Y. Wang and C. MacKenzie. The role of contextualhaptic and visual constraints on object manipulationin virtual environments. In Proceedings of CHI, pages532–539, 2000.

[38] M. Weiser. The computer for the 21st century.Scientific American, 265(3):94–104, 1991.

[39] A. Wilson and S. Shafer. Xwand: Ui for intelligentspaces. In Proceedings of CHI, pages 545–552. ACMNew York, USA, 2003.

[40] A. D. Wilson. Robust computer vision-based detectionof pinching for one and two-handed gesture input. InProceedings of UIST, pages 255–258, 2006.

[41] Z. Zalavari and M. Gervautz. The personal interactionpanel: a two-handed interface for augmented reality.Computer Graphics Forum, 16(3):335–346, 1997.

[42] Z. Zhang. A flexible new technique for cameracalibration. IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(11):1330–1334, 2000.