Mixed Reality in Virtual World Teleconferencing
Tuomas Kantonen (1), Charles Woodward (1), Neil Katz (2)
(1) VTT Technical Research Centre of Finland, (2) IBM Corporation
ABSTRACT
In this paper we present a Mixed Reality (MR) teleconferencingapplication based on Second Life (SL) and the OpenSim virtualworld. Augmented Reality (AR) techniques are used fordisplaying virtual avatars of remote meeting participants in realphysical spaces, while Augmented Virtuality (AV), in form ofvideo based gesture detection, enables capturing of humanexpressions to control avatars and to manipulate virtual objects invirtual worlds. The use of Second Life for creating a sharedaugmented space to represent different physical locations allowsus to incorporate the application into existing infrastructure. Theapplication is implemented using open source Second Life viewer,ARToolKit and OpenCV libraries.
KEYWORDS: mixed reality, virtual worlds, Second Life,teleconferencing, immersive virtual environments, collaborativeaugmented reality.
INDEX TERMS: H.4.3 [Information System Applications]:Communications: Applications – computer conferencing,teleconferencing, and video conferencing; H.5.1 [InformationSystems]: Multimedia Information Systems – artificial,augmented, and virtual realities.
1 INTRODUCTION
The need for effective teleconferencing systems is increasing,mainly due to economical and environmental reasons astransporting people for facetoface meetings consumes lot oftime, money and energy. Massively multiuser virtual 3D worldshave lately gained popularity as teleconferencing environments.This interest is not only academic as one of the largest virtualconferences was held by IBM in late 2008 with over 200participants. The conference, hosted in a private installment ofSecond Life virtual world, was a great success saving anestimated $320,000 compared to the expense of having theconference held in the physical world [1].
In this paper, we present a system for mixed realityteleconferencing where a mirror world of a conference room iscreated in Second Life and the virtual world is displayed in thereallife conference room using augmented reality techniques. Thereal people’s gestures are reflected back to Second Life. Theparticipants are also able to interact with shared virtual objects onthe conference table. A synthetic illustration of such a setting isshown in figure 1.
The structure of the paper is as follows. Section 2 describes thebackground and motivation for our work. Section 3 explainsprevious work related to the subject. Section 4 gives an overviewof the system we are developing. Section 5 goes into someexplanation of Second Life technical detail. Section 6 gives adescription of our prototype implementation. Section 7 provides adiscussion of results, as well as items for future work.Conclusions are in the section 8.
2 BACKGROUND
There are several existing teleconference systems, ranging fromold but still often used audio teleconferencing and videoteleconferencing to webbased conferencing applications. 2Dgroupware and even massively multiuser 3D virtual worlds havealso been used for teleconferencing.
Each of these existing systems has its pros and cons.Conference calls are quick and easy to set up without otherhardware than a mobile phone, yet it is limited to audio only andrequires a separate channel e.g. for document sharing.Videoconferencing adds a new modality as pictures of participantsare transferred but it requires more hardware and bandwidth,being quite expensive in the highend. Webconferencing islightweight and readily supports document and applicationsharing but it lacks natural interaction between users.
We see several advantages of using a 3D virtual environment,such as Second Life or OpenSim among many other platforms, asalternative means for realtime teleconferencing andcollaboration. First, the users are able to see all meetingparticipants and get a sense of presence not possible in atraditional conference call. Second, the integrated voice capabilityof 3D virtual worlds provides spatial and stereo audio. Third, the3D environment itself provides a visually appealing sharedmeeting environment that is just not possible with other means ofteleconferencing. However, the lack of natural gestures constitutesa major drawback for real interaction between the participants.
Figure 1. Illustration of Mixed Reality teleconference:Second Life avatar among real people, wearing ultra lightweight data glasses, sharing a virtual object on the table,
inside virtual room, displayed in CAVE.
[email protected]@[email protected]
179
IEEE Virtual Reality 201020 - 24 March, Waltham, Massachusetts, USA978-1-4244-6238-4/10/$26.00 ©2010 IEEE
3 RELATED WORK
In our work, virtual reality and augmented reality is combined insimilar manner as in the original work by Piekarski et al. [2].Their work was quite limited in the amount of augmentedvirtuality as only position and orientation of users weretransferred into the virtual environment. Our work focuses oninteraction between augmented reality and a virtual environment.Therefore our work is closely related to immersive telepresenceenvironments such as [3, 4]. Several different immersive 3D videoconferencing systems are described in [5].
Local collaboration in augmented reality has been studied forexample in [6, 7]. Collaboration is achieved by presenting colocated users the same virtual scene from their respectiveviewpoints and providing the users simple collaboration toolssuch as virtual pointers. Remote AR collaboration has mostlybeen limited to augmenting live video such as in [8] or lateraugmenting a 3D model reconstructed from multiple videocameras as in [9]. Remote sharing of the augmented virtualobjects and applications has been studied for example in [10].
Our work uses Second Life and the open source implementationof Second Life server called OpenSim, which are multiuservirtual worlds, as the virtual environment for presenting sharedvirtual objects. Using Second Life in AR has been previouslystudied by Lang et al. [11] as well as Stadon [12] although theirwork does not include augmented virtuality.
In the simplest case, augmented virtuality can be achieved bydisplaying real video inside a virtual environment as in [13]. Thisapproach has been also used for virtual videoconferencing in [14]and augmenting avatar heads in [15]. Another form of augmentedvirtuality is avatar puppeteering where human body gestures arerecognized and used to control the avatar, either only the avatarsface as in [16] or the whole avatar body as in [17]. However, onlylittle previous work has been presented on augmenting SecondLife avatars with real life gestures. The main exception is the VRWear system [18] for controlling avatar’s facial expressions.
4 SECOND LIFE VIRTUAL WORLD
Second Life is a free, massively multiuser online gamelike 3Dvirtual world for social interaction. It is based on communitycreated content and it even has a thriving economy. The virtualworld users, called residents, are represented by customizableavatars and can take part in different activities provided by otherresidents.
For interaction, Second Life features spatial voice chat, textchat and avatar animations. Only the left hand of the avatar can befreely animated onthefly, while all other animations rely on prerecorded skeletal animations that the user can create and upload tothe SL server.
For nonexpert SL users, however, meetings in SL can be quitestatic with the ‘who is currently speaking’ indicator being the onlyactive element. From our experience, actively animating theavatar while talking takes considerable training and directs theuser’s focus away from the discussion.
Second Life has clientserver architecture and each server isscalable to tens of thousands of concurrent users. The server isproprietary to Linden Labs but there exists also the communitydeveloped SL compatible server OpenSimulator [19].
5 SYSTEM OVERVIEW
In this project we developed a prototype and proofofconcept ofvideo conference meeting taking place between Second Life andthe real world. Our system combines immersive virtualenvironment, collaborative augmented reality and human gesturerecognition in a way to support collaboration between real and
virtual worlds. We call the system Augmented Collaboration inMixed Environments (ACME).
In the ACME system, some participants of the meeting occupya space in Second Life while others are located around a table inreal world. The physical meeting table is replicated in Second Lifeto support virtual object interactions as well as avatar occlusions.The people in real world see the avatars augmented around a realworld table, displayed by video see through glasses, immersivestereoscopic walls or within a video teleconference screen.Participants in Second Life see the real world people as avatarsaround the meeting table, augmented with hand and bodygestures. Both the avatars and real people can interact with virtualobjects shared between them, on the virtual and physicalconference tables respectively.
The main components of the system are: colocated userswearing videoseethrought HMD, a laptop for each user runningthe modified SL client, a ceiling mounted camera above each userfor hand tracking and remote users using the normal SL client.
The system is designed for restricted conference roomenvironments where meeting participants are seated around a welllit, uniformly colored table. As an alternative to HMDs, a CAVEstyle stereo display environment or a plain old video screens canbe used.
Figure 2 shows how the ACME system is experienced in ameeting between two participants, one attending the meeting inSecond Life and the other one in real life. It should be noted thatthe system is designed for multiple simultaneous remote and colocated users. A video of the ACME system is available at [20].
6 IMPLEMENTATION
6.1 GeneralThe ACME system is implemented by modifying the open sourceSecond Life viewer [21]. The viewer is kept backward compatiblewith original Second Life so that, even though more advancedfeatures might require server side changes, all major ACMEfeatures are also available when the user is logged in to theoriginal Second Life world.
The SL client was run on Dell Precision M6400 laptops (IntelMobile Core 2 Duo 2.66GHz, 4GB DDR3 533MHz). LogitechQuickCam Pro for Notebooks USB cameras (640x480 RGB, 30FPS) were used for videoseethrough functionality, whileUnibrain FireI firewire camera (640x480 YUV, 7.5 FPS) wasused for hand tracking. eMagin Z800 (800x600, 40° diagonalFOV) and MyVu Crystal 701 (640x480, 22.5° diagonal FOV)HMDs were used as videoseethrough displays.
Usability studies of the system are currently limited to projectsinternal testing of individual components. The author hasevaluated the technical feasibility of each feature and commentshave been collected during multiple public demonstrations,including a demo at ISMAR 2009. We have been able to identifykey points where the application has possibilities to overcomelimitations of current systems and also points whereimprovements need to be made to create a really usable system. Aproper user study will be conducted during 2010 with HIT LabNZ, comparing the ACME system with other means oftelecommunication. Detailed plans of the study have not yet beenmade.
6.2 Augmenting realityTo be able to use SL for video seethrough AR, three steps arerequired: video capture, camera pose estimation and rendering ofcorrectly registered virtual objects.
Currently the ACME system supports two different videosources, either ARToolkit [22] video capture routines for USB
180
devices or CMU [23] firewire camera driver API. ARToolkitOpenGL subroutines are used for video rendering.
HMD camera pose is estimated by ARToolkit marker trackingsubroutines. Multiple markers are placed around the walls of theconference room and the table so that at least one marker isalways seen by the user wearing a HMD. We experimented with20cm by 20cm and 50cm by 50cm markers at the distance from 1to 3 meters from the user. Distance between markers was aboutthree times the width of the marker.
Real world coordinate system is defined by a marker that lieson a conference table. Registration with SL coordinates is done byfixing one SL object to the real world origin and using object’scoordinate axis as unit vectors. This anchor object is selected inthe ACME configuration file. If the marker is not on the table, theanchor object must be transformed accordingly.
Occlusion is the ability of a physical object to cover those partsof virtual objects that are physically behind it. In the ACMEsystem, occlusion is implemented by modeling the physical spacein the virtual world and using the virtual model as a mask whenrendering virtual objects. The virtual model itself is not visible inthe augmented image as otherwise it would cover the veryphysical objects we want to see. Similar method was used in [24].
The ACME system does not place any restrictions on what kindof virtual objects can be augmented. Any virtual object can alsobe used as occlusion model. However, properly augmentingtransparent objects has not yet been implemented.
6.3 Hand trackingFor hand tracking, a camera is set up over the conference roomtable. The camera is oriented downwards so that the whole table isvisible in the camera image. The current implementation supportsonly one hand tracking camera.
Hand tracking video capturing and processing is done in aseparate thread from rendering so that a lower video frame ratecan be used without affecting rendering of the augmented video.
Hands are recognized from the video image by HSV (hue,saturation and value) segmentation. HSV color space has beenshown to perform well for skin detection [25]. Each HSV channelis thresholded and combined into a single binary mask. Acalibration utility was created for calibrating threshold limits totake different lightning conditions into account.
The current implementation uses only a single camera for handtracking, therefore proper 3D hand tracking has not yet beenimplemented. User hand is always assumed to hover 15cm overthe table so that the user can do simple interactions with virtualobjects on the table.
6.4 Gesture interactionInteraction in the ACME system is divided into two categories:interacting with other avatars and interacting with virtual objects.
Avatar interaction is more relaxed as the intent of body languageis conveyed even when avatar movements don’t precisely matchto user motion. Object interaction requires finer control as objectscan be small and in many cases the precise relative position ofobjects is of importance.
The orientation of the user’s face is a strong cue about wherethe user is currently focusing on. When the user is wearing avideoseethrough HMD we use the orientation of the camera,already computed for augmented reality visualization, to rotate theavatar’s head accordingly.
User hands are tracked by the hand tracking camera asexplained in section 6.3. This hand position information is used tomove avatar’s hand towards the same position. Second Lifeviewer has a simple built in inverse kinematics (IK) logic tocontrol shoulder and elbow joints so that the palm of the avatar isplaced approximately to the correct position. As the currentimplementation limits the hand to a plane over the table,interaction is restricted to simple pointing gestures. Otheranimations, for example waving for good bye, can still be used bymanually triggering animations from the SL client.
6.5 Object interactionFor easy interaction with objects, a direct and correct visualfeedback is needed. This is achieved by moving a feedback objectwith the user’s hand. Any SL object can be used as the feedbackobject by attaching the object to the avatar’s hand. This feedbackobject is moved only locally to avoid any network latency.
Currently we provide three object interaction techniques:pointing, grabbing and dragging. Interaction is controlled by twodifferent gestures: thumb visible and thumb hidden. Gestures areinterpreted from the point of view of the hand tracking camera,therefore the hand must be kept in a proper pose.
If the user moves her hand inside an object, the object ishighlighted by rendering a white silhouette around the object. Ifthere is a gesture transaction from thumb visible to thumb hiddenwhile an object is highlighted, the object is grabbed. The grabbedobject is highlighted with a yellow silhouette. By moving the handwhile an object is grabbed the object can be dragged, that is, theobject will move with the hand. Releasing the grabbed object isdone with a gesture transition from thumb hidden to thumbvisible.
7 RESULTS
The current implementation of the ACME system is still quitelimited. Even when using multiple large markers there can beregistration errors of tens of pixels, creating annoying visualeffects particulary at occlusion boundaries. Augmented objectsalso jerk a lot when markers become visible or disappear from theview. Better vision based tracking techniques or fusion withinertial sensors are clearly required for the system to be usable.
Visualizing virtual avatars with a head mounted video seethrough display is limited by the current HMD technology.Affordable HMDs do not provide enough wide field of view to bereally usable in a multi user conferencing. On the other hand,when augmentation is done into a video teleconferencing image,the user is able to follow virtual participants as easily as othervideo conference participants.
Hand tracking with nonadaptive HSV segmentation isextremely sensitive to lighting and skin color changes. Carefulcalibration is needed for each user and recalibration needs to bedone when ever the room lightning changes.
The current hand gesture recognition is prone to errors andlacks haptic feedback. This makes the interaction feel veryunnatural and requires very fine control from the user. Also the
Figure 2. User views of ACME: Second Life view(screenshot, left), real life view (augmented video, right).
181
current limitation of the hand motion to a 2D plane makes anysensible interaction rather difficult.
It should be noted that most of these short comings can be fixedby applying existing, more advanced algorithms. The only majorissue without a direct solution is the low quality of currentlyavailable affordable HMDs.
8 CONCLUSIONS
In this paper, we have presented a system called ACME forteleconferencing between virtual and physical worlds, includingtwo way interaction with shared virtual objects, using means ofaugmented reality and gesture detection in combination of SecondLife viewer and ARToolkit and OpenCV libraries. Currently theACME system contains augmenting of avatars and virtual objectsbased on marker tracking, visualization including occlusions, andfor interaction head tracking, 2D hand tracking from a monocularcamera and a grabandhold gesture based interaction with virtualobjects. Items for future work include enhanced AR visualizationwith markerless tracking, more elaborated hand gestureinteractions and body language recognition, controlling avatarfacial expressions, as well as various user interface issues.
Overall, we believe this early work with the ACME system hasdemonstrated the feasibility of using a mixed reality environmentas a means to enhance a collaborative teleconference. Certainly,the ACME system is not a replacement for a face to face meeting,but it should simplify and even enhance the 3D meetingexperience to the point where mixed world teleconferencemeetings could be a low cost yet effective alternative for manybusiness meetings. Our aim is within the next few months toemploy the ACME system in our internal project meetingsbetween overseas partners, which we have so far held in the purevirtual Second Life environment.
ACKNOWLEDGMENTS
The system has been developed in project “MRConference”starting in October 2008 with VTT as the main developer, IBMand Nokia Research Center as partner companies, and mainfunding provided by Tekes (Finnish Funding Agency forTechnology and Innovation). Various people in the project teamhelped us with their ideas and discussions, special thanks going toSuzy Deffeyes at IBM and Martin Schrader at Nokia ResearchCenter.
REFERENCES
[1] How Meeting In Second Life Transformed IBM’s Technology EliteInto Virtual World Believershttp://secondlifegrid.net/casestudies/IBM.
[2] W. Piekarski, B. Gunther, B. Thomas (1999), “Integrating virtualand augmented realities in an outdoor application”, Proc. IWAR1999, pp. 4549.
[3] P. Kauff and O. Sheer (2002), “An immersive 3D videoconferencing system using shared virtual team user environments”,in Proc. CVE’02, pp. 338354.
[4] M. Gross et al. (2003), “bluec: a spatially immersive display and 3Dvideo portal for telepresence”, ACM Transactions on Graphics22(3) , Jul 2003, pp. 819 – 827.
[5] P. Eisert (2003), “Immersive 3D Video conferencing: challenges,concepts, and implementations”, Proc. VCIP 2003, pp. 6979.
[6] D. Schmalstieg, A. Fuhrmann, G. Hesina, Z. Szalavári, L.Encarnaçäo, M. Gervautz, W. Purgathofer (2002), “The StudierstubeAugmented Reality Project.”, Presence: Teleoperators and VirtualEnvironments, Feb 2002, pp. 33 – 54.
[7] M. Billinghurst, I. Poupyrev, H. Kato, R. May (2000), “Mixingrealities in shared space: an augmented reality interface forcollaborative computing”, in Proc. ICME 2000.
[8] M. Billinghurst and H. Kato (1999), “Real world teleconferencing”,Proc. Chi’99, pp. 194195.
[9] S. Prince et al., “Realtime 3D interaction for augmented and virtualreality”, ACM SIGGRAPH 2002 conference abstracts andapplications, pp. 238
[10] D. Schmalstieg, G. Reitmayr, G. Hesina (2003), “Distributedapplications for collaborative threedimensional workspaces.”Presence: Teleoperators and Virtual Environments 12(1), Feb 2003,pp. 5267.
[11] T. Lang, B. MacIntyre, I. J. Zugaza (2008), “Massively MultiplayerOnline Worlds as a Platform for Augmented Reality Experiences”,IEEE VR ’08, pp. 6770.
[12] J. Stadon, “Project SLARiPS: An investigation of mediated mixedreality” In Arts, Media and Humanities Proc. of the 8th IEEEISMAR 2009, pp. 43–47.
[13] K. Simsarian, K.P. Åkesson (1997), “Windows on the World: Anexample of Augmented Virtuality.”, Proceedings of Interfaces 97:ManMachine Interaction.
[14] H. Regenbrecht, C. Ott, M. Wagner, T. Lum, P. Kohler, W. Wilke,E. Mueller (2003), “An Augmented Virtuality Approach to 3DVideoconferencing.”, In Proc. Of the 2nd IEEE and ACM ISMAR,2003.
[15] P. Quax, T. Jehaes, P. Jorissen, W. Lamotte (2003), “A MultiUserFramework Supporting VideoBased Avatars.”, In Proceedings ofthe 2nd workshop on Network and system support for games, 2003,pp. 137 – 147.
[16] F. Pighin, R. Szeliski, D. Salesin (1999), “Resynthesizing FacialAnimation through 3D ModelBased Tracking.”, Proceedings of the7th ICCV, 1999, pp. 143 – 150.
[17] J. Lee, J. Chai, P. Reitsma, J. Hodgins, N. Pollard (2002),“Interactive Control of Avatars Animated with Human MotionData.”, ACM Transactions on Graphics 21, Jul 2002, pp. 491 – 500.
[18] VRWEAR SL head analysis viewer, http://sl.vrwear.com/,unpublished
[19] OpenSimulator, http://opensimulator.org/.[20] Video of the ACME system,
http://www.youtube.com/watch?v=DNB0_c5TSk[21] Second Life Source Downloads,
http://wiki.secondlife.com/wiki/Source_archive.[22] ARToolkit homepage, http://www.hitl.washington.edu/artoolkit/.[23] CMU 1394 Digital Camera Driver
http://www.cs.cmu.edu/~iwan/1394/.[24] A. Fuhrmann, et al. “Occlusion in Collaborative Augmented
Environments”, Computers and Graphics, 23(6):809819, 1999.[25] Benjamin D. Zarit, Boaz J. Super, Francis K. H. Quek (1999),
“Comparison of Five Color Models in Skin Pixel Classification”, InICCV’99 Int’l Workshop on, pp 5863.
Figure 3. Interaction with virtual objects: Second Life view(left), and real life view (right). Feedback object as red ball.
182