connecting users to virtual worlds within mpeg-v standardization

Contents lists available at SciVerse ScienceDirect

Signal Processing: Image Communication

Signal Processing: Image Communication 28 (2013) 97–113

0923-59

http://d

n Corr

E-m

journal homepage: www.elsevier.com/locate/image

Connecting users to virtual worlds within MPEG-V standardization

Seungju Han, Jae-Joon Han n, James D.K. Kim, Changyeong Kim

Advanced Media Lab, Samsung Advanced Institute of Technology, Yongin, Republic of Korea

a r t i c l e i n f o

Available online 17 November 2012

Keywords:

MPEG-V

Virtual World

3D Manipulation

Gesture Recognition

65/$ - see front matter & 2012 Elsevier B.V

x.doi.org/10.1016/j.image.2012.10.014

esponding author. Tel.: þ82 31 280 9443; fax

ail address: [email protected] (J.-J

a b s t r a c t

Virtual world such as Second life and 3D internet/broadcasting services have been

increasingly popular. A life-scale virtual world presentation and the intuitive interaction

between the users and the virtual worlds would provide more natural and immersive

experience for users. The emergence of novel interaction technologies, such as facial-

expression/body-motion tracking and remote interaction for virtual object manipulation,

could be used to provide a strong connection between users in the real world and avatars

in the virtual world. For the wide acceptance and the use of the virtual world, various

types of novel interaction devices should have a unified interaction format between the

real world and the virtual world. Thus, MPEG-V Media Context and Control (ISO/IEC

23005) standardizes such connecting information. The paper provides an overview and

its usage example of MPEG-V from the real world to the virtual world (R2V) on interfaces

for controlling avatars and virtual objects in the virtual world by the real world devices.

In particular, we investigate how the MPEG-V framework can be applied for the facial

animation and hand-based 3D manipulation using intelligent camera. In addition, in

order to intuitively manipulate objects in a 3D virtual environment, we present two

interaction techniques using motion sensors such as a two-handed spatial 3D interaction

approach and a gesture-based interaction approach.

& 2012 Elsevier B.V. All rights reserved.

1. Introduction

How we interact with computers in the future is exciting to say the least. Some of the interaction technologies are alreadyin place and accepted as natural interaction methods. For example, Nintendo’s Wii motion controller adopts accelerometers forusers to control virtual objects with natural motions [1].

Especially, virtual worlds, which are persistent online computer-generated environments such as Second Life, World ofWarcraft and Lineage, have potential needs for such novel interaction technology since people can interact, either for work orplay, in a manner comparable to the real world. The strong connection between the real world and the virtual world wouldprovide the immersive experience to users. Such a connection can be provided by the large-scale display where the objects inthe virtual world are represented as real world life-scale, and by natural interaction using a facial expression and body motionof the users to control an avatar which is a user’s representation of himself/herself or alter ego in the virtual world. Recently,Microsoft introduced Xbox Kinect which senses the full-body motion of users with a 3D-sensing camera. Due to the sensed full-body motion, the users can control a character in a game according to their own body movements. It is expected to develop theeven more precise facial expression and motion sensing technology. The virtual world services adapting the precise and natural

. All rights reserved.

: þ82 31 280 1711.

. Han).

www.elsevier.com/locate/image

www.elsevier.com/locate/image

dx.doi.org/10.1016/j.image.2012.10.014

dx.doi.org/10.1016/j.image.2012.10.014

dx.doi.org/10.1016/j.image.2012.10.014

dx.doi.org/10.1016/j.image.2012.10.014

Fig. 1. An example of virtual world service system architecture (MPEG-V from the real world devices to the virtual world).

S. Han et al. / Signal Processing: Image Communication 28 (2013) 97–11398

interaction technology would provide various experiences such as a virtual tour, which enables users to travel back in time avirtual ancient Rome, and a simulated astrophysical space exploration as if users walk or fly in the enormous space.

These virtual world services require a strong connection between the virtual and the real worlds to reach simultaneousreactions in both the worlds to any changes in the environment. To make interfaces between them efficient, effective andintuitive is of crucial importance for their wide acceptance and use. The standardized interface between the real world andthe virtual world is needed for the unified interface formats in between and interoperability among virtual worlds [2].Fig. 1 shows the needs of standardization of the interface, which enables various virtual world services. The virtual worldservice providers should be able to communicate the interoperable metadata of virtual world object with the console,while the console also needs to adapt the signal received from any real world input devices to the virtual world objectmetadata and send the adapted signal to the virtual world.

MPEG-V (ISO/IEC 23005) provides such architecture and specifies the associated information representations to enablethe interoperability between virtual worlds, e.g., digital content provider of a virtual world, gaming, simulation, DVD, andwith the real world devices, e.g., sensors and actuators [3].

In this paper, we focus on one of the standardization areas of MPEG-V, real world to virtual world adaptation (R2Vadaptation). Specifically, it contains control information; interaction information and virtual world object characteristics,which are essential ingredients for controlling the virtual world objects by the real world devices. The real world devicessuch as motion sensors and cameras capture and reflect motions/posture/expressions of humans to virtual worldimplicitly. The paper presents a 6-DOF motion sensor which estimates 3D position and 3D orientation; as well as anintelligent camera which is capable of recognizing feature points of face and hand posture/gestures.

In addition, the paper also presents how the recognized output of such devices can be adapted to virtual world by R2Vadaptation engine. Presented are four different instantiations, the two of which use an intelligent camera for facialexpression cloning and hand based interaction, respectively; the other two of which uses motion sensors for 3Dmanipulation, and virtual music conducting, respectively.

This paper is organized as follows: Section 2 reviews the system architecture of MPEG-V R2V, and metadata of theMPEG-V R2V systems, i.e., control information, interaction information, and virtual world object characteristics; Section 3presents the architecture of the motion sensor based interaction of the paper and its instantiated examples, i.e., how themotion sensor can be adapted for the two instantiations; Section 4 presents the architecture of the intelligent camerabased interaction of the paper and how to adapt the received information to the specific virtual worlds. Finally, the paper isconcluded in Section 5.

2. System architecture and metadata of MPEG-V R2V

The system architecture for the MPEG-V R2V framework is depicted in Fig. 2(a) comprising an adaptation RV engineand three standardization parts: Control Information (MPEG-V part 2); Interaction Information (MPEG-V part 5); andVirtual World Object Characteristics (MPEG-V part 4). The individual elements of the architecture have the followingspecific functions.

Control Information concerns about the description of the capabilities of real world devices such as sensors and inputdevices. The control information conveys intrinsic information such as accuracy, resolution, ranges of the sensed valuefrom the real world devices.

Interaction Information specifies the syntax and semantics of the data formats for interaction devices, ‘‘SensedInformation,’’ to provide common input commands or sensor data format from any interaction devices in the real worldconnected to the virtual world. It aims to provide data formats for industry-ready interaction devices (sensors).

Fig. 2. Use scenario with MPEG-V R2V Framework. (a) System Architecture of MPEG-V R2V. (b) Body motion tracking with motion sensor and facial

expression with intelligent camera.

S. Han et al. / Signal Processing: Image Communication 28 (2013) 97–113 99

Virtual World Object Characteristics describes a set of metadata to characterize a virtual world object, makingpossible to migrate a virtual object from one virtual world to another and control a virtual world object in a virtual worldby real word devices.

The MPEG-V R2V supports interaction information and control information from interaction devices to a virtualworld for the purpose of controlling one or more entities in the virtual world. Particularly, consider controlling the bodymotion and facial expression of an avatar in the virtual world. The motion of avatar can be generated by either pre-recorded animation clips or direct manipulation using motion capturing devices. Fig. 2 (b) shows an example of facialexpression and body tracking application with an intelligent camera. The intelligent camera detects/tracks feature pointsof both face and body; and then analyzes the time series of the detected feature points to recognize a body gesture and/or afacial expression.

The detected feature points of the user provide the body motion and the facial expression information of the user in thereal world. To control the motion of the avatar using such information, the avatar should also have the similar featurepoints for rendering. In the simplest case, the sensed feature points of the user and the feature points of the avatar areidentical. Therefore, the description of the avatar should provide the feature points for both the body and the face of theavatar [4,5].

In order to support direct manipulation of the avatar, virtual world object characteristics contain the animationelement and the control feature element in avatar characteristics. The animation element contains a description ofanimation resources and the control element contains a set of descriptions for body control and facial control of anavatar.

In order to efficiently render the real world effect in the virtual world, the MPEG-V R2V also provides architecture tocapture and understand the current status of the real world environment. For example, the virtual world acquires thesensed temperature or light level of the room in the real world by the sensed information to render the same effect in thevirtual world.


Adaptation RV engine receives the sensed information and the description of the sensor capability; and thenunderstands/adjusts the sensed information appropriately based on the sensor capability. For example, the offset, oneof the attributes in sensor description, can be added to the sensor value in order to get the correct value. The SNR (Signal toNoise Ratio), which is another attribute, can give the measure how much the data can be trusted due to the noise. Byproviding these attributes, the sensed information can be understood more precisely.

The current syntax and semantics of control information, interaction information and virtual world object character-istics are specified in [6–8], respectively. However, the paper provides an EBNF (Extended Backus–Naur Form)-likeoverview of them due to the lack of space and the verbosity of XML [9].

A.
Control Information for sensors and input devicesThe Control Information Description Language (CIDL) is a description tool to provide basic structure in XML schema forinstantiations of control information tools including sensor capabilities.1) Sensor Capability Description
SensorCapabilityBaseType provides a base abstract type for a subset of types defined as part of the sensor devicecapability metadata types.

It contains an optional Accuracy element and a sensorCapabilityBaseAttributes attribute. The Accuracy describes thecloseness degree of a measured quantity to its actual value. The SensorCapabilityBaseAttributes is used to define agroup of attributes for the sensor capabilities.

The sensorCapabilityBaseAttributes may have several optional attributes, which are defined as follows: unit describesthe unit of the sensor’s measuring value; maxValue and minValue describe the maximum/minimum value that thesensor can perceive respectively; offset describes the value to be added to a base value in order to get to a correctvalue; numOfLevels describes the number of value levels that the sensor can perceive in between maximum andminimum value; sensitivity describes the minimum magnitude of input signal required to produce a specified outputsignal in given unit; SNR describes the ratio of a signal power to the noise power.

2) Sensor Capability VocabularyThe Sensor Capability Vocabulary (SCV) defines a clear set of actual sensor capabilities to be used with the SensorCapability Description in an extensible and flexible way. That is, it can be easily extended with new capabilityinformation or by derivation of existing capabilities thanks to the extensibility feature of XML Schema.

Currently, the standard defines the capabilities of the following sensor; light, ambient noise, temperature, humidity,distance, atmospheric pressure, position, velocity, acceleration, orientation, angular velocity, angular acceleration,force, torque, pressure, motion sensor and intelligent camera. The main capabilities of all the sensors except intelligentcamera contain maxValue and minValue, which describe the maximum/minimum value that the sensor can perceive interms of the specified unit, respectively. Also, the location element, which describes the location of the sensor from theglobal coordinate system, is included in light, ambient noise, temperature, humidity, distance, and atmosphericpressure sensors.The two examples of the specific sensor capabilities, motion sensor capability and intelligent camera capability areprovided since the paper mainly concerns those two sensors for RV interaction. The motion sensor is an aggregatedsensor type which contains sensed information such as position, velocity, acceleration, orientation, angular velocity,and angular acceleration. It contains the base type as well as the capabilities of all the sensed information in the sensor.


Finally, the intelligent camera contains the base type; the description whether the camera can capture feature points onbody and/or face; the description whether the camera can recognize the facial expression and/or body gesture; themaximum number of detectable feature points; and its location of the feature points.

FeatureTrackingStatus describes whether the feature tracking is possible or not. FacialExpressionTrackingStatus describeswhether the intelligent camera can extract the facial animation or not. GestureTrackingStatus describes whether theintelligent camera can extract the body animation or not. maxBodyFeaturePoint describes the maximum number of bodyfeature points that the intelligent camera can track. maxFaceFeaturePoint describes the maximum number of facialfeature points that the intelligent camera can track. TrackedFeature describes what kind of feature points can be trackedas given in FeatureType (that is, 1. face, 2. body, and 3. both). TrackedFacialFeaturePoints describes whether each of thefacial feature points orderly listed in AvatarControlFeatures. TrackedBodyFeaturePoints describes whether each of thebody feature points orderly listed in AvatarControlFeatures.

B.
Interaction Information for sensors and input devicesThe Interaction Information Description Language (IIDL) is a XML-Schema-based language providing the basic structurefor describing sensed information.1) Sensed Information Description
SensedInfoBaseType provides the topmost type of the base type hierarchy which individual sensed information can inherit.

It contains a TimeStamp element and a SensedInfoBaseAttributes attribute. The TimeStamp provides the timeinformation at which the sensed information is acquired. The SensedInfoBaseAttributes describes a group of attributesfor the sensed information.

sensedInfoBaseAttributes may have several optional attributes, which are defined as follows: id is used to definedunique identifier for identifying individual sensed information; sensorIdRef references a sensor device that hasgenerated the information included in this specific sensed information; linkedlist describes the multi-sensor structurethat consists of a group of sensors in a way that each record contains a reference to the ID of the next sensor; groupID

describes identifier for a group multi-sensor structure to which this specific sensor belongs; activate describes whetherthe sensor shall be activated; priority describes a priority for sensed information with respect to other sensedinformation sharing the same point in time when the sensed information become adapted.

2) Sensed Information VocabularyThe Sensed Information Vocabulary (SIV) defines a set of sensors to be used with the Sensed InformationDescription. Also, it uses the same set of the sensors with the Sensor Capabilities. The sensed information of allthe sensors except intelligent camera contains Value and Unit. Value describes the sensed intensity of the sensor interms of the related unit. Unit specifies the unit of the sensed value, if a unit other than the default unit is used, as areference to a classification scheme term. The sensed information of a motion sensor, MotionSensorType containsthat of the following six sensors; position, orientation, velocity, angular velocity, acceleration, angular acceleration.

C.


Next, the intelligent camera has the sensed information, intelligentCameraType, which contains information such asfacial/body animation (gesture) and facial/body feature points.

FacialAnimationID describes the ID referencing the facial expression animation clip. BodyAnimationID describes theID referencing the body animation clip. FaceFeature describes the 3D position of each of the facial feature pointsdetected by the camera. BodyFeature describes the 3D position of each of the body feature points detected by thecamera.

Virtual World Object CharacteristicsThe Virtual World Object Description Language (VWODL) is a XML-Schema-based language, called Virtual World ObjectCharacteristics XSD, for describing an object by considering three main requirements:– Easy to create importers/exporters from various Virtual Environments (VEs) implementations.– Easy to control an object within a VE.– Possible to modify a local template of the object by using data contained in Virtual World Object Characteristics file.

The schema deals only with metadata and does not include representation of the geometry, sound, scent, animation ortexture. To represent the latter, references to media resources are used.

1) Virtual World Object DescriptionVWOBaseType describes common types of elements and attributes of the virtual world objects which are shared byboth avatars and the virtual objects. The common associated elements and attributes are composed of followingtype of data.

Identification contains the identification descriptors of the virtual world object. VWOC describes a set ofcharacteristics of the virtual world. It contains a list of the effects associated to the virtual world object, such asSoundList, ScentList, ControlList and EventList. BehaviorModelList contains a list of descriptors defining the behaviorinformation of the object according to input events. id is an unique identifier for identifying individual virtual worldobject information.

2) Avatar DescriptionThe AvatarType is composed of the following type of data in addition to the common characteristics type of virtualworld object.

AvatarBaseType contains a set of avatar descriptors about VWOBaseType defined in the common characteristics of thevirtual world object. Appearance contains the high level description of the appearance and may refer a mediacontaining the exact geometry and texture. Animation contains the description of a set of animation sequences thatthe avatar is able to perform and may refer to several media containing the exact (geometric transformations)animation parameters. CommunicationSkills contains a set of descriptors providing information on the differentmodalities an avatar is able to communicate. Personality contains a set of descriptors defining the personality of theavatar. Gender describes the gender of the avatar. ControlFeatures contains a set of descriptors defining possibleplace-holders for sensors on body skeleton and facial feature points. As shown in Fig. 3, the facial feature points forthe generic face model are defined in the standard. HapticPropertyList contains a list of high level descriptors of thehaptic properties. gender describes the gender of the avatar. A complete description of the avatar framework in
MPEG-V is seen in Ref. [10].

A

B

(x, y, z, �, �, �)Position Orientation

A

B

Radiant Intensity

IRLED

IR LEDDirectivity

IR Sensor

IR LED

Processor

-Analog IntensitySensing Circuit

-Non-linear Optimization

-IR Transmission(Dim. Φ2.5cm)

-InertialSensor

Fig. 3. 3D motion tracking architecture.


3) Virtual Object DescriptionVirtualObjectType provides a representation of virtual object inside the environment. It is composed of following typeof data.

VirtualObjectType ::= [VirtualObjectBaseType] [Appearance] [Animation] [HapticProperty] [VirtualObjectComponents]

VirtualObjectBaseType contains a set of virtual object descriptors about VWOBaseType defined in the commoncharacteristics of the virtual world object. Appearance contains one or more resource link(s) to appearance filesdescribing the visual and tactile elements of the object. Animation contains a set of metadata describing pre-recorded animations associated with object. HapticProperty contains a set of high level descriptors of the hapticproperties defined as material property, dynamic force effect and tactile property. VirtualObjectComponents containsthe list of the virtual objects which are concatenated to the virtual object as components.Finally, the MPEG-V R2V framework includes the adaptation RV engine, which is not the scope of MPEG-V standard.The adaptation becomes necessary due to the possible mismatch of data representations in the virtual world and thereal world. For example, since an avatar in the virtual world has a different shape and size compared to the user, thebody of the user and the body of the avatar should be somehow connected. The sensed feature points of the usershould be connected to the rendering feature points of the avatar. The adaptation RV engine maps the datarepresentation of virtual world to the real world devices. Note that it is due to the existence of the adaptation RVengine that the input events of virtual world are not necessary to be adjusted depending on either the performanceof individual interaction device or the mismatch of the data representations. The adaptation engine takes the devicecapabilities and the data provided by each interaction device; generates the input events based on thecharacteristics of virtual world. Therefore, the MPEG-V system architecture provides independence among virtualworld providers and interaction device manufacturers, i.e., each entity can develop its own best work withoutconcerning the others. The following two sections provide different type of interactions, each of which introduces anovel interaction device (e.g., a motion sensor and an intelligent camera) and its corresponding adaptation RVengine depending upon the virtual world applications.

3. Motion sensor-based interaction

Motion-based control is gaining popularity, and motion gestures form a complementary modality in human-computerinteractions. The motion gestures are composed only by the position and orientation of the hand or the handheld device,i.e., 6-DOF motion gesture. Motion can be tracked via either vision-based systems or sensors attached to the human’s body.To achieve high precision motion gesture recognition, we use motion sensor-embedded handheld device.

3.1. 6-DOF motion sensor and the interaction information

Large vertical displays are effective at displaying 3D information. In addition, interfaces for vertical displays aredesigned based on the implicit assumption that users are located at a distance. Fig. 3 illustrates the overall structure of theproposed 3D remote interaction system. The system includes handheld devices containing IR LEDs and inertial sensors aswell as an IR sensor-embedded display. The 6-DOF sensing architecture measures the emitted analog intensity valueaccurately using 950 nm wavelength IR light modulated by 1 MHz carrier frequency and super-heterodyne signal filteringcircuit against possible internal/external noise signal sources within a sensing area. The LEDs in different direction emit


their light signals in turn during a cycle. For each light emission, multiple photo-receiver measures the received opticalinputs at a time. For each cycle, all the measured signals are collected and used for estimating 6-DOF motion of the targetin reference to the coordinate frame of the photo-receivers by non-linear optimization [11] and a temporal filteringalgorithm. Also, to handle possible deterioration of signal to noise ratio due to static noise such as ambient light at a longdistance, inertial sensors with higher sampling rate are used. Basically interpolation is performed during short timeinterval by inertial sensors, incorporated with Kalman filter which provides the best estimates of current states. Theresultant 6-DOF tracking performance shows that the average of the absolute estimation error is 3.3 cm for 3D positionand 3.61 for 3D orientation are achieved at 1.0–2.5 m, where the working range of the reference system is within 2.5 m.The details of the 6-DOF system can be found in Ref. [12].

The estimated data is then encapsulated with a data type of motion sensor specified in ISO/IEC 23005-5 [7] asinteraction information to the adaptation RV engine as in Fig. 1. Although the output of sensing architecture can generateall six types of information, the proposed system for interaction only uses 3D positions and 3D orientation. Therefore, theMotionSensorType contains just position and orientation as the interaction information.

3.2. Adaptation for motion sensor based 3D manipulation

For the object manipulation, we selected gestures for four major tasks: selection, translation, rotation and scaling. Afterthe preliminary empirical test of various gestures, we designed one-handed tasks (selection, rotation and translation) andtwo-handed tasks (rotation and scaling) for the intuitive and sophisticated 3D manipulation. The two-handed interactionsprovide users with the feeling like using multi-touch interactions at a distance. This design approach is possible becauseour sensing system can track 3D movement and orientation of multiple devices simultaneously.

As for one-handed interaction, users need to identify a desired target to interact with virtual objects. This task has to beaccomplished by an interaction technique itself. Selection is simply implemented by using image-plane techniques with abutton on handheld devices [13].

As for translation, the system obtains 3D position in the scene when the selection occurs. The (x, y) plane of thetranslation is then determined by the x, y coordinates of the acquired 3D position. Then, the (x, y) movement of the hand-held device in the sensor reference coordinate is then mapped one-to-one to the translation in (x, y) plane in the scene.Similarly, the z-axis translation is designed. Overall, users can translate an object using the gestures of a translation alongany direction intuitively like moving a real object. The rotation along the 3-axis (roll, pitch, yaw) of 3D virtual space isdesigned using the orientation information estimated from the proposed system. Users can rotate the object intuitivelyusing the gestures of a rotation as shown in Fig. 4.

Fig. 5 shows how the two-handed rotation is implemented. In principle, it performs a rotation around an arbitrary axisu in 3D space. In order to determine the axis u, one hand pointer determines the center point of rotation (a); the rotation

Fig. 4. 1-Handed rotation.

Fig. 5. 2-handed rotating gesture and rotation around an arbitrary axis.


axis is determined by the cross product of two vectors of ab and ac; and the amount of rotation (y) is specified by the anglebetween the two vectors.

Scaling in 3D space is implemented using two-handed pinch gestures (a-c, b-d in Fig. 6). Moving the devices closerwhile selected is implemented as scaling down with the scaling factor determined by the proportion between the distancebefore the gesture and the one after. Moving the devices farther is implemented as scaling up similar to scaling down.Fig. 7 shows our designed selection, translation, rotation, and scaling tasks for 3D manipulation.

As for evaluation, 10 subjects (8 males and 2 females), aged 20–35, participated in the experiment. All participantsstand 2.5 m away from the display. Before the user study, all subjects were asked to practice seven gestures for 3Dmanipulation with the input. After all the subjects performed each gesture by 100 trials respectively, we evaluated the

Fig. 6. Scaling. (a) Gesture (b) Scaling up (c) Scaling down (d) Non-scaling.

Fig. 7. 3D manipulation. (a) Demonstration, (b) pointing, (c) selection, (d) moving forward (e) moving backward, (f) 2-handed X-rotation, (g) 2-handed

Y-rotation, (h) 2-handed Z-rotation, (i) scaling up and (j) scaling down.

Fig. 8. Evaluation for 3D manipulation: (a) recognition rate and (b) user feedback.


success rate of the 3D manipulation tasks. As mentioned before, selection and translation tasks were used by one-handeddevice and rotation and scaling tasks were performed by two-handed device.

The average rate of correctly recognized gestures was 95.6%. The averaged recognition rate for each of the sevengestures is shown in Fig. 8(a). During 3D manipulation tasks, all the subjects completed quickly and easily. In addition, thesubjects were also asked to leave feedback on the questionnaire after 3D manipulation tasks. The questionnaire includesfive items on mental effort, physical effort, operation speed, general comfort and overall ease. Fig. 8(b) shows thattranslation performs best over all the questions and scaling has all positive results over all the questions. The importantaddition to the survey is the comparison for a preferred rotation with one-handed and two-handed. One-handed rotationhas less mental effort and more physical effort than two-handed rotation, since one-handed rotation is more intuitivegesture, but has more physical limitations on its rotational hand movement. The details of 3D manipulation can be foundin Ref. [14].

In order to interact with virtual objects on the scene, the adaptation engine receives the metadata of each object on thescene from a virtual world and returns the updated metadata. For 3D manipulation of an object, motion control featuretype, one of the metadata specified in ISO/IEC 23005-4 [8], would be updated by transforming the 3D motion tracking dataof the input devices into the corresponding commands for the virtual environment. The following example shows themetadata to double up the scale of a virtual object with the ID of ‘‘VO-1’’ in all three axes.

<vwoc:VirtualObject xsi:type="vwoc:VirtualObjectType" id="VO-1">

<vwoc:VWOC> <vwoc:ControlList> <vwoc:Control> <vwoc:MotionFeatureControl> <vwoc:ScaleFactor> <mpegvct:X>2.0</mpegvct:X> <mpegvct:Y>2.0</mpegvct:Y> <mpegvct:Z>2.0</mpegvct:Z> </vwoc:ScaleFactor> </vwoc:MotionFeatureControl> </vwoc:Control>

</vwoc:ControlList> </vwoc:VWOC> </vwoc:VirtualObject>

3.3. Adaptation for virtual music conducting

As shown in Fig. 9, a gesture based virtual music conducting was implemented. The real-time gesture recognitionenables a user to control a virtual orchestra using gestures of user’s baton, which is the sensor-embedded handheld device.Conducting gestures which defines the musical beat are detected from the baton’s trajectory which is obtained from asequence of 3D positions of the motion sensors and conveyed to a sound synthesis system as events that control the tempoand the phase of the music. Our system allows the user control over beat by conducting four different beat-patterngestures; tempo by making faster or slower gestures; volume by making larger or smaller gestures; and instrument

Fig. 9. Virtual Conducting System: (a) demonstration, (b) conducting gesture and (c) virtual orchestra.


emphasis by directing the gestures towards specific areas of a video of the orchestra on a large display. The paper focuseson real-time gesture recognition for four beat patterns (2/4, 3/4, 4/4, and 6/8), which is fundamental to the other controls.

One main concern of gesture recognition is how to segment some meaningful gestures from a continuous sequence ofmotions. Generally, such gesture segmentation suffers segmentation ambiguity [15] and spatio-temporal variability [16].The segmentation ambiguity is caused by the uncertainty that the gesture pattern starts and ends in a continuoussequence of motions. The spatio-temporal variability is caused by the fact that the same gesture varies in shape, duration,and trajectory even for the same person. To solve the above problems, researchers have used the HMM because it canmodel the spatial and temporal characteristics of gestures effectively [17–20].

The trajectory from the hand-held devices is converted to a spatio-temporal sequence of the moving direction of thehand-held device as shown in Fig. 10. The moving direction is then converted to one of the 8 directional codewords by

Fig. 10. HMM model for conducting gestures: (a) directional codewords and (b) HMM model.

Fig. 11. Comparison of gesture segmentation between the sliding window HMM and the accumulative HMM.

Fig. 12. Gesture recognition rate of the sliding window HMM and the accumulative HMM.

Table 1Confusion matrix of accumulate HMM (%).

Class Actual

2/4 3/4 4/4 6/8

2/4 95.9 1.5 0.0 0.2

3/4 3.3 93.6 2.8 2.4

4/4 0.0 1.3 86.5 3.5

6/8 0.8 3.6 10.7 93.9


a vector quantizer to make use of the discrete HMM-based approaches. For each gesture, we design a model using the left–right HMM utilizing the temporal characteristics of gesture signals. We apply both a sliding window HMM and anaccumulated HMM [21] to our real-time conducting gesture recognition system as shown in Fig. 11. The former HMM usesthe backward spotting scheme that first detects the end point; then traces back to the start point; and sends the extractedgesture segment to the hidden Markov model (HMM). This makes an inevitable time delay. The latter HMM improves thegesture recognition rate greatly by accepting all the accumulated gesture segments between the start and end points anddeciding the gesture type by a majority vote over all the intermediate recognition results.

As for evaluation, the gesture recognition rates of the HMM methods are measured. The recognition rate is defined asthe ratio of the number of correctly recognized gesture sequences over the total number of testing gesture sequences.Testing dataset consists of 200 gesture sequences, i.e., 50 sequences for each of the four gestures. Fig. 12 shows therecognition results of the two different segmentation methods for each of the beat pattern gestures. The overallrecognition accuracy of the accumulated HMM method (92.5%) is 9.8% better than that of the sliding window HMMmethod (82.7%). Table 1 shows the confusion matrix of the accumulated HMM. We observe that the gesture of 4/4 beat iscomparably the lowest class. Also, it is the most confusing class with 4/4 beat gesture and 6/8 beat gesture.

4. Intelligent camera-based interaction

Camera is an important sensor in a sense that it captures an entire visible scene within its field of view and can be usedto analyze the scene to obtain contextual information automatically. Full body gesture and facial expression of users is ofinterest due to the importance of understanding the context. In addition, depth sensible cameras overcome bad lightconditions and simplify complex image processing procedures. The living room environment, where most large screendisplays are located, especially suffers by a low ambient light intensity and a wide distance between the sensor and users[22]. In such environment, the depth cameras provide effective sensing means to support manipulation by hand gestures/postures without any legacy input devices. On the other hand, color cameras are beneficial in detecting facial expressions,because it provides higher resolution and more robust texture information than depth cameras. The detected facialexpressions can be used as animation controls which drive facial animation of a 3D avatar [23].

4.1. Intelligent camera for facial feature points & hand posture

As for facial feature point tracking, the color image is used. The intelligent camera basically initializes feature points topersonalize facial model and then traces each of the feature points. In initialization step, feature points are acquired fromthe user’s frontal neutral face while optimizing the shape parameters of the 3D face model in equation,

F ¼ FþtsSþeE ð1Þ

where F is the mean shape of the 3D face model, S is a shape basis, ts is a vector for shape parameters, E is an animationbasis, e is a vector for animation parameters.

An Active Appearance Model (AAM) [24] is exploited for acquiring the feature points of the user’s. After facial featurepoints are located by AAM, each point needs to be processed and possibly corrected. For feature points that must belocated on edges, we move each feature point along the normal vector of boundary toward the salient edge at the image.The parameterized 3D face model is then aligned with the 33 feature points acquired from AAM while optimizing thevector of shape parameter, ts in Eq. (1).

After initialization, tracking step starts tracking lower and upper face separately. The lower facial feature points aretracked using an iterative gradient descent approach [25]. The upper facial features are tracked by a belief propagationalgorithm with the geometric constraints to avoid drifting of the tracked feature points [23].

These sensed feature points are then encapsulated by the intelligent sensor type as follows. To specify intelligentcamera, the capability type is used as ‘‘featureTrackingStatus’’ indicating that the feature tracking is possible; ‘‘Track-

edFeature’’ indicating that the facial features are tracked; and ‘‘FacialFeatureMask’’ which contains only active 33 featurepoints. It provides parameters to adaptation engines for processing the sensed information.


The sensed information is then encapsulated with the 33 feature points obtained from the intelligent camera with thesensor identification of ‘‘IC_1’’. The sensed information instantiation of the example shows in the following. Note that thefirst four feature points in the example represent the head outline of the face specified in ISO/IEC 23005-4 [8].

<iidl:SensedInfo xsi:type="siv:IntelligentCameraType" sensorIdRef="IC1">

<siv:FaceFeature> <mpegvct:X> 0.0 </mpegvct:X> <mpegvct:Y> 0.0 </mpegvct:Y> <mpegvct:Z> 0.0 </mpegvct:Z> </siv:FaceFeature> <siv:FaceFeature> <mpegvct:X> 0.01 </mpegvct:X> <mpegvct:Y> 0.0 </mpegvct:Y> <mpegvct:Z> 0.01 </mpegvct:Z> </siv:FaceFeature> <siv:FaceFeature> <mpegvct:X> 0.01 </mpegvct:X>

<mpegvct:Y> 0.01 </mpegvct:Y> <mpegvct:Z> 0.01 </mpegvct:Z> </siv:FaceFeature> <siv:FaceFeature> <mpegvct:X> -0.01 </mpegvct:X> <mpegvct:Y> 0.01 </mpegvct:Y> <mpegvct:Z> 0.01 </mpegvct:Z> </siv:FaceFeature> … </iidl:SensedInfo>

As for hand based interaction, the distance between the camera and the user is set to about 2 m. Push motion is chosento initiate the hand interaction and extract the hand region. In order to detect the push motion, a sequence of depth imagesis stacked up to extract blobs, each of which has a negative depth difference between the frames for all the consecutiveframes. After performing the connected component labeling on the extracted blobs, the decision whether the detectedmotion is a push motion or not is made by real Adaboost classifier with seven features concerning the shape, size, andmoving direction of the blob [22]. This push blob extraction enables hand interaction without background subtractioneven when there are multiple users present in the cluttered background.

After detecting the push motion, the hand region is extracted by the appropriate depth range while maintaining thecharacteristics of the hand such as its size and the aspect ratio [26]. The extracted hand region, then, can be used not only forrecognizing the hand posture, but also for estimating the 3D position of the hand using the moments of the hand region. The levelset based approach provides the 2D image at a certain depth [27]. Invariant moments for each level are then used in generatingthe feature vector on a set of hand images. Fig. 13 shows the original depth image and the resulting contour images of threedifferent levels. The Random Forest algorithm is then applied to the obtained feature vectors to classify each of the postures.

In order to evaluate the performance of the algorithm, dataset was acquired from PrimeSenses camera in 2 m away.The resolution of depth image is 640�480 in pixels. Training data contains eight different hand postures from 10 peoplewith 200 different viewpoint images for each posture per person, i.e., 16,000 images. For evaluation data, we collect fivedata sets; each dataset is composed of 1000 images with different postures with various viewpoints from a single person.As for the feature, the depth difference between neighboring levels is set to 5 cm. different numbers of trees for the random

Fig. 13. Resulting contour image of the level set.

Table 2Results on hand posture recognition.

Evaluation dataset Correct rate (%)

100 Trees 200 Trees 400 Trees

DS-1 90.1 91.0 92.4

DS-2 87.5 90.8 90.2

DS-3 88.8 91.3 93.8

DS-4 88.5 92.0 92.8

DS-5 91.4 92.4 91.6

Average 89.3 91.5 92.2


forest are used for evaluation. The performance of the proposed algorithm is shown in Table 2. As the number of trees increases,the performance of the algorithm gradually increases. Note that the result of the paper is view-independent compared to othermethods based on depth images [28–30] and achieves more than 90% accuracy with more than 200 trees.

The estimated 3D position and the recognized posture of each hand are then sent to an adaptation engine as sensedinformation for intelligent camera. To specify intelligent camera, the capability type is used as ‘‘featureTrackingStatus’’indicating that the feature tracking is possible; ‘‘TrackedFeature’’ indicating that the body features are tracked; ‘‘Body-

FeatureMask’’ which contains only active flags on both hands; and finally, ‘‘gestureTrackingStatus’’ indicating the handposture detection is available. It provides parameters to adaptation engines for processing the sensed information.

The sensed information containing the two hand postures and the two 3D positions is then generated as an input to theadaptation engine. For example, the user forms the right hand posture of thumb up at the position of (1.0, 2.0, 1.0) and the lefthand posture of unknown at the position of (1.0, 1.0, 1.0). Then, the sensed information instantiation of the example shows in thefollowing.

<iidl:SensedInfo xsi:type="siv:IntelligentCameraType"> <siv:BodyAnimationID>Left Hand Unknown </siv:BodyAnimationID>

<siv:BodyAnimationID>Right Hand Thumb Up </siv:BodyAnimationID>

<siv:BodyFeature> <mpegvct:X > 1.0 </mpegvct:X> <mpegvct:Y> 1.0 </mpegvct:Y> <mpegvct:Z> 1.0 </mpegvct:Z> </siv:BodyFeature> <siv:BodyFeature> <mpegvct:X > 1.0 </mpegvct:X> <mpegvct:Y> 2.0 </mpegvct:Y> <mpegvct:Z> 1.0 </mpegvct:Z> </siv:BodyFeature> </iidl:SensedInfo>

4.2. Adaptation for facial cloning

The feature points in the generic face model are well-defined and sufficient for all kinds of facial expressions. However,when an intelligent camera provides the limited set of feature points, controlling a facial expression of avatars in virtualworlds may need the different number of control points as in Fig. 14. For example, the intelligent camera is capable ofdetecting feature points on obvious edges such as eyebrows, eyes, nose, mouth, and so on. Whereas those detected pointsare sometimes insufficient for sophisticated facial animation under the control points based animation system. Therefore,the adaptation engine should generate additional control points to support such facial animation.

Consider some additional control points over cheeks and forehead for the realistic facial animation. The white dots inFig. 14(c) represent the additional control points. Note that the motions of the additional control points are highlycorrelated to mouth’s and eyebrows’ movements. Therefore, the movement of the points on the face can be parameterizedusing a mass spring model, which characterizes the facial muscle behavior with a mass and its stiffness [31].

The stiffness constants of the mass spring model can be trained offline using movement of motion capture (Mocap)data. Then the additional control points can be estimated using the trained mass spring model and the positions of theneighboring detected feature points obtained by the intelligent camera. The estimated position of each additional point

Fig. 14. Experimental result: (a) an input raw image, (b) detected feature points from the intelligent camera, and (c) up-sampling case: white dots are

generated using interpolation of the detected neighboring feature points.


using the mass spring model is approximated as

x̂tj �

Pi2N xjð Þ\fxjg

kij xti�JL0

ijJ Lt�1ij =JLt�1

ij J� ��

Pi2N xjð Þ\fxjg

kijð2Þ

where x̂tj is the estimated position of the jth control point at time t, kijis a stiffness constant between the ith point and the

jth point, xti is a position of the ith point linked to the jth point at time t, Lt

ij is the Euclidean distance between xti and xt

j attime t, and N(xj) represents the index set of the neighboring points of xj.

Fig. 15 depicts the up-sampling approach which creates the additional control points using the mass spring model and thedetected feature points from the intelligent camera. The block dots are the detected feature points from the intelligent camera.The mass spring model, which is trained by the Mocap database, receives the detected feature points and generates the additionalcontrol points. The resultant up-sampled facial animation is obtained by combining the additional points and the detected featurepoints.

Once, the up-sampled feature points are then mapped into the feature points of the avatar by adjusting scales. Then,the adaptation engine will send the metadata related to avatar facial features to update the facial expression of the avatar.The following example partly shows the avatar facial feature control information.

<vwoc:Avatar xsi:type="vwoc:AvatarType"> <vwoc:ControlFeatures>

<vwoc:ControlFaceFeatures> <vwoc:HeadOutline> <vwoc:Outline4Points> <vwoc:Point1 xsi:type="vwoc:

Physical3DPointType" x="10" y="10" z="0"/> <vwoc:Point2 xsi:type="vwoc:

Physical3DPointType" x="0" y="20" z="0"/> <vwoc:Point3 xsi:type="vwoc:

Physical3DPointType" x="-10" y="10" z="0"/> <vwoc:Point4 xsi:type="vwoc:

Physical3DPointType" x="0" y="0" z="0"/> </vwoc:Outline4Points>

</vwoc:HeadOutline> </vwoc:ControlFaceFeatures> </vwoc:ControlFeatures>

</vwoc:Avatar>

4.3. Adaptation for hand based 3D manipulation

The 3D manipulation using 3D positions and postures are proposed to support bare hand interaction. From theencapsulated sensed information, 3D drag and drop, 3D rotation, and 3D scaling are proposed for 3D manipulation as in

Fig. 15. Overview diagram of up-sampling approach. Tracked points (black dots) come out of an intelligent camera sensor. Additional points (white dots)

come out of spring mass model generated from Mocap data. White lines in up-sampled model show the links between each point.

Fig. 16. Hand interaction set: (a) one-hand interaction set and (b) two-handed interaction.


Fig. 16. In addition, thumb-up and thumb-down postures are used as enter/select and escape/cancel, respectively. Theother postures are not used. The 3D position in the real world space is then mapped into the scene by considering effectiveregion of interaction (ROI) dependent upon whether the user stands up or sits down. Our experiment suggests the averagewidth and the average height of the ROI for stand-up are 45.7 cm and 38.0 cm respectively; the average width and heightof the ROI for sit-down are 32.6 cm and 26.8 cm respectively. The information is then used to map the ROI into the screenregion [20]. Once the push motion is initiated, the ROI is created by setting the started position of the push motion isassumed to be the center of ROI.

Note that the generated xml instance for 3D manipulation toward the virtual world servers is similar to the remotemulti-touch example because from the virtual world point of view, the interaction is the same.

5. Conclusion

The paper presented an overview of MPEG-V from the real world to the virtual world (R2V) on interfaces for controllingavatars and virtual objects by the real world devices. In particular, it describes the details on control information whichconcerns about the description of the real world device’s capabilities; interaction information which includes common


input command or sensor data format from the real world devices; and virtual world object characteristics which describea set of metadata to characterize the avatar and the virtual object.

The paper has presented the proposed 6-DOF motion sensor architecture with the evaluation of its performance; thetwo-handed spatial 3D interaction approach using handheld devices validated thru the user tests; and the virtual musicconducting which recognizes beat pattern gestures with HMM as a basis for controlling beat, tempo, volume, and theinstrument emphasis of music.

The paper also presented the vision-based facial feature extraction method using different tracking algorithms on lowerand upper face; the vision based recognition method for hand postures by random forest with level set based features;adaptation for facial feature control by the up-sampling method utilizing the spring-mass model; and adaptation for handbased interaction by identifying the region of interaction using the push motion.

The presented MPEG-V R2V system architecture is expected to contribute the more immersive experience to the usersby the standardized interaction technologies between the real world and the virtual world for their wide acceptanceand use.

References

[1] Wii Remote. Available from: www.nintendo.com/wii.[2] ISO/MPEG, MPEG-V Requirements V3.2, ISO/IEC MPEG/w10498, Lausanne, Switzerland, February 2009. Available from: http://mpeg.chiariglione.org/

working_documents.htm#MPEG-V.[3] ISO/IEC 23005-1 Information technology—Media context and control—Part 1: architecture, August 2011, ISO Publication.[4] J.-J. Han, H. Lee, B. Lee, S. Kim, K.H. Kim, W.C. Bang, D.J. Kim, Real-time 3D Full-body motion tracking architecture for Home CE devices, in: Samsung

Technology Conference, Kiheung, Korea, October 2009.[5] J.A.Y. Zepeda, F. Davoine, M. Charbit, A linear estimation method for 3D pose and facial animation tracking, in: IEEE Conference on Computer Vision

and Pattern Recognition(CVPR), June 2007.[6] ISO/IEC 23005-2 Information technology—Media context and control—Part 2: Control Information, August 2011, ISO Publication.[7] ISO/IEC 23005-5 Information technology–Media context and control–Part 5: Data Formats for Interaction Devices, August 2011, ISO Publication.[8] ISO/IEC 23005-4 Information technology–Media context and control–Part 4: Virtual World Object Characteristics, August 2011, ISO Publication.[9] ISO/IEC 14977 Information technology–Syntactic metalanguage–Extended BNF, 1996, ISO Publication.

[10] M. Preda, B. Jovanova, Avatar interoperability and control in Virtual Worlds, Signal Processing: Image Communication, http://dx.doi.org/10.1016/j.image2012.10.012, this issue.

[11] Y. Yuan, A review on trust region algorithms for optimization, in: Proceedings of ICIAM, 2000, pp. 271–282.[12] H.-E. Lee, S. Kim, C. Choi, W.-C. Bang, J.D.K. Kim, C. Kim, High-precision 6 DOF motion tracking architecture with compact low-cost sensors for 3D

manipulation, in: Proceedings of Consumer Electronics (ICCE), 2012, pp. 193-194.[13] J.S. Pierce, A.S. Forsberg, M.J. Conway, S. Hong, R.C. Zeleznik, M.R. Mine, Image plane interaction techniques in 3D immersive environments,

Proceedings of Symposium on Interactive 3D graphics I3D (1997) 39–43.[14] S. Han, H. Lee, J. Park, W. Chang, C. Kim, Remote Interaction for 3D Manipulation, in: Proceedings of ACM SIGCHI, Atlanta, USA, April 2010.[15] K. Takahashi, S. Seki, R. Oka, Spotting Recognition of Human Gestures From Motion Images, Technical Report IE92-134, The Institute of Electronics,

Information, and Communication Engineers, Japan, 1992, pp. 9–16 (in Japanese).[16] T. Baudel, M. Beaudouin-Lafon, CHARADE: remote control of objects using free-hand gestures, Communications ACM 36 (7) (1993) 28–35.[17] J. Wilpon, C. Lee, L. Rabiner, Application of hidden Markov models for recognition of a limited set of words in unconstrained speech, Proceedings of

ICASSP 89 (1989) 254–257.[18] H. Lee, J. Kim, An HMM-based threshold model approach for gesture recognition, IEEE Transactions of PAMI 21 (10) (1999) 961–973.[19] J. Deng, H. Tsui, An HMM-based approach for gesture segmentation and recognition, in: Proceedings of the15th International Conference on Pattern

Recognition, 2000, pp. 679–682.[20] H. Kang, C. Lee, K. Jung, Recognition-based gesture spotting in video games, Pattern Recognition Letters 25 (15) (2004) 1701–1714.[21] D. Kim, J. Song, D. Kim, Simultaneous gesture segmentation and recognition based on forward spotting accumulative HMMs, Pattern Recognition 40

(11) (2007) 3012–3026.[22] B. Yoo, J.-J. Han, C. Choi, H.-S. Ryu, D. Park, C. Kim. 3D remote interface for smart displays, in: Proceedings of CHI EA ’11, 2011, pp. 551-560.[23] T. Rhee, Y. Hwang, J.D. Kim, C. Kim, Real-time facial animation from live video tracking, in: Proceedings of ACM SIGGRAPH Symposium on Computer

Animation, 2011, pp. 215-224.[24] I. Matthews, S. Baker, Active appearance models revisited, International Journal of Computer Vision 60 (2) (2004) 135–164.[25] F. Dornaika, F. Dovoine, On appearance based face and facial action tracking, IEEE Transactions on Circuits and Systems for Video Technology 16 (9)

(2006) 1107–1124.[26] K. Fujimura, X. Lui, Sign recognition using depth image streams, in: Proceedings of the 7th International Conference on Automatic Face and Gesture

Recognition (FGR 2006), April 2006, pp. 381–386.[27] S.J. Osher, R.P. Fedkiw, Level Set Methods and Dynamic Implicit Surfaces, Springer-Verlag New York, Inc., 2002 ISBN 0-387-95482-1.[28] O. Rashid, A. Al-Hamadi, A. Panning, B. Michaelis, Posture recognition using combined statistical and geometrical feature vectors based on SVM,

World Academy of Science, Engineering and Technology 56 (2009) 590–598.[29] X. Lui, K. Fujimura, Hand gesture recognition using depth data, in: Proceedings of the 5th International Conference on Automatic Face and Gesture

Recognition, 2004 (FGR 2004), May 2004, pp. 529–534.[30] Z. Mo, U. Neumann, Real-time hand pose recognition using low-resolution depth images, in: Proceedings of CVPR, June 2006, pp. 1499–1505.[31] X. Provot, Deformation constraints in a mass-spring model to describe rigid cloth behavior, in: Proceedings of Graphics Interface, 1995, pp. 147–154.

www.nintendo.com/wii

http://mpeg.chiariglione.org/working_documents.htm#MPEG-V

http://mpeg.chiariglione.org/working_documents.htm#MPEG-V

dx.doi.org/10.1016/j.image2012.10.012

dx.doi.org/10.1016/j.image2012.10.012

connecting users to virtual worlds within mpeg-v standardization

Documents