[ieee ieee virtual reality conference (vr 2006) - alexandria, va, usa (25-29 march 2006)] ieee...

3D Object Transfer Between Non-Overlapping Videos

Jiangjian Xiao∗Sarnoff Corporation

Xiaochun Cao†

University of Central Florida

Hassan Foroosh‡

University of Central Florida

(�a�)� (�b�)� (�c�)� (�d�)�

Figure 1: Transfer of a 3D “sit-talking” person between two non-overlapping videos captured by moving cameras. (a) Two original framesfrom the input source video. (b) Two result frames by a naive cut-paste process [7] without the 3D alignment of the camera trajectories. It isobvious that the transfer is geometrically incorrect, where the pose, sitting location, and size of the inserted person are not consistent with thetarget scene since the source and target cameras have different motion. (c) Homography-based transfer by 2D reference plane alignment (thefloor in (a) and desk in (b)), where the transferred person does ”sit” in the same location on the desk throughout the whole target video butthe 3D points out of these registered planes appear distorted. (d) shows the final results of our method, which are both geometrically correctand visually realistic.

ABSTRACT

Given two video sequences of different scenes acquired with mov-ing cameras, it is interesting to seamlessly transfer a 3D object fromone sequence to the other. In this paper, we present a video-basedapproach to extract the alpha mattes of rigid or approximately rigid3D objects from one or more source videos, and then geometry-correctly transfer them into another target video of a different scene.Our framework builds upon techniques in camera pose estima-tion, 3D spatiotemporal video alignment, depth recovery, key-frameediting, natural video matting, and image-based rendering. Basedon the explicit camera pose estimation, the camera trajectories ofthe source and target videos are aligned in 3D space. Combiniedwith the estimated dense depth information, this allows us to sig-nificantly relieve the burdens of key-frame editing and efficientlyimprove the quality of video matting. During the transfer, our ap-proach not only correctly restores the geometric deformation of the3D object due to the different camera trajectories, but also effec-tively retains the soft shadow and environmental lighting propertiesof the object to ensure that the augmenting object is in harmony

∗e-mail: [email protected]†e-mail:[email protected]‡e-mail:[email protected]

with the target scene.

CR Categories: I.3.7 [Computer Graphics]: Three-DimensionalGraphics and Realism—Virtual reality; I.4.8 [Image Processing andComputer Vision]: Scene Analysis—stereo and time varying im-agery

Keywords: video-based augmented reality, image-based render-ing, depth driven video matting and compositing, computer vision.

1 INTRODUCTION

In commercial film and television production, video matting andcompositing operations make it possible for a director to transferpart of a scene between two video sequences. However, most of thepast work on video matting and compositing focuses on the mattingside and assumes that the camera poses in both the source and targetscenes are the same or have a fixed 2D translation or scaling [9,7, 2]. Therefore, with simple temporal video alignment, an alphablending is sufficient to composite the foreground object with thetarget view.However, these methods cannot handle situations where the

source and target video cameras have different motion. In this case,the two videos must be aligned in the 3D space to reduce globalgeometric distortions and the 3D foreground object must be re-rendered from a different target viewpoint. If the foreground objectis directly cut and pasted into the target view, the 3D appearance ofthe object may not be consistent with the target scene due to the dif-ferent viewing trajectories and lighting conditions. Fig.1.b shows a

127

IEEE Virtual Reality 2006March 25 - 29, Alexandria, Virginia, USA1-4244-0224-7/06/$20.00 ©2006 IEEE

Proceedings of the IEEE Virtual Reality Conference (VR’06) 1087-8270/06 $20.00 © 2006 IEEE

naive result with the cut-pasting process by Chuang’s video mattingmethod [7], where the spatial artifact is clearly visible when the twovideos have different poses (please see the supplemental video).In order to ensure that the composited/transferred virtual objects

are consistent with the existing ones in the target scene, our methodexplicitly estimates the poses of the source and target cameras, andaligns the videos along the computed camera trajectories in the 3Dspace. Fig.2 gives an overview of the 3D object transfer process.The inputs are two videos of different scenes acquired by two mov-ing cameras: one is the source video with the foreground object,the other is the target video to be augmented by the object. First,we estimate the camera poses for both the source and target videosequences. Next, the algorithm aligns these two videos along 3Dcamera trajectories for the object transfer by minimizing the globaldifferences between the source and transformed target viewing di-rections. In addition, after the pose estimation of the source videosequence, an initial depth of the foreground object is recovered,and further an intuitive graphical user interface (GUI) is designedto remove the depth errors due to small non-rigid motion or specu-lar reflection. Then, we combine the depth information with videomatting process to extract the high-quality alpha mattes of the ob-ject layer and shadow mattes of the shadow layer from the sourcevideo separately. Finally, using the optimal target viewing positionestimated from the 3D video alignment, both layers are re-renderedand blended into the corresponding target frame. As a result, wenot only correctly restore the perspective deformation of the objectduring the transfer, but also render the object which is spatiotem-porally consistent with the target scene with realistic 3D effects asshown in Fig.1.d.The proposed framework has the following contributions. First,

we preform 3D spatiotemporal alignment given 2D imageries as in-put, and provide an effective solution to transfer 3D objects betweenthe videos of distinct scenes captured by two moving cameras, forwhich the existing methods would introduce noticeable unpleasingartifacts. Second, we utilize the 3D information of the source sceneand design a depth driven GUI to effectively reduce the user interac-tion for object segmentation and alpha matting. Third, our approachis flexible to handle small non-rigid motion and specular reflection.Finally, the proposed approach does not require the computation ofthe lighting conditions, but is able to render a high-quality videowith realistic environmental illumination. Our work not only ad-vances the video matting and compositing problem from 2D to 3D,but also provides a feasible alternative to the expensive camera con-trol systems that have broadly been used in film industry.The remainder of this paper is organized as follows. In Section

2 we review the related work and address the limitations of the ex-isting methods. The core of our algorithm is 3D video alignment,which is presented in Section 3. Section 4 describes the depth esti-mation process and illustrate how to handle small non-rigid motionand specular reflection. Section 5 demonstrates alpha matting ofthe object and shadow layers, and the final rendering process. Then,several results are given in Section 6. Finally, we discuss the con-tributions and limitations of our approach, and provide the futuredirection in Section 7.

2 RELATED WORK

In digital matting, the observed color of the pixels around layerboundaries can be considered as a mixture of foreground and back-ground colors. For single image matting, once a trimap (unknown,definitely foreground, and definitely background regions) is man-ually specified, the alpha values, foreground, and background col-ors of the unknown regions can be estimated under certain con-straints [23, 19, 8]. Recently, Rother et al. [18] and Li et al.[14]present their approaches to reduce manual work for the foregroundcutout and trimap generation by interactively specifying a small

Pose Estimation of�target video�

Key-frame depth estimation�

Object cutout and matting�

Soft shadow matting�

Pose estimation of�source video�

3D video�alignment�

3D object transfer�

source video� target video�

Figure 2: The flow chart of 3D video transfer process.

number of feature points or regions.

Using the motion information of video sequence, several ap-proaches have been proposed to reduce the manual work for thetrimap generation and have achieved good matting results for 2Dvideo sequences [7, 2]. After specifying the trimaps for a set ofkey-frames, Chuang et al. use the estimated 2D optical flow tointerpolate the trimaps at the remaining frames. If the computedalpha mattes based on the interpolated trimaps are not satisfac-tory, an additional interative process is needed to refine the inter-polated trimaps. Recently, Apostoloff and Fitzgibbon present aBayesian background substraction approach to automatically ob-tain the trimaps and recover the background [2]. In their approach,a priori knowledge is required to train the foreground model froma selected image set. Agarwala et al. propose a tracking-basedapproach to obtain rotoscoping for a dynamic scene with user in-teraction [1]. The rotoscoping method is able to create cartoon ani-mation by attaching user-drawn strokes to the tracked contours, butthe method has difficulties to handle intricate silhouettes, e.g., hair.

Most importantly, all of the previous work in video matting andcompositing focus only on the matting problem, and ignore the 3Drelation between source and target cameras by implicitly assumingthat the corresponding source and target frames can be approxi-mately related by a 2D similarity transformation (translation, in-plane rotation and isotropic scaling). In order to transfer a 3D ob-ject between two videos of distinct scenes captured by two movingcameras along potentially different trajectories, a video alignmentprocess is necessary. In computer vision and image processing area,2D video alignment and registration have been investigated for along time [29]. For the 2D case, given two mostly overlapped videosequences, a set of corresponding features are explored to align thevideos based on a 2D transformation [20]. However, this approachwould not work for our case since there are no overlapping areas atall.

For the case of the non-overlapping videos, some researchersattempt to exploit 3D information from the videos for thealignment[17, 6]. Given a set of user-specified corresponding fea-tures, Rao et al. propose an approach to align the human ac-tion taken from two static cameras based on fundamental matrices[17]. Caspi and Irani present a homography approach to tempo-rally align the non-overlapping videos taken by two moving cam-eras [6]. However, they require a fixed pose between these twocameras during the video capturing and assume the cameras havethe same motion. Different from the existing methods, we seekto handle a more challenging case, where the videos are taken fromtwo different scenes with more flexible camera trajectories, and per-form an accurate 3D object transfer between videos.

128Proceedings of the IEEE Virtual Reality Conference (VR’06) 1087-8270/06 $20.00 © 2006 IEEE

Z�

X�

Y�O�

X�

Y�O�

Z�

(�a�)� (�b�)�

(�c�)� (�d�)�

Figure 3: Transfer of the object from one arbitrary source frame tothe target view. (a) One arbitrary frame from the source video. (b)One target frame to be augmented by the object. The world 3D co-ordinates are superimposed in the source and target frames. Bottom:transfer results comparison with/without 3D alignment. (c) Without3D alignment, the object directly transferred from the source viewpoint Cs to the target view Ct (See Fig.4.a), where some distortionshappen due to the imperfect reference depth. (d) Our result afterapplying an optimal 3D alignment transformation H on the targetcoordinates, where the distortion is eliminated.

3 3D VIDEO ALIGNMENT

Before 3D video alignment, we need to calibrate the cameras andestimate the camera poses for each input video sequence. The re-cent advancements in computer vision techniques have been uti-lized to estimate the camera parameters given an image or video.As reported in computer vision literature, mainly in [11], a varietyof objects including buildings, spheres, symmetric objects, circles,and rectangles, etc., can be used for calibration in real images. Inthis paper, we utilize the parallel structure of the scene to deter-mine the vanishing points, and then estimate both the internal andexternal camera parameters for the entire video sequence [5, 15].However, other methods can also be applied. For example, if thevideo is controllable during the acquisition, we can simply put acalibration board in the scene to preform calibration and camerapose estimation [27].Theoretically, we have the freedom to choose the world coordi-

nate frame. Suppose they are simply chosen as shown in Fig.3.a-bfor source and target scenes respectively, so that the points on thefloor and the desk (where the person is and will be sitting) havethe same z-coordinates. Then the foreground object can be directlytransferred from the reference plane (X-Y ) of the source view tothe reference plane of the target view. However, this direct trans-fer may produce a rendering problem since the viewing position ofthe source and target cameras may be at very different locations af-ter overlapping their world coordinates as shown in Fig.4.a. If wesimply render the object at target viewing position Ct using sourceview captured atCs, some obvious unpleasing distortions may hap-pen due to the imperfect reference depth and occlusion as shown inFig.3.c. In order to minimize these distortions, we aim to choose the3D world coordinate systems for the source and target scenes suchthat the source and target camera trajectories are globally matched.In other words, we need to find a 4× 4 matrix H to transform thechosen target world coordinate system (Fig.3.b) to a new one, withthe intention so that the new target viewing position, Co = HCt ,

C�t�T�

s�

s�T�t�

Z�

Y�

X�

SR�0�-�

C�

C = H�C�t�

s�

Z�

Y�

X�

o�

C�

O� C�t�

(�a�)�

(�b�)�

Figure 4: Four degrees of transferring freedom of 3D object transfer.(a) Applying a transformation H on the original target camera, Ct , thecamera center is transferred Co, which is close to the reference sourceview point Cs. (b) The inverse transformation on the object can beseparated into four parts Ts, R , S, and Tt . After the transformationthe object may have a close appearance as that directly observed byCs.

is closer to the source viewpoint Cs (Fig.4.a), and thus the mattedobject can be rendered with the least global distortions as shownin Fig.3.d. Toward this goal, the transformation H must satisfy thefollowing two properties:

Least Distortion: H should make Co close to the source view-point Cs, or minimize the viewing orientation difference, ,between camera Co andCs as shown in Fig.4.a.

Correct Augmentation: H should guarantee that the object will becorrectly augmented into the target reference plane or surface.

According to the relative motion theory, applying a transformationH on the target camera is equivalent to transforming the objectalong the inverse direction or applying H−1 on the 3D points asshown in Fig.4.b.Generally, the global coordinate transformation matrix H can

be approximately decomposed into four sub-transformations as inEq.3 with four degrees of freedom: translations along the X or Ydirection, a global scaling, and a rotation around the Z axis. As aresult, the object is able to move on the reference plane of the tar-get environment with a flexible scale and rotation. Given cameramatrices Pis and P

jt for the source frame i and the target frame j

respectively, the 3D centroid of the object, X, is projected as

xis = PisX, (1)

x jo = P jt TtSR TsX= P joX, (2)

H−1 = TtSR Ts, (3)

where Ts is a translation matrix to translate the object into the 3Dorigin of the source video, R is a rotation matrix around Z axis, Sis a global scaling matrix, Tt is the object’s destination in the targetscene, x jo and xis are 2D projections of X, and P

jo is the transformed

target camera matrix.However, in order to obtain the best visual effect with the least

perspective distortion, we also need to enforce the least distortion


Z

O

X

Y

O

Y

X

O

X

Y

Z

O

Y

X

Figure 5: 3D camera trajectory alignment. Top: alignment be-tween the source “Sit-talking” and the target “table” sequences (seeFig.1). Bottom: alignment between the source “Beetle” and thetarget “flower-floor” sequences (see Fig.14). The left is 3D view andthe right side is top view. The red curve is the source camera tra-jectory. The blue and green curves are the target camera trajectoriesbefore and after alignment.

property to minimize the viewing orientation differences. Given acamera matrix P = [M |t], the 3D camera location is C = −M−1t.Here we use vector �C as the viewing direction of camera P. There-fore, the view orientation difference between P jo and Pis is computedas

i, j = arccos

(�Cis·�C j

o

‖ �Cis ‖ · ‖ �C jo ‖

), (4)

where �Cis and �C jo are the viewing directions of Pis , and P

jo respec-

tively.In practice, we want to transfer the object into the target scene

at a specified location, Tt , and a desired scale, S, with a maximalviewing field for the synthetic video sequence. Therefore, we haveonly one degree of freedom, the rotation angle , to make the sourceand target sequences to overlap as much as possible with each other.This angle can be estimated by minimizing the view angles betweenthe ending source and target frames, or maximizing the dot product

argmax

(�C1o ·�C1s

‖ �C1o ‖ · ‖ �C1s ‖+

�Cmo ·�Cns‖ �Cmo ‖ · ‖ �Cns ‖

), (5)

where n and m are the lengthes of source and target sequences re-spectively. Since the P jo = P jt TtSR Ts, the best is obtained byusing the global search in the range of [0−360◦]. Fig.5 shows thetrajectories before and after 3D alignment. Once is determined,the closest source view i can be found for each target frame j byminimizing i, j .

4 KEY-FRAME DEPTH ESTIMATION

Without the reference depth information, the object may not be cor-rectly rendered, e.g. Fig.1.c. Therefore, a rough depth estimation or

(�a�)� (�c�)�(�b�)�

Figure 6: Depth refinement of non-rigid motion and occlusion. (a)One key-frame from “sit-talking” sequence. (b) The initial estimateddepth of the object using the multi-way cut algorithm. The redpixels have not been assigned a depth label due to the occlusionor non-rigid hand movement (inside the dotted green circle). (c)The refined depth map where the discontinuities are reduced andthe unassigned pixels obtain a smooth depth by enforcing a depthconsistency constraint.

depth proxy is necessary for rendering the objects from a differentviewing point [13, 4, 22, 24]. Currently, a number of stereo algo-rithms have been developed to recover the depth information froma stereo pair [21] or from a set of calibrated images [28, 10, 12].Instead of using all of the video frames to compute depth of thescene, a set of key-frames are selected from the source video forthe depth estimation to reduce the computational complexity. Theseselected key-frames cover the entire viewing field of the video withan approximately uniform gap. For example, in the“Sit-talking” se-quence case (Fig.1), we obtained 13 key-frames out of a total of252 source frames with approximately 3◦ viewing gap between theneighboring key-frames.Using every five consecutive key-frames, an initial depth is com-

puted by a multi-way cut algorithm which is adopted from a multi-frame segmentation framework [25, 26]. However, the initial esti-mated depth (Fig.6.b) has sharp changes between the discrete depthlabels since the multi-way algorithm is a labeling-based approachand the depth dimension is quantized into several labels (here weused 15− 20 depth labels). Moveover, from the results (Fig.6.b),we can see that there are some apparent errors around the non-rigidregions, such as the moving hand and the specular reflection re-gions1. Therefore, we need to refine the depth map to reduce thediscontinuities and fix the depth errors for those non-rigid regions.We first determine the highly discontinuous regions based on the

gradients of the depth map and smooth those regions by a gaus-sian kernel. Then, we enforce a depth consistency constraint toensure that the depth maps from different key-frames agree witheach other. Intuitively, this means that given a pixel, p, with depthd in key-frame i, its 2D projection in key-frame j, j,i(p,d), shouldhave the same depth d. Therefore, the new depth of this pixel canbe updated by iteratively enforcing the constraint

Dt+1i (p) =i−1≤ j≤i+1

g( j− i)Dtj( j,i(p,Dti(p))), (6)

where Di(·) is the depth map of frame i, Dti(p) is the depth of pixelp in frame i at iteration t, j,i(p,d) is the projection of pixel p witha depth d from key-frame i to key-frame j, and g(·) is a gaussianweight function. Fig.6 shows the refined depth map of one key-frame in the “sit-talking” sequence. Once the correct depth map

1Due to the fact that the specular reflection areas are view-variant, thecorresponding motion of these regions is not consistent with the cameramotion and its epipolar geometry. Here we call this the non-stationary ornon-rigid property of the specular reflection.


(�a�)� (�b�)� (�c�)�

Figure 7: Object segmentation and alpha matting. (a) The firstframe with two kinds of feature points. The blue curves are usedto exclude shadow areas from the foreground cutout, and the yellowcurve is to mark a rough shadow region. (b) The top is the projectedfeature points at the succeeding key-frame using the estimated depth,the bottom is the corresponding foreground segmentation. (c) Thetrimap of the object layer (top) and its corresponding alpha mattes(bottom).

Figure 8: Matting of complex objects. Left: A set of feature pointsin a key-frame of “doll” sequence, where two large unknown regionsare marked by the green curves. Middle: The corresponding trimap.Right: The final matting result.

is obtained for each key-frame, we interpolate depth maps for theremaining non-key-frames by using again Eq.6.

5 VIDEO MATTING WITH SOFT SHADOWS

Objects in natural scenes are often accompanied with some shad-ows, which have been ignored by the most previous video mattingapproaches [7, 2]. In [9], Chuang et al. point out that shadow is anessential element for a realistic natural scene compositing. How-ever, they only study the simplified case where scenes are illumi-nated by one dominant, point-like light source, and require a staticcamera. More importantly, they also require that the relationship ofthe dominant light sources, the reference planes, and the cameras bematched in the source and target scenes. Here we will investigatea more general case where the shadow mattes vary in a wide rangeunder multiple light sources.

5.1 Object cutout and matting

We decompose our matting problem into a two-layer system:shadow and object layers. However, it is not trivial to automati-cally separate these two layers, especially around the mixed or lowcontrast regions such as the bumpers of the Beetle. To separatethese two layers, a user only needs to mark a few lines to specify

Figure 9: Object matting refinement after the temporal consistencyenforcement. Top: The results before the refinement. Bottom: Theresults after refinement. The left side is the results from “doll”sequence; the right side is results from “sit-talking” sequence.

(�a�)� (�b�)� (�c�)�

Figure 11: Local shadow matte editing. (a) The temporally smoothedshadow with the manipulating the feature points. After markingthe undesired shadow region (inside the red curves), we remove itand obtain the final result (b). (c) The shadow mattes warped intoanother frame.

the shadow region roughly along the boundaries in our GUI. Then,combining the background difference map with the specified fea-ture points, a precise segmentation for the foreground object is ob-tained as shown in Fig.7.a. In the subsequent key-frame j (Fig.7.b),these feature points are reused for the next frame segmentation by3D projection, j,i(p,d) as the depth correction step. For the re-maining frames, the foreground cutouts are interpolated from theneighboring key-frames using the 3D projection. Then, a trimap iscreated by small boundary expanding and shrinking, and the objectmatting is obtained as shown in Fig.7.c. If the object has a com-plicated silhouette (e.g., hair), a large unknown region may need tobe specified in the key-frame using the GUI to estimate the objectmatting as shown in Fig.8. Then, applying Poisson matting tech-nique [23], the alpha mattes of the foreground object are computed.However, the consecutive alpha mattes may not be consistent

along the temporal domain. Therefore, we again apply a gaussiankernel on each pixel p of alpha channel to ensure that the alphamattes are coherent along the temporal domain.

i(p) =i−5≤ j≤i+5

g( j− i) j( j,i(p,Di(p))), (7)

where i(·) is the alpha map of frame i, and g(·) is a 1D gaussianweight function. After one pass temporal filtering, some boundarynoise is removed particularly for the irregular boundaries, such ashair. Some refined matting results are shown in Fig.9.

5.2 Shadow matting and editing

For the shadow layer, a rough shadow region needs to be markedin the first key-frame as shown by the yellow curve in Fig.7.a.Then a planar homography projects this shape to the other framesto cutout the shadow boundary. Instead of using a trimap to esti-mate alpha matting, we propose a new bi-map to extract the soft


(�a�)� (�b�)� (�c�)� (�d�)� (�e�)�

Figure 10: Shadow matting of one frame (Fig.7.a) of “sit-talking” sequence. (a) The bi-map of the shadow layer. (b) After removing theforeground object, the observation I is estimated by Poisson filling. (c) The lit image L obtained from a reference image without this object.(d) The initial shadow matting 0. (e) The enhanced shadow mattes using the membrane model.

shadow matting. In this bi-map, there are only two parts: one isthe definite background B and the other is the unknown regions U ,which are separated by the specified region boundary. Given anestimated observed color I (Fig.10.b) and lit color L without theshadow (Fig.10.c), the initial shadow channel 0 for each pixel canbe estimated as

0 =I ·L

‖ L ‖2 +, (8)

where is a small constant to prevent zero division. However, thismatting equation does not enforce the boundary condition wherethe alpha values at the shadow boundary, U , are 0. To enforce thisboundary condition, we employ � 0 as a guidance field with themembrane model to re-estimate the alpha value, , such that

min∫ ∫

U‖ � − � 0 ‖2 with( | U = 0), (9)

where ( | U = 0) is to enforce the boundary condition, and is aconstant coefficient which can be used to scale the guidance field.Using this approach, we not only enhance the alpha mattes by theguidance scale, but also effectively smooth the noise by the mem-brane model as shown in Fig.10.e. Once the shadow mattes areobtained for each frame, we further enforce the temporal consis-tency to refine the shadow mattes based on its planar homographyinvariant property, such that

1(p) =1N

N

i=1i(H1,i p), (10)

where i(·) is the alpha map of frame i, H1,i is the homographytransformation from frame 1 to i with respect to the plane whereshadow cast. Fig.11.a shows the temporally smoothing shadowmattes. However, the smoothing results still have some errors dueto the noise and imperfectly estimated Is. In order to achieve thedesired shadow matting, we introduce a local editing process to al-low users to locally manipulate the shadow matting by specifying aset of regions as shown in Fig.11.

5.3 Layer compositing with image-based rendering

After preforming 3D camera trajectory alignment (Section 3), wehave already determined the closest source view for each tar-get frame. Then, using image-based rendering, we re-render theshadow and the object layer with their alpha channels separately,and then composite them into the target view. The detailed stepsare described as follows.Given a new target view j, the nearest source frame i and

its matting data are projected into the virtual view by projection

(�a�)� (�b�)�

(�c�)� (�d�)� (�e�)�

Figure 12: Layer compositing by image-based rendering. (a) Thesource frame. (b) The final result after compositing the renderedlayers on the target view. (c) The rendering result of the object layer.(d) The rendering result of the shadow layer. (e) Layer compositing.

j,i(p,Di(p)) respectively. The results of both the shadow and theforeground object are stored in separate buffers, each containingcolor, depth, and opacity. For each layer, a 3D mesh is created byconverting their depth maps. Then, these two layers are renderedand blended using the opacity channel to generate the final image asshown in Fig.12. Even though there are some gaps between the twoprojection matrices P jo and Pis as shown in Fig.5 (e.g. the viewingorientation difference i, j between the closest P

jo and Pis may up to

10−15◦), our rendering shows the proposed method produces verypromising and impressive results.

6 EXPERIMENTAL RESULTS

We applied our approach to a number of video sequences, all ofwhich are captured by a hand-held camera without using any assis-tant facility. The resolution of all our video sequences is 640×480.To illustrate the soft shadow effects, our video sequences are takenfrom indoor scenes illuminated by multiple light sources.To test our approach, two groups of objects are selected. The first

group contains the objects with highly specular materials, such ascar and mug. The second group includes the objects with soft andirregular hair or fur, such as doll, real human, and toy dog. Fig.13


Figure 13: The sample results in the other source video sequences.From left to right, the columns are source frames, depth maps, objectmattes, and shadow mattes.

shows the sample frames of the source objects, the correspondingestimated depth, alpha mattes, and shadow mattes. The number ofkey-frames required for each sequence are given in Table 1.With the strong support of the estimated depth map, our frame-

work also provides the flexibility to change the target destination,Tt , and the rotation angle, , in a certain range, which allows theuser to create some special effects, such as moving the objects orduplicating the objects in the target video.

Multiple Objects in One Scene: The first interesting applica-tion is to transfer multiple objects into one target video even thoughthese objects are from different sources.

Object Moving, Colorizing, and Deforming: Due to the fourdegrees of freedom, we can translate, rotate, scale, or even deformthe objects in the process of the video transfer.

Object Duplication: Another interesting application is dupli-cating the object into multiple copies, and each of them may havea slightly different appearance according to its own pose and desti-nation.Fig.14 shows four synthetic video sequences to demonstrate the

above applications. In order to feel the correct 3D visual effects, thereader is strongly recommended to watch our supplemental video.Using the proposed approach, we not only correctly recover theperspective deformation to yield a consistent and realistic 3D ef-fect, but also implicitly explore lighting information which has beenrecorded in the source frames. Our approach avoids the challengesof the complicated lighting computation, and effectively restoresthe subtle variations of the specular reflections to make the ren-dered objects to plausibly respond to the lighting environment ofthe target scene.

7 CONCLUSION AND DISCUSSION

In this paper, we have presented a new system for geometricallycorrect and visually realistic 3D object transfer that is capable ofcompositing 3D objects from the source videos into a distinct targetvideo captured by a moving camera along a different trajectory. Theproposed 3D spatiotemporal alignment approach provides a robustsolution to align two non-overlapping videos of different scenes.With the assistance of our intuitive GUI, our approach is able toefficiently handle small non-rigid motion and specular reflectionfor the depth recovery, and also tackle the difficulties of extractingaccurate object mattes and the corresponding soft-shadow mattes.

Table 1: The number of the key-frames used in our video sequences.

Name Key-frame Number Total Frame Number

Sit-talking 13 252

Beetle 20 405

Doll 21 433

Toy dog 17 400

Blue car 23 504

Mug 16 300

The experimental results strongly demonstrates that our approachis feasible to generate a realistic 3D video with a plausible environ-mental illumination from multiple video sources without expensivelighting computation.While we have achieved a realistic and correct 3D video transfer

between two different scenes, some limitations of our frameworkremain to be addressed. Transferring an object from one video toanother is an extremely difficult problem for the general case. Herewe put some constraints to simplify the problem and attempt to pro-vide a feasible solution. One constraint is that the transferred objectis located on a plane where the shadow is relatively easily extractedand transferred. This constraint is quite ubiquitous to most natu-ral videos such as ground, sea plane, or man-made planar surfaces.The second constraint is that we require the scene to be nearly static,which can provide more robust 3D information when the object istransferred between 3D scenes. However, we also illustrate thatour approach has some flexibility to handle small non-rigid mo-tion of the objects, such as hand moving. For the large non-rigidmotion such as human walking or running, the current frameworkis not working. One possible interesting solution is to extend ourframework combining it with the optical flow technique [7] for thenon-rigid object transfer between 3D videos. Another limitation ofour approach is that we need a sufficient amount of background tex-ture or a wide field of view to perform camera calibration and poseestimation.

REFERENCES

[1] A. Agarwala, A. Hertzmann, D. Salesin, and S. Seitz. Keyframe-basedtracking for rotoscoping and animation. In Proceedings of ACM SIG-GRAPH, 2004.

[2] N. Apostoloff and A. Fitzgibbon. Bayesian video matting using learntimage priors. In Proceedings of IEEE CVPR, 2004.

[3] Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundaryand region segmentation of objects in n-d images. In Proceedings ofIEEE ICCV, 2001.

[4] C. Buchler, M. Bosse, S. Gortler L. McMillan, and M. Cohen. Un-structured lumigraph rendering. In Proceedings of ACM SIGGRAPH,2001.

[5] B. Caprile and V. Torre. Using vanishing points for camera calibration.Int. J. Comput. Vision, 4(2):127–140, 1990.

[6] Y. Caspi and M. Irani. Alignment of non-overlapping sequences. InProceedings of IEEE ICCV, 2001.

[7] Y. Chuang, A. Agarwala, B. Curless, D. Salesin, and R. Szeliski.Video matting of complex scenes. In Proceedings of ACM SIG-GRAPH, 2002.

[8] Y. Chuang, B. Curless, D. Salesin, and R. Szeliski. A bayesian ap-proach to digitial matting. In Proceedings of IEEE CVPR, 2001.

[9] Y. Chuang, D. Goldman, B. Curless, D. Salesin, and R. Szeliski.Shadow matting and compositing. In Proceedings of ACM SIG-GRAPH, 2003.

[10] A. Fitzgibbon, Y. Wexler, and A. Zisserman. Image-based renderingusing image-based priors. In Proceedings of IEEE ICCV, 2003.


(�a�)�

(�b�)�

(�c�)�

(�d�)�

Figure 14: This figure shows four synthetic video sequences with one or more transferred objects from different source video sequences. (a) TheBeetle is moving on the floor and the color is keeping change during the moving. (b) A blue car and a toy dog are transferred into the scenewith an apparent scale change where the calibration grid is removed gradually. (c) The “sit-talking” person are duplicated during the transfer.(d) One Beetle is split into two with the quite different poses and locations, while the doll and mug are also correctly augmented into the targetscene simultaneously.

[11] R. I. Hartley and A. Zisserman. Multiple View Geometry in ComputerVision. Cambridge University Press, second edition, 2004.

[12] V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction viagraph cut. In Proceedings of ECCV, 2002.

[13] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings ofACM SIGGRAPH, 1996.

[14] Y. Li, J. Sun, C. Tang, and H. Shum. Lazy snapping. In Proceedingsof ACM SIGGRAPH, 2004.

[15] D. Liebowitz and A. Zisserman. Combining scene and auto-calibration constraints. In IEEE International Conference on Com-puter Vision, pages 293–300, 1999.

[16] P. Perez, M. Gangnet, and A. Blake. Poisson image editing. In Pro-ceedings of ACM SIGGRAPH, 2003.

[17] C. Rao, A. Gritai, and M. Shah. View-invariant alignment and match-ing of video sequences. In Proceedings of IEEE ICCV, 2003.

[18] C. Rother, V. Kolmogorov, and A. Blake. Grabcut interactive fore-ground extraction using iterated graph cuts. In Proceedings of ACMSIGGRAPH, 2004.

[19] M. Ruzon and C. Tomasi. Alpha estimation in natural images. InProceedings of IEEE CVPR, 2000.

[20] P. Sand and S. Teller. Video matching. In Proceedings of ACM SIG-GRAPH, 2004.

[21] D. Scharstein and R. Szeliski. A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms. International Journal ofComputer Vision, 47(1), 2002.

[22] J. Shade, S. Gortler, L. He, and R. Szeliski. Layered depth images. InProceedings of ACM SIGGRAPH, 1998.

[23] J. Sun, J. Jia, C. Tang, and H. Shum. Poisson matting. In Proceedingsof ACM SIGGRAPH, 2004.

[24] J. Xiao and M .Shah. Tri-view morphing. Computer Vision and ImageUnderstanding, 96(3), 2004.

[25] J. Xiao and M. Shah. Accurate motion layer segmentation and mat-ting. In Proceedings of IEEE CVPR, 2005.

[26] J. Xiao and M. Shah. Motion layer extraction in the presence of oc-clusion using graph cut. IEEE Transaction on PMAI, 27(10), 2005.

[27] Z. Zhang. A flexible new technique for camera calibration. IEEETransaction on PMAI, 22(11), 2001.

[28] C. Zitnick, S. Kang, M. Uyttendele, S. Winder, and R. Szeliski. High-quality video view interpolation using a layered representation. InProceedings of ACM SIGGRAPH, 2004.

[29] B. Zitova and J. Flusser. Image registration methods: a survey. Imageand Vision Computing, 21, 2003.


[ieee ieee virtual reality conference (vr 2006) - alexandria, va, usa (25-29 march 2006)] ieee...

Documents