論文紹介"dynamicfusion: reconstruction and tracking of non-‐rigid scenes in...

DynamicFusion: Reconstruction and Tracking of Non-‐rigid Scenes in Real-‐Time

Richard A. Newcombe, Dieter Fox, Steven M. Seitz

CVPR2015, Best Paper Award

論文紹介，櫻田健（東京工業大学）, ２０１５年６月２３日

1

論文紹介者

櫻田健 ( http://www.vision.is.tohoku.ac.jp/us/member/sakurada )

• 東京工業大学博士研究員（2015年4月〜）• 東北大学岡谷研卒（2015年3月）• Twitter ID: @sakuDken• 研究内容

– 車載カメラを利用した都市の時空間モデリング– SfM, MVS, CNN …

• 主要論文– CVPR(Poster) １本– BMVC(Poster) １本– ACCV(Oral) １本

• “Best Application Paper Honorable Mention Award”– IROS …

内容に関して何かお気づきになりましたらご連絡頂けると幸いですEmail: [email protected]

[email protected]

DynamicFusionの著者紹介

University of Washington

• Richard Newcombe（PhD）– KinectFusionやDTAMの著者

• Dieter Fox（Professor）– “Probabilistic Robotics”（SLAMのバイブル本）の著者

• Steven Seitz (Professor)– SfMなどの大御所

3

DynamicFusion

Dense SLAM システム• デプス画像を統合して動的シーンをリアルタイムで3次元復元– KinectFusionを動的シーンに拡張

Video: https://www.youtube.com/watch?v=i1eZekcc_lM

4

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

KinectFusion

静的シーンを対象としたDense SLAM システム• 複数のデプス画像から密なサーフェスモデルを構築• 得られたモデルに対して最新のデプス画像を位置合わせしてカメラ姿勢を推定

“KinectFusion: Real-‐Time Dense Surface Mapping and Tracking” (ISMAR 2011) Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, Andrew Fitzgibbon

KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤

Richard A. NewcombeImperial College London

Shahram IzadiMicrosoft Research

Otmar HilligesMicrosoft Research

David MolyneauxMicrosoft ResearchLancaster University

David KimMicrosoft Research

Newcastle University

Andrew J. DavisonImperial College London

Pushmeet KohliMicrosoft Research

Jamie ShottonMicrosoft Research

Steve HodgesMicrosoft Research

Andrew FitzgibbonMicrosoft Research

Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparisonis an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).

ABSTRACT

We present a system for accurate real-time mapping of complex andarbitrary indoor scenes in variable lighting conditions, using only amoving low-cost depth camera and commodity graphics hardware.We fuse all of the depth data streamed from a Kinect sensor intoa single global implicit surface model of the observed scene inreal-time. The current sensor pose is simultaneously obtained bytracking the live depth frame relative to the global model using acoarse-to-fine iterative closest point (ICP) algorithm, which usesall of the observed depth data available. We demonstrate the advan-tages of tracking against the growing full surface model comparedwith frame-to-frame tracking, obtaining tracking and mapping re-sults in constant time within room sized scenes with limited driftand high accuracy. We also show both qualitative and quantitativeresults relating to various aspects of our tracking and mapping sys-tem. Modelling of natural scenes, in real-time with only commod-ity sensor and GPU hardware, promises an exciting step forwardin augmented reality (AR), in particular, it allows dense surfaces tobe reconstructed in real-time, with a level of detail and robustnessbeyond any solution yet presented using passive computer vision.

Keywords: Real-Time, Dense Reconstruction, Tracking, GPU,SLAM, Depth Cameras, Volumetric Representation, AR

Index Terms: I.3.3 [Computer Graphics] Picture/Image Genera-tion - Digitizing and Scanning; I.4.8 [Image Processing and Com-puter Vision] Scene Analysis - Tracking, Surface Fitting; H.5.1[Information Interfaces and Presentation]: Multimedia InformationSystems - Artificial, augmented, and virtual realities

⇤This work was performed at Microsoft Research.

1 INTRODUCTION

Real-time infrastructure-free tracking of a handheld camera whilstsimultaneously mapping the physical scene in high-detail promisesnew possibilities for augmented and mixed reality applications.

In computer vision, research on structure from motion (SFM)and multi-view stereo (MVS) has produced many compelling re-sults, in particular accurate camera tracking and sparse reconstruc-tions (e.g. [10]), and increasingly reconstruction of dense surfaces(e.g. [24]). However, much of this work was not motivated by real-time applications.

Research on simultaneous localisation and mapping (SLAM) hasfocused more on real-time markerless tracking and live scene re-construction based on the input of a single commodity sensor—amonocular RGB camera. Such ‘monocular SLAM’ systems such asMonoSLAM [8] and the more accurate Parallel Tracking and Map-ping (PTAM) system [17] allow researchers to investigate flexibleinfrastructure- and marker-free AR applications. But while thesesystems perform real-time mapping, they were optimised for ef-ficient camera tracking, with the sparse point cloud models theyproduce enabling only rudimentary scene reconstruction.

In the past year, systems have begun to emerge that combinePTAM’s handheld camera tracking capability with dense surfaceMVS-style reconstruction modules, enabling more sophisticatedocclusion prediction and surface interaction [19, 26]. Most recentlyin this line of research, iterative image alignment against dense re-constructions has also been used to replace point features for cam-era tracking [20]. While this work is very promising for AR, densescene reconstruction in real-time remains a challenge for passivemonocular systems which assume the availability of the right typeof camera motion and suitable scene illumination.

But while algorithms for estimating camera pose and extract-ing geometry from images have been evolving at pace, so havethe camera technologies themselves. New depth cameras based ei-ther on time-of-flight (ToF) or structured light sensing offer densemeasurements of depth in an integrated device. With the arrivalof Microsoft’s Kinect, such sensing has suddenly reached wideconsumer-level accessibility. The opportunities for SLAM and AR

5

Video: https://www.youtube.com/watch?v=quGhaggn3cQ

KinectFusion: 処理の流れ

6

KinectFusion: カメラの移動量• カメラ姿勢を6自由度の剛体変換で表現– ローカルデプスマップをグローバルサーフェスに変換

7

KinectFusion: デプスマップから頂点と法線への変換

8

• バイラテラルフィルタをデプスマップへ適用（ノイズ低減）• 頂点（3次元点）

• 3次元点をカメラ座標からワールド座標へ変換

• 単位法線ベクトル– 隣接ピクセルの3次元点から推定

𝐯" = 𝐷" 𝐮 K'( 𝑢, 𝑣, 1 -

𝐷" 𝐮 : ピクセル𝐮 = 𝑢, 𝑣 -のデプス

𝐾: カメラの内部パラメータ

𝐯𝒘 = 𝑻𝒘𝐯𝒌

𝐍 𝑥, 𝑦 =𝐚×𝐛𝐚×𝐛

KinectFusion: 処理の流れ

9

カメラ移動量とサーフェスの同時推定問題From Depth to a Dense Oriented Point Cloud

Raw Depth ICP Outliers

Depth MapConversion

Model-FrameCamera Track

VolumetricIntegration

Model Rendering

Predicted Vertexand Normal Maps

(Measured Vertexand Normal Maps)

(ICP) (TSDF Fusion) (TSDF Raycast)

6DoF Pose and Raw Depth

Richard A. Newcombe (RSE and GRAIL labs) CSE 576 KinectFusion: Real-Time Dense Surface Mapping and Tracking

From Depth to a Dense Oriented Point Clouds

Each valid depth map value D(u) at pixel u = (x , y)⊤ provides a 3Dpoint measurement in camera frame:

Requires known camera intrinsics, K . With focal length f0, f1 andprinciple point p0, p1:

The depth map at time k provides a scale correct 3D pointmeasurement at each pixel; a vertex map Vk .

Transformation of the 3D point from the camera to world frame isvw = Tw ,kvk


Approximating surface normals

We can estimate the surface normal from neighbouring pairs of 3Dpoints by exploiting the regular grid structure

b

a

b a

a

b

b a

Unit normal: N(x , y) = a×b

∥a×b∥


Joint Estimation Problem: What is the camera motion andsurface geometry?


カメラ移動量から3次元復元可能

10

カメラ移動量が分かると．．．

11


12


13

計測を統合(サーフェス生成)できる...

14

．．．また，3次元形状が分かると．．．

15

．．．新しいサーフェスの位置合わせができる．．．

16

．．．サーフェスの計測誤差を最小化すると．．．

17

．．．カメラ姿勢が求まりデプスマップを統合可能．．．

18

KinectFusion: サーフェス表現

19

Dense Mapping as Surface Reconstruction

Raw Depth ICP Outliers

Depth MapConversion

Model-FrameCamera Track

VolumetricIntegration

Model Rendering

Predicted Vertexand Normal Maps

(Measured Vertexand Normal Maps)

(ICP) (TSDF Fusion) (TSDF Raycast)

6DoF Pose and Raw Depth


Implicit Surface Representation

We can implicitly define the geometry as the level-set of a function:


Truncated Signed Distance Function surfacerepresentations

We use a truncated signed distance function representation,F(x) : R3 !→ R for the estimated surface where F(x) = 0.

Figure: A cross section through a 3D Signed Distance Function of the surfaceshown.


Signed Distance Function surfaces


KinectFusion:サーフェス表現

問題点

• デプスマップの計測誤差が大きい• 〃に穴（計測値なし）がある

解決方法

• 陰なサーフェス表現を利用

20Reference: http://slideplayer.com/slide/3892185/#


Truncated Signed Distance Function (TSDF)


ボクセルグリッド


23

[投票値] = ピクセルのデプス − [センサーからボクセルまでの距離]Reference: http://slideplayer.com/slide/3892185/#



[投票値] = ピクセルのデプス − [センサーからボクセルまでの距離]


TSDF 𝐹",𝑊" の更新（𝐹":投票値，𝑊":重み）• 𝑊=> 𝐩 = 1で良い結果が得られる

27

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙

uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙

u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u


Sk(p) 7! [Fk(p),Wk(p)] . (5)


˙



FRk (p) = Y⇣


l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ


null otherwise(9)




minF2F Â




Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)







˙




Sk(p) 7! [Fk(p),Wk(p)] . (5)


˙



FRk (p) = Y⇣


l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ


null otherwise(9)




minF2F Â




Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)







˙



Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

新しい観測現在まで

KinectFusion: センサーの姿勢推定

点群と面の距離（point-‐plane energy）を最小化

28

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

��⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

��2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

Technical Report TR04-004, Department of Computer Science, University of North Carolina at Chapel Hill, February 2004.

1

Linear Least-Squares Optimization for Point-to-Plane ICP Surface Registration

Kok-Lim Low

Department of Computer Science University of North Carolina at Chapel Hill

Email: [email protected]

ABSTRACT The Iterative Closest Point (ICP) algorithm that uses the point-to-plane error metric has been shown to converge much faster than one that uses the point-to-point error metric. At each iteration of the ICP algorithm, the change of relative pose that gives the minimal point-to-plane error is usually solved using standard nonlinear least-squares methods, which are often very slow. Fortunately, when the relative orientation between the two input surfaces is small, we can approximate the nonlinear optimization problem with a linear least-squares one that can be solved more efficiently. We detail the derivation of a linear system whose least-squares solution is a good approximation to that obtained from a nonlinear optimization.

1 INTRODUCTION 3D shape alignment is an important part of many applications. It is used for object recognition in which newly acquired shapes in the environment are fitted to model shapes in the database. For reverse engineering and building real-world models for virtual reality, it is used to align multiple partial range scans to form models that are more complete. For autonomous range acquisition, 3D registration is used to accurately localize the range scanner, and to align data from multiple scans for view-planning computation.

Since its introduction by Besl and McKay [Besl92], the ICP (Iterative Closest Point) algorithm has become the most widely used method for aligning three-dimensional shapes (a similar algorithm was also introduced by Chen and Medioni [Chen92]). Rusinkiewicz and Levoy [Rusinkiewicz01] provide a recent survey of the many ICP variants based on the original ICP concept.

In the ICP algorithm described by Besl and McKay [Besl92], each point in one data set is paired with the closest point in the other data set to form correspondence pairs. Then a point-to-point error metric is used in which the sum of the squared distance between points in each correspondence pair is minimized. The process is iterated until the error becomes smaller than a threshold or it stops changing. On the other hand, Chen and Medioni [Chen92] used a point-to-plane error metric in which the object of minimization is the sum of the squared distance between a point and the tangent plane at its correspondence point. Unlike the point-to-point metric, which has a closed-form solution, the point-to-plane metric is usually solved using standard nonlinear least squares methods, such as the Levenberg-Marquardt method [Press92]. Although each iteration of the point-to-plane ICP algorithm is generally slower than the point-to-point version, researchers have observed significantly better convergence rates in the former [Rusinkiewicz01]. A more theoretical explanation of the convergence of the point-to-plane metric is described by Pottmann et al [Pottmann02].

In [Rusinkiewicz01], it was suggested that when the relative orientation (rotation) between the two input surfaces is small, one can approximate the nonlinear least-squares optimization problem with a linear one, so as to speed up the computation. This approximation is simply done by replacing sin θ by θ and cos θ by 1 in the rotation matrix.

In this technical report, we describe in detail the derivation of a system of linear equations to approximate the original nonlinear system, and demonstrate how the least-squares solution to the linear system can be obtained using SVD (singular value decomposition). A 3D rigid-body transformation matrix is then constructed from the linear least-squares solution.

2 POINT-TO-PLANE ICP ALGORITHM Given a source surface and a destination surface, each iteration of the ICP algorithm first establishes a set of pair-correspondences between points in the source surface and points in the destination surfaces. For example, for each point on the source surface, the nearest point on the destination surface is chosen as its correspondence [Besl92] (see [Rusinkiewicz01] for other approaches to find point correspondences). The output of an ICP iteration is a 3D rigid-body transformation M that transforms the source points such that the total error between the corresponding points, under a certain chosen error metric, is minimal.

When the point-to-plane error metric is used, the object of minimization is the sum of the squared distance between each source point and the tangent plane at its corresponding destination point (see Figure 1). More specifically, if si = (six, siy, siz, 1)T is a source point, di = (dix, diy, diz, 1)T is the corresponding destination point, and ni = (nix, niy, niz, 0)T is the unit normal vector at di, then the goal of each ICP iteration is to find Mopt such that

( )( )∑ •−⋅=i

iii2

opt minarg ndsMM M

(1)

where M and Mopt are 4×4 3D rigid-body transformation matrices.

Figure 1: Point-to-plane error between two surfaces.

tangent plane

s1 source point

destination point

d1

n1 unit

normal

s2

d2

n2 s3

d3

n3

destination surface

source surface

l1

l2 l3

Figure Reference: Low, Kok-‐Lim. "Linear least-‐squares optimization for point-‐to-‐plane icp surface registration.”Chapel Hill, University of North Carolina (2004).

２つのサーフェス間の誤差


29

点群と面の距離を最小化


30



31



点と面の対応付けと姿勢の最適化

• 頂点と法線のピラミッドマップを利用– 𝐿 = 3つの解像度でデプスと法線のマップを作成

𝐕𝒍∈ 𝟏…𝑳 , 𝑵𝒍∈ 𝟏…𝑳

• Coarse-‐to-‐fine

32Figure Reference: http://razorvision.tumblr.com/post/15039827747/how-‐kinect-‐and-‐kinect-‐fusion-‐kinfu-‐work

KinectFusion: 実験結果

33

前後フレーム間でトラッキング

• 累積誤差が発生してドリフト

34

フレーム・モデル間でトラッキング

• グローバルなモデルに対してスキャンマッチング• ドリフトフリー• 前後フレーム間のトラッキングより高精度



ボクセルの解像度と処理時間

上から順に

• デプスマップの統合• サーフェス生成のためのレイキャスティング• ピラミッドマップを利用したカメラ姿勢の最適化• ピラミッドマップの各スケール間の対応付け• デプスマップの前処理

35

(a) M=1 (b) M=2 (c) M=4

Figure 10: Close up view of loop closing frames in circular experimentas the data from a single loop is repeatedly fed to our system. Wesee (a) initially good alignment after one pass improving through (b)two passes to finally (c) the frames are extremely closely registeredafter four passes.

(a) Frame to frame tracking (b) Frame to model tracking

Figure 11: (a) Frame to frame vs. (b) frame to model tracking, bothusing every 8t h frame. There is a drastic reduction in drift comparedto Figure 8(a) where all frames are used. But the frame-model track-ing results in drift-free operation without explicit global optimisation.

4.2 Processing TimeFigure 13 shows results from an experiment where timings weretaken of the main system components and the reconstruction voxelresolution was increased in steps. We note the constant time opera-tion of tracking and mapping for a given voxel resolution.

4.3 Observations and Failure ModesOur system is robust to a wide range of practical conditions in termsof scene structure and camera motion. Most evidently, by usingonly depth data it is completely robust to indoor lighting scenar-ios. Our video demonstrates a good variety of agile motion trackingsuccessfully through even rapid motion. The main failure case instandard indoor scenes is when the sensor is faced by a large planarscene which fills most of its field of view. A planar scene leavesthree of the sensor’s 6DOF motion unconstrained in the linear sys-tems null space, resulting in tracking drifting or failure.

5 GEOMETRY AWARE ARThe dense accurate models obtained in real-time open up many newpossibilities for augmented reality, human-computer-interaction,robotics and beyond. For example, the ability to reason aboutchanges in the scene, utilising outliers from ICP data association(see Figure 7), allows for new object segmentation methods; thesesegmented objects can be tracked independently using other in-stances of ICP allowing piece-wise rigid tracking techniques; andphysics can be simulated in real-time on acquired models directly

Figure 12: A reconstruction result using 164 the memory (643 vox-

els) of the previous figures, and using only every 6th sensor frame,demonstrating graceful degradation with drastic reductions in mem-ory and processing resources.

Tim

e (m

s)

Voxel Resolution64 128 192 320 448384256 51233 3 3 3 3 3 3

Figure 13: Real-time cumulative timing results of system compo-nents, evaluated over a range of resolutions (from 643 to 5123 vox-els) as the sensor reconstructs inside a volume of 3m3. Timingsare shown (from bottom to top of the plot) for: pre-processing rawdata, multi-scale data-associations; multi-scale pose optimisations;raycasting the surface prediction and finally surface measurementintegration.

from the TSDF volumetric representation (see Figure 14). For AR,the dense model also provides an ability to handle truer occlusionboundaries between real and virtual for rendering. In [16] we dis-cuss all these possibilities in detail.

Figure 14: Thousands of particles interact live with surfaces as theyare reconstructed. Notice how fine-grained occlusions are handledbetween the real and virtual. Simulation works on the TSDF volu-metric representation, and runs on the GPU alongside tracking andmapping, all in real-time.

6 CONCLUSIONS

The availability of commodity depth sensors such as Kinect has thepotential to revolutionise the fields of robotics and human-computerinteraction. In this work we have taken a step towards bringingthe ability to reconstruct and interact with a 3D environment to themasses. The key concepts in our real-time tracking and mappingsystem are (1) always up-to-date surface representation fusing allregistered data from previous scans using the truncated signed dis-tance function; (2) accurate and robust tracking of the camera pose

KinectFusion: 課題

• 長距離軌跡のドリフト– 明示的なループクロージングが必要

• 十分な幾何拘束が必要– 例，一枚の平面では3自由度しか拘束されない

• 広域な問題への拡張– 均一なボクセルではメモリ・計算量が膨大– 疎な領域が多いため八分木のSDFを利用

36

KinectFusionからDynamicFusionへ

KinectFusionの前提• 観測シーンは大部分が変化しない

DynamicFusion• リアルタイム処理を保ちKinectFusionを動的かつ非剛体なシーンへと拡張

37

KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤

Richard A. NewcombeImperial College London

Shahram IzadiMicrosoft Research

Otmar HilligesMicrosoft Research

David MolyneauxMicrosoft ResearchLancaster University

David KimMicrosoft Research

Newcastle University

Andrew J. DavisonImperial College London

Pushmeet KohliMicrosoft Research

Jamie ShottonMicrosoft Research

Steve HodgesMicrosoft Research

Andrew FitzgibbonMicrosoft Research

Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparisonis an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).

ABSTRACT

We present a system for accurate real-time mapping of complex andarbitrary indoor scenes in variable lighting conditions, using only amoving low-cost depth camera and commodity graphics hardware.We fuse all of the depth data streamed from a Kinect sensor intoa single global implicit surface model of the observed scene inreal-time. The current sensor pose is simultaneously obtained bytracking the live depth frame relative to the global model using acoarse-to-fine iterative closest point (ICP) algorithm, which usesall of the observed depth data available. We demonstrate the advan-tages of tracking against the growing full surface model comparedwith frame-to-frame tracking, obtaining tracking and mapping re-sults in constant time within room sized scenes with limited driftand high accuracy. We also show both qualitative and quantitativeresults relating to various aspects of our tracking and mapping sys-tem. Modelling of natural scenes, in real-time with only commod-ity sensor and GPU hardware, promises an exciting step forwardin augmented reality (AR), in particular, it allows dense surfaces tobe reconstructed in real-time, with a level of detail and robustnessbeyond any solution yet presented using passive computer vision.

Keywords: Real-Time, Dense Reconstruction, Tracking, GPU,SLAM, Depth Cameras, Volumetric Representation, AR

Index Terms: I.3.3 [Computer Graphics] Picture/Image Genera-tion - Digitizing and Scanning; I.4.8 [Image Processing and Com-puter Vision] Scene Analysis - Tracking, Surface Fitting; H.5.1[Information Interfaces and Presentation]: Multimedia InformationSystems - Artificial, augmented, and virtual realities

⇤This work was performed at Microsoft Research.

1 INTRODUCTION

Real-time infrastructure-free tracking of a handheld camera whilstsimultaneously mapping the physical scene in high-detail promisesnew possibilities for augmented and mixed reality applications.

In computer vision, research on structure from motion (SFM)and multi-view stereo (MVS) has produced many compelling re-sults, in particular accurate camera tracking and sparse reconstruc-tions (e.g. [10]), and increasingly reconstruction of dense surfaces(e.g. [24]). However, much of this work was not motivated by real-time applications.

Research on simultaneous localisation and mapping (SLAM) hasfocused more on real-time markerless tracking and live scene re-construction based on the input of a single commodity sensor—amonocular RGB camera. Such ‘monocular SLAM’ systems such asMonoSLAM [8] and the more accurate Parallel Tracking and Map-ping (PTAM) system [17] allow researchers to investigate flexibleinfrastructure- and marker-free AR applications. But while thesesystems perform real-time mapping, they were optimised for ef-ficient camera tracking, with the sparse point cloud models theyproduce enabling only rudimentary scene reconstruction.

In the past year, systems have begun to emerge that combinePTAM’s handheld camera tracking capability with dense surfaceMVS-style reconstruction modules, enabling more sophisticatedocclusion prediction and surface interaction [19, 26]. Most recentlyin this line of research, iterative image alignment against dense re-constructions has also been used to replace point features for cam-era tracking [20]. While this work is very promising for AR, densescene reconstruction in real-time remains a challenge for passivemonocular systems which assume the availability of the right typeof camera motion and suitable scene illumination.

But while algorithms for estimating camera pose and extract-ing geometry from images have been evolving at pace, so havethe camera technologies themselves. New depth cameras based ei-ther on time-of-flight (ToF) or structured light sensing offer densemeasurements of depth in an integrated device. With the arrivalof Microsoft’s Kinect, such sensing has suddenly reached wideconsumer-level accessibility. The opportunities for SLAM and AR

DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time

Richard A. [email protected]

Dieter [email protected]

University of Washington, Seattle

Steven M. [email protected]

Figure 1: Real-time reconstructions of a moving scene with DynamicFusion; both the person and the camera are moving. The initiallynoisy and incomplete model is progressively denoised and completed over time (left to right).

Abstract

We present the first dense SLAM system capable of re-constructing non-rigidly deforming scenes in real-time, byfusing together RGBD scans captured from commodity sen-sors. Our DynamicFusion approach reconstructs scene ge-ometry whilst simultaneously estimating a dense volumet-ric 6D motion field that warps the estimated geometry intoa live frame. Like KinectFusion, our system produces in-creasingly denoised, detailed, and complete reconstructionsas more measurements are fused, and displays the updatedmodel in real time. Because we do not require a templateor other prior scene model, the approach is applicable to awide range of moving objects and scenes.

3D scanning traditionally involves separate capture andoff-line processing phases, requiring very careful planningof the capture to make sure that every surface is cov-ered. In practice, it’s very difficult to avoid holes, requir-ing several iterations of capture, reconstruction, identifyingholes, and recapturing missing regions to ensure a completemodel. Real-time 3D reconstruction systems like KinectFu-sion [18, 10] represent a major advance, by providing usersthe ability to instantly see the reconstruction and identifyregions that remain to be scanned. KinectFusion spurred aflurry of follow up research aimed at robustifying the track-ing [9, 32] and expanding its spatial mapping capabilities tolarger environments [22, 19, 34, 31, 9].

However, as with all traditional SLAM and dense re-construction systems, the most basic assumption behindKinectFusion is that the observed scene is largely static.The core question we tackle in this paper is: How can wegeneralise KinectFusion to reconstruct and track dynamic,

non-rigid scenes in real-time? To that end, we introduceDynamicFusion, an approach based on solving for a vol-umetric flow field that transforms the state of the scene ateach time instant into a fixed, canonical frame. In the caseof a moving person, for example, this transformation un-does the person’s motion, warping each body configurationinto the pose of the first frame. Following these warps, thescene is effectively rigid, and standard KinectFusion up-dates can be used to obtain a high quality, denoised recon-struction. This progressively denoised reconstruction canthen be transformed back into the live frame using the in-verse map; each point in the canonical frame is transformedto its location in the live frame (see Figure 1).

Defining a canonical “rigid” space for a dynamicallymoving scene is not straightforward. A key contributionof our work is an approach for non-rigid transformation andfusion that retains the optimality properties of volumetricscan fusion [5], developed originally for rigid scenes. Themain insight is that undoing the scene motion to enable fu-sion of all observations into a single fixed frame can beachieved efficiently by computing the inverse map alone.Under this transformation, each canonical point projectsalong a line of sight in the live camera frame. Since theoptimality arguments of [5] (developed for rigid scenes) de-pend only on lines of sight, we can generalize their optimal-ity results to the non-rigid case.

Our second key contribution is to represent this volumet-ric warp efficiently, and compute it in real time. Indeed,even a relatively low resolution, 2563 deformation volumewould require 100 million transformation variables to becomputed at frame-rate. Our solution depends on a com-bination of adaptive, sparse, hierarchical volumetric basisfunctions, and innovative algorithmic work to ensure a real-

DynamicFusion: 概要

• ノイズの大きい連続デプス画像を入力

• 動的シーンの密な3次元形状をリアルタイムで出力

38





c

2 S, Tlc

= W(v

c



t



W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)


c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,


kc


k

:


t


warp = {dgv

,dg

w

,dg


i

v


ic

= dg

i


w


i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��


w


c


c


t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)


lw






c

2 S, Tlc

= W(v

c



t



W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)


c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,


kc


k

:


t


warp = {dgv

,dg

w

,dg


i

v


ic

= dg

i


w


i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��


w


c


c


t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)


lw


DynamicFusion:概要

• ワープフィールドを疎なノードのワープ（ 6自由度）の重み付き平均で表現

• 基準空間の各ボクセルを最新フレームへワープ

39





c

2 S, Tlc

= W(v

c



t



W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)


c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,


kc


k

:


t


warp = {dgv

,dg

w

,dg


i

v


ic

= dg

i


w


i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��


w


c


c


t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)


lw






c

2 S, Tlc

= W(v

c



t



W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)


c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,


kc


k

:


t


warp = {dgv

,dg

w

,dg


i

v


ic

= dg

i


w


i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��


w


c


c


t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)


lw






c

2 S, Tlc

= W(v

c



t



W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)


c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,


kc


k

:


t


warp = {dgv

,dg

w

,dg


i

v


ic

= dg

i


w


i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��


w


c


c


t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)


lw


補間疎なノード

ワープフィールド𝒲Jを推定

DynamicFusion:概要

TSDFは最新フレームの空間で統合• 最新フレームのレイが基準空間では歪曲

40

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

↵

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

DynamicFusion: 幾何表現

Truncated Signed Distance Function (TSDF)

256Nのボクセルの場合,

• Kinect Fusion– カメラ姿勢（6パラメータ）のみ推定

• DynamicFusion– 1フレームごとに6×256Nパラメータを推定– KinectFusionの約1000万倍

全ボクセルのワープフィールド推定は困難

41

DynamicFusion: 密な非剛体ワープフィールド

ワープ関数

ベース: 疎なノードの変換

個の変換ノード

: 基準空間における位置

: ノードの（座標）変換パラメータ

:放射基底重み（ノードの影響半径）

座標におけるノードの影響度合い

: 基準空間 : 各ボクセルの中心

42


ワープ関数

補間: 密なボクセルのワープ関数

𝒲 𝑥O ≡ 𝑆𝐸3 𝐃𝐐𝐁 𝑥O

𝐃𝐐𝐁 𝒙𝒄 =∑ 𝐰" 𝑥O 𝐪["O"∈\ ]^

∑ 𝐰" 𝑥O 𝐪["O"∈\ ]^

単位デュアルクォータニオン: 𝐪["O ∈ ℝ`

𝑆𝐸3 . : デュアルクォータニオンを３次元ユークリッド空間の座標変換行列に変換

43

予備知識: デュアルクォータニオン

クォータニオン (William Rowan Hamilton,1843)

• 回転のみ𝐪 = cos

𝜃2+ 𝑢]𝐢 + 𝑢h𝐣 + 𝑢j𝐤 sin

𝜃2

𝒊𝟐 = 𝒋𝟐 = 𝒋𝟐 = 𝒊𝒋𝒌 = −𝟏単位ベクトル: 𝐮 = 𝑢]𝐢 + 𝑢h𝐣 + 𝑢j𝐤

• ３次元点 𝐩 = 𝑝]𝐢 + 𝑝h𝐣 + 𝑝j𝐤 を回転

𝐩r = 𝐪𝐩𝐪'𝟏

44

予備知識: デュアルクォータニオン

デュアルクォータニオン（William Kingdon Clifford, 1873）

• 回転 + 並進�� = 𝐪t + 𝜖𝐪v 𝜖x = 0𝐩r = ��𝐩��'𝟏

回転

𝐪t = 𝑟{ + 𝑟]i+𝑟hj +𝑟j𝒌 = cos |x+ sin |

x} 𝐮

並進

𝐪v = 0+ ~]x 𝒊 +

~hx 𝒋 +

~jx 𝒌

45

予備知識: Dual Quaternion Blending (DQB)

Linear Blending Skinning (LBS)

𝐩𝐢� =�𝑤��𝑇�𝐩𝐢

�

��(

• 体積縮小問題–スケール変化が原因

46Reference: Kavan, Ladislav, et al. "Skinning with dual quaternions." Proceedings of the 2007 symposium on Interactive 3D graphics and games. ACM, 2007.

予備知識: Dual Quaternion Blending (DQB)

Dual Quaternion Skinning (DQS)

�� =∑ 𝑤��𝐢��(

∑ 𝑤��𝐢��(

47Reference: Kavan, Ladislav, et al. "Skinning with dual quaternions." Proceedings of the 2007 symposium on Interactive 3D graphics and games. ACM, 2007.

LBS DQS

ワープフィールド

𝒲J 𝑥O = T�{𝑆𝐸3 𝐃𝐐𝐁 𝑥O


48

剛体変換（カメラの移動量）

ボクセル（ノード）ごとのワープフィールド

DynamicFusion: 密な非剛体サーフェスの統合

Sampled TSDF𝒱 𝐱 ↦ v 𝐱 ∈ ℝ,w 𝐱 ∈ ℝ

v 𝐱 : 全projective TSDF 値の重み付き平均

w 𝐱 : 𝐱 に関する重みの総和

49

DynamicFusion: 密な非剛体サーフェスの統合

最新のデプス画像取得後• 基準空間の各ボクセル中心を最新フレームの座標系に変換

• Projective Signed Distance Function (psdf)

: ボクセル中心の投影画素

50

デプスの計測点ボクセル中心

DynamicFusion: 密な非剛体サーフェスの統合TSDF値の更新

𝒱 𝐱 J = � v′ 𝐱 ,w′ 𝐱-, if 𝐩𝐬𝐝𝐟 𝐝𝐜 𝐱 > −𝜏

𝒱 𝐱 J'(, otherwise 𝐝𝐜 . : （ボクセルID）→（ボクセル中心の座標（TSDF領域））𝜏 > 0: 打ち切り(Truncate)する距離の閾値

vr 𝐱 = � 𝐱 𝒕�𝟏� 𝐱 𝒕�𝟏 ¡¢£ ¤,¥ { 𝐱� 𝐱 𝒕�𝟏 { 𝐱

𝜌 = 𝐩𝐬𝐝𝐟 𝐝𝐜 𝐱 𝑤r 𝐱 = min w 𝐱 𝒕'𝟏 + 𝑤 𝐱 ,𝑤¡¨©𝑤 𝐱 ∝ (

"∑ 𝐝𝐠{� − 𝑥O x

�∈\ ]^51

DynamicFusion: ワープフィールド𝒲Jの推定

エネルギー関数

𝐸 𝒲J,𝒱, 𝐷J, ℰ = 𝐃𝐚𝐭𝐚 𝒲J, 𝒱,𝐷J + 𝜆𝐑𝐞𝐠 𝒲J, ℰ

52

モデルから最新フレームへの密なICPコスト

非スムースなモーションフィールドに対するペナルティ

𝒱:現在の３次元形状𝐷J: 最新のデプスマップℰ : エッジの集合


現在のサーフェスモデル

• TSDF 𝒱からマーチングキューブアルゴリズムで抽出– 0の等値面から表面形状を生成

• 基準空間におけるポリゴンメッシュとして保持– 頂点と法線のペア： 𝒱±O ≡ 𝑉O, 𝑁O

53


データ項

𝐃𝐚𝐭𝐚 𝒲,𝒱, 𝐷J ≡ �𝜓𝐝𝐚𝐭𝐚 𝐧[¶- 𝐯·¶ − 𝐯𝐥¶¹¶∈º

𝜓𝐝𝐚𝐭𝐚: Robust Tukey penalty function

54

頂点と法線の推定値（基準空間から最新フレームの空間に変換）

計測点（デプス）

最新フレームにおける頂点の法線方向の誤差

• This gives the “solution” as a simple least-squares problem:

a =!

"

i

wixix⊤i

#−1"

i

wiyixi. (8)

Note that this solution is depends on the wi values which in turn dependon a.

• The idea is to alternate calculating a and recalculating wi = w((yi −a⊤xi)/σi).

• Here are the weight functions associated with the two estimates. For theCauchy ρ function,

wC(u) =u

1 + (u/c)2(9)

0.2

0.4

0.6

0.8

1

–6 –4 –2 0 2 4 6

and, for the Beaton-Tukey ρ function,

wT (u) =

$

%

1 −!

u

a

#2&2|u| ≤ a

0 |u| > a. (10)

5

Beaton-‐Tukey 𝜌 function

• This gives the “solution” as a simple least-squares problem:

a =!

"

i

wixix⊤i

#−1"

i

wiyixi. (8)

Note that this solution is depends on the wi values which in turn dependon a.

• The idea is to alternate calculating a and recalculating wi = w((yi −a⊤xi)/σi).

• Here are the weight functions associated with the two estimates. For theCauchy ρ function,

wC(u) =u

1 + (u/c)2(9)

0.2

0.4

0.6

0.8

1

–6 –4 –2 0 2 4 6

and, for the Beaton-Tukey ρ function,

wT (u) =

$

%

1 −!

u

a

#2&2|u| ≤ a

0 |u| > a. (10)

5

Beaton-‐Tukey 𝜌 functionの例


正則化項• 最新フレームで計測されない箇所のモーションを制約

𝐑𝐞𝐠 𝒲,ℰ ≡� � 𝛼��𝜓𝐫𝐞𝐠 𝐓𝒊𝒄𝐝𝐠¾� − 𝐓𝒋𝒄𝐝𝐠¾

�

�∈ℰ �

�

��¿

𝜓𝐫𝐞𝐠: Huber penalty

𝛼�� = max 𝐝𝐠𝓌� , 𝐝𝐠𝓌�

55

ノード 𝑖 と 𝑗のエッジができるだけ剛体を保つ制約

𝐿Å 𝑎 =

12𝑎x 𝑎 ≤ 𝛿

𝛿 𝑎 −12𝛿 , otherwise

Huber loss function

Huber loss

Squared error loss

Huber loss functionの例


56

エネルギー関数𝐸を最小化• ガウス・ニュートン法

ヘッシアン: 𝐉-𝐉 = 𝐉𝒅-𝐉𝒅 + λ𝐉𝒓-𝐉𝒓– ブロックarrow-‐head行列– ブロックコレスキー分解で効率的に計算

Arrow-‐head行列

DynamicFusion: 新しいノードの挿入

• 支持されてないサーフェスの頂点min"∈\ ]^

𝐝𝐠¾" − 𝑣O𝐝𝐠𝓌"

≥ 1

• 新ノード 𝐝𝐠¾∗ ∈ 𝐝𝐠Ï¾

– DQBを利用して周囲のノードから新ノードの変換パラメータを初期化

𝐝𝐠ÐÑN∗ ←𝒲J 𝐝𝐠¾∗

• 更新𝒩𝐰𝐚𝐫𝐩J = 𝒩𝐰𝐚𝐫𝐩J'( ∪ 𝐝𝐠Ï¾, 𝐝𝐠ÏÐÑN, 𝐝𝐠Ï𝓌

57

サーフェスの頂点

ノードの座標と影響半径

DynamicFusion: 実験結果

58

Canonical Model for “drinking from a cup”

(a) Canonical model warped into the live frame for “drinking from a cup”

Canonical Model for “Crossing fingers”

(b) Canonical model warped into the live frame for “crossing fingers”

Figure 5: Real-time non-rigid reconstructions for two deforming scenes. Upper rows of (a) and (b) show the canonical models as theyevolve over time, lower rows show the corresponding warped geometries tracking the scene. In (a) complete models of the arm and thecup are obtained. Note the system’s ability to deal with large motion and add surfaces not visible in the initial scene, such as the bottom ofthe cup and the back side of the arm. In (b) we show full body motions including clasping of the hands where we note that the model staysconsistent throughout the interaction.

tracking scenes with more fluid deformations than shownin the results, but the long term stability can degrade andtracking will fail when the observed data term is not able toconstrain the optimisation sufficiently. Finally, we show re-construction results that are at the current limit of what canbe obtained in real-time with DynamicFusion, and there aretwo limitations for scaling further. As in KinectFusion, vol-umetric TSDF memory limits geometric extent; but thereare many solutions to this in the literature. More challeng-ing is the estimation of a growing warp field. As the sizeand complexity of the scene increases, proportionally moreis occluded from the camera, and the problem of predictingthe motion of occluded areas becomes much more challeng-ing. This is a subject of our ongoing research.

5. Conclusions

In this paper we introduced DynamicFusion, the firstreal-time dense dynamic scene reconstruction system, re-moving the static scene assumption pervasive across real-time 3D reconstruction and SLAM systems. We achievedthis by generalising the volumetric TSDF fusion techniqueto the non-rigid case, as well as developing an efficient ap-proach to estimate a volumetric 6D warp field in real-time.DynamicFusion obtains reconstructions of objects whilstthey deform and provides dense correspondence acrosstime. We believe that DynamicFusion, like KinectFusion,will open up a number of interesting applications of real-time 3D scanning and SLAM systems in dynamic environ-ments.

DynamicFusion: 実験結果

59

Canonical Model for “drinking from a cup”

(a) Canonical model warped into the live frame for “drinking from a cup”

Canonical Model for “Crossing fingers”

(b) Canonical model warped into the live frame for “crossing fingers”

Figure 5: Real-time non-rigid reconstructions for two deforming scenes. Upper rows of (a) and (b) show the canonical models as theyevolve over time, lower rows show the corresponding warped geometries tracking the scene. In (a) complete models of the arm and thecup are obtained. Note the system’s ability to deal with large motion and add surfaces not visible in the initial scene, such as the bottom ofthe cup and the back side of the arm. In (b) we show full body motions including clasping of the hands where we note that the model staysconsistent throughout the interaction.

tracking scenes with more fluid deformations than shownin the results, but the long term stability can degrade andtracking will fail when the observed data term is not able toconstrain the optimisation sufficiently. Finally, we show re-construction results that are at the current limit of what canbe obtained in real-time with DynamicFusion, and there aretwo limitations for scaling further. As in KinectFusion, vol-umetric TSDF memory limits geometric extent; but thereare many solutions to this in the literature. More challeng-ing is the estimation of a growing warp field. As the sizeand complexity of the scene increases, proportionally moreis occluded from the camera, and the problem of predictingthe motion of occluded areas becomes much more challeng-ing. This is a subject of our ongoing research.

5. Conclusions

In this paper we introduced DynamicFusion, the firstreal-time dense dynamic scene reconstruction system, re-moving the static scene assumption pervasive across real-time 3D reconstruction and SLAM systems. We achievedthis by generalising the volumetric TSDF fusion techniqueto the non-rigid case, as well as developing an efficient ap-proach to estimate a volumetric 6D warp field in real-time.DynamicFusion obtains reconstructions of objects whilstthey deform and provides dense correspondence acrosstime. We believe that DynamicFusion, like KinectFusion,will open up a number of interesting applications of real-time 3D scanning and SLAM systems in dynamic environ-ments.

DynamicFusion: 課題

• グラフ構造の急激な変化に弱い– トポロジー的に閉じている状態から開いてる状態へ急に遷移すると失敗し易い

–結合しているノードは急に分離できない

• リアルタイム差分トラッキングに共通の失敗–ループクロージングの失敗– フレーム間の大きすぎる移動

60

まとめ

• ダイナミックシーンの密なリアルタイム３次元復元手法を提案

–シーンの静的仮定なし

• ３次元TSDFの統合を非剛体に一般化

• ３次元（６自由度）のワープフィールドをリアルタイムで推定

61

END

62

論文紹介"dynamicfusion: reconstruction and tracking of non-‐rigid scenes in...

Technology