real-time large-scale fusion of high resolution 3d scans...

Real-Time Large-Scale Fusion of High Resolution 3D Scans with Details Preservation

Hicham Sekkati, Jonathan Boisvert, Guy Godin, and Louis Borgeat

Digital Technologies/Computer Vision and Graphics

National Research Council Canada (NRC)

Ottawa, Canada

Email: [email protected]

Figure 1: (a) On-the-fly mapping of the scene knurled head screw using our anisotropic TSDF model. (b) Zoom on different parts

showing how our model is able to preserve details present in original scans. (c) Mapping of the same scene using the full pipeline but

with standard TSDF. (d) Zoom on the same parts in b), notice how the averaging TSDF model over-smooths fine details at larger scale

compared to the anisotropic TSDF.

Abstract—This paper presents a real-time 3D shape fusion systemthat faithfully integrates very high resolution 3D scans with thegoal of maximizing details preservation. The system fully mapscomplex shapes while allowing free movement similarly to denseSLAM systems in robotics where sensor fusion techniques map largeenvironments. We propose a novel framework to integrate shapesinto a volume with fine details preservation of the reconstructedshape which is an important aspect in many applications, especiallyfor industrial inspection. The truncated signed distance functionis generalized with a global variational scheme that controls edgepreservation and leads to updating cumulative rules adapted forGPU implementation. The framework also embeds a map deformationmethod to online deform the shape and correct the system trajectorydrift at few microns accuracy. Results are presented from theintegrated system on two mechanical objects which illustrate thebenefits of the proposed approach.

Keywords-RGB-D; Real-time Fusion; Large-Scale 3D Scans; Volu-metric Representation; Dense SLAM; GPU.

I. INTRODUCTION

New computational paradigms are emerging with rapid advance-

ment of high resolution 3D scanners and real-time visualization

platforms. With the advent of high resolution 3D scanners that

can reconstruct 3D shape at high resolution ranging between few

microns up to 1mm, a number of methods have been published

over the last years that allow offline registration of multiple 3D data

scans into a full high definition 3D model [19] [1] [6] [15] [3] [11].

For applications requiring real-time interactivity such as robotics,

industrial inspection and augmented reality, mapping 3D data must

be done in real time, but often at the cost of lowering the resolution.

Producing large-scale 3D scans with high resolution is useful

for online industrial inspection, but requires complex engineering

and calibration to assemble 3D laser scanners with robotic arms

[27] or using very expensive commercial products. For all these

applications, 3D scanners based on structured-light sensors are pre-

ferred, because they can produce faster highly dense point clouds.

However, the big challenge with 3D structured light systems is how

to fuse the 3D data on-the-fly and without alteration while moving

the system, especially if the system offers the capability to scan

millions of points per second. Unlike 3D reconstruction of objects

at small scales, which requires the object to be entirely or mostly

in the field of view of the camera, 3D reconstruction of complex

objects at large scales must be fused from views acquired along

a free arbitrary trajectory, each view exposing only a very small

part of the object. For inspection purposes, these views must be

fused online and without oversmoothing to produce an interactive

3D model most faithfull to input data.

Similarly in robotics, recent RGB-D large-scale dense SLAM

systems (LSD-SLAM) try to solve a similar problem but at

human environment scales [16][8][25][10][4][26][5]. LSD-SLAM

methods perform well with low resolution sensors such as Kinect

but they are not adapted to map on-the-fly very high resolution

point clouds streamed by modern fast 3D scanners. On the other

hand, camera paths that thoroughly image all small patches at

very close range lead to significant odometry drift, and there is

a necessity to match and register different views globally. In the

context of LSD-SLAM, online loop closure, that corrects drifts

along camera trajectory while mapping the environment, is very

challenging and only few works have addressed this problem

[12][23][26][18]. Even more challenging is an online correction

of the drifts at microns scale while mapping very high resolution

scans.

We propose a global framework for a large-scale 3D scan system

having the following capabilities;

• Mapping, at the scanner frame rate, very high resolution 3D

scans while preserving fine details of each scan aside. This is

63

2018 15th Conference on Computer and Robot Vision

978-1-5386-6481-0/18/$31.00 ©2018 CrownDOI 10.1109/CRV.2018.00019

Figure 2: System architecture diagram with its three main components each drawn with a different plain color.

done by integrating depth maps into a volume according to a

generalized truncated signed distance function allowing fine

details preservation.

• Embedding in our system an online deformation component

to allow free movement of the scanner while correcting on-

the-fly camera drifts at microns scale.

The goal of this paper is not to thoroughly compare the per-

formance of our large-scale 3D scan to Kinect-like LSD-SLAMs

mainly because of the large difference in terms of the input

resolution of the two acquisition systems. However we can still

emulate a LSD-SLAM using our high resolution input data, and

compare the subsequent output with the proposed pipeline system.

The rest of the paper is organized in three main sections. In the

next section, an outline of the pipeline provides a description of the

main components of our large scale 3D scanning system. Section

III exposes the general framework of the anisotropic truncated

signed distance function where more global updating rules are

derived and which leads to GPU implementation. Section IV

describes the method for an online global consistency used in our

system. The last section presents an evaluation of the proposed

method, including both qualitative and quantitative evaluation of

trajectory estimation performance, surface reconstruction quality

and computational performance.

II. APPROACH OVERVIEW

The input to our system is a sequence of high resolution depth

maps that can be acquired using fast structured light systems (SLS).

The system is allowed to move freely relatively to an object or vice

versa. Figure 2 provides an overview of the system components in

the form of three blocks that all run on GPU independently. The

blocks are described as follows:

• Tracker: To perform SLS tracking, we follow a similar

scheme used by Newcombe in [16]. Depth maps are integrated

into a volume using the truncated signed distance function

(TSDF) while a predicted surface is extracted by raycasting

the volume. The predicted surface is registered with the next

captured depth map using ICP algorithm. Our main extension

to this pipeline is our generalized model to the real-time TSDF

which we describe in more details in the following section.

As will be seen in the next section, this model is capable

of extracting the finest details of a high resolution surface

from a volume all by preserving the same computational cost

compared to the standard cumulative TSDF model used in

most dense SLAM systems. Figure 1 shows the output of

the tracker using our anisotropic TSDF model and standard

TSDF respectively. Notice here that computational costs will

be compared using SLS acquisition system because frame

rates and depth resolutions of SLS are different from the

Kinect-like sensor used by dense SLAM systems.

• Moving Volume and Cloud Extraction: In KinectFusion

[16], camera is restricted to move within the initialized

volume and hence the resolution of the extracted surface is

limited by the volume resolution. An extension to this volume

representation was introduced by Whelan et al. [25], allowing

the camera to move outside the volume and shifting the

volume to recenter around the camera pose. If large volumes

are scanned then cubic volume representation is sufficient to

span large areas. In this configuration only the translational

component of camera pose was used to update the global

pose of the TSDF (Figure 3a). To allow scanning objects of

different forms, we use a Rectangular Parallelepiped Volume

(RPV) representation. This permits resizing the volume for

each direction independently to fit object dimensions. In our

volume motion, a rotational component of the camera pose

was also used to update the TSDF volume position. An

optimal area is spanned using both translational and rotational

components of camera pose (Figure 3c) compared to only

translational component (Figure 3b). The operation of adding

a rotational component requires a re-sampling of the volume

at each move but its cost on GPU was insignificant and worth

the benefit of building a denser cloud points. After moving

the volume, cloud of points out of volume are extracted and

shipped to the CPU memory.

• Online deformation: Like in all SLAM systems, drift will

accumulate over time, thus requiring maps adjustment. To

achieve global consistency, localized non-rigid surface correc-

tion is of great interest specifically in the realm of volumetric

64

Figure 3: Illustration of dynamic volume positioning along camera

trajectory using, (a) translational cubic volume, (b) translational

RPV, and (c) translational and rotational RPV. The volume spans

larger area, shown with red lines, when rotation is also considered

to position a RPV.

fusion. To that end, pose graph optimization is suitable to

non-rigidly deform a shape using a graph deformation while

preserving details of the shape [22]. An important question

was raised in [26], whether or not it is optimal to attach highly

dense maps to a sparse pose graph structure like in features-

based visual SLAM systems. In Section 4, we will show how

to extend the deformation model of Sumner [22] to online

adjusting the maps by pose graph deformation.

III. GENERALIZED ANISOTROPIC TSDF MODEL

A. Background

Data are represented as depth maps that are converted into

3D truncated signed distance field (TSDF). When depth errors

are relatively high (> 1mm), which is the case of Kinect-like

sensors, extracting a noisy surface from a volume by a TSDF has

shown its reliability to represent objects at large scales. The TSDF

averages depths between the actual frame point and the model

point. However, when objects bear details at much finer scales,

depth averaging will smooth details of the object specifically over

surface edges. The real-time anisotropic TSDF model presented in

[13] takes into account the anisotropic and inhomogeneous local-

ization errors similar to the way it was originally implemented to

generalize the ICP point-based registration [14]. This latter model

updates the TSDF for each point independently to remove noise,

but can generate smooth surfaces without preserving details of the

geometry. Hence, a regularization process that smooths the TSDF

all by preserving the geometry of the surface has to be considered.

A more general and accurate model was presented in [20], that

regularizes a variational signed distance field with anisotropic

smoothing term to obtain smooth surfaces while preserving ridges

and corners. The directional signed distance is very fast to compute

compared to the suggested closest signed distance. Even if coarse-

to-fine architecture can be implemented in GPU, the method is

not meant for on-line mapping of high resolution depth maps. The

algorithm was tested using low resolution data maps with CPU

implementation requiring hours of processing.

B. Anisotropic variational signed distance field

The goal of this section is to present an anisotropic TSDF

scheme that leads to on-line updating rules with numerical GPU

implementation. We follow a scheme similar to the one presented

in [20] but with different anisotropic regularization leading to GPU

implementation. This is a very important feature in large-scale 3D

scan systems, because we remember that the goal here is not only

to reconstruct very high quality 3D models, but that updating the

3D model must be done on-line for interactivity purpose while

preserving all details mapped by each scan individually.

A depth image acquired by high resolution 3D scanner is

converted into truncated 3D signed distance fields fl by computing

the signed distance dl(x) of a 3D point x in the volume domain

Ω3 along the optical axis of the camera, weighting by a factor 1/δ,

and then truncating to the interval [−1, 1],

fl(x) = F (dl(x))

with F (dl(x)) =

{sgn(dl(x)

δ) if |dl

δ| > 1

dl(x)δ

otherwise.(1)

Similarly to [28], we introduce a binary weighting ωl(x) ={0, 1} that controls the width of the occluded region behind the

surface (dl(x) < −η, where η > δ). A regularized field uthat approximates all the truncated signed distance fields (fl, l =0, . . . , n) can be computed by minimizing the following functional,

∫Ω3

((∑l

ωl)−1

∑l

ωl(u− fl)2 + λφ(|∇u|2)

)dx, (2)

where λ is a positive real parameter that controls the smoothing,

and the function φ permits anisotropic diffusion while preserving

discontinuities of the approximated field u. The discrete Euler-

Lagrange conditions corresponding to minimization of (2) yield a

large system of non-linear equations that are difficult to solve and

that cannot be parallelized for GPU implementation.

1) Approximation by half-quadratic algorithm: The half-

quadratic algorithm, originally used in image restoration and

optical flow estimation [2], and later extended to 3D interpreta-

tion [21], allows minimization of the non-convex model (2) by

introducing a dual field b into an alternate model,

E(u, b) =

∫Ω3

((∑l

ωl)−1

∑l

ωl(u− fl)2

+ λ(b |∇u|2 + ψ(b)))dx, (3)

where the convex decreasing function ψ is related to φ according

to the minimization equation:

φ(s) = inf{b}

(b s2 + ψ(b)). (4)

Minimization of the alternate model E(u, b) is obtained in two

steps by successively minimizing it with respect to each variable.

The sequence that alternates minimizing the two steps converge

to a unique minimum (proof can be found in [2]). The advantage

of using the model (3) instead of (2) is that for each step the ob-

jective function is convex and optimal conditions for reaching the

minimum can be derived. More precisely, the 2-steps minimization

are described as follows,

• by fixing u, the value that leads to a minimum of E(u, b)with respect to b is an analytic expression given by,

bs =φ′(s)

2s(5)

• by fixing b = bs, the Euler-Lagrange condition that corre-

sponds to minimizing E(u, bs) with respect to u is given by,

λ div(bs∇u) = 2(∑l

ωl)−1

∑l

ωl(u− fl) (6)

where ∇ = ( ∂∂x, ∂∂y, ∂∂z

) represents the three partial derivatives

along the three Cartesian axes of the volume space.

65

2) On-line updating rules: The goal here is to derive the up-

dating rules that approximate the cumulative signed distance fields

while integrating new surface measurements into the current TSDF

volume. These rules will be implemented using GPU memory,

hence they should efficiently use memory during the fusion without

adding an extra cost to the original fusion implementation in [16].

The discretization of the div term in (6), which is responsible

for the anisotropic diffusion, was earlier formulated by Perona and

Malik model [17] for 2D image and later extended to 3D in [9].

Another more sophisticated discretization was proposed using a

different scheme [20] but with the drawback that it is not obvious

how to derive efficient on-line updating rules. The discrete 3D

extension of the divergence operator is straightforward and leads

to

[div(bs∇)

]i,j,k

= β∑ξ

bξs∇ξ, (7)

where 0 < β ≤ 16 is a normalizing coefficient and ξ ∈

{x+, x−, y+, y−, z+, z−} defines the discrete directional deriva-

tives along all axes. Hence, the operator ∇ξ applied to u(i, j, k)is defined by,

∇ξu(i, j, k) = uξ(i, j, k)− u(i, j, k). (8)

where uξ(i, j, k) is the nearest-neighbor of u(i, j, k) along the

direction ξ. Using the above definition, for example the positive

directional derivative along the x axis is defined by,

∇x+u(i, j, k) = ux+(i, j, k)− u(i, j, k), (9)

with ux+(i, j, k) = u(i+ 1, j, k) and bx+

s can be computed using

the formula in (5),

bx+

s =φ′(|∇x+

u|)

2|∇x+u|(10)

By plugging equation (7) into (6), we get

u(i, j, k) = (α b(i, j, k) + 1)−1(α u(i, j, k) +

∑l ωlfl∑l ωl

) (11)

where α = λβ, b =∑

ξ bξs, and u =

∑ξ b

ξsu

ξ. Let’s denote the

cumulative TSDFs and weights by the capitale symbols Un and

Wn respectively. Bn and Un will be respectively computed by the

same formulas as b and u using the field Un. To lighten notation,

we drop out the subscripts related to indexing voxels. Equation

(11) can be rewritting as:

WnUn = (α Bn + 1)−1(α Un +Wn−1Un−1 + ωnfn) (12)

Using iterative Gauss-Seidel scheme we can derive the updating

rules as follow,

{WnU

(t)n = (α B

(t−1)n + 1)−1(α U

(t−1)n +Wn−1Un−1 + ωnfn)

Wn =Wn−1 + ωn

(13)

By setting α = 0 in the updating rules (13), we find again the

updating rules that average TSDFs as in [16]. Notice also that

the new online TSDF scheme does not require any additional

GPU memory as the auxiliary field b in equation (10) can be

computed analytically using directional variations of the field

u. The cumulative field Un−1 and the field fn were previously

registered using ICP algorithm, hence only few iterations are

sufficient to update the rules in (13). In all our experiments we set

the number of iterations to updating these rules to 3 or 4 iterations.

For our SLS tracking implementation, we have used the Kinect-

Fusion implementation [16] with several adaptations to cope with

SLS high resolution scans. This adaptation includes changing

camera model in both raycasting and surface integration to fit the

SLS projection model, changing data structure relative to TSDF

to incorporate anisotropic diffusion, and adopting a rotational

TSDF voxel buffer to cope with arbitrary moving volume followed

by volume resampling. Also we have adopted the ICP variant

implementation using RGB-D as in [25].

IV. ONLINE GLOBAL CONSISTENCY

A. Background

Existing dense SLAMs, that close occasional loops in corridor-

like trajectories, perform poorly in a situation where many small

localized loops have to be adjusted online such as hand-held

systems. Applying loop-closures early and often, like in methods

[12][23][26][18], has shown higher performance over the existing

dense SLAMs systems that utilize offline pose graph optimization

[24][25]. All these methods use point-based fusion instead of

volumetric representation and are more suitable to represent objects

at large scale with low resolution. On the other hand, and as

mentioned earlier, for finer scale, volumetric representation handles

well topological changes like often encountered in prefabricated

objects bearing very sharp details. To our best knowledge, no

online global consistency method using the later representation

has been explored yet. The goal of this section, is to present an

online scheme having both capabilities of adjusting small localized

drifts by continually deforming a graph model, while on the other

hand constraining the model to occasionally close large loops when

detected or if needed. In the experimental section, we will compare

our full pipeline with and without online deformation. Our pipeline

without online deformation can emulate the LSD-SLAM presented

in [25] that deforms camera trajectory and maps only when loop

closure is detected.

B. Online deformation graph

The Sumner model [22] represents the tracked camera poses

by an affine transformation for each pose, embeds the poses in a

graph, and deforms subsequently a shape according to the graph

deformation. The graph node positions are denoted by {gj}mj=1,

and for each node j an affine transformation {Rj , tj} is assigned.

A rigidity term expressing that each matrix Rj must preserve

length, that means each Rj is a rotation matrix and must satisfy

an orthonormal constraint:

Erig({Rj}mj=1) =

∑j

||RjRtj − I||2F (14)

A second term is a regularization on the graph positions that

ensures a smooth graph by minimizing the following energy

Ereg({Rj , tj}mj=1) =

m∑j=1

∑k∈Nj

||Rj(gk−gj)+gj+tj−(gk+tk)||2

(15)

We define a cost that deforms the graph incrementally and when-

ever a new pose estimation is added to the graph. This online cost

is defined by minimizing jointly the two terms Erig and Ereg , that

is

Eonl = Erig + γEreg (16)

66

(a) (b)

(c) (d)

Figure 4: Fusion of the knurled knob scene without running our online deformation model. Mapping of the point cloud is shown in (a),

and with texture mapping in (b). The red rectangle shows the part where loop closure occurs and a close up view of it is shown in (c).

The loop closure part is compared to the original scan and errors are displayed in (d) as a color mapping with an average RMSE of

0.085 mm.

where γ is a constant to balance the graph smoothness over rigidity.

Contrary to the method presented in [25] which adds a non-linear

third term that takes into account the user-defined vertex positions,

minimizing the cost Eonl is very cheap as the minimum is obtained

by solving a sparse linear system. At this stage, it is very important

to avoid an iterative optimization which alleviates significantly the

computation burden without delaying the front-end loop. The graph

connectivity, determined by all sets Nj , can be defined by the k-

neighborhood nodes to the node j as in [22], or by following the

temporal sampling of the graph similarly to [25][26]. Neither of

these graph connectivities is optimal, because the graph connects

only nodes that are spatially close to each other in the former

case, and connect nodes in temporal order without looking at the

neighborhood spatial distribution. We use a combined algorithm

to connect the graph by following the temporal sampling all by

being spatially aware about each node neighborhood. A one step

refinement of graph poses will be run if loop closure is detected.

C. Loop closure

Loop closure can be triggered manually or automatically by a

place recognition module. Conventional LSD-SLAM uses a heavy

place recognition module to explore if a loop is closed. Here we

choose to trigger loop closure manually after completing the full

scan. Place recognition modules with features like SURF, Ferns

or other 3D descriptors can work well for sparse features-based

SLAMs or even for dense SLAMs with low resolution depth

maps. In our case, we stream nearly 4k depth maps and applying

any place recognition module to our pipeline will slow down

drastically the front-end process. Once the scan is completed, graph

deformation can be finely tuned by adding a third term to the

energy optimization (16),

Econ =∑l

‖vl − ql‖ (17)

with ql are features extracted in the first frame and vl are deformed

positions of the corresponding features in the triggered frame.

D. Map deformation

Graph deformation is used to deform each shape vertex vi by

the following linear blending [22]:

v =m∑j=1

ωj(vi)[Rj(vi − gj) + gj + tj ] (18)

where the weights ωj(vi) are computed by,

ωj(vi) = (1− ‖vi − gj‖)/dmax)2 (19)

Figure 5: The volume shown after fusing 36 frames. The partial

camera trajectory is shown in blue and axis of camera are displayed

with three different colors (RGB). The volume size is 10 × 8 ×6mm3 with resolution of 9603 voxels.

V. EXPERIMENTS

We have conducted several experiments to evaluate the per-

formance of our system. RGB-D frames are acquired using the

structured light system described in [7], with a frame rate in

the order of 3Hz at frame resolution of 4000x3000 pixels. A

point cloud is generated by applying an undistorted model to the

corresponding depth map followed by reprojection to the 3D scene

using standard triangulation. To evaluate different aspects of the

system performance, we use two machined steel objects that were

painted to reduce inter-reflections. The first object is a cylindrical

67

(a) (b)

(c)(d)

Figure 6: Fusion of the knurled knob scene with our online deformation model. Mapping of the points cloud is shown in (a), and with

texture mapping in (b). The red rectangle shows the part where loop closure occurs and a close up view of it is shown in (c). The loop

closure part is compared to the original scan and errors are displayed in (d) as a color mapping with an average RMSE of 0.012 mm.

Knurled head screw with straight pattern of height (H = 14.8mm)and diameter (D = 32.2mm) and the distance between knurled

lines (d = 1mm). The second object is a cylindrical knurled knob

with wave pattern of height (H = 7.25), and exterior diameter

De = 34mm. Instead of moving the system around objects, which

requires a precise positioning system because of the narrow depth

of field (∼ 10mm), we fix the system and put the objects on

a turntable. The RGB-D frames are fed to our fusion system as

frames are captured and objects spun for full revolution around the

SLS system. At a frame rate of 3Hz, the SL system acquires 440

frames to cover the full revolution. The volume was initialized

to 10 × 8 × 6mm3 with resolution of 9603 voxels (Figure 5).

To compare the effect of our generalized model (GTSDF), depth

maps are fused with and without anisotropic diffusion (Figure 1).

Contrary to the smoothing effect of the TSDF, with the generalized

TSDF almost all details of original scans are preserved. In both

figures some of the details were zoomed out and shown in Figures

1-b and 1-d respectively. As ground truth is not available for

the full scan, errors on each zoomed area can be assessed by

comparing cloud distances to the original scan which can be set as

the cloud reference. The cloud-to-cloud (C2C) average distances

in each area are shown in Table 1. We emphasize here that the

errors computed in Table I does not take into account the online

deformation process. Next, global error will be assessed for the

tracker and full pipeline.

Table I: Comparison of the average C2C errors on the zoomed parts

of the knurled head screw scene in Figures 1-b and 1-d, using both

TSDF and GTSDF.

Zoomed areas Errors (TSDF) Errors (GTSDF)

(μm) (μm)

Upper-Left 178.88 20.15

Upper-Right 168.01 21.85

Lower-Left 188.50 23.15

Lower-Right 190.44 25.33

We compare loop closure errors and map deformation errors

using two methods i) pipeline with only offline deformation,

this will emulate the output of the LSD-SLAM method in [25]

if used with our high resolution input and ii) our full pipeline

using online deformation process as shown in Figure 2. With an

accurate turntable we can easily evaluate camera trajectory by

computing the root-mean square error (RMSE) between initial and

final camera poses. On the other hand depth maps deformation

can be evaluated in a similar fashion we evaluate the TSDFs

using C2C distances with original scans as references. For visual

aspect, we can also display the C2C distance errors as color

scales. Figures 4 and 6, show the fusion results of the knurled

knob scene using both methods mentioned above. The same fusion

results are shown in Figure 7 for the knurled head screw scene.

Figure 8 shows in blue the camera trajectory for the knurled knob

including all camera poses embedded in the graph. The last camera

pose was drawn in red (Figure 8 (c) and (f)). Errors on camera

trajectories and map deformations are summarized in Table II for

both examples. The processing time to fuse all frames is 145

seconds and 147 seconds using the pipeline with and without online

map deformation respectively, and that matches the acquisition

time. Also, the time processing with TSDF and GTSDF are nearly

the same. These results show that the computational cost related

to GTSDF and drift correction are both insignificant to alter the

real-time processing. The computational performance of the system

was evaluated on Windows platform using a desktop PC with an

Intel Xeon E5-1650 v3 CPU at 3.50GHz, 24GB of RAM and a

graphical card GeForce GTX980TI with 6GB of GPU memory.

Table II: Comparing errors on loop closure trajectory and map

deformation using two methods.

68

(a)(b)

(c)

(d)

(e)(f)

Figure 7: Fusions of the knurled head screw scene without and with the online deformation model are shown in (a) and (d) respectively.

The red rectangle shows the part where loop closure occurs and a close up view of it is shown in (b) ((e) resp.). The loop closure part is

compared to the original scan and errors are displayed in (c) ((f) resp.) as a color mapping with an average RMSE of 0.105 mm (0.015

resp.).

Errorsknurled head screw knurled knob

Method1

[25]

New

method

Method1

[25]

New

method

Camera

trajectory(μm)

95.80 5.70 165.23 6.20

C2C average

distance (μm)

105.10 15.22 85.56 12.25

VI. CONCLUSION

We presented in this paper an approach to perform large-scale

data fusion on 3D measurements acquired from high-resolution

structured-light sensors that are becoming more and more common

for industrial inspection applications. A generalized truncated

signed distance function (TSDF) integration model that preserves

more details because of its anisotropy was presented as well as

an efficient numerical scheme that is compatible with real-time

performances. An online deformation component that allows free

movement of the scanner while correcting camera drifts at microns

scale was also introduced. In order to illustrate practical gains, we

chose two different objects and performed data fusion on the results

of 3D scans performed at a lateral resolution of approximately 10

μm. Our approach was able to successfully integrate the 3D data

stream while preserving more details than a conventional approach.

This was illustrated by point cloud comparison between the fused

data and registered raw sensor data. Cloud-to-cloud distance was

reduced by a factor of 5 or greater.

Future works include performance optimization, detailed vali-

dation based on traceable artifacts and the integration of a noise

model for the 3D sensor to discriminate between high-quality and

low-quality 3D points (based on factors such as distance to the

focal plane, angles between surface normal and optical axes, etc.).

REFERENCES

[1] D. Aiger, N. J. Mitra, and D. Cohen-Or. 4-points congruent sets forrobust pairwise surface registration. ACM Trans. Graph, 27(3):85:1–85:10, 2008.

[2] G. Aubert, R. Deriche, and P. Kornprobst. Computing optical flowvia variational techniques. SIAM Journal of Applied Mathematics,60(1):156–182, 1999.

[3] M. Brophy, A. Chaudhury, S. S. Beauchemin, and J. L. Barron. Amethod for global non-rigid registration of multiple thin structures.In CRV, pages 214–221. IEEE Computer Society, 2015.

[4] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust reconstruction of indoorscenes. In CVPR, pages 5556–5565. IEEE Computer Society, 2015.

[5] A. Dai, M. Nießner, M. Zollhofer, S. Izadi, and C. Theobalt.Bundlefusion: real-time globally consistent 3D reconstruction usingon-the-fly surface re-integration. ACM Trans. Graph, 36(4), 2017.

[6] J. Digne, J.-M. Morel, N. Audfray, and C. Lartigue. High fidelityscan merging. Comput. Graph. Forum, 29(5):1643–1651, 2010.

[7] M.-A. Drouin, F. Blais, and G. Godin. High resolution projector for3D imaging. In 3D Vision (3DV), 2014 2nd International Conferenceon, volume 1, pages 337–344. IEEE, 2014.

[8] N. Fioraio, J. Taylor, A. W. Fitzgibbon, L. di Stefano, and S. Izadi.Large-scale and drift-free surface reconstruction using online sub-volume registration. In CVPR, pages 4475–4483. IEEE ComputerSociety, 2015.

[9] G. Gerig, O. Kubler, R. Kikinis, and F. A. Jolesz. Nonlinearanisotropic filtering of MRI data. IEEE Trans. Med. Imaging,11(2):221–232, 1992.

[10] D. Holz and S. Behnke. Approximate surface reconstruction andregistration for RGB-D SLAM. In ECMR, pages 1–8. IEEE, 2015.

[11] W. Kehl, T. Holl, F. Tombari, S. Ilic, and N. Navab. An octree-based approach towards efficient variational range data fusion. CoRR,abs/1608.07411, 2016.

[12] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, andA. Kolb. Real-time 3D reconstruction in dynamic scenes using point-based fusion. In 3DV, pages 1–8. IEEE Computer Society, 2013.

69

(a)

(b)

(c)

(d)

(e)

(f)

Figure 8: Camera trajectory of the knurled knob scene without the online deformation model is shown with blue poses in (a). The red

rectangle shows the part where loop closure occurs and a close up view of it is shown in (b). The white circled area in (b), zoomed in

(c), shows the last camera pose in red with camera axis in RGB. The loop closure error, = 0.165mm, is computed as a distance between

the first and last camera poses. This is compared to the scenario using online deformation model in the second column (Figures (d), (e),

and (f)), with loop closure error of 0.006mm.

[13] D. Lefloch, T. Weyrich, and A. Kolb. Anisotropic point-based fusion.In FUSION, pages 2121–2128. IEEE, 2015.

[14] L. Maier-Hein, A. M. Franz, T. R. dos Santos, M. Schmidt,M. Fangerau, H.-P. Meinzer, and J. M. Fitzpatrick. Convergentiterative closest-point algorithm to accomodate anisotropic and inho-mogenous localization error. IEEE Trans. Pattern Anal. Mach. Intell,34(8):1520–1532, 2012.

[15] N. Mellado, D. Aiger, and N. J. Mitra. Super 4PCS fast globalpointcloud registration via smart indexing. Comput. Graph. Forum,33(5):205–215, 2014.

[16] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon.Kinectfusion: Real-time dense surface mapping and tracking. InISMAR, pages 127–136. IEEE Computer Society, 2011.

[17] P. Perona and J. Malik. Scale-space and edge detection usinganisotropic diffusion. IEEE Trans. Pattern Anal. Machine Intell.,12(7):629–639, July 1990.

[18] P. Puri, D. Jia, and M. Kaess. Gravityfusion: Real-time densemapping without pose graph using deformation and orientation.In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, IROS,September 2017.

[19] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. Real-time 3Dmodel acquisition. In J. Hughes, editor, SIGGRAPH 2002 Confer-ence Proceedings, Annual Conference Series, pages 438–446. ACMPress/ACM SIGGRAPH, 2002.

[20] C. Schroers, H. Zimmer, L. Valgaerts, A. Bruhn, O. Demetz, andJ. Weickert. Anisotropic range image integration. In A. Pinz,T. Pock, H. Bischof, and F. Leberl, editors, Pattern Recognition- Joint 34th DAGM and 36th OAGM Symposium, Graz, Austria,August 28-31, 2012. Proceedings, volume 7476 of Lecture Notesin Computer Science, pages 73–82. Springer, 2012.

[21] H. Sekkati and A. Mitiche. Dense 3D interpretation of imagesequences: A variational approach using anisotropic diffusion. InCIAP, pages 424–429, 2003.

[22] R. W. Sumner, J. Schmid, and M. Pauly. Embedded deformation forshape manipulation. ACM Trans. Graph, 26(3):80, 2007.

[23] B. Ummenhofer and T. Brox. Point-based 3D reconstruction of thinobjects. In ICCV, pages 969–976. IEEE Computer Society, 2013.

[24] T. Weise, T. Wismer, B. Leibe, and L. V. Gool. In-hand scanning withonline loop closure. In 2009 IEEE 12th International Conference onComputer Vision Workshops, ICCV Workshops, pages 1630–1637,Sept 2009.

[25] T. Whelan, M. Kaess, H. Johannsson, M. F. Fallon, J. J. Leonard,and J. McDonald. Real-time large-scale dense RGB-D SLAM withvolumetric fusion. I. J. Robotics Res, 34(4-5):598–626, 2015.

[26] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, andS. Leutenegger. ElasticFusion: Real-time dense SLAM and lightsource estimation. I. J. Robotics Res, 35(14):1697–1716, 2016.

[27] S. Yin, Y. Ren, Y. Guo, J. Zhu, S. Yang, and S. Ye. Development andcalibration of an integrated 3D scanning system for high-accuracylarge-scale metrology. Measurement, 54(Complete):65–76, 2014.

[28] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm forrobust TV-L1 range image integration. In ICCV, pages 1–8. IEEEComputer Society, 2007.

70

real-time large-scale fusion of high resolution 3d scans...

Documents