real-time large-scale fusion of high resolution 3d scans...
TRANSCRIPT
Real-Time Large-Scale Fusion of High Resolution 3D Scans with Details Preservation
Hicham Sekkati, Jonathan Boisvert, Guy Godin, and Louis Borgeat
Digital Technologies/Computer Vision and Graphics
National Research Council Canada (NRC)
Ottawa, Canada
Email: [email protected]
Figure 1: (a) On-the-fly mapping of the scene knurled head screw using our anisotropic TSDF model. (b) Zoom on different parts
showing how our model is able to preserve details present in original scans. (c) Mapping of the same scene using the full pipeline but
with standard TSDF. (d) Zoom on the same parts in b), notice how the averaging TSDF model over-smooths fine details at larger scale
compared to the anisotropic TSDF.
Abstract—This paper presents a real-time 3D shape fusion systemthat faithfully integrates very high resolution 3D scans with thegoal of maximizing details preservation. The system fully mapscomplex shapes while allowing free movement similarly to denseSLAM systems in robotics where sensor fusion techniques map largeenvironments. We propose a novel framework to integrate shapesinto a volume with fine details preservation of the reconstructedshape which is an important aspect in many applications, especiallyfor industrial inspection. The truncated signed distance functionis generalized with a global variational scheme that controls edgepreservation and leads to updating cumulative rules adapted forGPU implementation. The framework also embeds a map deformationmethod to online deform the shape and correct the system trajectorydrift at few microns accuracy. Results are presented from theintegrated system on two mechanical objects which illustrate thebenefits of the proposed approach.
Keywords-RGB-D; Real-time Fusion; Large-Scale 3D Scans; Volu-metric Representation; Dense SLAM; GPU.
I. INTRODUCTION
New computational paradigms are emerging with rapid advance-
ment of high resolution 3D scanners and real-time visualization
platforms. With the advent of high resolution 3D scanners that
can reconstruct 3D shape at high resolution ranging between few
microns up to 1mm, a number of methods have been published
over the last years that allow offline registration of multiple 3D data
scans into a full high definition 3D model [19] [1] [6] [15] [3] [11].
For applications requiring real-time interactivity such as robotics,
industrial inspection and augmented reality, mapping 3D data must
be done in real time, but often at the cost of lowering the resolution.
Producing large-scale 3D scans with high resolution is useful
for online industrial inspection, but requires complex engineering
and calibration to assemble 3D laser scanners with robotic arms
[27] or using very expensive commercial products. For all these
applications, 3D scanners based on structured-light sensors are pre-
ferred, because they can produce faster highly dense point clouds.
However, the big challenge with 3D structured light systems is how
to fuse the 3D data on-the-fly and without alteration while moving
the system, especially if the system offers the capability to scan
millions of points per second. Unlike 3D reconstruction of objects
at small scales, which requires the object to be entirely or mostly
in the field of view of the camera, 3D reconstruction of complex
objects at large scales must be fused from views acquired along
a free arbitrary trajectory, each view exposing only a very small
part of the object. For inspection purposes, these views must be
fused online and without oversmoothing to produce an interactive
3D model most faithfull to input data.
Similarly in robotics, recent RGB-D large-scale dense SLAM
systems (LSD-SLAM) try to solve a similar problem but at
human environment scales [16][8][25][10][4][26][5]. LSD-SLAM
methods perform well with low resolution sensors such as Kinect
but they are not adapted to map on-the-fly very high resolution
point clouds streamed by modern fast 3D scanners. On the other
hand, camera paths that thoroughly image all small patches at
very close range lead to significant odometry drift, and there is
a necessity to match and register different views globally. In the
context of LSD-SLAM, online loop closure, that corrects drifts
along camera trajectory while mapping the environment, is very
challenging and only few works have addressed this problem
[12][23][26][18]. Even more challenging is an online correction
of the drifts at microns scale while mapping very high resolution
scans.
We propose a global framework for a large-scale 3D scan system
having the following capabilities;
• Mapping, at the scanner frame rate, very high resolution 3D
scans while preserving fine details of each scan aside. This is
63
2018 15th Conference on Computer and Robot Vision
978-1-5386-6481-0/18/$31.00 ©2018 CrownDOI 10.1109/CRV.2018.00019
Figure 2: System architecture diagram with its three main components each drawn with a different plain color.
done by integrating depth maps into a volume according to a
generalized truncated signed distance function allowing fine
details preservation.
• Embedding in our system an online deformation component
to allow free movement of the scanner while correcting on-
the-fly camera drifts at microns scale.
The goal of this paper is not to thoroughly compare the per-
formance of our large-scale 3D scan to Kinect-like LSD-SLAMs
mainly because of the large difference in terms of the input
resolution of the two acquisition systems. However we can still
emulate a LSD-SLAM using our high resolution input data, and
compare the subsequent output with the proposed pipeline system.
The rest of the paper is organized in three main sections. In the
next section, an outline of the pipeline provides a description of the
main components of our large scale 3D scanning system. Section
III exposes the general framework of the anisotropic truncated
signed distance function where more global updating rules are
derived and which leads to GPU implementation. Section IV
describes the method for an online global consistency used in our
system. The last section presents an evaluation of the proposed
method, including both qualitative and quantitative evaluation of
trajectory estimation performance, surface reconstruction quality
and computational performance.
II. APPROACH OVERVIEW
The input to our system is a sequence of high resolution depth
maps that can be acquired using fast structured light systems (SLS).
The system is allowed to move freely relatively to an object or vice
versa. Figure 2 provides an overview of the system components in
the form of three blocks that all run on GPU independently. The
blocks are described as follows:
• Tracker: To perform SLS tracking, we follow a similar
scheme used by Newcombe in [16]. Depth maps are integrated
into a volume using the truncated signed distance function
(TSDF) while a predicted surface is extracted by raycasting
the volume. The predicted surface is registered with the next
captured depth map using ICP algorithm. Our main extension
to this pipeline is our generalized model to the real-time TSDF
which we describe in more details in the following section.
As will be seen in the next section, this model is capable
of extracting the finest details of a high resolution surface
from a volume all by preserving the same computational cost
compared to the standard cumulative TSDF model used in
most dense SLAM systems. Figure 1 shows the output of
the tracker using our anisotropic TSDF model and standard
TSDF respectively. Notice here that computational costs will
be compared using SLS acquisition system because frame
rates and depth resolutions of SLS are different from the
Kinect-like sensor used by dense SLAM systems.
• Moving Volume and Cloud Extraction: In KinectFusion
[16], camera is restricted to move within the initialized
volume and hence the resolution of the extracted surface is
limited by the volume resolution. An extension to this volume
representation was introduced by Whelan et al. [25], allowing
the camera to move outside the volume and shifting the
volume to recenter around the camera pose. If large volumes
are scanned then cubic volume representation is sufficient to
span large areas. In this configuration only the translational
component of camera pose was used to update the global
pose of the TSDF (Figure 3a). To allow scanning objects of
different forms, we use a Rectangular Parallelepiped Volume
(RPV) representation. This permits resizing the volume for
each direction independently to fit object dimensions. In our
volume motion, a rotational component of the camera pose
was also used to update the TSDF volume position. An
optimal area is spanned using both translational and rotational
components of camera pose (Figure 3c) compared to only
translational component (Figure 3b). The operation of adding
a rotational component requires a re-sampling of the volume
at each move but its cost on GPU was insignificant and worth
the benefit of building a denser cloud points. After moving
the volume, cloud of points out of volume are extracted and
shipped to the CPU memory.
• Online deformation: Like in all SLAM systems, drift will
accumulate over time, thus requiring maps adjustment. To
achieve global consistency, localized non-rigid surface correc-
tion is of great interest specifically in the realm of volumetric
64
Figure 3: Illustration of dynamic volume positioning along camera
trajectory using, (a) translational cubic volume, (b) translational
RPV, and (c) translational and rotational RPV. The volume spans
larger area, shown with red lines, when rotation is also considered
to position a RPV.
fusion. To that end, pose graph optimization is suitable to
non-rigidly deform a shape using a graph deformation while
preserving details of the shape [22]. An important question
was raised in [26], whether or not it is optimal to attach highly
dense maps to a sparse pose graph structure like in features-
based visual SLAM systems. In Section 4, we will show how
to extend the deformation model of Sumner [22] to online
adjusting the maps by pose graph deformation.
III. GENERALIZED ANISOTROPIC TSDF MODEL
A. Background
Data are represented as depth maps that are converted into
3D truncated signed distance field (TSDF). When depth errors
are relatively high (> 1mm), which is the case of Kinect-like
sensors, extracting a noisy surface from a volume by a TSDF has
shown its reliability to represent objects at large scales. The TSDF
averages depths between the actual frame point and the model
point. However, when objects bear details at much finer scales,
depth averaging will smooth details of the object specifically over
surface edges. The real-time anisotropic TSDF model presented in
[13] takes into account the anisotropic and inhomogeneous local-
ization errors similar to the way it was originally implemented to
generalize the ICP point-based registration [14]. This latter model
updates the TSDF for each point independently to remove noise,
but can generate smooth surfaces without preserving details of the
geometry. Hence, a regularization process that smooths the TSDF
all by preserving the geometry of the surface has to be considered.
A more general and accurate model was presented in [20], that
regularizes a variational signed distance field with anisotropic
smoothing term to obtain smooth surfaces while preserving ridges
and corners. The directional signed distance is very fast to compute
compared to the suggested closest signed distance. Even if coarse-
to-fine architecture can be implemented in GPU, the method is
not meant for on-line mapping of high resolution depth maps. The
algorithm was tested using low resolution data maps with CPU
implementation requiring hours of processing.
B. Anisotropic variational signed distance field
The goal of this section is to present an anisotropic TSDF
scheme that leads to on-line updating rules with numerical GPU
implementation. We follow a scheme similar to the one presented
in [20] but with different anisotropic regularization leading to GPU
implementation. This is a very important feature in large-scale 3D
scan systems, because we remember that the goal here is not only
to reconstruct very high quality 3D models, but that updating the
3D model must be done on-line for interactivity purpose while
preserving all details mapped by each scan individually.
A depth image acquired by high resolution 3D scanner is
converted into truncated 3D signed distance fields fl by computing
the signed distance dl(x) of a 3D point x in the volume domain
Ω3 along the optical axis of the camera, weighting by a factor 1/δ,
and then truncating to the interval [−1, 1],
fl(x) = F (dl(x))
with F (dl(x)) =
{sgn(dl(x)
δ) if |dl
δ| > 1
dl(x)δ
otherwise.(1)
Similarly to [28], we introduce a binary weighting ωl(x) ={0, 1} that controls the width of the occluded region behind the
surface (dl(x) < −η, where η > δ). A regularized field uthat approximates all the truncated signed distance fields (fl, l =0, . . . , n) can be computed by minimizing the following functional,
∫Ω3
((∑l
ωl)−1
∑l
ωl(u− fl)2 + λφ(|∇u|2)
)dx, (2)
where λ is a positive real parameter that controls the smoothing,
and the function φ permits anisotropic diffusion while preserving
discontinuities of the approximated field u. The discrete Euler-
Lagrange conditions corresponding to minimization of (2) yield a
large system of non-linear equations that are difficult to solve and
that cannot be parallelized for GPU implementation.
1) Approximation by half-quadratic algorithm: The half-
quadratic algorithm, originally used in image restoration and
optical flow estimation [2], and later extended to 3D interpreta-
tion [21], allows minimization of the non-convex model (2) by
introducing a dual field b into an alternate model,
E(u, b) =
∫Ω3
((∑l
ωl)−1
∑l
ωl(u− fl)2
+ λ(b |∇u|2 + ψ(b)))dx, (3)
where the convex decreasing function ψ is related to φ according
to the minimization equation:
φ(s) = inf{b}
(b s2 + ψ(b)). (4)
Minimization of the alternate model E(u, b) is obtained in two
steps by successively minimizing it with respect to each variable.
The sequence that alternates minimizing the two steps converge
to a unique minimum (proof can be found in [2]). The advantage
of using the model (3) instead of (2) is that for each step the ob-
jective function is convex and optimal conditions for reaching the
minimum can be derived. More precisely, the 2-steps minimization
are described as follows,
• by fixing u, the value that leads to a minimum of E(u, b)with respect to b is an analytic expression given by,
bs =φ′(s)
2s(5)
• by fixing b = bs, the Euler-Lagrange condition that corre-
sponds to minimizing E(u, bs) with respect to u is given by,
λ div(bs∇u) = 2(∑l
ωl)−1
∑l
ωl(u− fl) (6)
where ∇ = ( ∂∂x, ∂∂y, ∂∂z
) represents the three partial derivatives
along the three Cartesian axes of the volume space.
65
2) On-line updating rules: The goal here is to derive the up-
dating rules that approximate the cumulative signed distance fields
while integrating new surface measurements into the current TSDF
volume. These rules will be implemented using GPU memory,
hence they should efficiently use memory during the fusion without
adding an extra cost to the original fusion implementation in [16].
The discretization of the div term in (6), which is responsible
for the anisotropic diffusion, was earlier formulated by Perona and
Malik model [17] for 2D image and later extended to 3D in [9].
Another more sophisticated discretization was proposed using a
different scheme [20] but with the drawback that it is not obvious
how to derive efficient on-line updating rules. The discrete 3D
extension of the divergence operator is straightforward and leads
to
[div(bs∇)
]i,j,k
= β∑ξ
bξs∇ξ, (7)
where 0 < β ≤ 16 is a normalizing coefficient and ξ ∈
{x+, x−, y+, y−, z+, z−} defines the discrete directional deriva-
tives along all axes. Hence, the operator ∇ξ applied to u(i, j, k)is defined by,
∇ξu(i, j, k) = uξ(i, j, k)− u(i, j, k). (8)
where uξ(i, j, k) is the nearest-neighbor of u(i, j, k) along the
direction ξ. Using the above definition, for example the positive
directional derivative along the x axis is defined by,
∇x+u(i, j, k) = ux+(i, j, k)− u(i, j, k), (9)
with ux+(i, j, k) = u(i+ 1, j, k) and bx+
s can be computed using
the formula in (5),
bx+
s =φ′(|∇x+
u|)
2|∇x+u|(10)
By plugging equation (7) into (6), we get
u(i, j, k) = (α b(i, j, k) + 1)−1(α u(i, j, k) +
∑l ωlfl∑l ωl
) (11)
where α = λβ, b =∑
ξ bξs, and u =
∑ξ b
ξsu
ξ. Let’s denote the
cumulative TSDFs and weights by the capitale symbols Un and
Wn respectively. Bn and Un will be respectively computed by the
same formulas as b and u using the field Un. To lighten notation,
we drop out the subscripts related to indexing voxels. Equation
(11) can be rewritting as:
WnUn = (α Bn + 1)−1(α Un +Wn−1Un−1 + ωnfn) (12)
Using iterative Gauss-Seidel scheme we can derive the updating
rules as follow,
{WnU
(t)n = (α B
(t−1)n + 1)−1(α U
(t−1)n +Wn−1Un−1 + ωnfn)
Wn =Wn−1 + ωn
(13)
By setting α = 0 in the updating rules (13), we find again the
updating rules that average TSDFs as in [16]. Notice also that
the new online TSDF scheme does not require any additional
GPU memory as the auxiliary field b in equation (10) can be
computed analytically using directional variations of the field
u. The cumulative field Un−1 and the field fn were previously
registered using ICP algorithm, hence only few iterations are
sufficient to update the rules in (13). In all our experiments we set
the number of iterations to updating these rules to 3 or 4 iterations.
For our SLS tracking implementation, we have used the Kinect-
Fusion implementation [16] with several adaptations to cope with
SLS high resolution scans. This adaptation includes changing
camera model in both raycasting and surface integration to fit the
SLS projection model, changing data structure relative to TSDF
to incorporate anisotropic diffusion, and adopting a rotational
TSDF voxel buffer to cope with arbitrary moving volume followed
by volume resampling. Also we have adopted the ICP variant
implementation using RGB-D as in [25].
IV. ONLINE GLOBAL CONSISTENCY
A. Background
Existing dense SLAMs, that close occasional loops in corridor-
like trajectories, perform poorly in a situation where many small
localized loops have to be adjusted online such as hand-held
systems. Applying loop-closures early and often, like in methods
[12][23][26][18], has shown higher performance over the existing
dense SLAMs systems that utilize offline pose graph optimization
[24][25]. All these methods use point-based fusion instead of
volumetric representation and are more suitable to represent objects
at large scale with low resolution. On the other hand, and as
mentioned earlier, for finer scale, volumetric representation handles
well topological changes like often encountered in prefabricated
objects bearing very sharp details. To our best knowledge, no
online global consistency method using the later representation
has been explored yet. The goal of this section, is to present an
online scheme having both capabilities of adjusting small localized
drifts by continually deforming a graph model, while on the other
hand constraining the model to occasionally close large loops when
detected or if needed. In the experimental section, we will compare
our full pipeline with and without online deformation. Our pipeline
without online deformation can emulate the LSD-SLAM presented
in [25] that deforms camera trajectory and maps only when loop
closure is detected.
B. Online deformation graph
The Sumner model [22] represents the tracked camera poses
by an affine transformation for each pose, embeds the poses in a
graph, and deforms subsequently a shape according to the graph
deformation. The graph node positions are denoted by {gj}mj=1,
and for each node j an affine transformation {Rj , tj} is assigned.
A rigidity term expressing that each matrix Rj must preserve
length, that means each Rj is a rotation matrix and must satisfy
an orthonormal constraint:
Erig({Rj}mj=1) =
∑j
||RjRtj − I||2F (14)
A second term is a regularization on the graph positions that
ensures a smooth graph by minimizing the following energy
Ereg({Rj , tj}mj=1) =
m∑j=1
∑k∈Nj
||Rj(gk−gj)+gj+tj−(gk+tk)||2
(15)
We define a cost that deforms the graph incrementally and when-
ever a new pose estimation is added to the graph. This online cost
is defined by minimizing jointly the two terms Erig and Ereg , that
is
Eonl = Erig + γEreg (16)
66
(a) (b)
(c) (d)
Figure 4: Fusion of the knurled knob scene without running our online deformation model. Mapping of the point cloud is shown in (a),
and with texture mapping in (b). The red rectangle shows the part where loop closure occurs and a close up view of it is shown in (c).
The loop closure part is compared to the original scan and errors are displayed in (d) as a color mapping with an average RMSE of
0.085 mm.
where γ is a constant to balance the graph smoothness over rigidity.
Contrary to the method presented in [25] which adds a non-linear
third term that takes into account the user-defined vertex positions,
minimizing the cost Eonl is very cheap as the minimum is obtained
by solving a sparse linear system. At this stage, it is very important
to avoid an iterative optimization which alleviates significantly the
computation burden without delaying the front-end loop. The graph
connectivity, determined by all sets Nj , can be defined by the k-
neighborhood nodes to the node j as in [22], or by following the
temporal sampling of the graph similarly to [25][26]. Neither of
these graph connectivities is optimal, because the graph connects
only nodes that are spatially close to each other in the former
case, and connect nodes in temporal order without looking at the
neighborhood spatial distribution. We use a combined algorithm
to connect the graph by following the temporal sampling all by
being spatially aware about each node neighborhood. A one step
refinement of graph poses will be run if loop closure is detected.
C. Loop closure
Loop closure can be triggered manually or automatically by a
place recognition module. Conventional LSD-SLAM uses a heavy
place recognition module to explore if a loop is closed. Here we
choose to trigger loop closure manually after completing the full
scan. Place recognition modules with features like SURF, Ferns
or other 3D descriptors can work well for sparse features-based
SLAMs or even for dense SLAMs with low resolution depth
maps. In our case, we stream nearly 4k depth maps and applying
any place recognition module to our pipeline will slow down
drastically the front-end process. Once the scan is completed, graph
deformation can be finely tuned by adding a third term to the
energy optimization (16),
Econ =∑l
‖vl − ql‖ (17)
with ql are features extracted in the first frame and vl are deformed
positions of the corresponding features in the triggered frame.
D. Map deformation
Graph deformation is used to deform each shape vertex vi by
the following linear blending [22]:
v =m∑j=1
ωj(vi)[Rj(vi − gj) + gj + tj ] (18)
where the weights ωj(vi) are computed by,
ωj(vi) = (1− ‖vi − gj‖)/dmax)2 (19)
Figure 5: The volume shown after fusing 36 frames. The partial
camera trajectory is shown in blue and axis of camera are displayed
with three different colors (RGB). The volume size is 10 × 8 ×6mm3 with resolution of 9603 voxels.
V. EXPERIMENTS
We have conducted several experiments to evaluate the per-
formance of our system. RGB-D frames are acquired using the
structured light system described in [7], with a frame rate in
the order of 3Hz at frame resolution of 4000x3000 pixels. A
point cloud is generated by applying an undistorted model to the
corresponding depth map followed by reprojection to the 3D scene
using standard triangulation. To evaluate different aspects of the
system performance, we use two machined steel objects that were
painted to reduce inter-reflections. The first object is a cylindrical
67
(a) (b)
(c)(d)
Figure 6: Fusion of the knurled knob scene with our online deformation model. Mapping of the points cloud is shown in (a), and with
texture mapping in (b). The red rectangle shows the part where loop closure occurs and a close up view of it is shown in (c). The loop
closure part is compared to the original scan and errors are displayed in (d) as a color mapping with an average RMSE of 0.012 mm.
Knurled head screw with straight pattern of height (H = 14.8mm)and diameter (D = 32.2mm) and the distance between knurled
lines (d = 1mm). The second object is a cylindrical knurled knob
with wave pattern of height (H = 7.25), and exterior diameter
De = 34mm. Instead of moving the system around objects, which
requires a precise positioning system because of the narrow depth
of field (∼ 10mm), we fix the system and put the objects on
a turntable. The RGB-D frames are fed to our fusion system as
frames are captured and objects spun for full revolution around the
SLS system. At a frame rate of 3Hz, the SL system acquires 440
frames to cover the full revolution. The volume was initialized
to 10 × 8 × 6mm3 with resolution of 9603 voxels (Figure 5).
To compare the effect of our generalized model (GTSDF), depth
maps are fused with and without anisotropic diffusion (Figure 1).
Contrary to the smoothing effect of the TSDF, with the generalized
TSDF almost all details of original scans are preserved. In both
figures some of the details were zoomed out and shown in Figures
1-b and 1-d respectively. As ground truth is not available for
the full scan, errors on each zoomed area can be assessed by
comparing cloud distances to the original scan which can be set as
the cloud reference. The cloud-to-cloud (C2C) average distances
in each area are shown in Table 1. We emphasize here that the
errors computed in Table I does not take into account the online
deformation process. Next, global error will be assessed for the
tracker and full pipeline.
Table I: Comparison of the average C2C errors on the zoomed parts
of the knurled head screw scene in Figures 1-b and 1-d, using both
TSDF and GTSDF.
Zoomed areas Errors (TSDF) Errors (GTSDF)
(μm) (μm)
Upper-Left 178.88 20.15
Upper-Right 168.01 21.85
Lower-Left 188.50 23.15
Lower-Right 190.44 25.33
We compare loop closure errors and map deformation errors
using two methods i) pipeline with only offline deformation,
this will emulate the output of the LSD-SLAM method in [25]
if used with our high resolution input and ii) our full pipeline
using online deformation process as shown in Figure 2. With an
accurate turntable we can easily evaluate camera trajectory by
computing the root-mean square error (RMSE) between initial and
final camera poses. On the other hand depth maps deformation
can be evaluated in a similar fashion we evaluate the TSDFs
using C2C distances with original scans as references. For visual
aspect, we can also display the C2C distance errors as color
scales. Figures 4 and 6, show the fusion results of the knurled
knob scene using both methods mentioned above. The same fusion
results are shown in Figure 7 for the knurled head screw scene.
Figure 8 shows in blue the camera trajectory for the knurled knob
including all camera poses embedded in the graph. The last camera
pose was drawn in red (Figure 8 (c) and (f)). Errors on camera
trajectories and map deformations are summarized in Table II for
both examples. The processing time to fuse all frames is 145
seconds and 147 seconds using the pipeline with and without online
map deformation respectively, and that matches the acquisition
time. Also, the time processing with TSDF and GTSDF are nearly
the same. These results show that the computational cost related
to GTSDF and drift correction are both insignificant to alter the
real-time processing. The computational performance of the system
was evaluated on Windows platform using a desktop PC with an
Intel Xeon E5-1650 v3 CPU at 3.50GHz, 24GB of RAM and a
graphical card GeForce GTX980TI with 6GB of GPU memory.
Table II: Comparing errors on loop closure trajectory and map
deformation using two methods.
68
(a)(b)
(c)
(d)
(e)(f)
Figure 7: Fusions of the knurled head screw scene without and with the online deformation model are shown in (a) and (d) respectively.
The red rectangle shows the part where loop closure occurs and a close up view of it is shown in (b) ((e) resp.). The loop closure part is
compared to the original scan and errors are displayed in (c) ((f) resp.) as a color mapping with an average RMSE of 0.105 mm (0.015
resp.).
Errorsknurled head screw knurled knob
Method1
[25]
New
method
Method1
[25]
New
method
Camera
trajectory(μm)
95.80 5.70 165.23 6.20
C2C average
distance (μm)
105.10 15.22 85.56 12.25
VI. CONCLUSION
We presented in this paper an approach to perform large-scale
data fusion on 3D measurements acquired from high-resolution
structured-light sensors that are becoming more and more common
for industrial inspection applications. A generalized truncated
signed distance function (TSDF) integration model that preserves
more details because of its anisotropy was presented as well as
an efficient numerical scheme that is compatible with real-time
performances. An online deformation component that allows free
movement of the scanner while correcting camera drifts at microns
scale was also introduced. In order to illustrate practical gains, we
chose two different objects and performed data fusion on the results
of 3D scans performed at a lateral resolution of approximately 10
μm. Our approach was able to successfully integrate the 3D data
stream while preserving more details than a conventional approach.
This was illustrated by point cloud comparison between the fused
data and registered raw sensor data. Cloud-to-cloud distance was
reduced by a factor of 5 or greater.
Future works include performance optimization, detailed vali-
dation based on traceable artifacts and the integration of a noise
model for the 3D sensor to discriminate between high-quality and
low-quality 3D points (based on factors such as distance to the
focal plane, angles between surface normal and optical axes, etc.).
REFERENCES
[1] D. Aiger, N. J. Mitra, and D. Cohen-Or. 4-points congruent sets forrobust pairwise surface registration. ACM Trans. Graph, 27(3):85:1–85:10, 2008.
[2] G. Aubert, R. Deriche, and P. Kornprobst. Computing optical flowvia variational techniques. SIAM Journal of Applied Mathematics,60(1):156–182, 1999.
[3] M. Brophy, A. Chaudhury, S. S. Beauchemin, and J. L. Barron. Amethod for global non-rigid registration of multiple thin structures.In CRV, pages 214–221. IEEE Computer Society, 2015.
[4] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust reconstruction of indoorscenes. In CVPR, pages 5556–5565. IEEE Computer Society, 2015.
[5] A. Dai, M. Nießner, M. Zollhofer, S. Izadi, and C. Theobalt.Bundlefusion: real-time globally consistent 3D reconstruction usingon-the-fly surface re-integration. ACM Trans. Graph, 36(4), 2017.
[6] J. Digne, J.-M. Morel, N. Audfray, and C. Lartigue. High fidelityscan merging. Comput. Graph. Forum, 29(5):1643–1651, 2010.
[7] M.-A. Drouin, F. Blais, and G. Godin. High resolution projector for3D imaging. In 3D Vision (3DV), 2014 2nd International Conferenceon, volume 1, pages 337–344. IEEE, 2014.
[8] N. Fioraio, J. Taylor, A. W. Fitzgibbon, L. di Stefano, and S. Izadi.Large-scale and drift-free surface reconstruction using online sub-volume registration. In CVPR, pages 4475–4483. IEEE ComputerSociety, 2015.
[9] G. Gerig, O. Kubler, R. Kikinis, and F. A. Jolesz. Nonlinearanisotropic filtering of MRI data. IEEE Trans. Med. Imaging,11(2):221–232, 1992.
[10] D. Holz and S. Behnke. Approximate surface reconstruction andregistration for RGB-D SLAM. In ECMR, pages 1–8. IEEE, 2015.
[11] W. Kehl, T. Holl, F. Tombari, S. Ilic, and N. Navab. An octree-based approach towards efficient variational range data fusion. CoRR,abs/1608.07411, 2016.
[12] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, andA. Kolb. Real-time 3D reconstruction in dynamic scenes using point-based fusion. In 3DV, pages 1–8. IEEE Computer Society, 2013.
69
(a)
(b)
(c)
(d)
(e)
(f)
Figure 8: Camera trajectory of the knurled knob scene without the online deformation model is shown with blue poses in (a). The red
rectangle shows the part where loop closure occurs and a close up view of it is shown in (b). The white circled area in (b), zoomed in
(c), shows the last camera pose in red with camera axis in RGB. The loop closure error, = 0.165mm, is computed as a distance between
the first and last camera poses. This is compared to the scenario using online deformation model in the second column (Figures (d), (e),
and (f)), with loop closure error of 0.006mm.
[13] D. Lefloch, T. Weyrich, and A. Kolb. Anisotropic point-based fusion.In FUSION, pages 2121–2128. IEEE, 2015.
[14] L. Maier-Hein, A. M. Franz, T. R. dos Santos, M. Schmidt,M. Fangerau, H.-P. Meinzer, and J. M. Fitzpatrick. Convergentiterative closest-point algorithm to accomodate anisotropic and inho-mogenous localization error. IEEE Trans. Pattern Anal. Mach. Intell,34(8):1520–1532, 2012.
[15] N. Mellado, D. Aiger, and N. J. Mitra. Super 4PCS fast globalpointcloud registration via smart indexing. Comput. Graph. Forum,33(5):205–215, 2014.
[16] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon.Kinectfusion: Real-time dense surface mapping and tracking. InISMAR, pages 127–136. IEEE Computer Society, 2011.
[17] P. Perona and J. Malik. Scale-space and edge detection usinganisotropic diffusion. IEEE Trans. Pattern Anal. Machine Intell.,12(7):629–639, July 1990.
[18] P. Puri, D. Jia, and M. Kaess. Gravityfusion: Real-time densemapping without pose graph using deformation and orientation.In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, IROS,September 2017.
[19] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. Real-time 3Dmodel acquisition. In J. Hughes, editor, SIGGRAPH 2002 Confer-ence Proceedings, Annual Conference Series, pages 438–446. ACMPress/ACM SIGGRAPH, 2002.
[20] C. Schroers, H. Zimmer, L. Valgaerts, A. Bruhn, O. Demetz, andJ. Weickert. Anisotropic range image integration. In A. Pinz,T. Pock, H. Bischof, and F. Leberl, editors, Pattern Recognition- Joint 34th DAGM and 36th OAGM Symposium, Graz, Austria,August 28-31, 2012. Proceedings, volume 7476 of Lecture Notesin Computer Science, pages 73–82. Springer, 2012.
[21] H. Sekkati and A. Mitiche. Dense 3D interpretation of imagesequences: A variational approach using anisotropic diffusion. InCIAP, pages 424–429, 2003.
[22] R. W. Sumner, J. Schmid, and M. Pauly. Embedded deformation forshape manipulation. ACM Trans. Graph, 26(3):80, 2007.
[23] B. Ummenhofer and T. Brox. Point-based 3D reconstruction of thinobjects. In ICCV, pages 969–976. IEEE Computer Society, 2013.
[24] T. Weise, T. Wismer, B. Leibe, and L. V. Gool. In-hand scanning withonline loop closure. In 2009 IEEE 12th International Conference onComputer Vision Workshops, ICCV Workshops, pages 1630–1637,Sept 2009.
[25] T. Whelan, M. Kaess, H. Johannsson, M. F. Fallon, J. J. Leonard,and J. McDonald. Real-time large-scale dense RGB-D SLAM withvolumetric fusion. I. J. Robotics Res, 34(4-5):598–626, 2015.
[26] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, andS. Leutenegger. ElasticFusion: Real-time dense SLAM and lightsource estimation. I. J. Robotics Res, 35(14):1697–1716, 2016.
[27] S. Yin, Y. Ren, Y. Guo, J. Zhu, S. Yang, and S. Ye. Development andcalibration of an integrated 3D scanning system for high-accuracylarge-scale metrology. Measurement, 54(Complete):65–76, 2014.
[28] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm forrobust TV-L1 range image integration. In ICCV, pages 1–8. IEEEComputer Society, 2007.
70