abstract arxiv:2002.03983v1 [cs.cv] 10 feb 2020

StickyPillars: Robust feature matching on point cloudsusing Graph Neural Networks

Martin Simon * 1 Kai Fischer * 1 Stefan Milz * 2 Christian Tobias Witt 1 Horst-Michael Gross 2

AbstractStickyPillars introduces a sparse feature matchingmethod on point clouds. It is the first approachapplying Graph Neural Networks on point cloudsto stick points of interest. The feature estimationand assignment relies on the optimal transportproblem, where the cost is based on the neuralnetwork itself. We utilize a Graph Neural Net-work for context aggregation with the aid of multi-head self and cross attention. In contrast to imagebased feature matching methods, the architecturelearns feature extraction in an end-to-end manner.Hence, the approach does not rely on handcraftedfeatures. Our method outperforms state-of-the artmatching algorithms, while providing real-timecapability.

1. IntroductionPoint cloud registration is an essential computer vision prob-lem and necessary for a wide range of tasks in the domain ofreal-time scene understanding or applied robotics. Further-more, new generations of 3D sensors, like depth camerasor LiDARs (light detection and ranging), enable dense per-ception including distance information recorded within alarge field of view. This also increases the requirements forpoint cloud based feature matching applicable for varioustasks namely perception, mapping, re-localization or SLAM(simultaneous localization and mapping). Even more funda-mental operations like multi-sensor calibration rely on exactmatching of feature correspondences.

Point cloud registration using 3D sensors is commonlysolved using local describable features combined with aglobal optimization step (Shan & Englot, 2018; Zhang &Singh, 2014; Lin & Zhang, 2019). These real-time ap-proaches achieve state-of-the art performance on odometry

*Equal contribution 1Valeo Schalter und Sensoren GmbH, Ger-many 2Ilmenau University of Technology, Germany. Correspon-dence to: Martin Simon <[email protected]>, Kai Fis-cher <[email protected]>, Stefan Milz <[email protected]>.

Graph BGraph A

LiDAR scan A LiDAR scan B

PillarEncoder

sparse pillarpositioning

PositionalEncoder

PillarEncoder

PositionalEncoder

point cloud sampling centroid coordinates point cloud sampling centroid coordinates

+ +

Self Attention

Cross Attention

Optimal Transport

Robust Matches

Mapping Relocalization SLAM

sparse pillarpositioning

Figure 1. Point cloud registration using StickyPillars. Assum-ing two disjoint sets of point clouds (e.g. LiDAR Scan A & B),we propose a novel 3D feature matching approach. The algorithmrelies on the dynamics of the representation itself, where each pointcloud is represented by a sparse list of key-points (coordinates).Key-point features are learned in an end-to-end manner. With theaid of attention based graph neural networks, partial assignment issolved for correspondence. This enables powerful post-processingsteps (e.g. odometry, SLAM)

challenges like KITTI (Geiger et al., 2012), although theyare free from modern machine learning algorithms, becauseunbiased depth values of the sensors enable safe distancesestimations processed by classical algorithms. However,recent research towards neural network based point cloudprocessing, e.g. classification and segmentation (Qi et al.,2017a;b; Lang et al., 2019; Zhou & Tuzel, 2018), opened upnew possibilities for perception, registration, mapping andodometry and has shown impressive results (Engel et al.,2019; Li et al., 2019). The downside of all those methods is

arX

iv:2

002.

0398

3v1

[cs

.CV

] 1

0 Fe

b 20

20

StickyPillars: Robust feature matching on point clouds using Graph Neural Networks

a)

b)

Figure 2. Registration using StickyPillars. The example showsa challenging registration task (KITTI odometry) for two pointclouds (green and blue) solved by StickyPillars (a -top) and ICP(Zhang, 1994) (b - bottom). StickyPillars outperforms qualitativelyin a very robust way with a significant runtime capability.

that they tackle odometry estimation itself based on a globalrigid body operation. Hence, the target is the calculationof a robust coordinate transformation assuming many staticobjects within the environment and proper viewpoints. Itleads to instabilities for challenging situations like vast num-ber of dynamic objects, widely varying viewpoint or smalloverlapping areas. Furthermore, the mapping quality itselfis suffering and often not evaluated qualitatively. Examplesare blurring of dynamic objects in the map.To bear these dis-advantages satisfactorily we introduce a novel registrationstrategy for point clouds utilizing graph neural networks.Inspired by (DeTone et al., 2018; Vaswani et al., 2017), wesolve the feature correspondence instead of a direct odome-try estimation. StickyPillars is a robust real-time registrationmethod for point clouds (see Fig. 1) and confident underchallenging conditions, like dynamic environments, chal-lenging viewpoints or small overlapping areas. This enablesmore powerful odometry estimation, mapping or SLAM (ex-ample in Fig. 2). We verify our technique on the odometryKITTI benchmark using recalculation (Geiger et al., 2012)and significantly outperforming state-of-the-art approaches.

2. Related WorkPoint cloud registration was fundamentally investigatedand influenced by (Besl & McKay, 1992; Zhang, 1994;Rusinkiewicz & Levoy, 2001). The iterative closed pointsearch (ICP) is a powerful algorithm for calculating thedisplacement between two point sets. Its convergence andspeed depends on the matching accuracy itself. Given cor-rect data associations (e.g. similar viewpoints, large overlap,

other constraints), the transformation can be computed effi-ciently. This method is still a common technique and usedin a wide range of odometry, mapping and SLAM algo-rithms (Shan & Englot, 2018; Zhang & Singh, 2014; Lin& Zhang, 2019). However, its convergence and accuracydepends largely on the similarity of the given point sets.(Rusinkiewicz & Levoy, 2001) reports its error susceptibil-ity for challenging tasks like small overlap of point sets,different viewpoints.

Local feature correspondence was more widely used inthe domain of image processing. Similar to ICP, the classicalidea is composed by several steps, i.e. point detection, fea-ture calculation and matching. On top, the geometric trans-formation calculation is performed. The standard pipelineswere proposed by FLANN (Muja & Lowe, 2009) or SIFT(Lowe, 2004). Models based on neighborhood consensuswere evaluated by (Bian et al., 2017; Sattler et al., 2009;Tuytelaars & Van Gool, 2000; Cech et al., 2010) or in amore robust way combined with a solver called RANSAC(Fischler & Bolles, 1981; Raguram et al., 2008).

Recently, Deep Learning based approaches, i.e. Convolu-tional Neural Networks (CNNs), were utilized to learn localdescriptors and sparse correspondences (Dusmanu et al.,2019; Ono et al., 2018; Revaud et al., 2019; Yi et al., 2016).Most of those approaches operate on sets of matches andignore the assignment structure. In contrast, (Sarlin et al.,2019) focuses on bundling aggregation, matching and filter-ing based Graph Neural Networks.

Deep Learning on point clouds is a novel field. Researchhas been proposed by (Chen et al., 2017; Simon et al., 2018;2019) utilizing CNNs. Points are usually not ordered, influ-enced by the interaction amongst each other and viewpointinvariant. Hence, a more specific architecture was neededand first investigated for segmentation and classificationby (Qi et al., 2017a) capable of handling large point sets(Qi et al., 2017b). For registration, some work has beenproposed utilizing Deep Learning on 3D point clouds toapproximate ICP (Lu et al., 2019; Li & Lee, 2019) or imagegeneration (Milz et al., 2019), where the former focuses onrigid transformations and the latter on the key-point descrip-tor itself. However, this approach lacks of accuracy androbustness, when there is a demand for real-time capability.

Optimal transport is related to the graph matching prob-lem and therefore utilized in this work. In general, itdescribes a transportation plan between two probabilitydistributions. Numerically, this could be solved with theSinkhorn algorithm (Sinkhorn & Knopp, 1967; Cuturi, 2013;Peyre et al., 2019) and its derivatives. We approximategraph matching using optimal transport based on multi-head(Vaswani et al., 2017) attention (self and cross-wise) to learna robust registration, not related to handcrafted features orspecific costs, but approximated by the network itself.


Graph Neural Network LayerPillar Layer

PillarEncoder

point cloud K point cloud L point cloud K point cloud L

sparse key point initialisation robust matches

Optimal Transport Layer

πiK

πjL

fiK

fjL

xΩK points

PositionalEncoder

feat

ures

position

feat

ures

+

SelfAttention

CrossAttention

Multi-H

ead M

ulti-Head

Multi-H

ead M

ulti-Head

l

miK

mjL

Soft A

ssignment M

atrix S

inkhorn

non-visible

non-visible

Pm

atching scores

f’iK

f’jL

π’iK

π’jL

(0)niK

(0)njL

(1)niK

(1)njL

(2)niK

(2)njL

lniK

lnjL

Figure 3. StickyPillars Architecture is composed by three layers: 1. Pillar layer, 2. Graph Neural Network layer and 3. OptimalTransport layer. With the aid of 1, we learn 3D features (pillar encoder) and spatial clues (positional encoder) directly. Self- and CrossMulti-Head Attention is performed in a graph architecture to perform contextual aggregation for both domains. The resulting matchingscores are used to generate an assignment matrix for key-point correspondences via numerical optimal transport.

3. The StickyPillars ArchitectureThe idea beyond StickyPillars is the development of arobust-point cloud registration algorithm to replace ICPas most common matcher for environmental perception onpoint clouds. ICP has fundamental drawbacks in terms ofstability, predictable runtime and depends on a good initial-ization. We identify a need of an approach that is working inreal-time under challenging conditions: small overlappingarea and diverging viewpoints. Hence, a good registrationalgorithm should match corresponding key-points (evendynamic, no rigid global pose) or discard occlusions respec-tively non points. We propose such a model, that not relieson a encoder-decoder system, instead using a graph based onmultihead (Vaswani et al., 2017) self- and cross attention tolearn the context aggregation in an end-to-end manner (seeFig. 3. Our three-dimensional point cloud features (pillars)are flexible and fully composed by learnable parameters.

Problem description Let PK and PL be two point cloudsto be registered. The key-points of those point clouds willbe denoted as πK

i and πLj with πK

0 , . . . ,πKn ⊂ PK and

πL0 , . . . ,π

Lm ⊂ PL, while other points will be denoted

as xKk ∈ PK and xL

l ∈ PL. Each key-point with index

i is associated to a point pillar, which can pictured as asphere having a centroid position πK

i and a center of gravityπKi . All points (PK

i ) within a pillar i are associated witha pillar feature stack fK

i ∈ RD, with D as pillar encoderinput depth. The same applies for πL

j . ci,j and fi,j composethe input for the graph. The overall goal is to find partialassignments 〈πK

i ,πLj 〉 for the optimal re-projection P with

P := fπLj→πK

i(PL) ≈ PK.

3.1. Pillar Layer

Key-Point Selection is the initial basis of the pillar layer.Most common 3D sensors deliver dense point clouds havingmore than 120k points that need to be spared. Inspiredby (Zhang & Singh, 2014), we place the centroid pillarcoordinates on sharp edges and planar surface patches. Asmoothness term c identifies smooth or sharp areas. For apoint cloud PK the corresponding smoothness term cK isdefined by:

cK =1

|S| ·∥∥xK

k

∥∥ ·∥∥∥∥∥∥

∑k′∈S,k′ 6=k

(xKk − xK

k′

)∥∥∥∥∥∥ (1)


k and k′ denote point indices within the point cloud PK

having coordinates xKk ,x

Kk′ ∈ R3. S is a set of neighboring

points of k and|S| is the cardinality of S . With the aid of thesorted smoothness factors in PK, we define two thresholdscKmin and cKmax to pick a fixed number n of key-points πK

i

in sharp cKk > cKmax and planar regions cKk < cKmin. Thesimilar quotation applies for cL on PL selecting m pointswith index j.

Pillar Encoder is inspired by (Qi et al., 2017a; Lang et al.,2019). Any selected key-point, πK

i and πLj , is associated

with a point pillar i and j describing a set of specific pointsPKi and PL

j . We sample points into a pillar using an eu-clidean distance function:

PKi := xK

0 ,xKΩ , ...,x

Kz

∥∥∥πKi − xK

Ω

∥∥∥ < d (2)

Similar equations apply for PLj . Due to a fixed input size

of the pillar encoder, we draw a maximum of z points perpillar, where z = 100 is used in our experiments. d isthe distance threshold defining the pillar size (e.g. 50cm).For computational reasons, the point clouds are organizedwithin a k-d tree (Bentley, 1975). Based on πK

i the z closestsamples xK

Ω are drawn into the pillar PKi , whereas points

with a distance greater d were rejected and refilled by zerotupels instead.

To compose a sufficient feature input stack for the pillarencoder fK

i ∈ RD, we stack for each sampled point xKΩ

with Ω ∈ 1, . . . , z in the style of (Lang et al., 2019):

fKi =

[xK

Ω , iKΩ , (x

KΩ − πK

i ), ‖xKΩ‖2, (xK

Ω − πKi )], . . .

(3)

xKΩ ∈ R3 denotes the sample points coordinate (x, y, z)T.iKΩ ∈ R is a scalar and represents the intensity value,(xK

Ω − πKi ) ∈ R3 being the difference to the pillars cen-

ter of gravity and (xKΩ − πK

i ) ∈ R3 is the difference to thepillars key-point. ‖xK

Ω‖2 ∈ R is the L2 norm of the pointitself. This leads to an overall input depth D = z × 11.The pillar encoder is a single linear projection layer withshared weights across all pillars and frames followed by abatchnorm and ReLU layer:

f ′Ki = Wf · fK

i ∀i ∈ 1, . . . , n f ′Ki ∈ RD

′

f ′Lj = Wf · fL

j ∀j ∈ 1, . . . ,m f ′Kj ∈ RD

′ (4)

The output has a depth of D′, e.g. 32 values in our experi-ments.

Positional Encoder is introduced to learn contextual aggre-gation without applying pooling operations. The positionalencoder is inspired by (Qi et al., 2017a) and utilizes a singleMulti-Layer-Perceptron (MLP) shared across PL and PK

and all pillars including batchnorm and ReLU. From the

key-points centroid coordinates πKi ,π

Lj ∈ R3 we derive

again an output with a depth of D′:

π′iK = MLPπ(πK

i ) ∀i ∈ 1, . . . , n π′iK ∈ RD

′

π′jL = MLPπ(πL

j ) ∀j ∈ 1, . . . ,m π′jL ∈ RD

′

(5)

3.2. Graph Neural Network Layer

Graph Architecture assumes two complete graphs GL andGK, which nodes are related to the selected pillars and equiv-alent in its quantity, we derive the initial (0)nK

i ,(0) nL

j nodeconditions in the following way:

(0)nKi = π′

Ki + f ′

Ki

(0)nKi ∈ RD

′

(0)nLj = π′

Lj + f ′

Lj

(0)nLj ∈ RD

′ (6)

The overall composed graph (GL,GK) is a multiplex graphinspired by (Mucha et al., 2010; Nicosia et al., 2013). It iscomposed by intra-frame edges, i.e. self edges connectingeach key-point within GL and each key-point within GK

respectively. Additionally, to perform global matching usingcontext aggregation inter-frame edges are introduced, i.e.cross edges that connect all nodes of GK with GL and viceversa.

Multi-Head Self- and Cross-Attention is all we need tointegrate contextual cues intuitively and increase its distinc-tiveness considering its spatial and 3D relationship withother co-visible pillars, such as those that are salient, self-similar or statistically co-occurring (Sarlin et al., 2019). Anattention function A (Vaswani et al., 2017) is a mappingfunction of a query and a set of key-point pairs to an output,where the query q, the keys k, and the values v are simplyvectors. It is defined by:

A(q,k,v) = softmax(qT · k√

D′

)· v (7)

D′ describes the feature depth analogous to the depth ofevery node. We apply the Multi-Head Attention function toeach node lnK

i , lnLj at state l to calculate its next condition

l + 1. The node conditions l ∈ 0, l, ..., lmax are repre-sented as network layers to propagate information through-out the graph:

(l+1)nKi = (l)nK

i + (l)MK(qKi ,v

Ωα ,k

Ωα )

(l+1)nLj = (l)nL

j + (l)ML(qLj ,v

Ωβ ,k

Ωβ )

(8)

We alternate the indices for α and β to perform self andcross attention alternately with increasing depth of l through-out the network, where the following applies Ω ∈ K,L:

α, β :=

i, j if l ≡ evenj, i if l ≡ odd

(9)


The Multi-Head Attention function is defined by:

(l)MK(qKi ,v

Ωα ,k

Ωα ) = (l)W0 · (l)(headK

1 ‖...‖headKh )(10)

with ‖ being the concatenation operator. A single head iscomposed by the attention function as follows:

(l)headKh = (l)A(qK

i ,vΩα ,k

Ωα )

= (l)A(W1h · nKi ,W2h · nΩ

α ,W3h · nΩα)

(11)

The same applies for (l)ML:

(l)MK(qKi ,v

Ωα ,k

Ωα ) = (l)W0 · (l)(headK

1 ‖...‖headKh )

(l)headLh = (l)A(W1h · nL

i ,W2h · nΩβ ,W3h · nΩ

β )(12)

All weights (l)W0,(l)W11..

(l)W3h are shared throughout allpillars and both graphs (GL,GK) within a single layer l.

Final predictions are computed by the last layer withinthe Graph Neural Network and designed as single linearprojection with shared weights across both graphs (GL,GK)and pillars:

mKi = Wm · (lmax)nK

i mKi ∈ RD

′

mLj = Wm · (lmax)nL

j mLj ∈ RD

′ (13)

3.3. Optimal Transport Layer

Intensive research regarding the optimal transport problemwas done for decades (Sinkhorn & Knopp, 1967; Vallender,1974; Cuturi, 2013). In general, there are some requirementsto propagate data throughout the network with subsequentback-propagation of the error. To compute a soft-assignmentmatrix P ∈ R(n+1)×(m+1), defining correspondences be-tween pillars in PK and PL, the optimal transport layer hasto be fully differentiable, parallelizable (run-time), respon-sible for 2D normalization (row and column-wise) and hasto handle invisible key-points (occlusion or non-overlap)sufficiently.

To design an optimal transport plan, the achieved matchingmatrices, which include all learned contextual information(mK ∈ Rn×D′ → mL ∈ Rm×D′

), were combined in thefollowing way:

M = mK · (mL)T M ∈ Rn×m (14)

This design enables a cross-wise scalar multiplication ofall feature in PK with all in PL, whereby similar featuresreveal higher score entries as unequal ones. In order toreconcile non-visible pillars in both frames with an adequateloss function, we concatenate a single learnable weightparameter Wv at the end of the matching matrix shared

Multi-Head Cross Attention

Multi-Head Self Attention

PillarEncoder Pos.Encoder

+

Linear BatchNorm Relu

Linear Layer

Rb x m x 3Rb x m x z x 11

MLP

Rb x m x D’


Linear Layer

Rb x m x D’

PillarEncoder Pos.Encoder

+


Linear Layer

Rb x n x 3Rb x n x z x 11

Rb x n x (i * D’)

Rb x n x D’


Linear Layer

Rb x n x D’

MLP (i layers)

K L

Rb x n x D’

nRb x m x D’

m(0)nK (0)nL

Concat

Linear Layer

hLinear Layer

h

Linear Layer

h→Rb x 3 x n x D’ x h

(0)qK (0)vK (0)kK

Scaled Dot Product Attention h

Linear

Rb x n x D’ x h

Concat

Linear Layer

h

Linear Layer

h

Linear Layer

h(0)qL (0)vL (0)kL


Linear

Rb x m x D’ x h

Rb x 3 x m x D’ x h ←

Rb x n x D’

nRb x m x D’

m(1)nK (1)nL

Concat

Linear Layer

h h h(1)qK (1)vK (1)kK


Linear

Rb x n x D’ x h

Concat

Linear Layer

h h h(1)qL (1)vL (1)kL


Linear

Rb x m x D’ x h

Linear Layer

Linear Layer

Linear Layer

Linear Layer

Linear Linear

Rb x n x D’ Rb x n x D’Optimal Transport

Rb x n+1 x m+1

lmax

Figure 4. The StickyPillars Tensor Graph identifies the data flowthroughout the network architecture especially during self- andcross attention, where b describes the batch-size, n and m thenumber of pillars, h is the number of heads and lmax the maximumlayer depth. D′ is the feature depth per node. The result is anassignment matrix P with an extra column and row for invisiblepillars.

across all columns and rows (n = n+ 1, m = m+ 1):

M =

M11 · · · Wv

... Mnm

...Wv · · · Wv

∈ Rn×m (15)

Thereby, each key-point that is occluded or not visible inthe other point cloud should be assigned to this value and


vice versa. Starting from t = 0 to t = 100 we performthe sinkhorn algorithm in the simple following way, highlyparallelizable and fully differentiable:

(t+1)M =

(t)M11 −R1 − C1 · · · (t)M1m −R1 − Cm

......

(t)Mn1 −Rn − C1 · · · (t)Mnm −Rn − Cm

(16)

with the row and column-wise normalization functions:

Ri = log∑j

eMij Cj = log∑i

eMij

(17)

We approximate our soft-assignment matrix for 100 itera-tions with (100)M ≈ P. The overall tensor graph is shownin Fig 4 including architectural details from the pillar layerto the optimal transport layer.

3.4. Loss

The overall architecture with its three layer types: PillarLayer, Graph Neural Network Layer and Optimal TransportLayer is fully differentiable. Hence, the network is trainedin a supervised manner. The ground truth being the set GTincluding all index tuples (i, j) with pillar correspondencesin our datasets and also ground truth unmatched labels (n, j)and (i, m), with (n, m) being nonsense. We apply threedifferent losses to compare their results in the ablation study,e.g.the negative log-likelihood loss:

LNLL = −∑

i,j∈GTlogPij (18)

During training we detected occasionally, especially if non-visible matches were underrepresented within GT , a poorability to generalize well for invisible key-points. Hence,according to sinkhorns methodology, we introduce an extrapenalty term only in row direction, exclusively were un-matched points occur within the ground truth U ⊂ GT withj = m. We observed its sufficient to apply it only in rowdirection:

LNLLP =−∑

i,j∈GTlogPij

+∑i∈U

(Pim + log

m∑j

ePij

) (19)

The overall problem, could be seen as binary classificationproblem. Hence, we apply as well the dual cross entropyloss for an integral penalization of false matches and fullreward of correct matches:

LDCE =

n∑(i=1)∧(j∈GT )

(Pij + log

m∑j′

ePij′)

+

m∑(j=1)∧(i∈GT )

(Pij + log

n∑i′

ePi′j

) (20)

4. Experiments4.1. Implementation Details

Model: For key-point extraction, we used variable cmin andcmax to achieve n = m = 100 key-points πi as inputs forthe pillar layer. Each point pillar is sampled with up toz = 100 points xΩ using an euclidean distance thresholdof d = 0.5 m. Our implemented feature depth is D′ = 32.The key-point encoder has five layers with the dimensionsset to 32, 64, 128, 256 channels respectively. The graphis composed by lmax = 6 self and cross attention layerswith h = 8 heads each. Overall, this results in 33 linearlayers. Our model is implemented in PyTorch (Paszke et al.,2017)(v1.4), using Python 3.7. A forward pass for the modeldescribed above, dues an average of 27/26ms (37/38 fps)on a Nvidia RTX 2080 Ti / RTX Titan GPU, for one pair ofpoint clouds (see 4.1).

10 50 100 150 200

35 fps

40 fps

key-points

performance

Figure 5. Performance for varying number of key-points onNvidia RTX 2080 Ti.

Training details: We process all sequences 00 to 10 ofthe KITTI (Geiger et al., 2012) odometry dataset, usingthe smoothness function (1) and identify key-points as de-scribed in section 3.1. Ground truth correspondences andunmatched sets are generated using the existing odome-try ground truth. Both point clouds are transformed into ashared coordinate system. Ground truth correspondencesare either key-point pairs with the nearest neighbor distancesmaller than 0.1 m or invisible matches, i.e. all pairs withdistances greater 0.5 m are unmatched. We ignore all asso-ciations with a distance in range 0.1 m to 0.5 m, to ensurevariances in resulting features. The entire pre-processingis repeated three times with temporal distances 1, 5, 10, i.e.consecutive frame index distances. For our training, we usethe Adam optimizer (Kingma & Ba, 2014) with a constant


learning rate of 10−4 and a batch size of 16. We split thedataset for ablation studies into a subset of three training andthree evaluation datasets, resulting in T1, T5, T10 for train-ing and V1,V5 and V10 for validation with varying temporaldistance pairs, trained for 300 epochs.

4.2. Transformation Estimation

Matching Score is introduced in order to validate the resultsof our method, we compare our predictions based on twodifferent metrics and in comparison to various state of theart methods. Thereby, Ms = (

∑N PF )/N being a metric

depicting the mean percentage of correct predicted matchescompared to the total amount of correct matches in the testsequence (F is frame, by N totals frames). The matchingscore is used to compare the amount of correct predictionscompared to two benchmark methods (Tab 1), e.g. a simplenearest neighbour search for the 3D coordinates of our key-points based on a k-d tree (Bentley, 1975) and Point FeatureHistograms (PFH (Rusu et al., 2009)) descriptors to find ahigh dimensional representation of our key-points which canbe used to find corresponding key-points in the associatedframe based on a high dimensional k-d tree search. Basedon these predicted correspondences of the different methods,it is possible to deduce a transform estimation using singularvalue decomposition (SVD).

Transformation Error is calculated by comparing theground truth odometry poses TGT ∈ R4x4 of the KITTIodometry dataset with the transform estimation based on thepredicted correspondences Tpred ∈ R4x4 by StickyPillars:T = T−1

pred · TGT. T describes the transformation differencebetween ground truth and estimation for two related frames.Based on this, we are conveying the translational Tδ androtational Tθ error values used in (Geiger et al., 2012):

Tδ =∥∥∥(T41, T42, T43)T

∥∥∥Tθ = arccos fθ

(0.5 · (T11 + T22 + T33 − 1)

)fθ(x) =

1 if x > 1

x if − 1 < x < 1

−1 if x < −1

(21)

We are estimating the transformation based on predictedcorrespondences and subsequent SVD from nearest neigh-bour search (NN (Muja & Lowe, 2009)), the Point FeatureHistograms (PFH (Rusu et al., 2009)), StickyPillars (OURS)and also from all the possible valid matches in the groundtruth labels (VM). We are using VM to set a baseline for thetransformation error if all correspondences were found andcorrect. Furthermore our results are compared to the Itera-tive Closest Point (ICP (Zhang, 1994)) which is applied toour source and target key-points and exploited to iteratively

Table 1. Transformation results.

MATCHER V1 V5 V10

MATCHING SCORENN 0.485 0.106 0.048PFH 0.143 0.056 0.014OURS 0.909 0.722 0.559

TRANSLATIONAL ERROR (Tδ )NN 0.039 0.717 4.451ICP 0.073 0.393 3.264PFH 0.298 1.376 5.879OURS 0.025 0.056 0.548VM 0.025 0.049 0.169

ROTATIONAL ERROR (Tθ )NN 0.003 0.086 0.366ICP 0.012 0.014 0.068PFH 0.048 0.099 0.674OURS 0.002 0.004 0.091VM 0.002 0.003 0.009

refine the intermediate rigid transformation. The results forour validation metrics for various frame distances 1, 5 and10 (V1, V5 and V10) respectively can be found in Tab 1.

Table 2. Loss comparison. Confusion matrix with precision P andaccuracy A for our training and validation subsets.

DATAV1 V5 V10

P A P A P A

NLL - LNLL

T1 86.9 76.8 64.0 47.1 46.8 30.5T5 85.1 74.1 72.0 56.3 53.7 36.7T10 82.9 70.9 71.0 55.0 55.8 38.7

NLL + PENALTY - LNLLP

T1 89.6 81.2 70.5 54.5 52.6 35.7T5 84.4 73.0 70.5 54.5 52.9 36.0T10 83.4 71.5 71.3 55.4 56.5 39.4

DUAL CROSS ENTROPY - LDCE

T1 86.7 76.6 68.2 51.8 51.2 34.4T5 85.9 75.3 72.2 56.6 56.1 39.0T10 84.4 73.0 72.4 56.8 57.9 40.8

Based on our robust feature matching the results show thatour method can be also used to estimate the ego motionbased on features extracted from LiDAR point cloud scans.We are able to find corresponding features with a high match-ing score even from far apart scans. Therefore we are reach-ing highest matching score in all experiments and hencealso lowest translational and rotational error. For Framedifferences of 1 and 5 we are even close to par to VM whichuses all valid correspondences from the ground truth to esti-mate the desired transformation. Using nearest neighbourcorrespondences with SVD outperforms ICP in V1 becausewe solely use valid matches to perform transformation esti-mation. For higher distances it fails.


Figure 6. Qualitative Results on KITTI odometry with three different difficulty levels matching pillars from two different point cloudswith a frame distance of 1 (blue - top row), 5 (red - middle row), 10 (purple - bottom row). Ground truth and model was generatedaccording the experiments section using LNLLP. The figure shows samples from the valid set (V1,V5,V10), which were not seen duringtraining and come from a different sequence. Green lines identify correct matches and red lines incorrect ones. Even very challengingkey-points are matched in a sufficient manner.

4.3. Ablation Study

Table 2 shows a confusion matrix with precision and accu-racy results of our model trained on the subsets T1, T5, T10

and validated on V1,V5,V10. When using LNLL, we sawlots of nearly equal distributed probabilities for unmatchedkey-points, which expresses uncertainties. LNLLP andLDCE works slightly better thanLNLL, because they containadditional penalties for unmatched key-points compared toLNLL. Nevertheless, our model has an exceptional match-ing performance, independent from the underlying loss. Wesee constant biases towards the underlying training data,i.e. the model trained on Ti performs best on Vi. Still, alldifferences are minor, indicating good generalization.

5. ConclusionWe present a novel model for point-cloud registration us-ing Deep Learning. Thereby, we introduce a three stage

model composed by a point cloud encoder, an attention-based graph and an optimal transport algorithm. Our neuralnetwork performs local and global feature matching at onceusing contextual aggregation. We achieve significantly bet-ter results compared to the state of the art in a very robustmanner. We have shown our results on KITTI odometrydataset.

AcknowledgementsWe would like to thank Valeo, especially Driving AssistanceResearch Kronach, Germany to make this work possible.Further, we want to thank Prof. Patrick Mader from theIlmenau University of Technology and the Institute for Soft-ware Engineering for Safety-Critical Systems (SECSY) tosupport this work.


ReferencesBentley, J. L. Multidimensional binary search trees used for

associative searching. Communications of the ACM, 18(9):509–517, sep 1975.

Besl, P. J. and McKay, N. D. Method for registration of3-d shapes. In Sensor fusion IV: control paradigms anddata structures, volume 1611, pp. 586–606. InternationalSociety for Optics and Photonics, 1992.

Bian, J., Lin, W.-Y., Matsushita, Y., Yeung, S.-K., Nguyen,T.-D., and Cheng, M.-M. Gms: Grid-based motion statis-tics for fast, ultra-robust feature correspondence. In Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 4181–4190, 2017.

Cech, J., Matas, J., and Perdoch, M. Efficient sequentialcorrespondence selection by cosegmentation. IEEE trans-actions on pattern analysis and machine intelligence, 32(9):1568–1581, 2010.

Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. Multi-view3d object detection network for autonomous driving. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 1907–1915, 2017.

Cuturi, M. Sinkhorn distances: Lightspeed computationof optimal transport. In Advances in neural informationprocessing systems, pp. 2292–2300, 2013.

DeTone, D., Malisiewicz, T., and Rabinovich, A. Super-point: Self-supervised interest point detection and de-scription. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pp.224–236, 2018.

Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J.,Torii, A., and Sattler, T. D2-net: A trainable cnn for jointdetection and description of local features. arXiv preprintarXiv:1905.03561, 2019.

Engel, N., Hoermann, S., Horn, M., Belagiannis, V., andDietmayer, K. Deeplocalization: Landmark-based self-localization with deep neural networks. In 2019 IEEEIntelligent Transportation Systems Conference (ITSC), pp.926–933. IEEE, 2019.

Fischler, M. A. and Bolles, R. C. Random sample consensus:a paradigm for model fitting with applications to imageanalysis and automated cartography. Communications ofthe ACM, 24(6):381–395, 1981.

Geiger, A., Lenz, P., and Urtasun, R. Are we ready forautonomous driving? the kitti vision benchmark suite.In IEEE Conference on Computer Vision and PatternRecognition, pp. 3354–3361. IEEE, 2012.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., andBeijbom, O. Pointpillars: Fast encoders for object de-tection from point clouds. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pp. 12697–12705, 2019.

Li, J. and Lee, G. H. Usip: Unsupervised stable inter-est point detection from 3d point clouds. In The IEEEInternational Conference on Computer Vision (ICCV),October 2019.

Li, Q., Chen, S., Wang, C., Li, X., Wen, C., Cheng, M.,and Li, J. Lo-net: Deep real-time lidar odometry. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 8473–8482, 2019.

Lin, J. and Zhang, F. Loam livox: A fast, robust, high-precision lidar odometry and mapping package for lidarsof small fov. arXiv preprint arXiv:1909.06700, 2019.

Lowe, D. G. Distinctive image features from scale-invariantkeypoints. International journal of computer vision, 60(2):91–110, 2004.

Lu, W., Wan, G., Zhou, Y., Fu, X., Yuan, P., and Song, S.Deepvcp: An end-to-end deep neural network for pointcloud registration. In The IEEE International Conferenceon Computer Vision (ICCV), October 2019.

Milz, S., Simon, M., Fischer, K., and Gross, H.-M.Points2pix: 3d point-cloud to image translation usingconditional gans. In Pattern Recognition: 41st DAGMGerman Conference, DAGM GCPR 2019, Dortmund,Germany, September 10–13, 2019, Proceedings, volume11824, pp. 387. Springer Nature, 2019.

Mucha, P. J., Richardson, T., Macon, K., Porter, M. A., andOnnela, J.-P. Community structure in time-dependent,multiscale, and multiplex networks. science, 328(5980):876–878, 2010.

Muja, M. and Lowe, D. G. Fast approximate nearest neigh-bors with automatic algorithm configuration. VISAPP (1),2(331-340):2, 2009.

Nicosia, V., Bianconi, G., Latora, V., and Barthelemy, M.Growing multiplex networks. Physical review letters, 111(5):058701, 2013.

Ono, Y., Trulls, E., Fua, P., and Yi, K. M. Lf-net: learn-ing local features from images. In Advances in neuralinformation processing systems, pp. 6234–6244, 2018.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,


A. Automatic differentiation in pytorch. NIPS Workshops,2017.

Peyre, G., Cuturi, M., et al. Computational optimal transport.Foundations and Trends R© in Machine Learning, 11(5-6):355–607, 2019.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deeplearning on point sets for 3d classification and segmenta-tion. In Proceedings of the IEEE conference on computervision and pattern recognition, pp. 652–660, 2017a.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet++: Deephierarchical feature learning on point sets in a metricspace. In Advances in neural information processingsystems, pp. 5099–5108, 2017b.

Raguram, R., Frahm, J.-M., and Pollefeys, M. A compar-ative analysis of ransac techniques leading to adaptivereal-time random sample consensus. In European Confer-ence on Computer Vision, pp. 500–513. Springer, 2008.

Revaud, J., Weinzaepfel, P., De Souza, C., Pion, N., Csurka,G., Cabon, Y., and Humenberger, M. R2d2: Repeat-able and reliable detector and descriptor. arXiv preprintarXiv:1906.06195, 2019.

Rusinkiewicz, S. and Levoy, M. Efficient variants of the icpalgorithm. In Proceedings Third International Confer-ence on 3-D Digital Imaging and Modeling, pp. 145–152.IEEE, 2001.

Rusu, R. B., Blodow, N., and Beetz, M. Fast point featurehistograms (fpfh) for 3d registration. In 2009 IEEE in-ternational conference on robotics and automation, pp.3212–3217. IEEE, 2009.

Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich,A. Superglue: Learning feature matching with graphneural networks. arXiv preprint arXiv:1911.11763, 2019.

Sattler, T., Leibe, B., and Kobbelt, L. Scramsac: Improvingransac’s efficiency with a spatial consistency filter. In2009 IEEE 12th International Conference on ComputerVision, pp. 2090–2097. IEEE, 2009.

Shan, T. and Englot, B. Lego-loam: Lightweight andground-optimized lidar odometry and mapping on vari-able terrain. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pp. 4758–4765.IEEE, 2018.

Simon, M., Milz, S., Amende, K., and Gross, H.-M.Complex-yolo: An euler-region-proposal for real-time 3dobject detection on point clouds. In European Conferenceon Computer Vision, pp. 197–209. Springer, 2018.

Simon, M., Amende, K., Kraus, A., Honer, J., Samann,T., Kaulbersch, H., Milz, S., and Michael Gross, H.Complexer-yolo: Real-time 3d object detection and track-ing on semantic point clouds. In Proceedings of the IEEEConference on Computer Vision and Pattern RecognitionWorkshops, pp. 0–0, 2019.

Sinkhorn, R. and Knopp, P. Concerning nonnegative ma-trices and doubly stochastic matrices. Pacific Journal ofMathematics, 21(2):343–348, 1967.

Tuytelaars, T. and Van Gool, L. J. Wide baseline stereomatching based on local, affinely invariant regions. InBMVC, volume 412, 2000.

Vallender, S. Calculation of the wasserstein distance be-tween probability distributions on the line. Theory ofProbability & Its Applications, 18(4):784–786, 1974.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in neural informationprocessing systems, pp. 5998–6008, 2017.

Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. Lift: Learnedinvariant feature transform. In European Conference onComputer Vision, pp. 467–483. Springer, 2016.

Zhang, J. and Singh, S. Loam: Lidar odometry and mappingin real-time. In Proceedings of Robotics: Science andSystems Conference, July 2014.

Zhang, Z. Iterative point matching for registration of free-form curves and surfaces. International journal of com-puter vision, 13(2):119–152, 1994.

Zhou, Y. and Tuzel, O. Voxelnet: End-to-end learning forpoint cloud based 3d object detection. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pp. 4490–4499, 2018.

abstract arxiv:2002.03983v1 [cs.cv] 10 feb 2020

Documents