zhipeng001, zhou0365 @sensetime.com arxiv:2107.11355v1 [cs

11
Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency Zhipeng Luo 1,2 Zhongang Cai 1,2 Changqing Zhou 1,2 Gongjie Zhang 1 Haiyu Zhao 2 Shuai Yi 2 Shijian Lu *1 Hongsheng Li 3 Shanghang Zhang 4 Ziwei Liu 1 1 Nanyang Technological University 2 Sensetime Research 3 Chinese University of Hong Kong 4 UC Berkeley {zhipeng001, zhou0365}@e.ntu.edu.sg, {caizhongang, zhaohaiyu, yishuai}@sensetime.com [email protected], [email protected], [email protected], [email protected] Ground Truth Prediction Ground Truth Prediction Figure 1: Visualization of detection results for domain adaptation from KITTI to Waymo dataset. Left: Predictions of baseline model trained on KITTI dataset and directly tested on Waymo dataset. The model can classify and localize the objects, but produces inaccurate box scale due to geometric mismatch. The predicted boxes are therefore noticeably smaller than the ground truth. Right: Predictions of our domain-adaptive MLC-Net, which demonstrates accurate bounding box scale even though MLC-Net is trained without access to any target domain annotation or statistical information. Best viewed in color. Abstract Deep learning-based 3D object detection has achieved unprecedented success with the advent of large-scale au- tonomous driving datasets. However, drastic performance degradation remains a critical challenge for cross-domain deployment. In addition, existing 3D domain adaptive de- tection methods often assume prior access to the target do- main annotations, which is rarely feasible in the real world. To address this challenge, we study a more realistic set- ting, unsupervised 3D domain adaptive detection, which only utilizes source domain annotations. 1) We first com- prehensively investigate the major underlying factors of the domain gap in 3D detection. Our key insight is that geomet- ric mismatch is the key factor of domain shift. 2) Then, we propose a novel and unified framework, Multi-Level Con- sistency Network (MLC-Net), which employs a teacher- student paradigm to generate adaptive and reliable pseudo- targets. MLC-Net exploits point-, instance- and neural statistics-level consistency to facilitate cross-domain trans- fer. Extensive experiments demonstrate that MLC-Net out- performs existing state-of-the-art methods (including those using additional target domain information) on standard benchmarks. Notably, our approach is detector-agnostic, which achieves consistent gains on both single- and two- stage 3D detectors. Our code will be released soon. 1. Introduction With the prevalent use of LiDARs for autonomous vehi- cles and mobile robots, 3D object detection on point clouds has drawn increasing research attention. Large-scale 3D ob- ject detection datasets [10, 28, 3] in recent years has em- powered deep learning-based models [25, 34, 33, 15, 24, denotes equal contribution. * denotes corresponding author. arXiv:2107.11355v1 [cs.CV] 23 Jul 2021

Upload: others

Post on 18-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency

Zhipeng Luo†1,2 Zhongang Cai†1,2 Changqing Zhou†1,2 Gongjie Zhang1 Haiyu Zhao2

Shuai Yi2 Shijian Lu∗1 Hongsheng Li3 Shanghang Zhang4 Ziwei Liu1

1 Nanyang Technological University 2 Sensetime Research3 Chinese University of Hong Kong 4 UC Berkeley

{zhipeng001, zhou0365}@e.ntu.edu.sg, {caizhongang, zhaohaiyu, yishuai}@[email protected], [email protected], [email protected], [email protected]

Ground Truth

Prediction

Ground Truth

Prediction

Figure 1: Visualization of detection results for domain adaptation from KITTI to Waymo dataset. Left: Predictions of baseline modeltrained on KITTI dataset and directly tested on Waymo dataset. The model can classify and localize the objects, but produces inaccuratebox scale due to geometric mismatch. The predicted boxes are therefore noticeably smaller than the ground truth. Right: Predictions ofour domain-adaptive MLC-Net, which demonstrates accurate bounding box scale even though MLC-Net is trained without access to anytarget domain annotation or statistical information. Best viewed in color.

Abstract

Deep learning-based 3D object detection has achievedunprecedented success with the advent of large-scale au-tonomous driving datasets. However, drastic performancedegradation remains a critical challenge for cross-domaindeployment. In addition, existing 3D domain adaptive de-tection methods often assume prior access to the target do-main annotations, which is rarely feasible in the real world.To address this challenge, we study a more realistic set-ting, unsupervised 3D domain adaptive detection, whichonly utilizes source domain annotations. 1) We first com-prehensively investigate the major underlying factors of thedomain gap in 3D detection. Our key insight is that geomet-ric mismatch is the key factor of domain shift. 2) Then, wepropose a novel and unified framework, Multi-Level Con-sistency Network (MLC-Net), which employs a teacher-student paradigm to generate adaptive and reliable pseudo-targets. MLC-Net exploits point-, instance- and neural

statistics-level consistency to facilitate cross-domain trans-fer. Extensive experiments demonstrate that MLC-Net out-performs existing state-of-the-art methods (including thoseusing additional target domain information) on standardbenchmarks. Notably, our approach is detector-agnostic,which achieves consistent gains on both single- and two-stage 3D detectors. Our code will be released soon.

1. Introduction

With the prevalent use of LiDARs for autonomous vehi-cles and mobile robots, 3D object detection on point cloudshas drawn increasing research attention. Large-scale 3D ob-ject detection datasets [10, 28, 3] in recent years has em-powered deep learning-based models [25, 34, 33, 15, 24,

† denotes equal contribution.∗ denotes corresponding author.

arX

iv:2

107.

1135

5v1

[cs

.CV

] 2

3 Ju

l 202

1

Page 2: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

Figure 2: A study on the domain shift for 3D detection. Here we take KITTI as the source dataset and Waymo as the target dataset. Our keyinsights include: 1) distribution of object dimensions varies drastically across datasets, indicating geometric mismatch can be a key factorfor the domain gap; 2) directly applying a model trained on KITTI to Waymo (referred to as the baseline in the figure) is ineffective: themodel continues to predict box dimensions close to the source domain; 3) our MLC-Net is effective in addressing the geometric mismatch,and the distributions of its predictions on the target domain accurately align with the ground truth. Best viewed in color.

35, 18, 27, 26, 40, 37] to achieve remarkable success. How-ever, deep learning models trained on one dataset (sourcedomain) often suffer tremendous performance degradationwhen evaluated on another dataset (target domain). We in-vestigate the bounding box scale mismatch problem (e.g.,vehicle size in the U.S. is noticeably larger than that in Ger-many), which is found to be a major contributor to the do-main gap, in alignment with previous work [30]. This isunique to 3D detection: compared to 2D bounding boxesthat can have a large variety of size, depending on the dis-tance of the object from the camera, 3D bounding boxeshave a more consistent size in the same dataset, regard-less of the objects’ location relative to the LiDAR sensor.Hence, the detector tends to memorize a narrow, dataset-specific distribution of bounding box size from the sourcedomain (Figure 2).

Unfortunately, existing works are inadequate to addressthe domain gap with a realistic setup. Recent methods ondomain adaptive 3D detection either require some labeleddata from the target domain for finetuning or utilize someadditional statistics (such as the mean size) of the targetdomain [30]. However, such knowledge of the target do-main is not always available. In addition, popular 2D un-supervised domain adaptation methods that leverage fea-ture alignment techniques [7, 22, 38, 5, 32, 16] to mitigatedomain shift are not readily transferable to 3D detection.While these methods are effective in handling domain gapsdue to lighting, color, and texture variations, such informa-tion is unavailable in point clouds. Instead, point cloudspose unique challenges such as the geometric mismatch dis-cussed above.

Therefore, we propose MLC-Net for unsupervised do-main adaptive 3D detection. MLC-Net is designed totackle two major challenges. First, to create meaningfulscale-adaptive targets to facilitate the learning. Specifi-cally, MLC-Net employs the mean teacher [29] learningparadigm. The teacher model is essentially a temporal en-

semble of student models: the parameters of the teachermodel are updated by an exponential moving average win-dow on student models of preceding iterations. Our analy-ses show that the mean teacher produces accurate and stablesupervision for the student model without any prior knowl-edge of the target domain. To the best of our knowledge,we are the first to introduce the mean teacher paradigm inunsupervised domain adaptive 3D detection. Second, todesign scale-related consistency losses and construct use-ful correspondences of teacher-student predictions to initi-ate gradient flow, we design MLC-Net to enforce consis-tency at three levels. 1) Point-level. As point clouds areunstructured, point-based region proposals or equivalents[25, 34] are common. Hence, we sample the same subsetof points and share them between the teacher and student.We retain the indices of the points that allow 3D augmen-tation methods to be applied without losing the correspon-dences. 2) Instance-level. Matching region proposals canbe erroneous, especially at the initial stage when the qual-ity of region proposals is substandard. Hence, we resort totransferring teacher region proposals to students to circum-vent the matching process. 3) Neural statistics-level. Asthe teacher model only accesses the target domain input,the mismatch between the batch statistics hinders effectivelearning. We thus transfer the student’s statistics, which isgathered from both the source and the target domain, to theteacher to achieve a more stable training behavior.

MLC-Net shows remarkable compatibility with popularmainstream 3D detectors, allowing us to implement it onboth two-stage [25] and single-stage [34] detectors. More-over, we verify our design through rigorous experimentsacross multiple widely used 3D object detection datasets[10, 28, 3]. Our method outperforms baselines by convinc-ing margins, even surprisingly surpassing existing methodsthat utilize additional information. In summary, our maincontributions are:

Page 3: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

• We formulate and study unsupervised domain adaptive3D detection, a pragmatic, yet underexplored task thatrequires no information of the target domain. We com-prehensively investigate the major underlying factorsof the domain gap in 3D detection and find geometricmismatch is the key factor.

• We propose a concise yet effective mean-teacherparadigm that leverages three levels of consistency tofacilitate cross-domain transfer, achieving a significantperformance boost that is consistent on various main-stream detectors and across multiple popular publicdatasets.

• We validate our hypothesis on the unique challengesassociated with point clouds and verify our proposedapproach with comprehensive evaluations, which wehope would lay a strong foundation for future research.

2. Related WorksLiDAR-based 3D Detection. LiDAR-based 3D detectionmethods mainly come from two categories, namely grid-based methods and point-based methods. Grid-based ap-proaches convert the whole point cloud scene to grids offixed size and process the input with 2D or 3D CNN.MV3D [6] first projects point clouds to bird-eye view im-ages to generate proposals. PointPilar [15] performs vox-elization on point clouds and converts the representation to2D. VoxelNet [39] obtains voxel representations by apply-ing PointNet [20] to points and processes the features with3D convolution. SECOND [33] applies 3D sparse convo-lution [11] to improve the efficiency. PV-RCNN [24] pro-poses to combine voxelization and point-based set abstrac-tion to obtain more discriminative features. On the otherhand, point-based methods directly extract features fromraw point cloud input. F-PointNet [19] applies PointNet[20] to perform 3D detection based on 2D bounding boxes.PointRCNN proposes a two-stage framework to generatebox bounding proposals from the whole point clouds andrefine them with feature pooling. 3DSSD proposes to useF-FPS for better point sampling to achieves single-stage de-tection. In this work, we conduct focused discussion withPointRCNN [25] as the base model but we show our methodis also compatible to single-stage detector (3DSSD) in Sup-plementary Material.Point Cloud Domain Adaptation. While extensive re-searches have been conducted on domain adaptation taskswith 2D image data, the 3D point cloud domain adaptationfield has relatively small literature. PointDAN [21] pro-poses to jointly align local and global features using dis-crepancy loss and adversarial training for point cloud clas-sification. Achituve et. al. [1] introduces an additional self-supervised reconstruction task to improve the classificationperformance on the target domain. [36] designs a sparsevoxel completion network to perform point cloud comple-

tion for domain adaptive semantic segmentation. [14] lever-ages multi-modal information by projecting point cloud to2D images and train models jointly. For object detection,[30] identifies the major domain gap among autonomousdriving datasets and proposes to mitigate the gap by lever-aging target statistical information. SF-UDA [23] computesmotion coherence over consecutive frames to select the bestscale for the target domain. Our proposed method worksunder a similar setup to [30] but does not require target do-main statistical information.Mean Teacher. The mean teacher framework [29] is firstproposed for semi-supervised learning task. Many variants[8, 2, 31] have been proposed to further improve its per-formance. Furthermore, the framework has also been ap-plied to other fields such as domain adaptation [9, 4] andself-supervised learning [13, 12, 17] where labeled datais scarce or unavailable. Specifically, the mean teacherframework incorporates one trainable student model anda non-trainable teacher model whose weights are obtainedfrom the exponential moving average of the student model’sweights. The student model is optimized based on the con-sistency loss between the student and teacher network pre-dictions. In particular, although [4] also employs the meanteacher paradigm for the detection task by aligning region-level features, point cloud detection models are substan-tially different from 2D detectors and our proposed methoddiffers by incorporating multi-level consistency.

3. Our Approach

In this section, we formulate the 3D point cloud do-main adaptive detection problem (Section 3.1), and pro-vide an overview of our MLC-Net (Section 3.2), followedby the details of our mean-teacher paradigm (Section 3.3).Finally, we explain the details of the point-level (Section3.4), instance-level (Section 3.5), and statistics-level (Sec-tion 3.6) consistency of our MLC-Net.

3.1. Problem Definition

Under the unsupervised domain adaptation setting, wehave access to point cloud data from one labeled source do-main Ds = {xis, yis}

Ns

i=1 and one unlabeled target domainDt = {xit}

Nt

i=1, where Ns and Nt are the number of sam-ples from the source and target domains, respectively. Eachpoint cloud scene xi ∈ Rn×3 consists of n points with their3D coordinates while yi denotes the label of the correspond-ing training sample from the source domain. y is in the formof object class k and 3D bounding box parameterized by thecenter location of the bounding box (cx, cy, cz), the size ineach dimension (dx, dy, dz), and the orientation η. The goalof the domain adaptive detection task is to train a model Fbased on Ds and Dt and maximize the performance on Dt.

Page 4: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

RegionProposalNetwork

BoxRefinementNetwork

𝑅!

"𝑅!

𝑅"

𝑆!

$𝑆!

𝑆"

RegionProposalNetwork

Point-levelPredictions

Instance-levelPredictions

Source Input 𝑥!

Target Input "𝑥"

Target Input 𝑥"

BoxRefinementNetwork

Point-Level Consistency

Instance-Level Consistency

Point-LevelConsistency

Instance-LevelConsistency

ExponentialMoving Averageℎ Input

AugmentationPoint CloudRegion Pooling

Neural Statics-LevelConsistency

Student Model

Teacher Model

𝐹#$%

𝐹#$%&

𝐹'#%

𝐹'#%&

Example Points

𝜉 RoIAugmentation

𝜉!

𝜉

Figure 3: The network architecture of our proposed MLC-Net. MLC-Net leverages the mean-teacher [29] paradigm where the teacher isthe exponential moving average (hence the name mean-teacher) of the student model and is updated at every iteration. This mean-teacherdesign provides high-quality pseudo labels to facilitate smooth learning of the student model. Towards the goal, we design consistencyenforced at three levels. First, at point-level, 3D proposals are associated based on point correspondences, which are established bysampling the same set of points from the target domain for both the student and teacher models; second, at instance-level, the teacher3D proposals are passed to the student Box Refinement Network, and the correspondences between 3D box predictions from two modelsare naturally maintained. Third, at neural statistics-level, we discover non-learnable parameters in batch normalization layers demonstratesignificant domain shift, we thus align the teacher’s parameters with the student’s. We highlight the efficacy of MLC-Net and furtherdiscuss our design motivations in Section 3. Best viewed in color.

3.2. Framework Overview

We illustrate MLC-Net in Figure 3. The labeled sourceinput xs is used for standard supervised training of the stu-dent model F with loss Lsource. For each unlabeled targetdomain example xt, we perturb it by applying random aug-mentation h to obtain xt. The perturbed and original pointcloud inputs are passed to the student model and teachermodel respectively to get their point-level box proposals Rt

and Rt where point-level consistency is applied. Subse-quently, teacher proposals are passed to the student modelfor box refinement, to obtain St. Together with teacher’sinstance-level predictions St, the instance-level consistencyis applied. The overall consistency loss Lconsist is com-puted as:

Lconsist = Lpt,cls + Lpt,box + Lins,cls + Lins,box (1)

where pt, ins, cls and box stand for point-level, instance-level, classification and box regression respectively. Theseloss components are elaborated in Section 3.4 and 3.5. Ineach iteration, the student model is updated through gradi-ent descent with the total loss L, which is a weighted sumof Lsource and Lconsist:

L = λLsource + Lconsist (2)

where λ is the weight coefficient. The learnable parametersof the student model are then used to update the correspond-

ing teacher model parameters, where the details can befound in Section 3.3. In addition, we enforce non-learnableparameters to be aligned between the teacher and the stu-dent via neural statistics-consistency (Section 3.6).

MLC-Net achieves two major design goals towards ef-fective unsupervised 3D domain adaptive detection. First,to generate accurate and robust pseudo targets without anyaccess to the target domain annotation or statistical informa-tion. MLC-Net leverages a mean teacher paradigm wherethe teacher model can be regarded as a temporal ensembleof student models, allowing it to produce high-quality pre-dictions and guide the learning of the student. Second, todesign effective consistency losses at point-, instance- andneural statistics-level that enhance adaptability to scale vari-ation, and construct the teacher-student correspondencesthat allow the back-propagated gradient to flow throughthe correct routes. Although we conduct most analysis onPointRCNN [25] as the representative of two-stage 3D de-tectors, we highlight that our method is generic and canbe easily extended to single-stage detection models such as3DSSD [34] with modest modifications (see SupplementaryMaterial).

3.3. Mean Teacher

Motivated by the success of the mean teacher paradigm[29] in semi-supervised learning and self-supervised learn-

Page 5: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

ing, we apply it to our point cloud domain adaptive detec-tion task as illustrated in Figure 3. The framework consistsof a student model F and a teacher model F ′ with the samenetwork architecture but different weights θ and θ′ , respec-tively. The weights of the teacher model are updated bytaking the exponential moving average of the student modelweights:

θ′ = mθ′ + (1−m)θ (3)

where m is known as the momentum which is usually anumber close to 1, e.g. 0.99. Figure 5 shows that the teachermodel constantly provides effective supervision to the stu-dent model via high-quality pseudo targets. Hence, by en-forcing the consistency between the student and the teacher,the student learns domain-invariant representations to adaptto the unlabeled target domain, guided by the pseudo labels.We show in Table 5 that the mean teacher significantly im-proves model performance compared to baseline.

3.4. Point-Level Consistency

The point-level consistency loss is calculated betweenthe first-stage box proposals of the student and teacher mod-els. One of the key challenges for formulating consistencyis to find the correspondence between the student and theteacher. Unlike image pixels that are arranged in regularlattices, points reside in continuous 3D space which lacksstructure [20]. Hence, constructing point correspondencescan be problematic (Table 3). Instead, we circumvent thedifficulty by feeding the teacher and the student two identi-cal sets of points at the very beginning and trace the pointindices to maintain correspondences.

Specifically, for each target domain example, we sampleM points from the point cloud scene to obtain the teacherinput xt and apply random augmentation h on a replicatedset to obtain xt with xt = h(xt). h consists of randomglobal scaling of the point cloud scenes and can be regardedas applying displacements on individual points, without dis-rupting the point correspondences. As a result, each pointp ∈ xt corresponds to a point p ∈ xt, and this relation-ship holds for the point-level predictions of the region pro-posal network FRPN . We denote the first stage predictionas R = FRPN (x). Note that the point correspondences aretransferred to box proposals as each point generates one boxproposal. R consists of class prediction Rc and box regres-sionRb. For the class predictions, we define the consistencyloss as the Kullback-Leibler (KL) divergence between eachpoint pair from xt and xt:

Lpt,cls =1

|xt|∑

DKL(Rct ||Rc

t) (4)

where |xt| stands for the number of points in xt.More importantly, we enforce consistency between

bounding box regression predictions to address geometric

mismatch. For the bounding box predictions, we only com-pute the consistency over points belonging to the objectsbecause the background points do not generate meaning-ful bounding boxes. We obtain a set of points Ppos whichfall inside the bounding boxes of the final predictions ofboth the student and teacher networks with Ppos = {(p ∈NMS(St)) ∩ (p ∈ NMS(St))}, where St and St are therefined bounding box predictions after second stage (seeSection 3.5). We then compute the point-level box consis-tency loss as:

Lpt,box =1

|Ppos|∑

pi∈Ppos

d(Rc(i)t , h(R

c(i)t )) (5)

where d is the smooth L1 loss and h is the random augmen-tation applied to the input xt. We apply the same augmen-tation to the teacher bounding box predictions to align withthe scale of the student point cloud scene before computingthe consistency.

3.5. Instance-Level Consistency

In the second stage, NMS is performed on R to obtainN high-confidence region proposals denoted as G for eachpoint cloud scene. We highlight that the association be-tween region proposals from the student and teacher mod-els are lost in the NMS due to the differences between Rt

and Rt. To match the instance-level predictions for consis-tency computation, a common method is to perform greedymatching based on IoU between teacher and student regionproposals. However, such matching is not robust due to thelarge number of noisy predictions, which lead to ineffec-tive learning as shown experimentally in Table 3. Hence,we adopt a simple approach by replicating the teacher re-gion proposals to the student model and applying the inputaugmentation h to match the scale of the student model.Subsequently, we disturb the region proposals by apply-ing random RoI augmentation ξ for the sets of region pro-posals before they are used for feature pooling. The mo-tivation of this operation is to force the models to outputconsistent predictions given non-identical region proposalsand prevent convergence to trivial solutions. Formally, theabove process can be described as ft = pool(ξ(h(Gt)))and ft = pool(ξ′(Gt)) for the student and teacher models,respectively, where f denotes the instance-level featuresobtained from feature pooling as described in [25]. Thepooled features are then passed to the box refinement net-work FBRN for box refinement to obtain the second stagepredictions S = FBRN (f). Similar to the first stage predic-tion R, S consists of the class prediction Sc as well as thebounding box prediction Sb. We define the instance-levelclass consistency as the difference between Sc

t and Sct :

Lins,cls =1

|Gt|∑

DKL(Sct ||Sc

t ) (6)

Page 6: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

where |Gt| denotes the number of region proposals. On theother hand, to compute the instance-level box consistencyloss, we first obtain a set of positive predictions Spos =

{(Sct > ε) ∩ (Sc

t > ε)} by selecting bounding boxes withclassification predictions larger than a probability thresholdε. We then apply h to Sb

t to match the scale and computethe instance-level box consistency loss based on the discrep-ancy between Sb

t and Sbt for the selected predictions:

Lins,box =1

|Spos|∑

S(i)t ∈Spos

d(Sb(i)t , S

b(i)t ) (7)

3.6. Neural Statistics-Level Consistency

While the student model takes both source domain dataxs and target domain data xt as input, the teacher modelonly has access to the target data xt. The distribution shiftlying between source and target data could lead to mis-matched batch statistics between the batch normalization(BN) layers of the student and teacher models. This mis-match could cause misaligned normalization and in turn,leads to an unstable training process with degraded perfor-mance or even divergence. We provide an in-depth analysisregarding this matter in Section 4.4.

To mitigate this issue, we propose to use the runningstatistics of the student model BN layers for the teachermodel during the training process. Specifically, for each BNlayer in the student model, the batch mean µ and variance σare used to update the running statistics at every iteration:

µ′ = (1− α)µ′ + αµ (8)σ′ = (1− α)σ′ + ασ (9)

where µ′ and σ′ are the running mean of µ and σ andα is theBN momentum that controls the speed of batch statistics up-dating the running statistics. For the teacher model, we useµ′ and σ′ instead of the batch statistics for all the BN layersto normalize the layer inputs. We argue that this modifica-tion closes the gap caused by domain mismatch and leads tomore stable training behavior. We empirically demonstratethe effectiveness by comparing the performance under dif-ferent BN settings in Section 4.3.

4. ExperimentsWe first introduce the popular autonomous driving

datasets including KITTI [10], Waymo Open Dataset [28],and nuScenes [3] used in the experiments (Section 4.1).We then benchmark MLC-Net across datasets where MLC-Net achieves consistent performance boost in Section 4.2.Moreover, we ablate MLC-Net to give a comprehensive as-sessment of its submodules and justify our design choices inSection 4.3. Finally, we further investigate the challengesof unsupervised domain adaptive 3D detection and show

MLC-Net successfully addresses them. We further analysethe problems in 3D domain adaptive detection and our solu-tions in Section 4.4. Due to the space constraint, we includethe implementation details in the Supplementary Material.

4.1. Datasets

We follow [30] to evaluate MLC-Net on various source-target combinations with the following datasets.KITTI. KITTI [10] is a popular autonomous driving datasetthat consists of 3,712 training samples and 3,769 validationsamples. The 3D bounding box annotations are only pro-vided for objects within the Field of View (FoV) of the frontcamera. Therefore, points outside of the FoV are ignoredduring training and evaluation. We use the official KITTIevaluation metrics for evaluation where the objects are cat-egorized into three levels (Easy, Moderate, and Hard) basedon the number of pixels, occlusion and truncation levels.Waymo Open Dataset. The Waymo Open Dataset (re-ferred to as Waymo) [28] is a large-scale benchmark thatcontains 122,000 training samples and 30,407 validationsamples. We subsample 1/10 the training and validation set.To align the input convention, we apply the same front cam-era FoV as the KITTI dataset. The official Waymo evalua-tion metrics are used to benchmark the performance.nuScenes. The nuScenes [3] dataset consists of 28,130training samples and 6,019 validation samples. We subsam-ple the training dataset by 50% and use the entire valida-tion set. We also apply the same FoV on the input as otherdatasets. We adopt the official evaluation metrics of trans-lation, scale, and orientation errors, with the addition of thecommonly used average precision based on 3D IoU with athreshold of 0.7 to reflect the overall detection accuracy.

4.2. Benchmarking Results

As an emerging research area, the cross-domain pointcloud detection topic has relatively small literature. To thebest of our knowledge, [30] is the most relevant work thathas a similar setting as our study. We compare our methodwith two normalization methods proposed in [30], namelyOutput Transformation (OT) and Statistical Normalization(SN), where the former transforms the predictions by an off-set and the latter trains the detector with scale-normalizedinput. Moreover, we also compare with the adversarial fea-ture alignment method, which is commonly used on image-based tasks, by adapting DA-Faster [7] to our PointRCNN[25] base model. We also provide Direct Transfer andWide-Range Augmentation as baselines. More results canbe found in the Supplementary Material.

Table 1 demonstrates the cross-domain detection perfor-mance on four source-target domain pairs, MLC-Net out-performs all unsupervised baselines by convincing margins.We highlight that our method adapts scale for each instanceinstead applying a global shift, allowing us to surpass state-

Page 7: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

Table 1: Performance of MLC-Net on four source-target pairs in comparison with various baselines and state-of-the-art methods. MLC-Net outperforms all baselines and even surpasses SOTA methods that utilize target domain annotation information (indicated by †). Directtransfer: the model trained on the source domain is directly tested on the target domain. Wide-Range Aug: baseline method with randomscaling augmentation of a wide range which potentially includes the target domain scales. It is thus validated the drastic performancedegradation cannot be fully mitigated by simple data augmentation. DA-Faster: we also compare with adversarial feature alignment [7], acommon technique used in 2D domain adaptation. # indicates the implementation is adapted from 2D to 3D. However, feature alignmentis unable to solve the geometric mismatch, which we argue is unique to 3D detection. The state-of-the-art work [30] proposes to performoutput transformation (OT) to scale predictions and statistical normalization (SN) for scale-adjusted training examples. Both OT and SNrequire known target domain statistics. MLC-Net, albeit being fully unsupervised, even surpasses these methods on key metrics: APH/L2(Waymo), AP3D (nuScenes), and AP Moderate (KITTI).

KITTI → Waymo Waymo → KITTIMethods AP/L1 APH/L1 AP/L2 APH/L2 Methods Easy Moderate Hard

Direct Transfer 0.0917 0.0899 0.0794 0.0778 Direct Transfer 20.2213 21.4261 20.4927Wide-Range Aug 0.1861 0.1818 0.1677 0.1640 Wide-Range Aug 30.2341 31.4959 32.8531DA-Faster [7]# 0.0696 0.0687 0.0642 0.0633 DA-Faster [7]# 4.4248 5.5510 5.5296

OT [30]† 0.2648 0.2584 0.2385 0.2329 OT [30]† 39.7762 37.8212 39.5546SN [30]† 0.3069 0.3006 0.2723 0.2667 SN [30]† 61.9289 58.0656 58.4406

Ours 0.3821 0.3774 0.3446 0.3404 Ours 69.3518 59.4454 56.2913KITTI → nuScenes nuScenes → KITTI

Methods ATE ASE AOE AP3D Methods Easy Moderate HardDirect Transfer 0.207 0.248 0.212 13.0073 Direct Transfer 49.1303 39.5565 35.5127

Wide-Range Aug 0.200 0.228 0.211 16.0081 Wide-Range Aug 58.7072 45.3730 43.0254DA-Faster [7]# 0.247 0.253 0.292 10.7661 DA-Faster [7]# 52.2501 40.6209 35.9015

OT [30]† 0.207 0.220 0.212 14.6650 OT [30]† 23.1286 27.2584 29.0979SN [30]† 0.227 0.168 0.368 23.1491 SN [30]† 44.8135 45.1496 47.5991

Ours 0.197 0.179 0.197 23.4720 Ours 71.2648 55.4152 48.9880

of-the-art methods that utilize target domain statistical in-formation.

4.3. Ablation Study

To evaluate the effectiveness of the components of MLC-Net, we conduct ablation studies on KITTI → Waymotransfer with PointRCNN as the base model.Effectiveness of Point/Instance-Level Consistency. Westudy the effects of different components of the proposedconsistency loss. Table 2 reports the experimental resultswhen different combinations of loss components are ap-plied. It is observed that, for both point-level consistencyand instance-level consistency, the box consistency clearlyhas a larger contribution as compared to the class consis-tency. This observation indicates that the scale difference isa major source of the domain gap between source and tar-get domains with different object size distributions, whichis also in line with the previous work [30]. It also shows thatour proposed box consistency regularization method effec-tively mitigates this gap. In addition, all losses are comple-mentary to one another: the best result is achieved when allfour of them are used.

Furthermore, we compare MLC-Net with two alternativeapproaches for point and box matching respectively in Ta-ble 3. Compared to these baseline approaches, MLC-Netreplicates the input point clouds and the region proposalsbefore passed to the student and teacher models to eradi-

Table 2: Ablation study of point-level and instance-level con-sistency loss components. Results show loss components arehighly complementary; the joint use of all four losses at two lev-els achieves the best performance. More importantly, we find thatthe bounding box regression loss, which is directly associated withbounding box scale, benefits the performance more than the clas-sification loss. This further validates our stance that geometricmismatch is a key domain gap for 3D detection.

Lpt,cls Lpt,box Lins,cls Lins,box AP/L1 APH/L1 AP/L2 APH/L20.1861 0.1818 0.1677 0.1640

X 0.2034 0.1991 0.1807 0.1770X 0.3034 0.2969 0.2708 0.2649

X X 0.3100 0.3039 0.2764 0.2709X 0.2112 0.2087 0.1879 0.1857

X 0.3321 0.3244 0.2995 0.2926X X 0.3495 0.3453 0.3143 0.3105

X X X X 0.3821 0.3774 0.3446 0.3404

cate any noise which arise from inaccurate matching. Theresults highlight the importance of correspondence in con-structing meaningful consistency losses for effective unsu-pervised learning.

Effectiveness of Neural Statistics-Level Consistency. Wealso experiment on the effectiveness of neural statistics-level consistency by comparing the performance when suchalignment is enabled and disabled. From Table 4 we cansee that when neural statistics-level consistency is disabled,the model performance severely drops. As analyzed in Sec-tion 3.6, when neural statistics-level consistency is not inplace, the teacher model BN layers normalize the input fea-

Page 8: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

Table 3: Ablation study of point-level and instance-level match-ing methods. Nearest Point: a baseline for point match where apoint in the student input is matched to the nearest point in theteacher input using Euclidean distance. Max IoU Box: a baselinefor box matching where a student box prediction is matched to theteacher pseudo label with the largest IoU. Ours: input point cloudsor region proposals of the student are replicated from the teacher.We highlight that our matching method ensures accurate one-to-one correspondence, which is critical to effective teacher-studentlearning.

Matching Method AP/L1 APH/L1 AP/L2 APH/L2Nearest Point 0.0293 0.0286 0.0265 0.0258Max IoU Box 0.2695 0.2666 0.2418 0.2392

Ours 0.3821 0.3774 0.3446 0.3404

Table 4: Ablation study of neural statistics-level consistency in-dicates that MLC-Net effectively closes the domain gap due toneural statistics mismatch. Disabled: no consistency is enforced.Separate: the student model performs BN separately for sourceand target domain inputs to align with the teacher model. Enabled:our proposed neural statistics-level alignment.

Setting AP/L1 APH/L1 AP/L2 APH/L2Disabled 0.0279 0.0274 0.0254 0.0249Separate 0.2988 0.2945 0.2685 0.2648Enabled 0.3821 0.3774 0.3446 0.3404

Table 5: Ablation study of the exponential moving average (EMA)update scheme in mean teacher paradigm. The performance sig-nificantly degrades when the exponential moving average updateis disabled, highlighting the importance of mean teacher design inproducing meaningful targets.

EMA AP/L1 APH/L1 AP/L2 APH/L2Disabled 0.0895 0.0866 0.0835 0.0808Enabled 0.3821 0.3774 0.3446 0.3404

tures using batch statistics that are obtained from only tar-get data, while the student model performs BN with statis-tics from both source and target domains. This misalign-ment creates a significant gap. As a result, the consistencycomputation between the student and teacher predictions isinvalidated. We also compare with the approach that thestudent model performs separate BN for source and targetdata. In this case, although the normalization for target in-put is performed with target statistics for both models, themismatched normalization of source and target inputs leadsto suboptimal performance as compared to MLC-Net.Effectiveness of Mean Teacher. The teacher model is es-sentially a temporal ensemble of student models at differenttime stamps. We study the effectiveness of the mean teacherparadigm by comparing the performance when the expo-nential moving average update is enabled or disabled. Table5 shows that it is important to employ the moving averageupdate mechanism for the teacher to generate meaningfulsupervisions to guide the student model, and the removal ofsuch mechanism leads to performance deterioration.

Figure 4: Neural statistics mismatch across domains. We plot thedistributions of batch mean and batch variance. Significant mis-alignment is observed, which highlights the necessity of neuralstatistics-level consistency.

Figure 5: Teacher and student model performance against itera-tion. Not only does the teacher model constantly outperform thestudent, its performance curve is also smoother. Hence, the teachermodel, which can be regarded as a temporal ensemble of the stu-dent model, is able to produce more stable and accurate pseudolabels to supervise the student model.

4.4. Further Analysis

Analysis of Distribution Shift. We highlight that the ge-ometric mismatch is a significant issue for cross-domaindeployment of 3D detection models. In Figure 2, the ob-ject dimension (length, width, and height) distributions aredrastically different across domains with a relatively smalloverlap. The baseline, trained on the source domain, is notable to generalize to the target domain as the distribution ofits dimension prediction is still close to that of the sourcedomain. In contrast, MLC-Net is able to adapt to the newdomain by predicting highly similar geometric distributionas the target domain.

Analysis of Neural Statistics Mismatch.

Figure 4 shows that inputs from different domains havevery different distributions of batch statistics, which ex-plains the tremendous performance drop when our proposedneural statistics-level consistency is not applied to align thestatistics (Table 4).

Analysis of Teacher/Student Paradigm. In Figure 5,the teacher model in MLC-Net demonstrates stronger per-formance during the training process until convergence.Moreover, the teacher model exhibits a smoother learningcurve. This validates the effectiveness of our mean-teacherparadigm to create accurate and reliable supervision for ro-bust optimization of the student model.

Page 9: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

5. ConclusionWe study unsupervised 3D domain adaptive detection

that requires no target domain annotation or statistics. Wevalidate that geometric mismatch is a major contributor tothe domain shift and propose MLC-Net that leverages ateacher-student paradigm for robust and reliable pseudo la-bel generation via point-, instance- and neural statistics-level consistency to enforce effective transfer. MLC-Netoutperforms all the baselines by convincing margins, andeven surpasses methods that require additional target infor-mation.

Page 10: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

References[1] Idan Achituve, Haggai Maron, and Gal Chechik. Self-

supervised learning for domain adaptation on point clouds.In Proceedings of the IEEE/CVF Winter Conference on Ap-plications of Computer Vision, pages 123–133, 2021. 3

[2] David Berthelot, Nicholas Carlini, Ian Goodfellow, NicolasPapernot, Avital Oliver, and Colin Raffel. Mixmatch: Aholistic approach to semi-supervised learning. arXiv preprintarXiv:1905.02249, 2019. 3

[3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-modal dataset for autonomous driving. In Proceedings ofthe IEEE/CVF conference on computer vision and patternrecognition, pages 11621–11631, 2020. 1, 2, 6

[4] Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, LingyuDuan, and Ting Yao. Exploring object relation in meanteacher for cross-domain detection. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 11457–11466, 2019. 3

[5] Chaoqi Chen, Zebiao Zheng, Xinghao Ding, Yue Huang, andQi Dou. Harmonizing transferability and discriminability foradapting object detectors. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 8869–8878, 2020. 2

[6] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia.Multi-view 3d object detection network for autonomousdriving. In Proceedings of the IEEE conference on ComputerVision and Pattern Recognition, pages 1907–1915, 2017. 3

[7] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, andLuc Van Gool. Domain adaptive faster r-cnn for object de-tection in the wild. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 3339–3348,2018. 2, 6, 7

[8] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-van, and Quoc V Le. Autoaugment: Learning augmentationpolicies from data. arXiv preprint arXiv:1805.09501, 2018.3

[9] Geoffrey French, Michal Mackiewicz, and Mark Fisher.Self-ensembling for visual domain adaptation. arXivpreprint arXiv:1706.05208, 2017. 3

[10] Andreas Geiger, Philip Lenz, Christoph Stiller, and RaquelUrtasun. Vision meets robotics: The kitti dataset. The Inter-national Journal of Robotics Research, 32(11):1231–1237,2013. 1, 2, 6

[11] Benjamin Graham, Martin Engelcke, and Laurens VanDer Maaten. 3d semantic segmentation with submani-fold sparse convolutional networks. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 9224–9232, 2018. 3

[12] Jean-Bastien Grill, Florian Strub, Florent Altche, CorentinTallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-mad Gheshlaghi Azar, et al. Bootstrap your own latent: Anew approach to self-supervised learning. arXiv preprintarXiv:2006.07733, 2020. 3

[13] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual rep-resentation learning. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages9729–9738, 2020. 3

[14] Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Em-ilie Wirbel, and Patrick Perez. xmuda: Cross-modal unsu-pervised domain adaptation for 3d semantic segmentation.In Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, pages 12605–12614, 2020.3

[15] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encodersfor object detection from point clouds. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 12697–12705, 2019. 2, 3

[16] Congcong Li, Dawei Du, Libo Zhang, Longyin Wen, TiejianLuo, Yanjun Wu, and Pengfei Zhu. Spatial attention pyramidnetwork for unsupervised domain adaptation. In EuropeanConference on Computer Vision, pages 481–497. Springer,2020. 2

[17] Songtao Liu, Zeming Li, and Jian Sun. Self-emd: Self-supervised object detection without imagenet. arXiv preprintarXiv:2011.13677, 2020. 3

[18] Charles R Qi, Or Litany, Kaiming He, and Leonidas JGuibas. Deep hough voting for 3d object detection in pointclouds. In Proceedings of the IEEE International Conferenceon Computer Vision, 2019. 2

[19] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas JGuibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 918–927, 2018. 3

[20] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3d classificationand segmentation. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 652–660,2017. 3, 5

[21] Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, andYun Fu. Pointdan: A multi-scale 3d domain adaptionnetwork for point cloud representation. arXiv preprintarXiv:1911.02744, 2019. 3

[22] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and KateSaenko. Strong-weak distribution alignment for adaptive ob-ject detection. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 6956–6965, 2019. 2

[23] Cristiano Saltori, Stephane Lathuiliere, Nicu Sebe, ElisaRicci, and Fabio Galasso. Sf-uda 3d: Source-free unsuper-vised domain adaptation for lidar-based 3d object detection.arXiv preprint arXiv:2010.08243, 2020. 3

[24] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, JianpingShi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 10529–10538, 2020. 2, 3

[25] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr-cnn: 3d object proposal generation and detection from point

Page 11: zhipeng001, zhou0365 @sensetime.com arXiv:2107.11355v1 [cs

cloud. In Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, pages 770–779, 2019.2, 3, 4, 5, 6

[26] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang,and Hongsheng Li. From points to parts: 3d object detectionfrom point cloud with part-aware and part-aggregation net-work. IEEE Transactions on Pattern Analysis and MachineIntelligence, 2020. 2

[27] Vishwanath A Sindagi, Yin Zhou, and Oncel Tuzel. Mvx-net: Multimodal voxelnet for 3d object detection. In 2019 In-ternational Conference on Robotics and Automation (ICRA),pages 7276–7282. IEEE, 2019. 2

[28] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, AurelienChouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,Yuning Chai, Benjamin Caine, et al. Scalability in perceptionfor autonomous driving: Waymo open dataset. In Proceed-ings of the IEEE/CVF Conference on Computer Vision andPattern Recognition, pages 2446–2454, 2020. 1, 2, 6

[29] Antti Tarvainen and Harri Valpola. Mean teachers are bet-ter role models: Weight-averaged consistency targets im-prove semi-supervised deep learning results. arXiv preprintarXiv:1703.01780, 2017. 2, 3, 4

[30] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, BharathHariharan, Mark Campbell, Kilian Q Weinberger, and Wei-Lun Chao. Train in germany, test in the usa: Making 3d ob-ject detectors generalize. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 11713–11723, 2020. 2, 3, 6, 7

[31] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong,and Quoc V Le. Unsupervised data augmentation for con-sistency training. arXiv preprint arXiv:1904.12848, 2019.3

[32] Chang-Dong Xu, Xing-Ran Zhao, Xin Jin, and Xiu-ShenWei. Exploring categorical regularization for domain adap-tive object detection. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages11724–11733, 2020. 2

[33] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed-ded convolutional detection. Sensors, 18(10):3337, 2018. 2,3

[34] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd:Point-based 3d single stage object detector. In Proceedingsof the IEEE/CVF conference on computer vision and patternrecognition, pages 11040–11048, 2020. 2, 4

[35] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and JiayaJia. Std: Sparse-to-dense 3d object detector for point cloud.In Proceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 1951–1960, 2019. 2

[36] Li Yi, Boqing Gong, and Thomas Funkhouser. Complete &label: A domain adaptation approach to semantic segmenta-tion of lidar point clouds. arXiv preprint arXiv:2007.08488,2020. 3

[37] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. arXiv preprintarXiv:2006.11275, 2020. 2

[38] Yangtao Zheng, Di Huang, Songtao Liu, and Yunhong Wang.Cross-domain object detection through coarse-to-fine feature

adaptation. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 13766–13775, 2020. 2

[39] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learningfor point cloud based 3d object detection. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 4490–4499, 2018. 3

[40] Xinge Zhu, Yuexin Ma, Tai Wang, Yan Xu, Jianping Shi,and Dahua Lin. Ssn: Shape signature networks for multi-class object detection from point clouds. In Proceedings ofthe European Conference on Computer Vision, 2020. 2