arxiv:2012.07717v1 [cs.cv] 14 dec 2020

Improving Panoptic Segmentation at All Scales

Lorenzo Porzi, Samuel Rota Bulo, Peter KontschiederFacebook

{porzi,rotabulo,pkontschieder}@fb.com

Figure 1: Panoptic segmentation on high-resolution natural images is challenged with recognizing objects at a wide range ofscales. Standard approaches (left) can struggle when dealing with very small (zoomed detail) or very large objects (bus onthe left). By introducing a novel instance scale-uniform sampling strategy and a crop-aware bounding box loss, we are ableto improve panoptic segmentation results at all scales (right).

Abstract

Crop-based training strategies decouple training reso-lution from GPU memory consumption, allowing the useof large-capacity panoptic segmentation networks on multi-megapixel images. Using crops, however, can introduce abias towards truncating or missing large objects. To ad-dress this, we propose a novel crop-aware bounding boxregression loss (CABB loss), which promotes predictions tobe consistent with the visible parts of the cropped objects,while not over-penalizing them for extending outside of thecrop. We further introduce a novel data sampling and aug-mentation strategy which improves generalization acrossscales by counteracting the imbalanced distribution of ob-ject sizes. Combining these two contributions with a care-fully designed, top-down panoptic segmentation architec-ture, we obtain new state-of-the-art results on the challeng-ing Mapillary Vistas (MVD), Indian Driving and Cityscapesdatasets, surpassing the previously best approach on MVDby +4.5% PQ and +5.2% mAP.

1. Introduction

Panoptic segmentation [16] is the task of generating per-pixel, semantic labels for an image, together with object-specific segmentation masks. It is thus a combination ofsemantic segmentation and instance segmentation, i.e. twolong-standing tasks in computer vision that have been tradi-

tionally tackled separately. Due to its importance for taskslike autonomous driving or scene understanding it has re-cently attracted a lot of interest in the research community.

The majority of deep-learning based panoptic segmenta-tion architectures [15, 23, 17, 29, 21] proposed a combina-tion of specialized segmentation branches – one for con-ventional semantic segmentation and another one for in-stance segmentation – followed by a combination strategyto generate a final panoptic segmentation result. Instancesegmentation branches in top-down panoptic architecturesare dominantly designed on top of Mask R-CNN [12], i.e.a segmentation extension of Faster R-CNN [24] generatingstate-of-the-art mask predictions for given bounding boxes.In contrast and more recently, bottom-up panoptic architec-tures [6, 26] have emerged but still lag behind in terms ofinstance segmentation performance.

Panoptic segmentation networks are typically solvingmultiple tasks (object detection, instance segmentation andsemantic segmentation), and are trained on batches of full-sized images. However, with increasing complexity of tasksand growing capacity of the network backbone, full-imagetraining is quickly inhibited by available GPU memory, de-spite availability of memory-saving strategies during train-ing like [25, 20, 11, 14]. Obvious mitigation strategiesinclude a reduction of training batch size, downsizing ofhigh-resolution training images, or building on backboneswith lower capacity. These workarounds unfortunately in-troduce other limitations: i) Small batch sizes can lead tohigher variance in the gradients which will reduce the ef-

1

arX

iv:2

012.

0771

7v2

[cs

.CV

] 2

3 M

ar 2

021

fectiveness of Batch Normalization [13] and consequentlythe performance of the resulting model. ii) Reducing theimage resolution leads to a loss of fine structures which areknown to strongly correlate with objects belonging to thelong tail of the label distribution. Downsampling the imagesis consequently amplifying already existing performance is-sues on small and usually underrepresented classes. iii)A number of recent works [28, 5, 31] have shown thatlarger backbones with sophisticated strategies of maintain-ing high-resolution features are boosting panoptic segmen-tation results in comparison to those with reduced capacity.

A possible strategy to overcome the aforementioned is-sues is to move from full-image-based training to crop-based training. This was successfully used for conventionalsemantic segmentation [25, 3, 2], which is however an eas-ier problem as the task is limited to a per-pixel classificationproblem. By fixing a certain crop size the details of finestructures can be preserved, and at a given memory bud-get, multiple crops can be stacked to form reasonably sizedtraining batches. For more complex tasks like panoptic seg-mentation, the simple cropping strategy also affects the per-formance on object detection and consequently on instancesegmentation. In particular, extracting fixed-size crops fromimages during training introduces a bias towards truncatinglarge objects, with the likely consequence of underestimat-ing their actual bounding box sizes during inference on fullimages (see, e.g. Fig. 1 left). Indeed, Fig. 2 (left) shows thatthe distribution of box sizes during crop-based training onthe high-resolution Mapillary Vistas [22] dataset does notmatch with the one derived from full-image training data.In addition, Fig. 2 (right) shows that large objects (based on# pixels) are drastically underrepresented, which may leadto over-fitting and thus further harming generalization.

In this paper we overcome these issue by introducing twonovel contributions: 1) A crop-based training strategy ex-ploiting a crop-aware loss function (CABB) to address theproblem of cropping large objects, and 2) Instance scale-uniform (ISUS) sampling as data augmentation strategy tocombat the imbalance of object scales in the training data.Our solution enjoys all benefits from crop-based training asdiscussed above. In addition, our crop-aware loss incen-tivizes the model to predict bounding boxes to be consistentwith the visible parts of cropped objects, while not over-penalizing predictions outside of the crop. The underlyingintuition is simple: Even if an object bounding box sizewas modified through cropping, the actual object bound-ing boxes may be larger than what is visible to the net-work during training. By not penalizing hypothetical pre-dictions beyond the visible area of a crop but still withintheir actual sizes, we can better model the bounding boxsize distribution given by the original training data. WithISUS we introduce an effective data augmentation strategyto improve feature-pyramid like representations as used for

object detection at multiple scales. It aims at more evenlydistributing supervision of object instances during trainingacross pyramid scales, leading to improved recognition per-formance of instances at all scales during inference. Inthe experimental analyses we find that our crop-aware lossfunction is particularly effective on high-resolution imagesas available in the challenging Mapillary Vistas [22], IndianDriving [27], or Cityscapes [8] datasets.

Contributions. We summarize our contributions to thepanoptic segmentation research community as follows.

• We introduce a novel, crop-aware training loss appli-cable to improving bounding box detection in panop-tic segmentation networks when training them in acrop-based way. At negligible computational over-head (∼10ms per batch) we show how our new lossaddresses issues of crop-based training, considerablyimproving the performance on disproportionately of-ten truncated bounding boxes.

• We describe a novel Instance Scale-Uniform Samplingapproach to smooth the distribution of object sizes ob-served by a network at training time, improving itsgeneralization across scales.

• We significantly push the state-of-the-art results on thehigh-resolution Mapillary Vistas dataset, improving onmultiple evaluation metrics like Panoptic Quality [16](+4.5%) and mean average precision (mAP) for masksegmentation (+5.2%). We also obtain remarkable per-formance gains on IDD and Cityscapes, improving PQby +0.6% and mAP by +4.1% and +1.5%, respectively.

2. Technical ContributionsIn this section we present our main methodological con-

tributions. In particular, in Sec. 2.1 we describe a novelInstance-Scale Uniform Sampling (ISUS) approach aimedat reducing the object scale imbalance inherent in high-resolution panoptic datasets. Sections 2.2, 2.3 and 2.4 de-scribe the Crop-Aware Bounding Box (CABB) loss, whichwe propose as a mitigation to the bias imposed by crop-based training on the detection of large objects.

2.1. Instance Scale-Uniform Sampling (ISUS)

Most top-down panoptic segmentation networks build ontop of backbones that produce a “pyramid” of features atmultiple scales. At training time, some heuristic rule [15]is applied to split the ground truth instances across theavailable scales, such that the network is trained to detectsmall objects using high-resolution features and large ob-jects using low-resolution features. By sharing the parame-ters of the prediction modules (e.g. the RPN and ROI headsof [23]) across all scales, the network is incentivized to learn

2

0 – 25

25– 75

75– 90

90– 99

99– 10

00.2

0.6

1

IoU

Cropped vs. original IoU

≤ 16 32 64 128 256 ≥ 512

0

2

4·105

scale (px)

#ob

ject

s

Number of objects by scale

Figure 2: Left: average intersection over union of cropped bounding boxes w.r.t. their original extent, computed using theMapillary Vistas training settings in Sec. 4.1. Right: distribution of object scales in the Mapillary Vistas training set.

scale-invariant features. When dealing with high-resolutionimages, however, this approach encounters two major is-sues: i) the range of object scales can greatly exceed therange of scales available in the feature pyramid, and ii) thedistribution of object scales is markedly non-uniform (seeFig. 2). While (i) can be partially addressed by adding morefeature scales, at the cost of increased memory and compu-tation, (ii) will lead to a strong imbalance in the amount ofsupervision received by each level of the feature pyramid.

In order to mitigate this imbalance, we propose an ex-tension to the Class-Uniform Sampling (CUS) approach in-troduced in [25] we coin Instance Scale-Uniform Sampling(ISUS). The standard CUS data preparation process followsfour steps: 1) sample a semantic class with uniform proba-bility; 2) load an image that contains that class and re-scaleit such that its shortest side matches a predefined size s0;3) apply any data augmentation (e.g. flipping, random scal-ing); and 4) produce a random crop from an area of the im-age where the selected class is visible. In ISUS, we followthe same steps as in CUS, except that the scale augmenta-tion procedure is made instance-aware. In particular, whena “thing” class is selected in step 1 and after completing step2, we also sample a random instance of that class from theimage and a random feature pyramid level. Then, in step3 we compute a scaling factor σ such that the selected in-stance will be assigned to the selected level according to theheuristic adopted by the network being trained. In order toavoid excessively large or small scale factors, we clamp σ toa limited range rth. Conversely, when a “stuff” class is se-lected in step 1, we follow the standard scale augmentationprocedure, i.e. uniformly sample σ from a range rst. In thelong run, ISUS will have the effect of smoothing out the ob-ject scale distribution, providing more uniform supervisionacross all scales.

2.2. Bounding box regression

Most top-down panoptic segmentation approaches en-code object bounding boxes in terms of offsets with respectto a set of reference boxes [17, 23, 21]. These referenceboxes can be fixed, e.g. the “anchors” in the region proposal

stage, or be the output of a different network section, e.g.the “proposals” in the detection stage. The goal of a net-work component that predicts bounding boxes is to regressthese offset values given the input image (or derived fea-tures thereof).

A ground-truth bounding box G is encoded in terms of acenter cG ∈ R2 and dimensions dG ∈ R2. Each ground-truthbox is assigned a reference (or anchor) bounding box A withcenter cA ∈ R2 and dimensions dA ∈ R2. The ground truthfor the training procedure is then encoded in relative termsand specifically given by ∆G = (δG,ωG) where

δG =cG − cAdA

∈ R2 and ωG =dGdA∈ R2 .

Here and later, we implicitly assume for notational conve-nience that operations and functions applied to vectors workelement-wise unless otherwise stated. We will also use thenotation to denote the operation above that returns ∆G

given bounding boxes G and A, i.e. ∆G = G A.Similarly, given an anchor bounding box A and ∆P =

(δP,ωP), we can recover the predicted bounding box P withcenter cP and dimensions dP as

cP = cA + δPdA and dP = ωPdA .

Standard bounding box loss [24]. To train the network,the following per-box loss is minimized over the trainingdataset:

LBB(∆P; ∆G) = ‖`β(δP−δG)+`β(logωP−logωG)‖1 , (1)

where ‖ · ‖1 is the 1-norm and `β denotes the Huber (a.k.a.smooth-L1) norm with parameter β > 0, i.e.

`β(z) =

{1

2β z2 |z| ≤ β

|z| − β2 otherwise,

and |z| gives the absolute value of z.

3

2.3. Crop-Aware Bounding Box (CABB)

In a standard crop-based training, a ground-truth bound-ing box G from the original image that overlaps with thecropping area C is typically cropped yielding a new bound-ing box denoted by G|C. 1 Accordingly, the actual ground-truth ∆G that is used in the loss (1) is the result of ∆G =G|C A. Training with this modified ground-truth, however,poses some issues, namely a bias towards cutting or missingbig objects at inference time (see, e.g., Fig. 1 and 6).

The solution we propose in this work consists in relax-ing the notion of ground-truth bounding box G into a set ofground-truth boxes that coincide with G|C after the croppingoperation. We denote by ρ(G, C) the function that computesthis set for given ground-truth box G and cropping area C,i.e.

ρ(G, C) = {X ∈ B : X|C = G|C} ,

where X runs over all possible bounding boxes B. We re-fer to ρ(G, C) as a Crop-Aware Bounding Box (CABB) thatin fact is a set of bounding boxes (see also Fig. 3). If theground-truth bounding box G is strictly contained in the croparea then our CABB boils down to the original ground truth,for ρ(G, C) = {G} in that case.2 Since we will use a rep-resentation for bounding boxes relative to some anchor boxA we introduce also the notation ρA(G, C), which returns thesame set as above but with elements expressed relative to A,i.e. ρA(G, C) = {X A : X ∈ ρ(G, C)}.

Crop-aware bounding box loss. In order to exploit theproposed, relaxed notion of ground-truth bounding box,we introduce the following new loss function for a givenground-truth box G, anchor box A and crop area C:

LCABB(∆P) = min∆

LBB(∆P; ∆) ,

s.t ∆ ∈ ρA(G, C) .(2)

Any bounding box in ρ(G, C) is compatible with the croppedground-truth box we observe and thus could be potentiallya valid prediction. To disambiguate, our new loss favoursthe solution closer to the actual prediction from the networkin order to enforce a smoother training dynamic. Since theground-truth box that is typically adopted for the standardloss in (1) belongs to the feasible set of the minimization inour new loss, we have that LCABB lower bounds LBB.

1When masks are available like in instance or panoptic segmentation,the cropping operation is performed at the mask level and the boundingbox is recomputed a posteriori. We implicitly assume that this is the caseif a ground-truth mask is available for G.

2To simplify the description, we deliberately neglect the fact that abounding box strictly contained in the original image and touching theboundary of the crop area should not be extended beyond the crop. How-ever, our approach can be easily adapted to address these edge cases.

Figure 3: Example of Crop-Aware Bounding Boxes(CABB). We show 4 ground-truth boxes, three of whichfall partially outside the crop area. The corresponding setρ(G, C), a.k.a. CABB, consists of all rectangular boundingboxes that can be formed by moving the white-borderedcorners within the feasible areas (depicted in blue). Notethat the areas extend to infinity but are truncated here.

2.4. Computational Aspects

This section focuses on the computational aspects of ournew loss. In particular, we will address the problem of eval-uating it by solving the internal minimization as well ascomputing the gradient.

The minimization problem that is nested into our newloss has no straightforward solution, since it is neither con-vex nor quasi-convex and in general, local, non-global so-lutions might exist. Its feasible set is convex in ∆ = (δ,ω)since it can be written in terms of linear equalities and in-equalities. Each dimension gives rise to an independentset of constraints and since also the objective function isseparable with respect to dimension-specific variables, wehave that the whole minimization problem can be separatedinto two independent minimization problems involving onlydimension-specific variables.

Feasible set. Assume without loss of generality that thecropping area C is a box with top-left coordinate (0, 0) andbottom-right coordinate dC ∈ R2. Then the feasible set ofeach dimension-specific minimization problem can be writ-ten as:

• δ − ω2 ≤ −

cAdA

if cG ≤ dG2 else δ − ω

2 = δG − ωG

2 and

• δ + ω2 ≥

dC−cAdA

if cG ≥ dC − dG2 else δ + ω

2 = δG + ωG

2 ,

where we dropped the boldface style from the vector-valuedvariables to emphasize that the constraint is specified for asingle dimension.

Optimization problem. We will now enumerate the dif-ferent cases characterizing the feasible set and for each of

4

them we will provide the dimension-specific optimizationproblem that should be solved. Akin to the feasible setabove, all variables involved from here on refer implicitlyto a single dimension.

• If dG2 < cG < dC− dG2 then ∆? = (δG, ωG) is the solution

to the minimization problem in (2) for the dimensionunder consideration, since the feasible set is singletonin this case.

• If cG > dG2 and cG ≥ dC− dG

2 , we obtain an optimizationproblem in the variable ω of the form

minω

`β

(ω − ω

2

)+ `β(log(ω)− log(ωP))

s.t. ω ≥ b1 − a1 ,

(O1)

where a1 = δG − ωG

2 , b1 = dC−cAdA

and ω = 2(δP − a1).If w? is a solution to (O1) then ∆? = (a1 + ω?

2 , ω?)

is a solution to the minimization problem in (2) for thedimension under consideration.

• If cG ≤ dG2 and cG < dC− dG

2 , we obtain an optimizationproblem like (O1) but with a1 = − cA

dA, b1 = δG +

ωG

2 and ω = 2(b1 − δP). If w? is a solution to (O1)under this parametrization then ∆? = (b1 − ω?

2 , ω?)

is a solution to the minimization problem in (2) for thedimension under consideration.

• If dC − dG2 ≤ cG ≤ dG

2 then we obtain an optimizationproblem of the form

minδ,ω

`β(δ − δP) + `β(log(ω)− log(ωP))

s.t. δ − ω

2≤ a2 , δ +

ω

2≥ b2 ,

(O2)

where a2 = − cAdA

and b2 = dC−cAdA

. Solutions to (O2)map directly to solutions to (2) for the dimension underconsideration.

We focus now on finding the solution to the optimizationproblems (O1) and (O2).

Solution to (O1). As mentioned before, the optimizationproblem in (2) is in general non-convex and might havemultiple local minima. The same holds true for the prob-lem in (O1) despite having a single variable. Nonetheless,we devised an ad-hoc solver for this problem that allowsto quickly converge to a global solution under the desiredprecision. We provide the details in Appendix B.

Solution to (O2). To solve this problem we break it downinto cases. We start by noting that the solution to the uncon-strained optimization problem is trivially given by δ? = δP

and ω? = ωP, because 0 is the minimizer of `β . The solution∆? = (δ?, ω?) is valid for (O2) if it satisfies the constraints,but this is easy to check by substitution. If this is the case,we found the solution, otherwise no solution exists in the in-terior of the feasible set (see Prop. 2 in Appendix A), but liesalong the boundary of the feasible set. Accordingly, we startby forcing the first constraint to be active. This yields an in-stance of (O1) with a1 = a2, b1 = b2 and ω = 2(δP − a2),which can be solved using the algorithm from Appendix B,yielding ω?1 . By substituting it into the activated constraintwe obtain the other variable δ?1 = a2 +

ω?1

2 . Next, we moveto activating the second constraint. This yields again an in-stance of the same optimization problem with the only dif-ference being ω = 2(b2 − δP). Again we solve it obtainingω?2 and by substitution into the activated constraint we getδ?2 = b2− ω?

2

2 . We finally retain the solution among (δ?1 , ω?1)

and (δ?2 , ω?2) yielding the lowest objective. See Alg. 2 for

further details.

Gradient. For the sake of training a neural network, weare interested in computing gradients of the new loss func-tion, which exhibits a nested optimization problem. Thefollowing result shows that the derivative of the new lossfunction is equivalent to the derivative of the original one,with the ground-truth box replaced (as a constant) by the so-lution to the internal minimization problem. In general thesolution to the internal minimization problem is a functionof ∆P but the following result states that no gradient termis originated from this dependency. This is indeed a directconsequence of the envelope theorem [1].

Proposition 1. Let φ be a function returning the minimizerin (2) given ∆P, i.e. LCABB(∆P) = LBB(∆P, φ(∆P)) holdsfor any ∆P. Then

d

d∆P

LCABB(∆P) =∂

∂∆P

LBB(∆P,∆)

∣∣∣∣∆=φ(∆P)

.

3. Related WorksAfter scrutinizing the literature, we have found no other

work directly addressing the specific challenges of trainingpanoptic segmentation networks on high-resolution data,nor the bias introduced by crop-based training. Indeed,to our knowledge, we are tackling these issues for thefirst time. In the literature we find several methods forpanoptic segmentation that are architecture-wise compati-ble with our CABB loss and ISUS, among which we haveEfficientPS [21], AUNet [18], TASCNet [17], Panoptic-FPN [15], UPSNet [29] and Seamless Scene Segmenta-tion [23], to mention a few. Indeed, those approaches relyon the computation of bounding boxes at some stage, andemploy network backbones that produce multi-scale featurepyramids. Among them, only the first two report crop-based

5

HRNet-W48+ FPN

RPNHead

MaskHead

DeepLabv3Head

PanopticFusion

Figure 4: Overview of the main functional blocks of ournetwork. Red: network body, i.e. HRNet-W48+. Green:instance segmentation section, composed of an FPN mod-ule followed by a Region Proposal Head (RPH) and a masksegmentation head. Blue: semantic segmentation section,i.e. DeepLabv3 head. Yellow: final panoptic fusion step.

training results in the original work, while the remainingones report full-image training results. This however doesnot mean that the latter approaches would not benefit fromcrop-based trainings. Indeed, in this work, we perform ex-periments using Seamless Scene Segmentation as baselineand show that there is significant improvements derivingfrom a crop-based training protocol. Other panoptic seg-mentation methods that benefit from crop-based training areAdaptIS [26], DeeperLab [30] SSAP [10] and Panoptic-Deeplab [6]. The latter approaches however are neitherbased on bounding boxes nor employ feature pyramids,thus our contributions do not directly apply to them. Morebroadly, recent works dealing with high-resolution imagedata include RefineNet [19] or CascadePSP [7], which how-ever address the task of conventional semantic segmentationrather than Panoptic segmentation.

4. Experimental Results

We evaluate our proposed CABB loss on the three largestpublicly available, high-resolution panoptic segmentationdatasets: Mapillary Vistas [22] (MVD), the Indian DrivingDataset [27] (IDD) and Cityscapes [9] (CS). MVD com-prises 18k training, 2k validation, and 5k testing images,with resolutions ranging from 2 to 22 Mpixels and aver-aging 8.8 Mpixels, and annotations covering 65 semanticclasses, 37 of which instance-specific. IDD comprises 7ktraining, 1k validation, and 2k testing images, most cap-tured at a 2 Mpixels resolution and annotated with 26 se-mantic classes, 9 of them instance-specific. Cityscapescomprises 3k training, 500 validation, and 1.5k testing im-ages, captured at 2 Mpixels resolutions and annotated with19 classes, 8 of which instance-specific. Next, we presentdetailed ablation studies and a comparison with recent state-of-the-art panoptic segmentation approaches.

4.1. Network and Training Details

Our CABB loss and ISUS, described in Sec. 2.3, can beused in most top-down panoptic segmentation networks.To evaluate their effects, however, we focus our atten-tion on a specific architecture, carefully crafted to achievestate-of-the-art performance on high-resolution datasets al-ready without using either. In particular, we follow thegeneral framework of Seamless-Scene-Segmentation [23],with several modifications described below (see Fig. 4).First, we replace the ResNet-50 “body” with HRNetV2-W48+ [28, 6], a specialized backbone which preserveshigh-resolution information from the image to the finalstages of the network. Second, we replace the “Mini-DL”segmentation head from [23] with a DeepLabV3+ [4] mod-ule, connected to the HRNetV2-W48+ body as describedin [6]. As in [23], we apply synchronized InPlace-ABN [25]throughout the network. Finally, CABB loss is used to re-place the standard bounding box regression loss both in theregion proposal and object detection modules.

We train our networks with stochastic gradient descenton 8 NVidia V100 GPUs with 32GB of memory. TheHRNetV2-W48+ backbone is initialized from an ImageNetpre-training in the MVD and IDD experiments, while theCityscapes networks are fine-tuned from their MVD-trainedcounterparts. We fix the crop size to 1024×1024 for MVD,and to 512× 512 for IDD and Cityscapes due to their lowerresolution, while inference is always performed on full im-ages. Average inference time on MVD is ∼ 1.2s per image.To reduce inter-run variability and obtain more comparableresults, we fix all sources of randomness that can be eas-ily controlled, resulting in the same sequence of images andinitial network weights across all our trainings. For a de-tailed breakdown of the training hyper-parameters refer toAppendix D.

4.2. Comparison with State of the Art

We provide a comparison of results in Table 1, withbaselines including methods trained on full images (TASC-Net [17], Seamless [23]) and crops (AdaptIS [26], Effi-cientPS [21], Panoptic Deeplab [6]), as well as multipledifferent backbones (EfficientNet in EffcientPS, ResNet-50 in Seamless and TASCNet, ResNeXt-101 in AdaptIS,Xception-71 and HRNet-W48+ in Panoptic Deeplab). Weconsider several different variants of our network: (i) oneusing the standard bounding box regression loss and CUS,trained either on full images (FULL) or crops (CROP); (ii)one using our CABB loss and CUS, trained on crops (CROP+ CABB); (iii) one using the standard bounding box regres-sion loss and ISUS, trained on crops (CROP + ISUS); andfinally (iv) one using both our CABB loss and ISUS, trainedon crops (CROP + CABB + ISUS).

The MVD results on top in Table 1 show that CROP out-performs FULL on all metrics, attesting to the advantages

6

Network C Pre-training PQ PQth PQst mAP mIoU PC PCth PCst PQ†

TASCNet [17] 7 I 32.6 31.1 34.4 18.6 – – – – –AdaptIS [26] 3 I 35.9 31.5 – – – – – – –Seamless [23] 7 I 37.7 33.8 42.9 16.4 50.4 – – – –Deeplab, X71 [6] 3 I 37.7 30.4 47.4 14.9 55.3 – – – –EfficientPS [21] 3 I 38.3 33.9 44.2 18.7 52.6 – – – –Deeplab, HR48 [6] 3 I 40.6 – – 17.8 57.6 – – – –

Seamless [23] + CROP 3 I 39.2 36.5 42.8 19.0 50.8 48.8 41.2 59.0 41.5Seamless [23] + CABB + ISUS 3 I 40.5 38.0 43.7 19.4 51.0 50.7 43.1 60.8 42.9

FULL 7 I 39.4 34.0 46.5 16.2 54.4 55.2 49.7 62.4 39.5CROP 3 I 43.6 41.9 45.9 22.3 54.9 56.2 52.4 61.2 45.7CROP + CABB 3 I 44.5 42.5 47.0 23.0 55.4 57.4 54.2 61.6 46.3CROP + ISUS 3 I 44.7 43.1 46.9 23.0 56.3 59.4 56.1 63.7 46.9CROP + CABB + ISUS 3 I 45.1 43.4 47.4 23.9 56.3 60.4 57.2 64.6 47.2

Seamless [23] 7 I 47.7 48.9 47.1 30.1 69.6 – – – –EfficientPS [21] 3 I 50.1 50.7 49.8 31.6 71.3 – – – –

FULL 7 I 49.1 51.0 48.1 32.3 69.0 71.0 76.2 68.3 50.5CROP 3 I 50.3 52.5 49.1 35.3 69.7 70.8 73.8 69.2 51.4CROP + CABB + ISUS 3 I 50.7 52.9 49.5 35.7 70.4 72.8 78.1 70.0 51.9

Seamless [23] 7 I, V 65.0 60.7 68.0 – 80.7 – – – –Deeplab, X71 [6] 3 I, V 65.3 – – 38.8 82.5 – – – –EfficientPS [21] 3 I, V 66.1 62.7 68.5 41.9 81.0 – – – –

FULL 7 I, V 66.0 61.7 69.1 39.5 64.2 80.8 79.9 81.4 64.2CROP 3 I, V 66.6 61.1 69.5 42.2 81.7 81.3 80.0 82.3 64.4CROP + CABB + ISUS 3 I, V 66.7 62.4 69.9 43.4 82.6 82.6 82.4 82.7 65.1

Table 1: State of the art results on Mapillary Vistas (top) , the Indian Driving Dataset (middle) , and Cityscapes (bottom)compared with variants of our network. A 3symbol in column “C” indicates crop-based training. “Deeplab” abbreviatesPanoptic Deeplab [6]. “I” and “V” are used to indicate pre-training on ImageNet and Mapillary Vistas, respectively.

of crop-based training. Both our CABB loss and ISUS sep-arately lead to consistent improvements w.r.t. CROP onall aggregate and pure recognition metrics. The effects ofCABB and ISUS will be explored in more detail in Sec. 4.3.We also see that even the weakest among our network vari-ants surpasses all PQ baselines, the only exception beingthe HRNet-W48-based version of Panoptic Deeplab. Af-ter introducing all of our contributions in CROP + CABB +ISUS, we establish a new state of the art on Mapillary Vistas,surpassing existing approaches by very wide margins (e.g.+4.5% PQ, +5.2% mAP).

The IDD experiments in the middle of Table 1 show sim-ilar results: CROP outperforms FULL in most metrics, whileCABB + ISUS bring further improvements, most pronouncedin PC. Compared to prior works, we observe much im-proved mAP scores and state of the art PQ, while segmen-tation metrics lag a bit behind. One possible explanationcould be the advanced panoptic fusion strategy adopted inEfficientPS, which particularly aims at improving instance

segmentation. We observe the same trends in the Cityscapesresults reported in the bottom of Table 1, although with re-duced margins. While Cityscapes is smaller than IDD andMVD, and some metrics are already quite saturated, we stillget notable +1.5% gain for mAP in our CROP + CABB +ISUS setting over previous state-of-the-art.

4.3. Detailed Analysis

After showing our new high-scores for MVD, IDD andCityscapes in the previous section we provde in-depth anal-yses for CABB and ISUS next. First, to validate the gen-erality of our proposals, we evaluate crop-based training,our CABB loss, and ISUS when applied to the approach ofPorzi et al. [23]. We report the results in Table 1 under twosettings, both trained on 1024 × 1024 crops: the unmodi-fied network from [23], reproduced from their original code(Seamless + CROP), and the same network combined withour CABB loss and ISUS (Seamless + CABB + ISUS). Con-sistent with our other results, the introduction of crop-based

7

0 – 25 25 – 75 75 – 90 90 – 99 99 – 100

0

20

40

mA

P

Box mAP by sizeFULL

CROP

CROP + CABB

CROP + ISUS

CROP + CABB + ISUS

0 – 25 25 – 75 75 – 90 90 – 99 99 – 100

0

20

40

mA

P

Mask mAP by sizeFULL

CROP

CROP + CABB

CROP + ISUS

CROP + CABB + ISUS

Figure 5: Mean Average Precision results on Mapillary Vistas, averaged over different size-based subdivisions of the valida-tion instances. The reported ranges are percentiles of the distribution of instance areas in the validation set.

Figure 6: Ground truth (first row) and panoptic segmentation results on Mapillary Vistas’s validation set obtained with CROP(second row) and CROP + CABB + ISUS (third row). Notice how CROP + CABB + ISUS is able to detect very big instanceswhich are completely missed by CROP. This figure is best viewed on screen and at magnification.

training brings consistent improvements over the baseline,particularly in detection metrics, while the CABB loss andISUS further boost the scores achieving a +2.8% improve-ment in PQ w.r.t. Seamelss. Further ablations on ISUS arereported in Sec. E.

As discussed in Sec. 1 and 2, we expect crop-based train-ing to have a negative impact on large objects, which weaim to mitigate with our CABB loss, while our ISUS shouldbring improvements across all scales by smoothing out theobject size imbalance. To verify this, in Fig. 5 we plot box(left) and mask (right) mAP scores as a function of objectsize (i.e. area), splitting the validation instances into five

categories according to size percentiles. As expected, CROPoutperforms FULL by a wide margin on smaller objects, asit is able to work on almost double the input resolution. Onthe other hand, the gap between CROP and FULL shrinks asobject size increases, with FULL finally surpassing CROP onthe largest objects. By adding CABB the crop-based net-work is able to fill the gap with FULL when dealing withobjects in the 99th size percentile, while maintaining strongperformance in all other size categories. ISUS brings gener-alized improvements over CROP at most scales, with the ex-ception of the smallest one. More surprisingly, ISUS seemsto be similarly beneficial as CABB on the largest objects.

8

A possible explanation is that, by increasing generalizationacross scales, ISUS allows the network to properly infer thesizes and positions of objects that are bigger than the train-ing crop. Finally, when CABB and ISUS are combined, weobserve consistent improvements on all sizes.

In Table 1 we report additional comparisons between ournetwork variants, based on PC and PQ† (see Appendix. C).In all datasets, we observe a clear improvement in thesemetrics when the CABB loss and ISUS are introduced in thenetwork. In particular, the gap between CROP and CROP+ CABB + ISUS in PCth is markedly larger than in PQth.This is unsurprising, as the PC metrics weight image seg-ments proportionally to size, clearly highlighting how theCABB loss is able to boost the network’s accuracy on largeinstances. This is also visible from the qualitative results inFig. 6, showing a comparison between the outputs of CROPand CROP + CABB + ISUS on 12Mpixels Mapillary Vistasvalidation images featuring large objects.

5. ConclusionsIn this paper we have tackled the problem of training

panoptic segmentation networks on high resolution images,using crop-based training strategies to enable the use ofmodern, high-capacity architectures. Training on crops hasa negative impact on the detection of large objects, whichwe addressed by introducing a novel crop-aware boundingbox regression loss. To counteract the imbalanced distri-bution of objects sizes, we further proposed a novel datasampling and augmentation strategy which we have shownto improve generalization across scales. By combiningthese with a state-of-the-art panoptic segmentation archi-tecture we achieved new top scores on the Mapillary Vistasdataset, surpassing the previous best performing approachesby +4.5% PQ and +5.2% mAP. We also showed state of theart results on the Indian Driving and Cityscapes datasets onmultiple detection and segmentation metrics.

A. Proof of ResultsLet ω0 = b1 − a1 and let ξ(ω) be the objective of (O1)

with first-order derivative

ξ′(ω) =1

2`′β

(ω − ω

2

)+

1

ω`′β(log(ω)− log(ωP)) . (3)

The first-order derivative of the Huber loss is given by`′β(x) = max(min(β−1x, 1),−1). We assume that ωP > 0and ω0 > 0.

Proposition 2. If a strictly feasible local solution (δ?, ω?)of (O2) exists, then δ? = δP and ω? = ωP.

Proof. Let δλ = λδP + (1 − λ)δ? and ωλ = λωP + (1 −λ)ω?. By contradiction, assume that a strictly feasible localsolution (δ?, ω?) exists such that (δ?, ω?) 6= (δP, ωP). Then,

we expect ddλϕ(δλ, ωλ)

∣∣λ=0

= 0, where ϕ(δ, ω) denotes theobjective of (O2). However

d

dλϕ(δλ, ωλ)

∣∣∣λ=0

=d

dλ`β(δλ − δP)

∣∣∣λ=0

+d

dλ`β(log(ωλ)− log(ωP))

∣∣∣λ=0

= (δP − δ?)`′β(δ? − δP) +ωP − ω?

ω?`′β(log(ω?)− log(ωP))

is negative because `′β(x) < 0 if x < 0 and `′β(x) > 0if x > 0, the logarithm is an ordering-preserving mappingand ω? > 0. This yields a contradiction thus proving theresult.

Proposition 3. ξ′(ω) < 0 for all 0 < ω < min{ω, ωP} andξ′(ω) > 0 for all ω > max{ω, ωP}.

Proof. For 0 < ω < ω we have that `′β(ω−ω

2

)< 0 and

for 0 < ω < ωP we have that `′β(log(ω) − log(ωP)) < 0.Accordingly, ξ′(ω) < 0 for 0 < ω < min{ω, ωP}.

Similarly, for ω > ω, we have that `′β(ω−ω

2

)> 0 and

for ω > ωP we have that `′β(log(ω) − log(ωP)) > 0. Ac-cordingly, ξ′(ω) > 0 for ω > max{ω, ωP}.

Proposition 4. If max{ω, ωP} ≤ ω0 then ω0 is the solutionto (O1).

Proof. By Prop. 3, ξ′(ω) > 0 for all ω > max{ω, ωP}.Accordingly, the same holds true for all ω ≥ ω0, whichimplies that ξ(ω0) yields the lowest feasible objective value.

Proposition 5. A solution to (O1) exists in[max{ω0,min{ω, ωP}},max{ω, ωP}] if ω0 ≤max{ω, ωP}.

Proof. A feasible solution ω < max{ω0,min{ω, ωP}} ex-ists only if ω0 ≤ ω < min{ω, ωP}. If this is the case,ξ′(ω) < 0 holds in the latter interval by Prop. 3. Ac-cordingly, for ω ≤ min{ω, ωP} the best objective is at-tained at min{ω, ωP}. Similarly by Prop. 3, ξ′(ω) > 0 ifω > max{ω, ωP} and, therefore, for ω ≥ max{ω, ωP} thebest objective is attained at max{ω, ωP}. Hence, a solutionto (O1) exists in the required interval.

Proposition 6. If max{ω0, ω} < ωP then a solution to (O1)exists in [max{ω0, ω}, ωP] and there ξ′ is strictly increas-ing.

Proof. For all ω ≤ ω < ω′ we have that 0 ≤ `′β(ω−ω2 ) ≤`′β(ω

′−ω2 ). Moreover, for all 0 < ω < ω′ ≤ ωP, both

`′β(log(ω) − log(ωP)) ≤ `′β(log(ω′) − log(ωP)) ≤ 0 and1ω > 1

ω′ > 0 hold, which imply 1ω `′β(log(ω) − log(ωP)) <

1ω′ `′β(log(ω′) − log(ωP)) ≤ 0. It follows that ξ′(ω) <

ξ′(ω′) holds in the required interval.

9

Proposition 7. η(ω) = 1ω `′β(log(ω)− log(ωP)) is

• strictly increasing in (0, emin{β,1}ωP], and

• strictly decreasing for ω ≥ eωP.

Proof. η(ω) is strictly increasing for 0 < ω < e−βωPbecause in this case η(ω) = − 1

ω and strictly decreas-ing for ω > ωPe

β because in this case η(ω) = 1ω . For

e−βωP ≤ ω ≤ eβωP we have η(ω) = 1ωβ (log(ω)−log(ωP))

andη′(ω) =

1

ω2β[1− log(ω) + log(ωP)] .

Since η′(e−βωP) > 0, η′(eβωP) < 0 and η′(ω) = 0 only atω = eωP, it follows by continuity of η′ that η′(ω) > 0 forω < eωP and η′(ω) < 0 for ω > eωP. Accordingly, η(ω) isstrictly increasing in [e−βωP, e

min{β,1}ωP]. Since the sameholds for 0 < ω < e−βωP as shown before, by continu-ity of η, we conclude that η(ω) is strictly increasing alongthe whole interval (0, emin{β,1}ωP]. Similarly, we have thatη(ω) is strictly decreasing in [eωP, e

βωP] and for ω > eβωPas shown before. Hence, by continuity of η, we concludethat η(ω) is strictly decreasing for ω ≥ eωP.

Proposition 8. If max{ω0, ωP} < ω then a solution to (O1)is either ω0 or ω, or it belongs to one of the following inter-vals:

(i) J1 = [max{ω0, ωP},min{emin{β,1}ωP, ω}],

(ii) J2 = [max{ω0, 2√β, ω − 2β, eβωP}, ω],

(iii) J3 = [max{ω0, ω − 2β, eωP},min{eβωP, ω}] if ω ≤4√

2

(iv) J4 = [max{ω0, ω − 2β, eωP},min{eβωP, ω, ν1}] ifω > 4

√2,

(v) J5 = [max{ω0, ω − 2β, eωP, ν2},min{eβωP, ω}] ifω > 4

√2,

where ν1,2 = ω4

(1±

√1− 32

ω2

).

Moreover, ξ′ is strictly increasing in (i)-(ii) and σ(ω) =ωξ′(ω) is strictly increasing in (iii)− (v).

Proof. By Prop. 5 a solution to (O1) exists in I =[max{ω0, ωP}, ω]. We partition I into sections where ξ′ orσ are either strictly increasing or strictly decreasing. Wework by cases:

• J1. In this interval `′β(ω−ω2 ) is increasing in ω andη is strictly increasing by Prop. 7. Hence, ξ′(ω) =12`′β(ω−ω2 ) + η(ω) is strictly increasing as well.

• [max{ω0, eωP}, ω − 2β]. In this interval `′β(ω−ω2 ) isconstant and η is strictly decreasing by Prop. 7. Hence,ξ′ is strictly decreasing as well.

• [max{ω0, ω − 2β, eβωP}, ω]. In this interval, ξ′(ω) =1

4β (ω − ω) + 1ω and ξ′′(ω) = 0 holds only in the

feasible point ω = 2√β, while ξ′′(ω) < 0 for 0 <

ω < 2√β and ξ′′(ω) > 0 for ω > 2

√β. Ac-

cordingly ξ′(ω) is strictly decreasing in the interval[max{ω0, ω − 2β, eβωP},min{2

√β, ω}] and strictly

increasing in the interval J2.

• J3. In this interval, ξ′(ω) = 14β (ω− ω)+ 1

ωβ (log(ω)−log(ωP)) and by setting σ′(ω) = 0 we find at mosttwo solutions, namely ν1,2, which are distinct and realfor ω > 4

√2. Both solutions might potentially be-

long to the interval under consideration. The signof σ′(ω) is negative for ν1 < ω < ν2 and posi-tive for ω < ν1 and ω > ν2. Accordingly σ(ω) isstrictly increasing in the intervals J4 and J5, whileit is strictly decreasing in the interval [max{ω0, ω −2β, eωP, ν1},min{eβωP, ω, ν2}]. If ω ≤ 4

√2 then

σ′(ω) ≥ 0 in the whole interval J3, with equalityonly if ω = 4

√2 and ω = ν1 = ν2. Accordingly,

if ω ≤ 4√

2 we have that σ(ω) is strictly increasing inJ3.

Since σ(ω) and ξ′(ω) share the same sign, given an in-terval J where ξ′ or σ is strictly decreasing, we have one ofthe following cases: a) ξ′ is strictly positive, b) ξ′ is strictlynegative or c) ξ′ transitions once from a positive to a nega-tive sign. In all three cases, a solution to (O1) cannot exist inthe interior of J but can be at most at one endpoint of J . Forthe same reason, no solution can be at the junction of twointervals where ξ′ or σ are strictly decreasing. Hence, theendpoint has to be either an endpoint of I or be in commonwith an interval where either ξ′ or σ are strictly increasing,which proves the result.

B. Optimization Algorithms

In this section we provide the optimization algorithmsused to solve (O2) and (O1), which exploit the theoreticalresults given in Sec. A.

Algorithm for (O2). Alg. 2 provides a solution to theoptimization problem (O2). The idea of the algorithm issketched also in Sec. 2.4 of the main paper. The global, un-constrained solution to (O2) is attained at (δP, ωP). Accord-ingly, if this solution is feasible it is also the global solutionto the constrained version of the problem (line 5). If it’s notfeasible, we have two options, the solution is in the interiorof the feasible set, or at the boundary. However, by Prop. 2the former case is not possible, because the only solutionwould be the one we excluded already, namely (δP, ωP).Hence, the solution has to lie at the boundary of the fea-sible set. Since we have only two constraints, we can apply

10

Algorithm 1 Solves the optimization problem (O1).

1: function SOLVE O1(ωP, ω, a1, b1)2: ω0 = b1 − a1

3: S = {ω0} . Used to collect potential solutions4: if max{ω0, ω} < ωP then . Prop. 65: return FIND MIN([max{ω0, ω}, ωP], ξ′)6: else if max{ω0, ωP} < ω then . Prop. 87: S = S ∪ {ω, FIND MIN(J1, ξ

′), FIND MIN(J2, ξ′)}

8: if ω ≤ 4√

2 then9: S = S ∪ {FIND MIN(J3, σ)}

10: else11: S = S ∪ {FIND MIN(J4, σ), FIND MIN(J5, σ)}12: end if13: end if14: return arg minω∈S ξ(ω)15: end function

Algorithm 2 Solves the optimization problem (O2).

1: function SOLVE O2(δP, ωP, a2, b2)2: ω1 = 2(δP − a2)3: ω2 = 2(b2 − δP)4: if ωP ≥ max{ω1, ω2} then5: return (δP, ωP)6: end if7: ω1 =SOLVE O1(ωP, ω1, a2, b2)8: ω2 =SOLVE O1(ωP, ω2, a2, b2)9: if ξ(ω1) ≤ ξ(ω2) then

10: return (a2 + ω1

2 , ω1)11: else12: return (b2 − ω2

2 , ω2)13: end if14: end function

a brute force approach, and explore two cases where we as-sume that the solution activates the first constraint (line 7)or the second one (line 8). In both cases, we boil down tosolving an instance of the optimization problem (O1) whereω = 2(δP−a2) and ω = 2(b2− δP), respectively. The solu-tion to each of those problems, denoted by ω1 and ω2 in thealgorithm, is given by applying Alg. 1, which is discussedlater. Among those two solutions, we retain the one mini-mizing the objective of (O1), where the objective is denotedby ξ in the algorithm. If ω1 is the best one then the solutionto (O2) is given by (δ1, ω1) in line 10, where δ1 = a2 + ω1

2is obtained by substituting ω1 in the first constraint. Oth-erwise, the solution is given by (δ2, ω2) in line 12, whereδ2 = b2 − ω2

2 is obtained similarly from the second con-straint.

Algorithm for (O1). Alg. 1 provides a solution to theoptimization problem (O1). According to Prop. 4, if

Algorithm 3 Finds the minimum of an objective ξ in a giveninterval [u, v] by leveraging an increasing, continuous func-tion ϕ, whose sign agrees with the sign of ξ′.

1: function FIND MIN([u, v], ϕ)2: if ϕ(u) ≥ 0 then3: return u4: else if ϕ(v) ≤ 0 then5: return v6: else7: m = u+v

28: if v − u < ε then . ε is a tolerance9: return m

10: else if ϕ(m) ≥ 0 then11: return FIND MIN([u,m], ϕ)12: else13: return FIND MIN([m, v], ϕ)14: end if15: end if16: end function

max{ω, ωP} ≤ ω0 then ω0 is the solution. Indeed, in thiscase we return ω0 in line 14 since it is the only element of S.If condition in line 4 is hit, then by Prop. 6 we can search asolution in the interval [max{ω0, ω}, ωP] by leveraging themonotonicity of ξ′. We do so by exploiting Alg. 3 in line 5,which will be discussed later. If condition in line 6 is hitinstead, according to Prop. 8, we need to search for the bestsolution within the intervals Ji with i ∈ {1, . . . , 5} even-tually satisfying the given conditions. Moreover, we needto include in the pool also ω0 and ω. The search over eachinterval Ji is performed via Alg. 3, by leveraging the mono-tonicity of ξ′ or σ. All potential solutions are collected intoS and the best one in terms of the objective is retained inline 14.

Alg. 3. Finds the minimum of an objective ξ in a giveninterval [u, v] by leveraging an increasing, continuous func-tion ϕ, whose sign agrees with the sign of ξ′. This can bedone by searching the element in [u, v] that is the closestone to a zero of ϕ. Since the function is increasing, if ϕ(u)is non-negative then the closest element to a zero is u, whileif ϕ(v) is non-positive then the closest element is v. Other-wise, we perform a dichotomic search on the half-intervalwhere we have discording signs of ϕ at the extremes untilwe reach the zero with sufficient accuracy.

C. Evaluation Metrics

Panoptic Quality (PQ), originally described in [16], isthe most commonly adopted metric to evaluate panopticsegmentation results. We report it together with seman-tic Intersection over Union (mIoU) and mask mean Aver-

11

age Precision (mAP), in order to detailedly measure ournetwork’s segmentation and detection performance, respec-tively. Some recent works [23, 30] have proposed alterna-tives to PQ aimed at highlighting different aspects of thepanoptic predictions, or overcoming potential pitfalls of PQ.Note that we denote by PQth and PQst the PQ scores com-puted only on thing and stuff classes, respectively.3

Parsing Covering. PQ assigns equal importance to allimage segments, a choice which is not always desirable, e.g.autonomous driving systems might care more about objectscloser to the vehicle, and thus appear bigger in the image,than far away ones. Motivated by this observation, [30] pro-posed Parsing Covering (PC) as an alternative panoptic met-ric that weights image segments in proportion to their areas.Since our CABB loss focuses on improving detection resultsof large objects, PC helps highlighting its impact.

PQ†. Porzi et al. [23] discussed a potential limitation ofPQ, as it handles all classes in a uniform way, imposing ahard 0.5 threshold on IoUs of both things and stuff. Whilethis is strictly necessary to obtain a unique matching be-tween thing segments and their respective ground truth, itcan result in strong over-penalization of stuff segments. Tosolve this, they propose PQ† as a direct modification of PQwhich avoids the thresholding issue, giving a more faithfulrepresentation of the quality of stuff predictions.

D. Training hyper-parametersAll our networks are trained using stochastic gradient

descent with momentum 0.9 and weight decay 10−4. Thetraining schedule starts with a warm-up phase, where thelearning rate is increased linearly from 0 to a value lr0 inthe first 200 training steps. Then, the learning rate followsa linear decay schedule given by lri = lr0(1− i

#steps ), wherelri is the value at training step i. In all of our experimentswe augment the data with random horizontal flipping, andin those involving ISUS we fix the maximum “things” scaleaugmentation range to rth = [0.25, 4]. The scale augmenta-tion range used in CUS always matches the rst of the corre-sponding ISUS experiments on the same dataset. In the fol-lowing we list the dataset-specific hyper-parameters. Notethat all schedules used for a particular dataset result in ap-proximately the same number of training iterations.

Mapillary Vistas. All MVD experiments use a “stuff”scale augmentation range of rst = [0.8, 1.25]. When utiliz-ing full images we set s0 = 1344, lr0 = 0.02, and we trainfor 75 epochs on batches including a single image per GPU.In all other experiments we set s0 = 2400, lr0 = 0.04, take

3A similar notation is also used for PC.

crops of size 1024 × 1024, and train for 300 epochs usingbatches of 4 crops per GPU.

Indian Driving Dataset. In the IDD experiments we fixs0 = 1080 and rst = [0.5, 2]. We train for 75 epochs withbatch size of 1 per GPU and lr0 = 0.02 when using fullimages, and for 600 epochs with batch size of 8 per GPU,lr0 = 0.08 and crop size 512× 512 when using crops.

Cityscapes. Finally, in the Cityscapes experiments wepre-train our networks on Mapillary Vistas, and fix s0 =1024 and rst = [0.5, 2]. When using full images, we trainfor 20 epochs with batch size of 1 per GPU and lr0 = 0.01.When using crops, we train for 150 epochs with batch sizeof 8 per GPU, lr0 = 0.04 and crop size 512× 512.

E. Additional ISUS ablations

In order to validate the efficacy of ISUS, we perform anadditional ablation experiment where we train our CROPnetwork variant (with CUS) using standard scale augmen-tation in the range [0.25, 4]. Note that this is the same rangeas the rth used in the ISUS experiments. The aim here is toverify whether the instance-aware scale sampling in ISUShas any impact on detection compared to a uniform sam-pling in the same range. When training on MVD, we ob-tain the following results: PQth=42.3, mAP=22.8. Comparethese to the corresponding CROP + ISUS results: PQth=43.1,mAP=23.0.

F. Qualitative Results

In the following we visualize sample outputs of our bestperforming CROP + CABB + ISUS networks on MapillaryVistas (Fig. 7), Cityscapes (Fig. 8) and the Indian DrivingDataset (Fig. 9).

References[1] S. N. Afriat. Theory of maxima and the method of lagrange.

SIAM J. Appl. Math., 20(3):343–357, 1971. 5[2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.

Yuille. Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connectedcrfs. (PAMI), 40(4):834–848, 2018. 2

[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, FlorianSchroff, and Hartwig Adam. Encoder-decoder with atrousseparable convolution for semantic image segmentation. InProceedings of the European Conference on Computer Vi-sion, September 2018. 2

[4] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Flo-rian Schroff, and Hartwig Adam. Encoder-decoder withatrous separable convolution for semantic image segmenta-tion, 2018. 6

12

Figure 7: Sample outputs of CROP + CABB + ISUS on Mapillary Vistas. Best viewed on screen.

Figure 8: Sample outputs of CROP + CABB + ISUS on Cityscapes. Best viewed on screen.

[5] Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu,Thomas S. Huang, Hartwig Adam, and Liang-Chieh Chen.Panoptic-deeplab. arXiv:1910.04751, 2019. 2

[6] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu,Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen.Panoptic-deeplab: A simple, strong, and fast baseline for

13

Figure 9: Sample outputs of CROP + CABB + ISUS on the Indian Driving Dataset. Best viewed on screen.

bottom-up panoptic segmentation. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 12475–12485, 2020. 1, 6, 7

[7] Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, and Chi-KeungTang. Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In(CVPR), 2020. 6

[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The Cityscapesdataset for semantic urban scene understanding. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, 2016. 2

[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 3213–3223, 2016. 6

[10] N. Gao, Y. Shan, Y. Wang, X/ Zhao, Y. Yu, M. Yang, andK. Huang. SSAP: Single-shot instance segmentation withaffinity pyramid. In (ICCV), 2019. 6

[11] Aidan N. Gomez, Mengye Ren, Raquel Urtasun, andRoger B. Grosse. The reversible residual network: Back-propagation without storing activations. In (NIPS), Decem-ber 2017. 1

[12] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B.Girshick. Mask R-CNN. In Proceedings of the IEEE Inter-national Conference on Computer Vision, 2017. 1

[13] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. CoRR, abs/1502.03167, 2015. 2

[14] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami,Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Sto-ica. Breaking the memory wall with optimal tensor remateri-alization. In Proceedings of Machine Learning and Systems2020. 2020. 1

[15] Alexander Kirillov, Ross Girshick, Kaiming He, and PiotrDollar. Panoptic feature pyramid networks. In (CVPR),pages 6399–6408, 2019. 1, 2, 5

[16] Alexander Kirillov, Kaiming He, Ross Girshick, CarstenRother, and Piotr Dollar. Panoptic segmentation. In (CVPR),pages 9404–9413, 2019. 1, 2, 11

[17] Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa,and Adrien Gaidon. Learning to fuse things and stuff. CoRR,abs/1812.01192, 2018. 1, 3, 5, 6, 7

[18] Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, GuanHuang, Dalong Du, and Xingang Wang. Attention-guidedunified network for panoptic segmentation. In (CVPR), 2019.5

[19] Guosheng Lin, Anton Milan, Chunhua Shen, and IanReid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In (CVPR), 2017. 6

[20] Paulius Micikevicius, Sharan Narang, Jonah Alben, GregoryDiamos, Erich Elsen, David Garcia, Boris Ginsburg, MichaelHouston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu.Mixed precision training. In (ICLR), 2018. 1

14

[21] Rohit Mohan and Abhinav Valada. Efficientps: Efficientpanoptic segmentation. arXiv preprint arXiv:2004.02307,2020. 1, 3, 5, 6, 7

[22] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, andPeter Kontschieder. The Mapillary Vistas dataset for seman-tic understanding of street scenes. In (ICCV), 2017. 2, 6

[23] Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, andPeter Kontschieder. Seamless scene segmentation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2019. 1, 2, 3, 5, 6, 7, 12

[24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards real-time object detection with re-gion proposal networks. In (NIPS), 2015. 1, 3

[25] Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder.In-place activated batchnorm for memory-optimized trainingof DNNs. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2018. 1, 2, 3, 6

[26] Konstantin Sofiiuk, Olga Barinova, and Anton Konushin.Adaptis: Adaptive instance selection network. In Proceed-ings of the IEEE International Conference on Computer Vi-sion, pages 7355–7363, 2019. 1, 6, 7

[27] Girish Varma, Anbumani Subramanian, Anoop Nambood-iri, Manmohan Chandraker, and C V Jawahar. Indian driv-ing dataset (IDD): A dataset for exploring problems of au-tonomous navigation in unconstrained environments. In(WACV), 2019. 2, 6

[28] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, MingkuiTan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deephigh-resolution representation learning for visual recogni-tion. TPAMI, 2019. 2, 6

[29] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, MinBai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unifiedpanoptic segmentation network. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 8818–8826, 2019. 1, 5

[30] Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-JingHwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Pa-pandreou, and Liang-Chieh Chen. Deeperlab: Single-shotimage parser. CoRR, abs/1902.05093, 2019. 6, 12

[31] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation.arXiv:1909.11065, 2020. 2

15

arxiv:2012.07717v1 [cs.cv] 14 dec 2020

Documents