densely supervised grasp detector (dsgd)

8
Densely Supervised Grasp Detector (DSGD) Umar Asif, Jianbin Tang, and Stefan Harrer Abstract— This paper presents Densely Supervised Grasp Detector (DSGD), a deep learning framework which combines CNN structures with layer-wise feature fusion and produces grasps and their confidence scores at different levels of the image hierarchy (i.e., global-, region-, and pixel-levels). Specifi- cally, at the global-level, DSGD uses the entire image informa- tion to predict a grasp. At the region-level, DSGD uses a region proposal network to identify salient regions in the image and predicts a grasp for each salient region. At the pixel-level, DSGD uses a fully convolutional network and predicts a grasp and its confidence at every pixel. During inference, DSGD selects the most confident grasp as the output. This selection from hierarchically generated grasp candidates overcomes limitations of the individual models. DSGD outperforms state-of-the-art methods on the Cornell grasp dataset in terms of grasp accuracy. Evaluation on a multi-object dataset and real-world robotic grasping experiments show that DSGD produces highly stable grasps on a set of unseen objects in new environments. It achieves 97% grasp detection accuracy and 90% robotic grasping success rate with real-time inference speed. I. I NTRODUCTION Grasp detection is a crucial task in robotic grasping be- cause errors in this stage affect grasp planning and execution. A major challenge in grasp detection is generalization to unseen objects in the real-world. Recent advancements in deep learning have produced Convolutional Neural Network (CNN) based grasp detection methods which achieve higher grasp detection accuracy compared to hand-crafted features. Methods such as [?], [?], [1], [2] focused on learning grasps in a global-context (i.e., the model predicts one grasp con- sidering the whole input image), through regression-based approaches (which directly regress the grasp parameters defined by the location, width, height, and orientation of a 2D rectangle in image space). Other methods such as [3] focused on learning grasps at patch-level by extracting patches (of different sizes) from the image and predicting a grasp for each patch. Recently, methods such as [4], [5] used auto-encoders to learn grasp parameters at each pixel in the image. They showed that one-to-one mapping (of image data to ground truth grasps) at the pixel-level can effectively be learnt using small CNN structures to achieve fast inference speed. These studies show that grasp detection performance is strongly influenced by three main factors: i) The choice of the CNN structure used for feature learning, ii) the objective function used to learn grasp representations, and iii) the image hierarchical context at which grasps are learnt (e.g., global or local). In this work, we explore the advantages of combining Umar Asif, Jianbin Tang, and Stefan Harrer are with IBM Research Australia. Email:{umarasif, jbtang, sharrer}@au1.ibm.com. multiple global and local grasp detectors and a mechanism to select the best grasp out of the ensemble. We also explore the benefits of learning grasp parameters using a combination of regression and classification objective functions. Finally, we explore different CNN structures as base networks to identify the best performing architecture in terms of grasp detection accuracy. The main contributions of this paper are summarized below: 1) We present Densely Supervised Grasp Detector (DSGD), an ensemble of multiple CNN structures which generate grasps and their confidence scores at different levels of image hierarchy (i.e., global-level, region-level, and pixel-level). 2) We propose a region-based grasp network, which learns to extract salient parts (e.g., handles or boundaries) from the input image, and uses the information about these parts to learn class-specific grasps (i.e., each grasp is associated with a probability with respect to a graspable class and a non-graspable class). 3) We perform an ablation study of our DSGD by varying its critical parameters and present a grasp detector that achieves real-time speed and high grasp accuracy. 4) We demonstrate the robustness of DSGD for producing stable grasps for unseen objects in real-world environ- ments using a multi-object dataset and robotic grasping experiments. II. RELATED WORK In the context of deep learning based grasp detection, methods such as [1], [6], [7] trained sliding window based grasp detectors. However, their high inference times limit their application for real-time systems. Other methods such as [8]–[10] reduced inference time by processing a discrete set of grasp candidates, but these methods ignore some potential grasps. Alternatively, methods such as [2], [11], [12] proposed end-to-end CNN-based approaches which regress a single grasp for an input image. However, these methods tend to produce average grasps which are invalid for certain symmetric objects [2]. Recently, methods such as [4], [5], [13]–[15] used auto-encoders to generate grasp poses at every pixel. They demonstrated higher grasp accuracy compared to the global methods. Another stream of work focused on learning mapping between images of objects and robot motion parameters using reinforcement learning, where the robot iteratively refines grasp poses through real-world experiments. In this context, the method of [3] learned visual- motor control by performing more than 40k grasping trials on a real robot. The method of [16] learned image-to-gripper pose mapping using over 800k motor commands generated arXiv:1810.03962v2 [cs.CV] 30 Jan 2019

Upload: others

Post on 21-Oct-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Densely Supervised Grasp Detector (DSGD)

Densely Supervised Grasp Detector (DSGD)

Umar Asif, Jianbin Tang, and Stefan Harrer

Abstract— This paper presents Densely Supervised GraspDetector (DSGD), a deep learning framework which combinesCNN structures with layer-wise feature fusion and producesgrasps and their confidence scores at different levels of theimage hierarchy (i.e., global-, region-, and pixel-levels). Specifi-cally, at the global-level, DSGD uses the entire image informa-tion to predict a grasp. At the region-level, DSGD uses a regionproposal network to identify salient regions in the image andpredicts a grasp for each salient region. At the pixel-level, DSGDuses a fully convolutional network and predicts a grasp andits confidence at every pixel. During inference, DSGD selectsthe most confident grasp as the output. This selection fromhierarchically generated grasp candidates overcomes limitationsof the individual models. DSGD outperforms state-of-the-artmethods on the Cornell grasp dataset in terms of graspaccuracy. Evaluation on a multi-object dataset and real-worldrobotic grasping experiments show that DSGD produces highlystable grasps on a set of unseen objects in new environments.It achieves 97% grasp detection accuracy and 90% roboticgrasping success rate with real-time inference speed.

I. INTRODUCTION

Grasp detection is a crucial task in robotic grasping be-cause errors in this stage affect grasp planning and execution.A major challenge in grasp detection is generalization tounseen objects in the real-world. Recent advancements indeep learning have produced Convolutional Neural Network(CNN) based grasp detection methods which achieve highergrasp detection accuracy compared to hand-crafted features.Methods such as [?], [?], [1], [2] focused on learning graspsin a global-context (i.e., the model predicts one grasp con-sidering the whole input image), through regression-basedapproaches (which directly regress the grasp parametersdefined by the location, width, height, and orientation ofa 2D rectangle in image space). Other methods such as[3] focused on learning grasps at patch-level by extractingpatches (of different sizes) from the image and predicting agrasp for each patch. Recently, methods such as [4], [5] usedauto-encoders to learn grasp parameters at each pixel in theimage. They showed that one-to-one mapping (of image datato ground truth grasps) at the pixel-level can effectively belearnt using small CNN structures to achieve fast inferencespeed. These studies show that grasp detection performanceis strongly influenced by three main factors: i) The choice ofthe CNN structure used for feature learning, ii) the objectivefunction used to learn grasp representations, and iii) theimage hierarchical context at which grasps are learnt (e.g.,global or local).

In this work, we explore the advantages of combining

Umar Asif, Jianbin Tang, and Stefan Harrer are with IBM ResearchAustralia. Email:{umarasif, jbtang, sharrer}@au1.ibm.com.

multiple global and local grasp detectors and a mechanismto select the best grasp out of the ensemble. We also explorethe benefits of learning grasp parameters using a combinationof regression and classification objective functions. Finally,we explore different CNN structures as base networks toidentify the best performing architecture in terms of graspdetection accuracy. The main contributions of this paper aresummarized below:1) We present Densely Supervised Grasp Detector

(DSGD), an ensemble of multiple CNN structures whichgenerate grasps and their confidence scores at differentlevels of image hierarchy (i.e., global-level, region-level,and pixel-level).

2) We propose a region-based grasp network, which learnsto extract salient parts (e.g., handles or boundaries) fromthe input image, and uses the information about theseparts to learn class-specific grasps (i.e., each grasp isassociated with a probability with respect to a graspableclass and a non-graspable class).

3) We perform an ablation study of our DSGD by varyingits critical parameters and present a grasp detector thatachieves real-time speed and high grasp accuracy.

4) We demonstrate the robustness of DSGD for producingstable grasps for unseen objects in real-world environ-ments using a multi-object dataset and robotic graspingexperiments.

II. RELATED WORK

In the context of deep learning based grasp detection,methods such as [1], [6], [7] trained sliding window basedgrasp detectors. However, their high inference times limittheir application for real-time systems. Other methods suchas [8]–[10] reduced inference time by processing a discreteset of grasp candidates, but these methods ignore somepotential grasps. Alternatively, methods such as [2], [11],[12] proposed end-to-end CNN-based approaches whichregress a single grasp for an input image. However, thesemethods tend to produce average grasps which are invalidfor certain symmetric objects [2]. Recently, methods such as[4], [5], [13]–[15] used auto-encoders to generate grasp posesat every pixel. They demonstrated higher grasp accuracycompared to the global methods. Another stream of workfocused on learning mapping between images of objects androbot motion parameters using reinforcement learning, wherethe robot iteratively refines grasp poses through real-worldexperiments. In this context, the method of [3] learned visual-motor control by performing more than 40k grasping trialson a real robot. The method of [16] learned image-to-gripperpose mapping using over 800k motor commands generated

arX

iv:1

810.

0396

2v2

[cs

.CV

] 3

0 Ja

n 20

19

Page 2: Densely Supervised Grasp Detector (DSGD)

by 14 robotic arms performing grasping trials for over 2months.

In this paper, we present a grasp detector which has severalkey differences from the current grasp detection methods.First, our detector generates multiple global and local graspcandidates and selects the grasp with the highest quality.This allows our detector to effectively recover from theerrors of the individual global or local models. Second,we introduce a region-based grasp network which learnsgrasps using information about salient parts of objects (e.g.,handles, extrusions, or boundaries), and produces more ac-curate grasps compared to global [11] or local detectors[3]. Finally, we use layer-wise dense feature fusion [17]within the CNN structures. This maximizes variation in theinformation flow across the networks and produces betterimage-to-grasp mappings compared to the models of [2], [4].

III. PROBLEM FORMULATION

Given an image of an object as input, the goal is to gener-ate grasps at different image hierarchical levels (i.e., global-,region- and pixel-levels), and select the most confident graspas the output. We define the global grasp by a 2D orientedrectangle on the target object in the image space. It is givenby:

Gg = [xg, yg, wg, hg, θg, ρg], (1)

where xg and yg represent the centroid of the rectangle. Theterms wg and hg represent the width and the height of therectangle. The term θg represents the angle of the rectanglewith respect to x-axis. The term ρg is grasp confidenceand represents the quality of a grasp. Our region-levelgrasp is defined by a class-specific representation, where theparameters of the rectangle are associated with n classes (agraspable class: n = 1, and a non-graspable class: n = 0).It is given by:

Gr = [xnr , ynr , w

nr , h

nr , θ

nr , ρ

nr ], n ∈ [0, 1]. (2)

Our pixel-level grasp is defined as:

Gp = [Mxy,Mw,Mh,Mθ] ∈ Rs×W×H , (3)

where Mxy , Mw, Mh, and Mθ representRs×W×H−dimensional heatmaps1, which encode theposition, width, height, and orientation of grasps at everypixel of the image, respectively. The terms W and Hrepresent the width and the height of the input imagerespectively. We learn the grasp representations (Gg , Gr,and Gp) using joint regression-classification based objectivefunctions. Specifically, we learn the position, width, and theheight parameters using a Mean Squared Loss, and learnthe orientation parameter using a Cross Entropy Loss withrespect to Nθ = 50 classes (angular-bins).

IV. THE PROPOSED DSGD (FIG. 1)

Our DSGD is composed of four main modules as shown inFig. 1: a base network for feature extraction, a Global GraspNetwork (GGN) for producing a grasp at the image-level,

1s = 1 for Mxy , Mw , and Mh, and s = Nθ for Mθ .

a Region Grasp Network (RGN) for producing grasps usingsalient parts of the image, and a Pixel Grasp Network (PGN)for generating grasps at each image pixel. In the following,we describe in detail the various modules of DSGD.

A. Base Network

The purpose of the base network is to act as a featureextractor. We extract features from the intermediate layersof a CNN such as DenseNets [17], and use the features tolearn grasp representations at different hierarchical levels.The basic building block of DenseNets [17] is a Dense block:bottleneck convolutions interconnected through dense con-nections. Specifically, a dense block consists of Nl number oflayers termed Dense Layers which share information from allthe preceding layers connected to the current layer throughskip connections [17]. Fig. 3 shows the structure of a denseblock with Nl = 6 dense layers. Each dense layer consists of1×1 and 3×3 convolutions followed by Batch Normalization[19] and a Rectified Linear Unit (ReLU). The output of thelth dense layer (Xl) in a dense block can be written as:

Xl = [X0, ...,Xl−1], (4)

where [· · ·] represents concatenation of the features producedby the layers 0, ..., l − 1.

B. Global Grasp Network (GGN)

Our GGN structure is composed of two sub-networksas shown in Fig. 1-A: a Global Grasp Prediction Network(GGPN) for generating grasp pose ([xg, yg, wg, hg, θg]) anda Grasp Evaluation Network (GEN) for predicting graspconfidence (ρg). The GGPN structure is composed of a denseblock, an averaging operation, a 4−dimensional fully con-nected layer for predicting the parameters [xg, yg, wg, hg],and a 50−dimensional fully connected layer for predictingθg . The GEN structure is similar to GGPN except thatGEN has a single 2−dimensional fully connected layer forpredicting ρg . The input to GEN is a grasp image which isproduced by replacing the Blue channel of the input imagewith a binary rectangle image generated from the output ofGGPN as shown in Fig. 2.

Let Rgi = [xgi , ygi , wgi , hgi ], θgi and ρgi denote thepredicted values of a global grasp for the ith image. Wedefine the loss of the GGPN and the GEN models over Kimages as:

Lggpn =∑i∈K

((1− λ1)Lreg(Rgi , R∗gi) + λ1Lcls(θgi , θ

∗gi)),

(5)Lgen =

∑i∈K

Lcls(ρgi , ρ∗gi), (6)

where R∗gi , θ∗gi , and ρ∗gi represent the ground-truths. The term

Lreg is a regression loss defined as:

Lreg(R,R∗) = ||R−R∗||/||R∗||2. (7)

The term Lcls is a classification loss defined as:

Lcls(x, c) = −Nθ∑c=1

Yx,c log(px,c), (8)

Page 3: Densely Supervised Grasp Detector (DSGD)

Region Grasp Network (RGN)

Pixel Grasp Network (PGN)

Global Grasp Network (GGN)

Dense block 5

1×1 conv

3×3 conv ×Nl

global

avg. pool

[1×4] fcR

[1×50] fcθ θ

Position map Mxy

[1×224×244]

[k×50×n] fcθ

[k×n] fcρ

Base

network

Dense block 4

1×1 conv

3×3 conv×Nl

[x,y,w,h]

Salient Region Network (SRN)

Classification

loss

Regression

loss

[3×3]

Conv

[1×1]

Conv

[1×1]

Conv

A

C

RO

I p

oo

lin

g

Input image

[3×224×244]

GGN grasp

89%width (w)

centroid

(x , y)

θ

height (h)Global Grasp Prediction Network (GGPN) Grasp Evaluation Network (GEN)

Region Grasp Prediction Network (RGPN)

[1×2] fcρ

92%83%

91%

Top four RGN grasps

95%

91%

Top PGN grasp

Width map Mw

[1×224×244]

Height map Mh

[1×224×244]

Angle map Mθ

[50×224×244]

Grasp image

[k×4×n] fcR

Dense block 6

1×1 conv

3×3 conv×Nl

Base

network

BDense block 7

1×1 conv

3×3 conv ×Nl

argmax

D

E

[x,y,w,h]

θ

ρ

ρ

global

avg. pool

Candidate

creator

F

Segmented

regions

[k×14×14×n] SR2

[k×256×14×14] SR1

Fig. 1: Overview of our DSGD architecture. Given an image as input, DSGD uses a base network to extract features whichare fed into a Global Grasp Network (A), a Pixel Grasp Network (B), and a Region Grasp network (C), to produce graspcandidates. The global model produces a single grasp per image and uses an independent Grasp Evaluation Network (D)to produce grasp confidence. The pixel-level model uses a fully convolutional network and produces grasps at every pixel.The region-level model uses a Salient Region Network (E) to extract salient parts of the image and uses information aboutthese parts to produce grasps. During inference, DSGD switches between the GGN, the PGN, and the RGN models basedon their confidence scores.

Input

image

Rectangle

image

00

1

1

Grasp

image

Training

samples

Fig. 2: Left: A grasp image is generated by replacing the bluechannel of the input image with a binary rectangle imageproduced from a grasp pose. Right: Our Grasp EvaluationNetwork is trained using grasp images labelled in terms ofvalid (1) and invalid (0) grasp rectangles.

where Y is a binary indicator if class label c is the correctclassification for observation x, and p is the predicted prob-ability of observation x of class c.

C. Region Grasp Network (RGN)

The RGN structure is composed of two sub-networksas shown in Fig. 1-C: a Salient Region Network (SRN)for extracting salient parts of image, and a region graspprediction network (RGPN) for predicting a grasp for each

candidate salient part.

1) Salient Region Network (SRN):: Here, we use thefeatures extracted from the base network to generate regionsdefined by the location (xsr, ysr), width (wsr), height (hsr),and confidence (ρsr) of non-oriented rectangles which en-compass salient parts of the image (e.g., handles, extrusions,or boundaries). For this, we first generate a fixed numberof rectangles using the Region of Interest (ROI) method of[20]. Next, we use the features from the base network andoptimize a Mean Squared Loss on the rectangle coordinatesand a Cross Entropy Loss on the rectangle confidence scores.

Let Ti = [xsr, ysr, wsr, hsr] denote the parameters ofthe ith predicted rectangle, and ρsri denote its probabilitywhether it belongs to a graspable region or a non-graspableregion. The loss of SRN over I proposals is given by:

Lsrn =∑i∈I

((1− λ2)Lreg(Ti, T ∗i ) + λ2Lcls(ρsri , ρ

∗sri)),

(9)where ρ∗sri = 0 for a non-graspable region and ρ∗sri = 1 fora graspable region. The term T ∗i represents the ground truthcandidate corresponding to ρ∗sri .

Page 4: Densely Supervised Grasp Detector (DSGD)

3×3 Conv

1×1 Conv 1×1 Conv

Candidate creator

ROI pooling

Dense block 4

Pooling

fcθfcRfcρ

128×1664×7×7

128×1664

128×2 128×8 128×100

128×640×7×7

2000×5

36×14×14

18×14×14

512×14×14

3×224×224

Base

network

InputDense Block

Dense Layer 1

Dense Layer 2

Dense Layer 3

Dense Layer 4

Dense Layer 5

Dense Layer 6

Region Grasp

Prediction Network

(RGPN)

concat

concat

concat

concat

concat

Grasp

Evaluation

Network

(GEN)

Dense block 6

Pooling

fcρ

1664×7×7

1×1664

1×2

Base

network

Dense block 5

Pooling

fcR fcθ1×4 1×50

Base

network

Global Grasp

Prediction

Network (GGPN)

1664×7×7

1×1664

Salient

Region

Network

(SRN)

Grasp image

generation

Base

network

Up-sample 1

Up-sample 2

Up-sample 3

Up-sample 4

Up-sample 5

Convw

Convh

Convθ

Convxy

50×224×224

1×224×224

1×224×224

1×224×224

64×224×224

128×112×112

128×56×56

256×28×28

512×14×14Dense 3

Dense 2

Dense 1

Conv1

Pixel Grasp

Network (PGN)

Dense block 7 1664×7×7

7×7 Conv1 64×112×112

3×3 avg. pool

Dense block 1

Transition 1

Dense block 2

Transition 2

Dense block 3

Transition 3

64×56×56

256×56×56

128×28×28

512×28×28

256×14×14

1280×14×14

640×14×14

Dense Layer

1×1 Conv

BN-ReLU

3×3 Conv

BN-ReLU

Transition

1×1 Conv

BN-ReLU

2×2 avg. pool

SR2

128×14×14×2

SR1

128×256×14×14

Fig. 3: Detailed architecture of our DSGD with a DenseNet [17] as its base network.

2) Region Grasp Prediction Network (RGPN):: Here, weproduce grasp poses for the salient regions predicted bySRN (k = 128 in our implementation). For this, we cropfeatures from the output feature maps of the base networkusing the Region of Interest (ROI) pooling method of [20].The cropped features are then fed to Dense block 4 whichproduces feature maps of k × 1664 × 7 × 7−dimensionsas shown in Fig. 3. These feature maps are then squeezedto k × 1664−dimensions through a global average pooling,and fed to three fully connected layers fcR ∈ Rk×4×n,fcθ ∈ Rk×50×n, and fcρ ∈ Rk×2 which produce class-specific grasps. RGPN also has a segmentation branch (withtwo upsampling layers SR1 ∈ Rk×256×14×14 and SR2 ∈Rk×14×14×n), which produces a segmentation mask for eachsalient region as shown in Fig. 1-F.

Let Rri = [xri , yri , wri , hri ], θri , and ρri denote thepredicted values of a region-level grasp for the ith salientregion, and Si ∈ R14×14×n denotes the correspondingpredicted segmentation. The loss of the RGPN model isdefined over I salient regions as:

Lrgpn =∑i∈I

(Lreg(Rri , R∗ri) + λ3Lcls(θri , θ

∗ri)+

λ3Lcls(ρri , ρ∗ri) + Lseg(Si,S∗i )),

(10)

where R∗, θ∗, ρ∗, and S∗ represent the ground truths. Theterm ρ∗ri = 0 for a non-graspable region and ρ∗ri = 1for a graspable region. The term Lseg represents a pixel-wise binary cross-entropy loss used to learn segmentations

of salient regions. It is given by:

Lseg = −1

|Si|∑j∈Si

(yj log(yj)+ (1− yj) log(1− yj)), (11)

where, yj represents the ground truth value and yj denotesthe predicted value for a pixel j ∈ Si. Learning segmen-tations in parallel with grasp poses helps the network toproduce better localization results [20]. The total loss of ourRGN model is given by:

Lrgn = Lsrn + Lrgpn. (12)

The terms λ1, λ2, and λ3 in Eq. 5, Eq. 11, and Eq. 10 controlthe relative influence of classification over regression on thecombined objective functions2.

D. Pixel Grasp network (PGN)

Here, we feed the features extracted from the base networkinto Dense block 7 followed by a group of upsampling layerswhich increase the spatial resolution of the features andproduce feature maps of the size of the input image. Thesefeature maps encode the parameters of the grasp pose atevery pixel of the image.

Let Mxyi , Mwi , Mhi , and Mθi denote the predictedfeature maps of the ith image, respectively. We define the

2For experiments, we set the parameters λ1, λ2, and λ3 to 0.4.

Page 5: Densely Supervised Grasp Detector (DSGD)

loss of the PGN model over K images as:

Lpgn =∑i∈K

(Lreg(Mxyi ,M∗xyi) + Lreg(Mwi ,M

∗wi)+

Lreg(Mhi ,M∗hi) + Lcls(Mθi ,M

∗θi)),

(13)where M∗

xyi , M∗wi , M∗

hi, and M∗

θirepresent the ground-

truths.

E. Training and Implementation

For the global model, we trained the GGPN and the GENsub-networks independently. For the region-based and thepixel-based models, we trained the networks in an end-to-end manner. Specifically, we initialized the weights of thebase network with the weights pre-trained on ImageNet. Forthe Dense blocks (4-7), the fully connected layers of GGPN,GEN, SRN, and RGPN, and the fully convolutional layersof PGN, we initialized the weights from zero-mean Gaussiandistributions (standard deviation set to 0.01, biases set to 0),and trained the networks using the loss functions in Eq. 5, Eq.6, Eq. 11, Eq. 12, and Eq. 13, respectively for 150 epochs.The starting learning rate was set to 0.01 and divided by 10 at50% and 75% of the total number of epochs. The parameterdecay was set to 0.0005 on the weights and biases.

Our implementation is based on the framework of Torchlibrary [21]. Training was performed using ADAM optimizerand data parallelism on four Nvidia Tesla K80 GPU devices.For grasp selection during inference, DSGD selects themost confident region-level grasp if its confidence score isgreater than a confidence threshold (δrgn), otherwise DSGDswitches to the PGN branch and selects the most confidentpixel-level grasp. If the most confident pixel-level grasp hasa confidence score less than δpgn, DSGD switches to theGGN branch and selects the global grasp as the output.Experimentally, we found that δrgn = 0.95 and δpgn = 0.90produced the best grasp detection results.

V. EXPERIMENTS

We evaluated DSGD for grasp detection on the popularCornell grasp dataset [1], which contains 885 RGB-D imagesof 240 objects. The ground-truth is available in the formof grasp-rectangles. We also evaluated DSGD for multi-object grasp detection in new environments. For this, weused the multi-object dataset of [?] which consists of 6896RGB-D images of indoor scenes containing multiple objectsplaced in different locations and orientations. The dataset wasgenerated using an extended version of the scene labelingframework of [?] and [?]. For evaluation, we used the object-wise splitting criteria [1] for both the Cornell grasp datasetand our multi-object dataset. The object-wise splitting splitsthe object instances randomly into train and test subsets (i.e.,the training set and the test set do not share any imagesfrom the same object). This strategy evaluates how wellthe model generalizes to unseen objects. For comparisonpurposes, we followed the procedure of [2] and substitutedthe blue channel with the depth image, where the depthvalues are normalized between 0 and 255. We also performeddata augmentation through random rotations.

TABLE I: Grasp evaluation on the Cornell grasp dataset interms of average grasp detection accuracy.

Method Accuracy (%)

(Jiang et. al. 2011) Fast search 58.3(Lenz et. al. 2015) Deep learning 75.6(Redmon et. al. 2015) MultiGrasp 87.1(Wang et. al. 2015) Multi-modal -(Kumra et. al. 2017) ResNets 88.9(Guo et. al. 2017) Hybrid-Net 89.1(this work) DSGD 97.5

TABLE II: Comparison of the individual networks of theproposed DSGD in terms of grasp accuracy (%) on theCornell grasp dataset.

Base Global model Local models DSGDnetwork GGN PGN RGN

ResNet50 86.8 94.1 96.1 96.7DenseNet 88.9 95.4 96.8 97.5

For grasp evaluation, we used the “rectangle-metric” pro-posed in [7]. A grasp is considered to be correct if: i)the difference between the predicted grasp angle and theground-truth is less than 30◦, and ii) the Jaccard index ofthe predicted grasp and the ground-truth is higher than 25%.The Jaccard index for a predicted rectangle R and a ground-truth rectangle R∗ is defined as:

J(R∗,R) =|R∗ ∩R||R∗ ∪R| . (14)

A. Single-Object Grasp Detection

Table I shows that our DSGD achieved a considerable im-provement of around 8% in mean accuracy compared to thebest performing model of [12] on the Cornell grasp dataset.We attribute this improvement to two main reasons: First,the proposed hierarchical grasp generation enables DSGD toproduce grasps and their confidence scores from both globaland local contexts. This enables DSGD to effectively recoverfrom the errors of the global [11] or local methods [12].Second, the use of dense feature fusion enables the networksto learn more discriminative features compared to the modelsused in [11], [12], respectively. Fig. 4 shows grasps producedby our DSGD on some images of the Cornell grasp dataset.

1) Significance of Combining Global and Local Models::Table II shows a quantitative comparison of the individualmodels of our DSGD in terms of grasp accuracy on theCornell grasp dataset, for different CNN structures as thebase network. The base networks we tested include: ResNets[18] and DenseNets [17]. Table II shows that on averagethe local models (PGN and RGN) produced higher graspaccuracy compared to the global model (GGN).

The global and the local models have their own pros andcons. The global model learns an average of the ground-truthgrasps from the global perspective. Although, the global-grasps are accurate for most of the objects, they tend tolie in the middle of circular symmetric objects resulting inlocalization errors as highlighted in red in Fig. 5-B. ThePGN model on the other hand operates at the pixel-leveland produces correct grasp localizations for these challenging

Page 6: Densely Supervised Grasp Detector (DSGD)

B

A

Fig. 4: Grasp detection results of our DSGD (B) on some challenging objects of the Cornell grasp dataset. Ground truthsare shown in A.

objects as shown in Fig. 5-C. However, pixel-based model issusceptible to outliers in the position prediction maps whichresult in localization errors as highlighted in red in Fig.5-D. Our RGN model works at a semi-global level whilemaintaining large receptive fields. It produces predictionsusing features extracted from the salient parts of the imagewhich highly likely encode graspable parts of an object (e.g.,handles or boundaries) as shown in Fig. 5-F. Consequently,RGN is less susceptible to pixel-level outliers and does notsuffer global averaging errors as shown in Fig. 5-G. OurDSGD takes advantage of both the global context and localpredictions and produces highly accurate grasps as shown inFig. 5-H.

2) Ablative Study of the Proposed DSGD:: The growthrate parameter W refers to the number of output featuremaps of each dense layer and therefore controls the depthof the network. Table III shows that a large growth rate andwider dense blocks (i.e., more number of layers in the denseblocks) increase the average accuracy from 97.1% to 97.5%at the expense of low runtime speed due to the overhead fromadditional channels. Table III also shows that a lite version ofour detector (DSGD-lite) can run at 12fps making it suitablefor real-time applications.

B. Multi-Object Grasp Detection

Table IV shows our grasp evaluation on the multi-objectdataset. The results show that on average, our DSGD im-proves grasp detection accuracy by 9% and 2.4% comparedto the pixel-level and region-level models, respectively. Fig.6 shows qualitative results on the images of our multi-objectdataset. The results show that our DSGD successfully gener-ates correct grasps for multiple objects in real-world scenescontaining background clutter. The generalization capabilityof our model is attributed to the proposed hierarchical image-to-grasp mappings, where the proposed region-level networkand the proposed pixel-level network learn to associate graspposes to salient regions and salient pixels in the imagedata, respectively. These salient regions and pixels encodeobject graspable parts (e.g., boundaries, corners, handles,extrusions) which are generic (i.e., have similar appearanceand structural characteristics) across a large variety of objectsgenerally found in indoor environments. Consequently, theproposed hierarchical mappings learned by our models suc-cessfully generalize to new object instances during testing.

This justifies the practicality of our DSGD for real-worldrobotic grasping.

C. Robotic Grasping

Our robotic grasping setup consists of a Kinect for imageacquisition and a 7 degrees of freedom robotic arm which istasked to grasp, lift, and take-away the objects placed withinthe robot workspace. For each image, DSGD generatesmultiple grasp candidates as shown in Fig. 7. For graspexecution, we select a random candidate which is locatedwithin the robot workspace and has confidence greater than90%. A robotic grasp is considered successful if the robotgrasps a target object (verified through force sensing in thegripper), holds it in air for 3 seconds and takes it awayfrom the robot workspace. The objects are placed in randompositions and orientations to remove bias related to the objectpose. Table IV shows the success rates computed over 200grasping trials. The results show that we achieved graspsuccess rates of 90% with DenseNet as the base network.Some failure cases include objects with non-planar graspingsurfaces (e.g., brush). However, this can be improved bymulti-finger grasps. We leave this for future work as ourrobotic arm only supports parallel grasps.

VI. CONCLUSION AND FUTURE WORK

We presented Densely Supervised Grasp Detector(DSGD), which generates grasps and their confidence scoresat different image hierarchical levels (i.e., global-, region-,and pixel-levels). Experiments show that our proposed hier-archical grasp generation produces superior grasp accuracycompared to the state-of-the-art on the Cornell grasp dataset.Our evaluations on videos from Kinect and robotic graspingexperiments show the capability of our DSGD for producingstable grasps for unseen objects in new environments. Infuture, we plan to reduce the computational burden of ourDSGD through parameter-pruning for low-powered GPUdevices.

REFERENCES

[1] Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting roboticgrasps. IJRR 34(4-5) (2015) 705–724

[2] Redmon, J., Angelova, A.: Real-time grasp detection using convo-lutional neural networks. In: Proceedings of the IEEE InternationalConference on Robotics and Automation, IEEE (2015) 1316–1322

Page 7: Densely Supervised Grasp Detector (DSGD)

A: Ground TruthB: GGN

output

H: DSGD

output

C: PGN

localization

D: PGN top

predictions

E: RGN

localization

F: RGN

segmented regions

G: RGN top

predictions

Fig. 5: Qualitative comparison of grasps produced by the proposed GGN, PGN, RGN, and DSGD models. The results showthat our DSGD effectively recovers from the errors of the individual models. Incorrect predictions are highlighted in red.

TABLE III: Ablation study of our DSGD (with DenseNet as the base network) on the Cornell grasp dataset in terms of thegrowth rate (W) and the number of dense layers (Nl) of the GGN, PGN, and RGN sub-networks.

Model GGN PGN RGN Accuracy SpeedW Nl1 Nl2 Nl3 Nl5 W Nl1 Nl2 Nl3 Nl7 W Nl1 Nl2 Nl3 Nl4 (%) (fps)

DSGD-lite 32 6 12 24 16 32 6 12 24 16 32 6 12 24 16 97.1 12DSGD-A 32 6 12 32 32 32 6 12 32 32 32 6 12 24 16 97.1 11DSGD-B 32 6 12 48 32 32 6 12 48 32 32 6 12 24 16 97.3 10DSGD-C 48 6 12 36 24 48 6 12 36 24 32 6 12 24 16 97.5 9

A B C A B C

Train

set

Test

set

Fig. 6: Grasp evaluation on our multi-object dataset. The train and the test sets do not share any images from the sameobject instance. The localization outputs of our pixel-level (PGN) and region-level (RGN) grasp models are shown in (A)and (B), respectively. The segmented regions produced by our RGN model are shown in (C). Note that we only show graspswith confidence scores higher than 90% in (D).

Fig. 7: Experimental setting for real-world robotic grasping.

TABLE IV: Grasp evaluation on our multi-object dataset.

Base network Grasp accuracy Robotic graspPGN RGN DSGD success

ResNet 86.5% 93.4% 95.8% 89%DenseNet 87.4% 94.7% 97.2% 90%

[3] Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to graspfrom 50k tries and 700 robot hours. In: Proceedings of the IEEEInternational Conference on Robotics and Automation, IEEE (2016)3406–3413

[4] Morrison, D., Corke, P., Leitner, J.: Closing the loop for robotic

grasping: A real-time, generative grasp synthesis approach. arXivpreprint arXiv:1804.05172 (2018)

[5] Zeng, A., Song, S., Yu, K.T., Donlon, E., Hogan, F.R., Bauza, M.,Ma, D., Taylor, O., Liu, M., Romo, E., et al.: Robotic pick-and-placeof novel objects in clutter with multi-affordance grasping and cross-domain image matching. arXiv preprint arXiv:1710.01330 (2017)

[6] Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic grasping of novel objectsusing vision. IJRR 27(2) (2008) 157–173

[7] Jiang, Y., Moseson, S., Saxena, A.: Efficient grasping from rgbd im-ages: Learning using a new rectangle representation. In: Proceedingsof the IEEE International Conference on Robotics and Automation,IEEE (2011) 3304–3311

[8] Mahler, J., Pokorny, F.T., Hou, B., Roderick, M., Laskey, M., Aubry,

Page 8: Densely Supervised Grasp Detector (DSGD)

M., Kohlhoff, K., Kroger, T., Kuffner, J., Goldberg, K.: Dex-net 1.0:A cloud-based network of 3d objects for robust grasp planning usinga multi-armed bandit model with correlated rewards. In: Robotics andAutomation (ICRA), 2016 IEEE International Conference on, IEEE(2016) 1957–1964

[9] Mahler, J., Liang, J., Niyaz, S., Laskey, M., Doan, R., Liu, X., Ojea,J.A., Goldberg, K.: Dex-net 2.0: Deep learning to plan robust graspswith synthetic point clouds and analytic grasp metrics. Robotics:Science and Systems (RSS) (2017)

[10] Johns, E., Leutenegger, S., Davison, A.J.: Deep learning a graspfunction for grasping under gripper pose uncertainty. In: IROS, IEEE(2016) 4461–4468

[11] Kumra, S., Kanan, C.: Robotic grasp detection using deep convolu-tional neural networks. In: IROS, IEEE (2017)

[12] Guo, D., Sun, F., Liu, H., Kong, T., Fang, B., Xi, N.: A hybrid deeparchitecture for robotic grasp detection. In: Robotics and Automation(ICRA), 2017 IEEE International Conference on, IEEE (2017) 1609–1614

[13] Zeng, A., Song, S., Yu, K.T., Donlon, E., Hogan, F.R., Bauza, M.,Ma, D., Taylor, O., Liu, M., Romo, E., Fazeli, N., Alet, F., Dafle,N.C., Holladay, R., Morona, I., Nair, P.Q., Green, D., Taylor, I., Liu,W., Funkhouser, T., Rodriguez, A.: Robotic pick-and-place of novelobjects in clutter with multi-affordance grasping and cross-domainimage matching. In: Proceedings of the IEEE International Conferenceon Robotics and Automation. (2018)

[14] Myers, A., Teo, C.L., Fermuller, C., Aloimonos, Y.: Affordancedetection of tool parts from geometric features. In: Proceedings of theIEEE International Conference on Robotics and Automation. (2015)1374–1381

[15] Varley, J., Weisz, J., Weiss, J., Allen, P.: Generating multi-fingeredrobotic grasps via deep learning. In: Intelligent Robots and Systems(IROS), 2015 IEEE/RSJ International Conference on, IEEE (2015)4415–4420

[16] Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., Quillen, D.: Learninghand-eye coordination for robotic grasping with deep learning andlarge-scale data collection. IJRR (2016) 0278364917710318

[17] Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Denselyconnected convolutional networks. In: Proceedings of the IEEEconference on computer vision and pattern recognition. Volume 1.(2017) 3

[18] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for imagerecognition. In: CVPR. (2016) 770–778

[19] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. In: ICML. (2015) 448–456

[20] He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In:Computer Vision (ICCV), 2017 IEEE International Conference on,IEEE (2017) 2980–2988

[21] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z.,Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiationin pytorch. (2017)

[22] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J.,Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewerparameters and 0.5mb model size. arXiv preprint arXiv:1602.07360(2016)

[23] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolu-tional neural networks for mobile vision applications. arXiv preprintarXiv:1704.04861 (2017)

[24] Simonyan, K., Zisserman, A.: Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)