monocular depth perception and robotic grasping …honglak lee, rajat raina, olga russakovsky, and...
TRANSCRIPT
MONOCULAR DEPTH PERCEPTION AND ROBOTIC
GRASPING OF NOVEL OBJECTS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Ashutosh Saxena
June 2009
Report Documentation Page Form ApprovedOMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.
1. REPORT DATE JUN 2009 2. REPORT TYPE
3. DATES COVERED 00-00-2009 to 00-00-2009
4. TITLE AND SUBTITLE Monocular Depth Perception and Robotic Grasping of Novel Objects
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Stanford University,Department of Computer Science,Stanford,CA,94305
8. PERFORMING ORGANIZATIONREPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT The ability to perceive the 3D shape of the environment is a basic ability for a robot. We present analgorithm to convert standard digital pictures into 3D models. This is a challenging problem, since animage is formed by a projection of the 3D scene onto two dimensions, thus losing the depth information.We take a supervised learning approach to this problem, and use a Markov Random Field (MRF) to modelthe scene depth as a function of the image features. We show that, even on unstructured scenes of a largevariety of environments, our algorithm is frequently able to recover accurate 3D models. We then applyour methods to robotics applications: (a) obstacle avoidance for autonomously driving a small electric car,and (b) robot manipulation, where we develop vision-based learning algorithms for grasping novel objects.This enables our robot to perform tasks such as open new doors, clear up cluttered tables, and unloaditems from a dishwasher.
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as
Report (SAR)
18. NUMBEROF PAGES
217
19a. NAME OFRESPONSIBLE PERSON
a. REPORT unclassified
b. ABSTRACT unclassified
c. THIS PAGE unclassified
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
c© Copyright by Ashutosh Saxena 2009
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Andrew Y. Ng) Principal Adviser
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Stephen Boyd)
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Sebastian Thrun)
Approved for the University Committee on Graduate Studies.
iii
iv
Abstract
The ability to perceive the 3D shape of the environment is a basic ability for a robot.
We present an algorithm to convert standard digital pictures into 3D models.
This is a challenging problem, since an image is formed by a projection of the 3D
scene onto two dimensions, thus losing the depth information. We take a supervised
learning approach to this problem, and use a Markov Random Field (MRF) to model
the scene depth as a function of the image features. We show that, even on unstruc-
tured scenes of a large variety of environments, our algorithm is frequently able to
recover accurate 3D models.
We then apply our methods to robotics applications: (a) obstacle avoidance for
autonomously driving a small electric car, and (b) robot manipulation, where we
develop vision-based learning algorithms for grasping novel objects. This enables our
robot to perform tasks such as open new doors, clear up cluttered tables, and unload
items from a dishwasher.
v
Acknowledgements
First and foremost, I would like to thank my advisor Andrew Y Ng for his continuous
support and the stream of ideas that kept me occupied. In addition to being instru-
mental in suggesting amazing research directions to pursue, he has been very helpful
in providing the resources and the motivation to execute them. Finally, he has been
a very patient mentor who also provided a strong moral support.
I also thank Sebastian Thrun, Daphne Koller, Oussama Khatib, Kenneth Salis-
bury and Stephen Boyd for their ideas. The discussions with Sebastian Thrun and
Oussama Khatib have been very motivational.
I would like to thank Isaac Asimov, and a number of other science fiction writers
for getting me excited about the field of robotics and artificial intelligence.
I owe much to the students I worked with, specifically Sung Chung, Justin Driemeyer,
Brad Gulko, Arvind Sujeeth, Justin Kearns, Chioma Osondu, Sai Soundararaj, Min
Sun, and Lawson Wong. I really enjoyed working with Min Sun, Justin Driemeyer
and Lawson Wong over a number of projects; their ideas and tenacity have been
instrumental in many of the researches presented in this dissertation.
I also thank Stephen Gould, Geremy Heitz, Kaijen Hsiao, Ellen Klingbeil, Quoc
Le, Andrew Lookingbill, Jeff Michels, Paul Nangeroni, and Jamie Schulte. I give warm
thanks to Jamie Schulte and Morgan Quigley for help with the robotic hardware and
the communication software.
I would like to thank Pieter Abeel, Adam Coates, Chuong Do, Zico Kolter,
Honglak Lee, Rajat Raina, Olga Russakovsky, and Suchi Saria for useful discussions.
I am indebited to Olga for her help in expressing my ideas into the papers I wrote.
Parts of the work in this thesis were supported by the DARPA LAGR program
vi
under contract number FA8650-04-C-7134, DARPA transfer learning program un-
der contract number FA8750-05-2-0249, by the National Science Foundation under
award number CNS-0551737, and by the Office of Naval Research under MURI
N000140710747. Support from PixBlitz Studios and Willow Garage is also grate-
fully acknowledged.
Finally, I would like to thank my father Anand Prakash to cultivate curiosity
when I was a kid and my mother Shikha Chandra who encouraged me and patiently
nurtured my curiosity. I also thank my brother Abhijit Saxena for his continuous
support during my PhD.
vii
Contents
Abstract v
Acknowledgements vi
1 Introduction 1
1.1 Depth perception from a single image . . . . . . . . . . . . . . . . . . 1
1.2 Robot applications using vision . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Robot grasping . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 First published appearances of the described contributions . . . . . . 10
2 Single Image Depth Reconstruction 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Monocular cues for depth perception . . . . . . . . . . . . . . . . . . 18
2.4 Point-wise feature vector . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Features for absolute depth . . . . . . . . . . . . . . . . . . . 20
2.4.2 Features for boundaries . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Point-wise probabilistic model . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Gaussian model . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.2 Laplacian model . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 MAP inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 Gaussian model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.2 Laplacian model . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Plane parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . 29
viii
2.8 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 Segment-wise features . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9.1 Monocular image features . . . . . . . . . . . . . . . . . . . . 31
2.9.2 Features for boundaries . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10.1 Parameter learning . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10.2 MAP inference . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.12 Experiments: depths . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 Experiments: visual . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.14 Large-scale web experiment . . . . . . . . . . . . . . . . . . . . . . . 58
2.15 Incorporating object information . . . . . . . . . . . . . . . . . . . . 58
2.16 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3 Multi-view Geometry using Monocular Vision 64
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Improving stereo vision using monocular Cues . . . . . . . . . . . . . 66
3.2.1 Disparity from stereo correspondence . . . . . . . . . . . . . . 66
3.2.2 Modeling uncertainty in Stereo . . . . . . . . . . . . . . . . . 67
3.2.3 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.4 Results on stereo . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Larger 3D models from multiple images . . . . . . . . . . . . . . . . . 72
3.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.3 Triangulation matches . . . . . . . . . . . . . . . . . . . . . . 74
3.3.4 Phantom planes . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4 Obstacle Avoidance using Monocular Vision 81
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
ix
4.3 Vision system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.1 Feature vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.3 Synthetic graphics data . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Error metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.5 Combined vision system . . . . . . . . . . . . . . . . . . . . . 91
4.4 Control and reinforcement learning . . . . . . . . . . . . . . . . . . . 92
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 94
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 Robotic Grasping: Identifying Grasp-point 98
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Learning the grasping point . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.1 Grasping point . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.2 Synthetic data for training . . . . . . . . . . . . . . . . . . . . 106
5.3.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.4 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.5 MAP inference . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4 Predicting orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Robot platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.6 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7.1 Experiment 1: Synthetic data . . . . . . . . . . . . . . . . . . 125
5.7.2 Experiment 2: Grasping novel objects . . . . . . . . . . . . . . 125
5.7.3 Experiment 3: Unloading items from dishwashers . . . . . . . 127
5.7.4 Experiment 4: Grasping kitchen and office objects . . . . . . . 133
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
x
6 Robotic Grasping: Full Hand Configuration 136
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2 3D sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3 Grasping Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3.1 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.4.1 Grasping single novel objects . . . . . . . . . . . . . . . . . . 144
6.4.2 Grasping in cluttered scenarios . . . . . . . . . . . . . . . . . 146
6.5 Optical proximity sensors . . . . . . . . . . . . . . . . . . . . . . . . 148
6.5.1 Optical Sensor Hardware . . . . . . . . . . . . . . . . . . . . . 151
6.5.2 Sensor model . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.5.3 Reactive grasp closure controller . . . . . . . . . . . . . . . . . 160
6.5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.6 Applications to opening new doors . . . . . . . . . . . . . . . . . . . 165
6.6.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.6.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.6.3 Identifying visual keypoints . . . . . . . . . . . . . . . . . . . 168
6.6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7 Conclusions 175
xi
List of Tables
2.1 Effect of multiscale and column features on accuracy. The average
absolute errors (RMS errors gave very similar trends) are on a log
scale (base 10). H1 and H2 represent summary statistics for k = 1, 2.
S1, S2 and S3 represent the 3 scales. C represents the column features.
Baseline is trained with only the bias term (no features). . . . . . . . 48
2.2 Results: Quantitative comparison of various methods. . . . . . . . . . 53
2.3 Percentage of images for which HEH is better, our PP-MRF is better,
or it is a tie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1 The average errors (RMS errors gave very similar trends) for various
cues and models, on a log scale (base 10). . . . . . . . . . . . . . . . . 70
4.1 Real driving tests under different terrains . . . . . . . . . . . . . . . . 95
4.2 Test error on 3124 images, obtained on different feature vectors after
training data on 1561 images. For the case of “None” (without any fea-
tures), the system predicts the same distance for all stripes, so relative
depth error has no meaning. . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Test error on 1561 real images, after training on 667 graphics images,
with different levels of graphics realism using Laws features. . . . . . 97
5.1 Mean absolute error in locating the grasping point for different objects,
as well as grasp success rate for picking up the different objects using
our robotic arm. (Although training was done on synthetic images,
testing was done on the real robotic arm and real objects.) . . . . . . 123
xii
5.2 Grasp-success rate for unloading items from a dishwasher. . . . . . . 128
6.1 STAIR 2. Grasping single objects. (150 trials.) . . . . . . . . . . . . . 146
6.2 STAIR 2. Grasping in cluttered scenes. (40 trials.) . . . . . . . . . . 148
6.3 STAIR 1. Dishwasher unloading results. (50 trials.) . . . . . . . . . . 148
6.4 Model Test Experiment Error . . . . . . . . . . . . . . . . . . . . . . 161
6.5 Visual keypoints for some manipulation tasks. . . . . . . . . . . . . . 167
6.6 Accuracies for recognition and localization. . . . . . . . . . . . . . . . 171
6.7 Error rates obtained for the robot opening the door in a total number
of 34 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
xiii
List of Figures
1.1 (a) A single still image, and (b) the corresponding depths. Our goal is
to estimate the depths given a single image. . . . . . . . . . . . . . . 2
1.2 The remote-controlled car driven autonomously in various cluttered
unconstrained environments, using our algorithm. . . . . . . . . . . . 6
2.1 (a) A single still image, and (b) the corresponding (ground-truth)
depthmap. Colors in the depthmap indicate estimated distances from
the camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 (a) An original image. (b) Oversegmentation of the image to obtain
“superpixels”. (c) The 3D model predicted by the algorithm. (d) A
screenshot of the textured 3D model. . . . . . . . . . . . . . . . . . . 14
2.3 In Single view metrology [21], heights are measured using parallel lines.
(Image reproduced with permission from [21].) . . . . . . . . . . . . . 17
2.4 The convolutional filters used for texture energies and gradients. The
first nine are 3x3 Laws’ masks. The last six are the oriented edge
detectors spaced at 300 intervals. The nine Laws’ masks are used to
perform local averaging, edge detection and spot detection. . . . . . 19
2.5 The absolute depth feature vector for a patch, which includes features
from its immediate neighbors and its more distant neighbors (at larger
scales). The relative depth features for each patch use histograms of
the filter outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
xiv
2.6 The multiscale MRF model for modeling relation between features and
depths, relation between depths at same scale, and relation between
depths at different scales. (Only 2 out of 3 scales, and a subset of the
edges, are shown.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 The log-histogram of relative depths. Empirically, the distribution of
relative depths is closer to Laplacian than Gaussian. . . . . . . . . . . 27
2.8 (Left) An image of a scene. (Right) Oversegmented image. Each small
segment (superpixel) lies on a plane in the 3d world. (Best viewed in
color.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 A 2-d illustration to explain the plane parameter α and rays R from
the camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 (Left) Original image. (Right) Superpixels overlaid with an illustration
of the Markov Random Field (MRF). The MRF models the relations
(shown by the edges) between neighboring superpixels. (Only a subset
of nodes and edges shown.) . . . . . . . . . . . . . . . . . . . . . . . . 33
2.11 (Left) An image of a scene. (Right) Inferred “soft” values of yij ∈ [0, 1].
(yij = 0 indicates an occlusion boundary/fold, and is shown in black.)
Note that even with the inferred yij being not completely accurate, the
plane parameter MRF will be able to infer “correct” 3D models. . . . 35
2.12 Illustration explaining effect of the choice of si and sj on enforcing (a)
Connected structure and (b) Co-planarity. . . . . . . . . . . . . . . . 36
2.13 A 2-d illustration to explain the co-planarity term. The distance of the
point sj on superpixel j to the plane on which superpixel i lies along
the ray Rj,sj” is given by d1 − d2. . . . . . . . . . . . . . . . . . . . . 38
2.14 Co-linearity. (a) Two superpixels i and j lying on a straight line in the
2-d image, (b) An illustration showing that a long straight line in the
image plane is more likely to be a straight line in 3D. . . . . . . . . . 39
xv
2.15 The feature vector. (a) The original image, (b) Superpixels for the
image, (c) An illustration showing the location of the neighbors of
superpixel S3C at multiple scales, (d) Actual neighboring superpixels
of S3C at the finest scale, (e) Features from each neighboring superpixel
along with the superpixel-shape features give a total of 524 features for
the superpixel S3C. (Best viewed in color.) . . . . . . . . . . . . . . . 43
2.16 Results for a varied set of environments, showing (a) original image, (b)
ground truth depthmap, (c) predicted depthmap by Gaussian model,
(d) predicted depthmap by Laplacian model. (Best viewed in color). 44
2.17 The 3D scanner used for collecting images and the corresponding depthmaps. 46
2.18 The custom built 3D scanner for collecting depthmaps with stereo im-
age pairs, mounted on the LAGR robot. . . . . . . . . . . . . . . . . 47
2.19 (a) original image, (b) ground truth depthmap, (c) “prior” depthmap
(trained with no features), (d) features only (no MRF relations), (e)
Full Laplacian model. (Best viewed in color). . . . . . . . . . . . 48
2.20 Typical examples of the predicted depths for images downloaded from
the internet. (Best viewed in color.) . . . . . . . . . . . . . . . . 50
2.21 (a) Original Image, (b) Ground truth depths, (c) Depth from image
features only, (d) Point-wise MRF, (e) Plane parameter MRF. (Best
viewed in color.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.22 Typical depths predicted by our plane parameterized algorithm on
hold-out test set, collected using the laser-scanner. (Best viewed in
color.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.23 Typical results from our algorithm. (Top row) Original images, (Bot-
tom row) depths (shown in log scale, yellow is closest, followed by red
and then blue) generated from the images using our plane parameter
MRF. (Best viewed in color.) . . . . . . . . . . . . . . . . . . . . . . 52
xvi
2.24 Typical results from HEH and our algorithm. Row 1: Original Image.
Row 2: 3D model generated by HEH, Row 3 and 4: 3D model
generated by our algorithm. (Note that the screenshots cannot be
simply obtained from the original image by an affine transformation.)
In image 1, HEH makes mistakes in some parts of the foreground rock,
while our algorithm predicts the correct model; with the rock occluding
the house, giving a novel view. In image 2, HEH algorithm detects
a wrong ground-vertical boundary; while our algorithm not only finds
the correct ground, but also captures a lot of non-vertical structure,
such as the blue slide. In image 3, HEH is confused by the reflection;
while our algorithm produces a correct 3D model. In image 4, HEH
and our algorithm produce roughly equivalent results—HEH is a bit
more visually pleasing and our model is a bit more detailed. In image
5, both HEH and our algorithm fail; HEH just predict one vertical
plane at a incorrect location. Our algorithm predicts correct depths of
the pole and the horse, but is unable to detect their boundary; hence
making it qualitatively incorrect. . . . . . . . . . . . . . . . . . . . . 54
2.25 Typical results from our algorithm. Original image (top), and a screen-
shot of the 3D flythrough generated from the image (bottom of the
image). 16 images (a-g,l-t) were evaluated as “correct” and 4 (h-k)
were evaluated as “incorrect.” . . . . . . . . . . . . . . . . . . . . . . 56
2.26 (Left) Original Images, (Middle) Snapshot of the 3D model without
using object information, (Right) Snapshot of the 3D model that uses
object information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.27 (top two rows) three cases where CCM improved results for all tasks.
In the first, for instance, the presence of grass allows the CCM to
remove the boat detections. The next four rows show four examples
where detections are improved and four examples where segmentations
are improved. (Best viewed in color.) . . . . . . . . . . . . . . . . . . 62
xvii
3.1 Two images taken from a stereo pair of cameras, and the depthmap
calculated by a stereo system. . . . . . . . . . . . . . . . . . . . . . . 65
3.2 Results for a varied set of environments, showing one image of the
stereo pairs (column 1), ground truth depthmap collected from 3D laser
scanner (column 2), depths calculated by stereo (column 3), depths
predicted by using monocular cues only (column 4), depths predicted
by using both monocular and stereo cues (column 5). The bottom row
shows the color scale for representation of depths. Closest points are
1.2 m, and farthest are 81m. (Best viewed in color) . . . . . . . . . . 69
3.3 The average errors (on a log scale, base 10) as a function of the distance
from the camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 An illustration of the Markov Random Field (MRF) for inferring 3D
structure. (Only a subset of edges and scales shown.) . . . . . . . . . 73
3.5 An image showing a few matches (left), and the resulting 3D model
(right) without estimating the variables y for confidence in the 3D
matching. The noisy 3D matches reduce the quality of the model.
(Note the cones erroneously projecting out from the wall.) . . . . . . 75
3.6 Approximate monocular depth estimates help to limit the search area
for finding correspondences. For a point (shown as a red dot) in image
B, the corresponding region to search in image A is now a rectangle
(shown in red) instead of a band around its epipolar line (shown in
purple) in image A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7 (a) Bad correspondences, caused by repeated structures in the world.
(b) Use of monocular depth estimates results in better correspondences.
Note the these correspondences are still fairly sparse and slightly noisy,
and are therefore insufficient for creating a good 3D model if we do not
additionally use monocular cues. . . . . . . . . . . . . . . . . . . . . . 76
3.8 (a,b,c) Three original images from different viewpoints; (d,e,f) Snap-
shots of the 3D model predicted by our algorithm. (f) shows a top-down
view; the top part of the figure shows portions of the ground correctly
modeled as lying either within or beyond the arch. . . . . . . . . . . . 77
xviii
3.9 (a,b) Two original images with only a little overlap, taken from the
same camera location. (c,d) Snapshots from our inferred 3D model. . 78
3.10 (a,b) Two original images with many repeated structures; (c,d) Snap-
shots of the 3D model predicted by our algorithm. . . . . . . . . . . . 78
3.11 (a,b,c,d) Four original images; (e,f) Two snapshots shown from a larger
3D model created using our algorithm. . . . . . . . . . . . . . . . . . 79
4.1 (a) The remote-controlled car driven autonomously in various cluttered
unconstrained environments, using our algorithm. (b) A view from the
car, with the chosen steering direction indicated by the red square; the
estimated distances to obstacles in the different directions are shown
by the bar graph below the image. . . . . . . . . . . . . . . . . . . . 82
4.2 Laser range scans overlaid on the image. Laser scans give one range
estimate per degree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Rig for collecting correlated laser range scans and real camera images. 86
4.4 For each overlapping windowWsr, statistical coefficients (Law’s texture
energy, Harris angle distribution, Radon) are calculated. The feature
vector for a stripe consists of the coefficients for itself, its left column
and right column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Graphics images in order of increasing level of detail. (In color, where
available.) (a) Uniform trees, (b) Different types of trees of uniform
size, (c) Trees of varying size and type, (d) Increased density of trees,
(e) Texture on trees only, (f) Texture on ground only, (g) Texture on
both, (h) Texture and Shadows. . . . . . . . . . . . . . . . . . . . . . 90
4.6 Experimental setup showing the car and the instruments to control it
remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.7 Reduction in hazard rate by combining systems trained on synthetic
and real data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xix
4.8 An image sequence showing the car avoiding a small bush, a difficult
obstacle to detect in presence of trees which are larger visually yet
more distant. (a) First row shows the car driving to avoid a series
of obstacles (b) Second row shows the car view. (Blue square is the
largest distance direction. Red square is the chosen steering direction
by the controller.) (Best viewed in color) . . . . . . . . . . . . . . . . 96
5.1 Our robot unloading items from a dishwasher. . . . . . . . . . . . . 99
5.2 Some examples of objects on which the grasping algorithm was tested. 100
5.3 The images (top row) with the corresponding labels (shown in red in
the bottom row) of the five object classes used for training. The classes
of objects used for training were martini glasses, mugs, whiteboard
erasers, books and pencils. . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 An illustration showing the grasp labels. The labeled grasp for a mar-
tini glass is on its neck (shown by a black cylinder). For two predicted
grasping points P1 and P2, the error would be the 3D distance from
the grasping region, i.e., d1 and d2 respectively. . . . . . . . . . . . . 107
5.5 (a) An image of textureless/transparent/reflective objects. (b) Depths
estimated by our stereo system. The grayscale value indicates the
depth (darker being closer to the camera). Black represents areas where
stereo vision failed to return a depth estimate. . . . . . . . . . . . . . 109
5.6 Grasping point classification. The red points in each image show the
locations most likely to be a grasping point, as predicted by our logistic
regression model. (Best viewed in color.) . . . . . . . . . . . . . . . . 110
5.7 (a) Diagram illustrating rays from two images C1 and C2 intersecting
at a grasping point (shown in dark blue). (Best viewed in color.) . . . 113
5.8 Predicted orientation at the grasping point for pen. Dotted line rep-
resents the true orientation, and solid line represents the predicted
orientation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.9 STAIR 1 platform. This robot is equipped with a 5-dof arm and a
parallel plate gripper. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
xx
5.10 STAIR 2 platform. This robot is equipped with 7-dof Barrett arm and
three-fingered hand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.11 The robotic arm picking up various objects: screwdriver, box, tape-
roll, wine glass, a solder tool holder, coffee pot, powerhorn, cellphone,
book, stapler and coffee mug. (See Section 5.7.2.) . . . . . . . . . . . 124
5.12 Example of a real dishwasher image, used for training in the dishwasher
experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.13 Grasping point detection for objects in a dishwasher. (Only the points
in top five grasping regions are shown.) . . . . . . . . . . . . . . . . . 129
5.14 Dishwasher experiments (Section 5.7.3): Our robotic arm unloads items
from a dishwasher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.15 Dishwasher experiments: Failure case. For some configurations of cer-
tain objects, it is impossible for our 5-dof robotic arm to grasp it (even
if a human were controlling the arm). . . . . . . . . . . . . . . . . . . 132
5.16 Grasping point classification in kitchen and office scenarios: (Left Col-
umn) Top five grasps predicted by the grasp classifier alone. (Right
Column) Top two grasps for three different object-types, predicted by
the grasp classifier when given the object types and locations. The red
points in each image are the predicted grasp locations. (Best viewed
in color.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.1 Image of an environment (left) and the 3D point-cloud (right) returned
by the Swissranger depth sensor. . . . . . . . . . . . . . . . . . . . . 138
6.2 (Left) Barrett 3-fingered hand. (Right) Katana parallel plate gripper. 139
6.3 Features for symmetry / evenness. . . . . . . . . . . . . . . . . . . . . 143
6.4 Demonstration of PCA features. . . . . . . . . . . . . . . . . . . . . . 144
6.5 Snapshots of our robot grasping novel objects of various sizes/shapes. 145
6.6 Barrett arm grasping an object using our algorithm. . . . . . . . . . . 147
6.7 Three-fingered Barrett Hand with our optical proximity sensors mounted
on the finger tips. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xxi
6.8 Normalized Voltage Output vs. Distance for a TCND5000 emitter-
detector pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.9 Possible fingertip sensor configurations. . . . . . . . . . . . . . . . . . 152
6.10 Front view of sensor constellation. Hatches show the crosstalk between
adjacent sensor pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.11 Calibration Data. (Top row) plots xrot vs. energy for all three sen-
sors, with binned distances represented by different colors (distance bin
centers in mm: green=5, blue=9, cyan=13, magenta=19, yellow=24,
black=29, burlywood=34). (Bottom row) shows zrot vs. energy for all
three sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.12 Locally weighted linear regression on calibration data. The plot shows
energy values for one sensor at a fixed distance and varying x and
z orientation. Green points are recorded data; red points show the
interpolated and extrapolated estimates of the model. . . . . . . . . . 157
6.13 A sequence showing the grasp trajectory chosen by our algorithm. Ini-
tially, the fingers are completely open; as more data comes, the esti-
mates get better, and the hand turns and closes the finger in such a
way that all the fingers touch the object at the same time. . . . . . . 160
6.14 Different objects on which we present our analysis of the model pre-
dictions. See Section 6.5.4 for details. . . . . . . . . . . . . . . . . . . 161
6.15 Failure cases: (a) Shiny can, (b) Transparent cup . . . . . . . . . . . 163
6.16 Snapshots of the robot picking up different objects. . . . . . . . . . . 164
6.17 Results on test set. The green rectangles show the raw output from
the classifiers, and the blue rectangle is the one after applying context. 170
6.18 Some experimental snapshots showing our robot opening different types
of doors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
xxii
Chapter 1
Introduction
Modern-day robots can be carefully hand-programmed to perform many amazing
tasks, such as assembling cars in a factory. However, it remains a challenging problem
for robots to operate autonomously in unconstrained and unknown environments,
such as the home or office. For example, consider the task of getting a robot to
unload novel items from a dishwasher. While it is possible today to manually joystick
(tele-operate) a robot to do this task, it is well beyond the state of the art to do
this autonomously. For example, even having a robot recognize that a mug should be
picked up by its handle is difficult, particularly if the mug has not been seen before by
the robot. A key challenge in robotics is the ability to perceive novel environments,
in order to take an action.
One of the most basic perception abilities for a robot is perceiving the 3D shape
of an environment. This is useful for both navigation and manipulation. In this
dissertation, we will first describe learning algorithms for depth perception using a
single still image.
1.1 Depth perception from a single image
Cameras are inexpensive and widely available. Most work on depth perception has
focused on methods that use multiple images and triangulation, such as stereo vision.
In this work however, we were interested in the problem of estimating depths from
1
2 CHAPTER 1. INTRODUCTION
Figure 1.1: (a) A single still image, and (b) the corresponding depths. Our goal is toestimate the depths given a single image.
a single image (see Fig. 1.1). This has traditionally been considered an impossible
problem in computer vision; and while this problem is indeed impossible in a narrow
mathematical sense, it is nonetheless a problem that humans solve fairly well. We
would like our robots to have a similar ability to perceive depth from one image.
In addition to being a problem that is interesting and important in its own right,
single-image depth perception also has significant advantages over stereo in terms of
cost and of scaling to significantly greater distances.
In order to estimate depths from an image, humans use a number of monocular
(single image) cues such as texture variations, texture gradients, interposition, occlu-
sion, known object sizes, light and shading, haze, defocus, etc. [9] For example, many
objects’ texture will look different at different distances from the viewer. In another
example, a tiled floor with parallel lines will appear to have tilted lines in an image.
The distant patches will have larger variations in the line orientations, and nearby
patches with almost parallel lines will have smaller variations in line orientations.
Similarly, a grass field when viewed at different distances will have different texture
gradient distributions. There are a number of other patterns that we have learned
to estimate depth from image features. For example, a large blue patch in the upper
part of scene is more likely to be sky. In the past, researchers have designed algo-
rithms to estimate 3D structure from a single image for a few very specific settings.
1.1. DEPTH PERCEPTION FROM A SINGLE IMAGE 3
Nagai et al. [98] used Hidden Markov Models to performing surface reconstruction
from single images for known, fixed objects such as hands and faces. Criminsi et
al. [21] compute aspects of geometry of the scene using vanishing points and parallel
lines in the image. However, this method works only for images with clear parallel
lines such as manhattan-type environments and fails on the general scenes that we
consider.
In order to learn some of these patterns, we have developed a machine learning
algorithm. Our method uses a supervised learning algorithm, which learns to estimate
depths directly as a function of visual image features. These features are obtained
from a number of low-level features computed at each point in the image. We used a
3D scanner to collect training data, which comprised a large set of images and their
corresponding ground-truth depths. Using this training set, we model the conditional
distribution of the depth at each point in the image given the monocular image
features.
However, local image cues alone are usually insufficient to infer the 3D structure.
For example, both blue sky and a blue object would give similar local features; hence
it is difficult to estimate depths from local features alone.
Our ability to “integrate information” over space, i.e. understand the relation
between different parts of the image, is crucial to understanding the scene’s 3D struc-
ture. For example, even if part of an image is a homogeneous, featureless, gray patch,
one is often able to infer its depth by looking at nearby portions of the image, so as
to recognize whether this patch is part of a sidewalk, a wall, etc. Therefore, in our
model we will also capture relations between different parts of the image. For this,
we use a machine learning technique called Markov Random Field that models the
statistical correlations between the labels of neighboring regions.
Our algorithm was able to accurately estimate depths from images of a wide
variety of environments. We also studied the performance of our algorithm through
a live deployment in the form of a web service called Make3D. Tens of thousands of
users have successfully used it to automatically convert their own pictures into 3D
models.
A 3D model built from a single image will almost invariably be an incomplete
4 CHAPTER 1. INTRODUCTION
model of the scene, because many portions of the scene will be missing or occluded.
We then consider 3D reconstruction from multiple views. Such views are available
in many scenarios. For example, robot(s) may be able to take images from different
viewpoints while performing a task. In another example, consider the few pictures
taken by a tourist of a scene from a few different places.
Given such sparse set of images of a scene, it is sometimes possible to construct
a 3D model using techniques such as structure from motion (SFM) [33, 117], which
start by taking two or more photographs, then find correspondences between the
images, and finally use triangulation to obtain 3D locations of the points. If the
images are taken from nearby cameras (i.e., if the baseline distance is small), then
these methods often suffer from large triangulation errors for points far-away from
the camera. (I.e., the depth estimates will tend to be inaccurate for objects at large
distances, because even small errors in triangulation will result in large errors in
depth.) If, conversely, one chooses images taken far apart, then often the change of
viewpoint causes the images to become very different, so that finding correspondences
becomes difficult, sometimes leading to spurious or missed correspondences. (Worse,
the large baseline also means that there may be little overlap between the images, so
that few correspondences may even exist.) These difficulties make purely geometric
3D reconstruction algorithms fail in many cases, specifically when given only a small
set of images.
However, when tens of thousands of pictures are available—for example, for frequently-
photographed tourist attractions such as national monuments—one can use the in-
formation present in many views to reliably discard images that have only few corre-
spondence matches. Doing so, one can use only a small subset of the images available,
and still obtain a “3D point cloud” for points that were matched using SFM. This
approach has been very successfully applied to famous buildings such as the Notre
Dame; the computational cost of this algorithm was significant, and required about
a week on a cluster of computers [164].
In the specific case of images taken from a pair of stereo cameras, the depths
are estimated by triangulation using the two images. Over the past few decades,
researchers have developed very good stereo vision systems (see [155] for a review).
1.2. ROBOT APPLICATIONS USING VISION 5
Although these systems work well in many environments, stereo vision is fundamen-
tally limited by the baseline distance between the two cameras. Specifically, their
depth estimates tend to be inaccurate when the distances considered are large (be-
cause even very small triangulation/angle estimation errors translate to very large
errors in distances). Further, stereo vision also tends to fail for textureless regions
of images where correspondences cannot be reliably found. One of the reasons that
such methods (based on purely geometric triangulation) often fail is that they do not
take the monocular cues (i.e., information present in the single image) into account.
On the other hand, humans perceive depth by seamlessly combining monocular
cues with stereo cues. We believe that monocular cues and (purely geometric) stereo
cues give largely orthogonal, and therefore complementary, types of information about
depth. Stereo cues are based on the difference between two images and do not depend
on the content of the image. Even if the images are entirely random, it would still
generate a pattern of disparities (e.g., random dot stereograms [9]). On the other
hand, depth estimates from monocular cues are entirely based on the evidence about
the environment presented in a single image.
In this dissertation, we investigate how monocular cues can be integrated with
any reasonable stereo system, to obtain better depth estimates than the stereo system
alone. Our learning algorithm models the depths as a function of single image features
as well as the depth obtained from stereo triangulation if available, and combines
them to obtain better depth estimates. We then also consider creating 3D models
from multiple views (which are not necessarily from a calibrated stereo pair) to create
a larger 3D model.
1.2 Robot applications using vision
Consider the problem of a small robot trying to navigate through a cluttered en-
vironment, for example, a small electric car through a forest (see Figure 1.2). For
small-sized robots, the size places a frugal payload restriction on weight and power
supply. In these robots, only light-weight small sensors that take less power could be
used. Cameras are an attractive option because they are small, passive sensors with
6 CHAPTER 1. INTRODUCTION
low power requirements.
In the problem of high-speed navigation and obstacle avoidance for a small car,
we have applied our ideas of single image depth perception to obstacle avoidance. We
had two other motivations for this work: First, we believe that single-image monoc-
ular cues have not been effectively used in most obstacle detection systems; thus we
consider it an important open problem to develop methods for obstacle detection that
exploit these monocular cues. Second, the dynamics and inertia of high speed driving
(5m/s on a small remote control car) means that obstacles must be perceived at a
distance if we are to avoid them, and the distance at which standard binocular vision
algorithms can perceive them is fundamentally limited by the “baseline” distance be-
tween the two cameras and the noise in the observations, and thus difficult to apply
to this problem.
Figure 1.2: The remote-controlled car driven autonomously in various cluttered un-constrained environments, using our algorithm.
For obstacle avoidance for a car, the problem simplifies to estimating the distances
to the nearest obstacles in each driving direction (i.e., each column in the image).
We use supervised learning to train such a system, and feed the output of this vision
system into a higher level controller trained using reinforcement learning. Our method
was able to avoid obstacles and drive autonomously in novel environments in various
locations.
1.2. ROBOT APPLICATIONS USING VISION 7
1.2.1 Robot grasping
We then consider the problem of grasping novel objects. Grasping is an important
ability for robots and is necessary in many tasks, such as for household robots cleaning
up a room or unloading items from a dishwasher.
If we are trying to grasp a previously known object, and we are given a full and
accurate 3D model of the object, then for grasping one can build physics-based models,
which model the contact forces using known geometry of the hand and the object in
order to compute stable grasps, i.e., the grasps with complete restraint. A grasp with
complete restraint prevents loss of contact and thus is very secure. A form-closure of
a solid object is a finite set of wrenches (force-moment combinations) applied on the
object, with the property that any other wrench acting on the object can be balanced
by a positive combination of the original ones [85]. Intuitively, form closure is a way
of defining the notion of a “firm grip” of the object [82]. It gurantees maintenance
of contact as long as the links of the hand and the object are well approximated as
rigid and as long as the joint actuators are sufficiently strong [123]. [75] proposed
an efficient algorithm for computing all n-finger form-closure grasps on a polygonal
object based on a set of sufficient and necessary conditions for form-closure.
The primary difference between form-closure and force-closure grasps is that force-
closure relies on contact friction. This translates into requiring fewer contacts to
achieve force-closure than form-closure [73, 47, 90]. A number of researches the
conditions necessary (and sufficient) for form- and force-closure, e.g., [131, 102, 119,
120, 6]. Some of these methods also analyze the dynamics using precise friction
modeling at the contact point (called friction cones, [85]).
However, in practical scenarios it is often very difficult to obtain a full and accurate
3D reconstruction of an object, specially the ones seen for the first time through vision.
For stereo systems, 3D reconstruction is difficult for objects without texture, and even
when stereopsis works well, it would typically reconstruct only the visible portions of
the object. Even if more specialized sensors such as laser scanners (or active stereo)
are used to estimate the object’s shape, we would still only have a 3D reconstruction
of the front face of the object.
In contrast to these approaches, we propose a learning algorithm that neither
8 CHAPTER 1. INTRODUCTION
(a) Our robot unloading items from adishwasher.
(b) Grasping in cluttered environment with amulti-fingered hand
requires, nor tries to build, a 3D model of the object. Instead it predicts, directly
as a function of the images, a point at which to grasp the object. Informally, the
algorithm takes two or more pictures of the object, and then tries to identify a point
within each 2-d image that corresponds to a good point at which to grasp the object.
(For example, if trying to grasp a coffee mug, it might try to identify the mid-point
of the handle.) Given these 2-d points in each image, we use triangulation to obtain
a 3D position at which to actually attempt the grasp. Thus, rather than trying to
triangulate every single point within each image in order to estimate depths—as in
dense stereo—we only attempt to triangulate one (or at most a small number of)
points corresponding to the 3D point where we will grasp the object. This allows
us to grasp an object without ever needing to obtain its full 3D shape, and applies
even to textureless, translucent or reflective objects on which standard stereo 3D
reconstruction fares poorly.
We test our method on grasping a number of common office and household objects
such as toothbrushes, pens, books, cellphones, mugs, martini glasses, jugs, keys, knife-
cutters, duct-tape rolls, screwdrivers, staplers and markers. We also test our method
on the task of unloading novel items from dishwashers (see Figure 1.2.1a).
1.2. ROBOT APPLICATIONS USING VISION 9
We then consider grasping with more general robot hands where one has to not
only infer the “point” at which to grasp the object, but also address the high dof
problem of choosing the positions of all the fingers in a multi-fingered hand. This work
also focused on the problem of grasping novel objects, in the presence of significant
amounts of clutter. (See Figure 1.2.1b.)
Our approach begins by computing a number of features of grasp quality, using
both 2-d image and the 3D point cloud features. For example, the 3D data (obtained
from 3D sensors such as stereo vision) is used to compute a number of grasp quality
metrics, such as the degree to which the fingers are exerting forces normal to the
surfaces of the object, and the degree to which they enclose the object. Using such
features, we then apply a supervised learning algorithm to estimate the degree to
which different configurations of the full arm and fingers reflect good grasps.
Our grasp planning algorithm that relied on long range vision sensors (such as
cameras, Swiss Ranger), errors and uncertainty ranging from small deviations in the
object’s location to occluded surfaces significantly limited the reliability of these open-
loop grasping strategies. Therefore, we also present new optical proximity sensors
mounted on robot hand’s fingertips and learning algorithms to enable the robot to
reactively adjust the grasp. These sensors are a very useful complement to long-range
sensors (such as vision and 3D cameras), and provide a sense of “pre-touch” to the
robot, thus helping in the grasping performance.
We also present an application of our algorithms to opening new doors. Most
prior work in door opening (e.g., [122, 110]) assumes that a detailed 3D model of the
door (and door handle) is available, and focuses on developing the control actions
required to open one specific door. In practice, a robot must rely on only its sensors
to perform manipulation in a new environment. In our work, we present a learning
algorithm to enable our robot to autonomously navigate anywhere in a building by
opening doors, even those it has never seen before.
10 CHAPTER 1. INTRODUCTION
1.3 First published appearances of the described
contributions
Most contributions described in this thesis have first appeared as various publications:
• Chapter 2: Saxena, Chung, and Ng [134, 135], Saxena, Sun, and Ng [149, 152,
151], Heitz, Gould, Saxena, and Koller [43]
• Chapter 3: Saxena, Schulte, and Ng [146], Saxena, Sun, and Ng [150], Saxena,
Chung, and Ng [135], Saxena, Sun, and Ng [152]
• Chapter 4: Michels, Saxena, and Ng [88], Saxena, Chung, and Ng [135]
• Chapter 5: Saxena, Driemeyer, Kearns, Osondu, and Ng [138], Saxena, Driemeyer,
Kearns, and Ng [136], Saxena, Wong, Quigley, and Ng [154], Saxena, Driemeyer,
and Ng [141, 140],
• Chapter 6: Saxena, Wong, and Ng [153], Klingbeil, Saxena, and Ng [64], Hsiao,
Nangeroni, Huber, Saxena, and Ng [51]
The following contributions have appeared as various publications: Saxena, Driemeyer,
and Ng [141], Saxena and Ng [144], Saxena, Gupta, and Mukerjee [142], Saxena,
Gupta, Gerasimov, and Ourselin [143], Saxena and Singh [147], Saxena, Srivatsan,
Saxena, and Verma [148], Saxena, Ray, and Varma [145], Soundararaj, Sujeeth, and
Saxena [165]. However, they are beyond the scope of this dissertation, and therefore
are not discussed here.
Chapter 2
Single Image Depth Reconstruction
2.1 Introduction
Upon seeing an image such as Fig. 2.1a, a human has no difficulty in estimating the 3D
depths in the scene (Fig. 2.1b). However, inferring such 3D depth remains extremely
challenging for current computer vision systems. Indeed, in a narrow mathematical
sense, it is impossible to recover 3D depth from a single image, since we can never
know if it is a picture of a painting (in which case the depth is flat) or if it is a picture
of an actual 3D environment. Yet in practice people perceive depth remarkably well
given just one image; we would like our computers to have a similar sense of depths
in a scene.
Recovering 3D depth from images has important applications in robotics, scene
understanding and 3D reconstruction. Most work on visual 3D reconstruction has
focused on binocular vision (stereopsis, i.e., methods using two images) [155] and on
other algorithms that require multiple images, such as structure from motion [33]
and depth from defocus [23]. These algorithms consider only the geometric (tri-
angulation) differences. Beyond stereo/triangulation cues, there are also numerous
monocular cues—such as texture variations and gradients, defocus, color/haze, etc.—
that contain useful and important depth information. Even though humans perceive
depth by seamlessly combining many of these stereo and monocular cues, most work
on depth estimation has focused on stereo vision.
11
12 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.1: (a) A single still image, and (b) the corresponding (ground-truth)depthmap. Colors in the depthmap indicate estimated distances from the camera.
Depth estimation from a single still image is a difficult task, since depth typically
remains ambiguous given only local image features. Thus, our algorithms must take
into account the global structure of the image, as well as use prior knowledge about
the scene. We also view depth estimation as a small but crucial step towards the
larger goal of image understanding, in that it will help in tasks such as understanding
the spatial layout of a scene, finding walkable areas in a scene, detecting objects, etc.
We apply supervised learning to the problem of estimating depths (Fig. 2.1b) from a
single still image (Fig. 2.1a) of a variety of unstructured environments, both indoor
and outdoor, containing forests, sidewalks, buildings, people, bushes, etc.
Our approach is based on modeling depths and relationships between depths at
multiple spatial scales using a hierarchical, multiscale Markov Random Field (MRF).
Taking a supervised learning approach to the problem of depth estimation, we used a
3D scanner to collect training data, which comprised a large set of images and their
corresponding ground-truth depths. (This data has been made publically available
on the internet at: http://make3d.stanford.edu) Using this training set, we model the
conditional distribution of the depth at each point in the image given the monocular
2.1. INTRODUCTION 13
image features.
Next, we present another approach in which we use the insight that most 3D
scenes can be segmented into many small, approximately planar surfaces.1 Our algo-
rithm begins by taking an image, and attempting to segment it into many such small
planar surfaces. Using a segmentation algorithm, [32] we find an over-segmentation
of the image that divides it into many small regions (superpixels). An example of
such a segmentation is shown in Fig. 2.2b. Because we use an over-segmentation,
planar surfaces in the world may be broken up into many superpixels; however, each
superpixel is likely to (at least approximately) lie entirely on only one planar surface.
For each superpixel, our algorithm then tries to infer the 3D position and orien-
tation of the 3D surface that it came from. This 3D surface is not restricted to just
vertical and horizontal directions, but can be oriented in any direction. Inferring 3D
position from a single image is non-trivial, and humans do it using many different
visual depth cues, such as texture (e.g., grass has a very different texture when viewed
close up than when viewed far away); color (e.g., green patches are more likely to be
grass on the ground; blue patches are more likely to be sky). Our algorithm uses
supervised learning to learn how different visual cues like these are associated with
different depths. Our learning algorithm uses a Markov random field model, which
is also able to take into account constraints on the relative depths of nearby super-
pixels. For example, it recognizes that two adjacent image patches are more likely to
be at the same depth, or to be even co-planar, than being very far apart. Though
learning in our MRF model is approximate, MAP inference is tractable via linear
programming.
Having inferred the 3D position of each superpixel, we can now build a 3D mesh
model of a scene (Fig. 2.2c). We then texture-map the original image onto it to build
a textured 3D model (Fig. 2.2d) that we can fly through and view at different angles.
This chapter is organized as follows. Section 2.2 gives an overview of various meth-
ods used for 3D depth reconstruction. Section 2.3 describes some of the visual cues
used by humans for depth perception, and Section 2.4 describes the image features
1Indeed, modern computer graphics using OpenGL or DirectX models extremely complex scenesthis way, using triangular facets to model even very complex shapes.
14 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.2: (a) An original image. (b) Oversegmentation of the image to obtain“superpixels”. (c) The 3D model predicted by the algorithm. (d) A screenshot of thetextured 3D model.
2.2. RELATED WORK 15
used to capture monocular cues. We describe our first probabilistic model, which
models depths at each point in the image, in Section 2.5. We then present a second
representation, based on planes, in Section 2.8, followed by its probabilistic model
in Section 2.10. In Section 2.11, we describe our setup for collecting aligned image
and laser data. The results of depth prediction on single images are presented in Sec-
tion 2.12, 2.13 and 2.14. We then also describe how to incorporate object information
in Section 2.15.
2.2 Related work
Although our work mainly focuses on depth estimation from a single still image, there
are many other 3D reconstruction techniques, such as: explicit measurements with
laser or radar sensors [124], using two (or more than two) images [155], and using video
sequences [20]. Among the vision-based approaches, most work has focused on stereo
vision (see [155] for a review), and on other algorithms that require multiple images,
such as optical flow [3], structure from motion [33] and depth from defocus [23]. Frueh
and Zakhor [35] constructed 3d city models by merging ground-based and airborne
views. A large class of algorithms reconstruct the 3D shape of known objects, such
as human bodies, from images and laser data [175, 2]. Structured lighting [156] offers
another method for depth reconstruction
For a few specific settings, several authors have developed methods for depth
estimation from a single image. Examples include shape-from-shading [186, 79] and
shape-from-texture [81, 74]; however, these methods are difficult to apply to surfaces
that do not have fairly uniform color and texture. Nagai et al. [98] used Hidden
Markov Models to performing surface reconstruction from single images for known,
fixed objects such as hands and faces. Hassner and Basri [41] used an example-based
approach to estimate depth of an object from a known object class. Han and Zhu [40]
performed 3D reconstruction for known specific classes of objects placed in untextured
areas.
In work contemporary to ours, Delage, Lee and Ng (DLN) [25, 26] and Hoiem,
Efros and Hebert (HEH) [46] assumed that the environment is made of a flat ground
16 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
with vertical walls. DLN considered indoor images, while HEH considered outdoor
scenes. They classified the image into horizontal/ground and vertical regions (also
possibly sky) to produce a simple “pop-up” type fly-through from an image. Their
method, which assumes a simple “ground-vertical” structure of the world, fails on
many environments that do not satisfy this assumption and also does not give accurate
metric depthmaps.
There are some algorithms that can perform depth reconstruction from single im-
ages in very specific settings. Methods such as shape from shading [186, 79] and shape
from texture [74, 81, 80] generally assume uniform color and/or texture,2 and hence
would perform very poorly on the complex, unconstrained, highly textured images
that we consider. Hertzmann et al. [44] reconstructed high quality 3D models from
several images, but they required that the images also contain “assistant” objects of
known shapes next to the target object. Torresani et al. [178] worked on reconstruct-
ing non-rigid surface shapes from video sequences. Torralba and Oliva [177] studied
the Fourier spectrum of the images to compute the mean depth of a scene.
Single view metrology [21] describes techniques that can be used to compute as-
pects of 3D geometry of a scene from a single perspective image. It assumes that
vanishing lines and points are known in a scene, and calculates angles between par-
allel lines to infer 3D structure from images of Manhattan-type environments. (See
Fig. 2.3.) On the other hand, our method is not restricted to Manhattan-type envi-
ronments and considers images of a large variety of environments.
Building on these concepts of single image 3D reconstruction, Hoiem, Efros and
Hebert [45] and Sudderth et al. [168] integrated learning-based object recognition
with 3D scene representations.
Our approach draws on a large number of ideas from computer vision such as
feature computation and multiscale representation of images. A variety of image
features and representations have been used by other authors, such as Gabor fil-
ters [100], wavelets [167], SIFT features [96], etc. Many of these image features are
2Also, most of these algorithms assume Lambertian surfaces, which means the appearance of thesurface does not change with viewpoint.
2.2. RELATED WORK 17
Figure 2.3: In Single view metrology [21], heights are measured using parallel lines.(Image reproduced with permission from [21].)
used for purposes such as recognizing objects [97, 160], faces [187], facial expres-
sions [133], grasps [138]; image segmentation [68], computing the visual gist of a
scene [103] and computing sparse representations of natural images [104]. Stereo and
monocular image features have been used together for object recognition and image
segmentation [67].
Beyond vision, machine learning has been successfully used for a number of other
ill-posed problems. In [144], we considered the problem of estimating the incident
angle of a sound, using a single microphone.
Our approach is based on learning a Markov Random Field (MRF) model. MRFs
are a workhorse of machine learning, and have been successfully applied to numerous
problems in which local features were insufficient and more contextual information
had to be used. Examples include image denoising [92], stereo vision [155], image
segmentation [32], range sensing [27], text segmentation [71], object classification [97],
and image labeling [42]. For the application of identifying man-made structures in
natural images, Kumar and Hebert used a discriminative random fields algorithm [70].
Since MRF learning is intractable in general, most of these models are trained using
pseudo-likelihood; sometimes the models’ parameters are also hand-tuned.
There is also ample prior work in 3D reconstruction from multiple images, as in
stereo vision and structure from motion. It is impossible for us to do this literature
justice here, but recent surveys include [155] and [78], and we discuss this work further
in Section 3.3.
18 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
2.3 Monocular cues for depth perception
Images are formed by a projection of the 3D scene onto two dimensions. Thus, given
only a single image, the true depths are ambiguous, in that an image might represent
an infinite number of 3D structures. However, not all of these possible depths are
equally likely. The environment we live in is reasonably structured, and thus humans
are usually able to infer a (nearly) correct 3D structure, using prior experience.
Humans use numerous visual cues to perceive depth. Such cues are typically
grouped into four distinct categories: monocular, stereo, motion parallax, and focus
cues [76, 158]. Humans combine these cues to understand the 3D structure of the
world [181, 121, 185, 76]. Below, we describe these cues in more detail. Our proba-
bilistic model will attempt to capture a number of monocular cues (Section 2.5), as
well as stereo triangulation cues (Section 3.2).
There are a number of monocular cues such as texture variations, texture gradi-
ents, interposition, occlusion, known object sizes, light and shading, haze, defocus,
etc. [9] For example, many objects’ texture will look different at different distances
from the viewer. Texture gradients, which capture the distribution of the direction of
edges, also help to indicate depth [80]. For example, a tiled floor with parallel lines
will appear to have tilted lines in an image [21]. The distant patches will have larger
variations in the line orientations, and nearby patches with almost parallel lines will
have smaller variations in line orientations. Similarly, a grass field when viewed at
different distances will have different texture gradient distributions. Haze is another
depth cue, and is caused by atmospheric light scattering [99].
However, we note that local image cues alone are usually insufficient to infer the
3D structure. For example, both blue sky and a blue object would give similar local
features; hence it is difficult to estimate depths from local features alone.
The ability of humans to “integrate information” over space, i.e. understand the
relation between different parts of the image, is crucial to understanding the scene’s
3D structure [180, chap. 11]. For example, even if part of an image is a homogeneous,
featureless, gray patch, one is often able to infer its depth by looking at nearby
portions of the image, so as to recognize whether this patch is part of a sidewalk,
2.4. POINT-WISE FEATURE VECTOR 19
Figure 2.4: The convolutional filters used for texture energies and gradients. The firstnine are 3x3 Laws’ masks. The last six are the oriented edge detectors spaced at 300
intervals. The nine Laws’ masks are used to perform local averaging, edge detectionand spot detection.
a wall, etc. Therefore, in our model we will also capture relations between different
parts of the image.
Humans recognize many visual cues, such that a particular shape may be a build-
ing, that the sky is blue, that grass is green, that trees grow above the ground and
have leaves on top of them, and so on. In our model, both the relation of monocular
cues to the 3D structure, as well as relations between various parts of the image,
will be learned using supervised learning. Specifically, our model will be trained to
estimate depths using a training set in which the ground-truth depths were collected
using a laser scanner.
2.4 Point-wise feature vector
Inspired in part by humans, we design features to capture some of the monocular
cues discussed in Section 2.3.
In our first approach, we divide the image into small rectangular patches (arranged
in a uniform grid), and estimate a single depth value for each patch.3 Note that
dividing an image into rectangular patches violates natural depth discontinuities,
therefore in our second method we will use the idea of over-segmented patches (see
Section 2.7). We use two types of features: absolute depth features—used to estimate
the absolute depth at a particular patch—and relative features, which we use to
estimate relative depths (magnitude of the difference in depth between two patches).
These features try to capture two processes in the human visual system: local feature
3The resolution of the grid could be made arbitarily high and be made equal to the resolution ofthe image, thus we can depth value for each point in the image.
20 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
processing (absolute features), such as that the sky is far away; and continuity features
(relative features), a process by which humans understand whether two adjacent
patches are physically connected in 3D and thus have similar depths.4
We chose features that capture three types of local cues: texture variations, tex-
ture gradients, and color. Texture information is mostly contained within the image
intensity channel [180],5 so we apply Laws’ masks [24, 88] to this channel to compute
the texture energy (Fig. 2.4). Haze is reflected in the low frequency information in
the color channels, and we capture this by applying a local averaging filter (the first
Laws’ mask) to the color channels.
Studies on monocular vision in humans strongly indicate that texture gradients
are an important cue in depth estimation. [185] When well-defined edges exist (e.g.,
scenes having regular structure like buildings and roads, or indoor scenes), vanishing
points can be calculated with a Hough transform to get a sense of distance. However,
outdoor scenes are highly irregular. To compute an estimate of texture gradient that
is robust to noise, we convolve the intensity channel with six oriented edge filters
(shown in Fig. 2.4). We thus rely on a large number of different types of features to
make our algorithm more robust and to make it generalize even to images that are
very different from the training set.
One can envision including more features to capture other cues. For example, to
model atmospheric effects such as fog and haze, features computed from the physics
of light scattering [99] could also be included. Similarly, one can also include features
based on surface-shading [79].
2.4.1 Features for absolute depth
We first compute summary statistics of a patch i in the image I(x, y) as follows.
We use the output of each of the 17 (9 Laws’ masks, 2 color channels and 6 texture
gradients) filters Fn, n = 1, ..., 17 as: Ei(n) =∑
(x,y)∈patch(i) |I ∗ Fn|k, where k ∈
4If two neighboring patches of an image display similar features, humans would often perceivethem to be parts of the same object, and therefore to have similar depth values.
5We represent each image in YCbCr color space, where Y is the intensity channel, and Cb andCr are the color channels.
2.4. POINT-WISE FEATURE VECTOR 21
Figure 2.5: The absolute depth feature vector for a patch, which includes featuresfrom its immediate neighbors and its more distant neighbors (at larger scales). Therelative depth features for each patch use histograms of the filter outputs.
{1, 2, 4} give the sum absolute energy, sum squared energy and kurtosis respectively.6
These are commonly used filters that capture the texture of a 3x3 patch and the edges
at various orientations. This gives us an initial feature vector of dimension 34.
To estimate the absolute depth at a patch, local image features centered on the
patch are insufficient, and one has to use more global properties of the image. We
attempt to capture this information by using image features extracted at multiple
spatial scales (image resolutions).7 (See Fig. 2.5.) Objects at different depths exhibit
very different behaviors at different resolutions, and using multiscale features allows
us to capture these variations [182]. For example, blue sky may appear similar at
different scales, but textured grass would not. In addition to capturing more global
information, computing features at multiple spatial scales also helps to account for
different relative sizes of objects. A closer object appears larger in the image, and
6We used k ∈ {1, 2} for point-wise features, but used k ∈ {2, 4} for segment-based features inSection 2.9.1 because they gave better results.
7The patches at each spatial scale are arranged in a grid of equally sized non-overlapping regionsthat cover the entire image. We use 3 scales in our experiments.
22 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
hence will be captured in the larger scale features. The same object when far away
will be small and hence be captured in the small scale features. Features capturing
the scale at which an object appears may therefore give strong indicators of depth.
To capture additional global features (e.g. occlusion relationships), the features
used to predict the depth of a particular patch are computed from that patch as well as
the four neighboring patches. This is repeated at each of the three scales, so that the
feature vector at a patch includes features of its immediate neighbors, its neighbors
at a larger spatial scale (thus capturing image features that are slightly further away
in the image plane), and again its neighbors at an even larger spatial scale; this is
illustrated in Fig. 2.5. Lastly, many structures (such as trees and buildings) found in
outdoor scenes show vertical structure, in the sense that they are vertically connected
to themselves (things cannot hang in empty air). Thus, we also add to the features
of a patch additional summary features of the column it lies in.
For each patch, after including features from itself and its 4 neighbors at 3 scales,
and summary features for its 4 column patches, our absolute depth feature vector x
is 19 ∗ 34 = 646 dimensional.
2.4.2 Features for boundaries
We use a different feature vector to learn the dependencies between two neighboring
patches. Specifically, we compute a 10-bin histogram of each of the 17 filter outputs
|I ∗ Fn|, giving us a total of 170 features yis for each patch i at scale s. These
features are used to estimate how the depths at two different locations are related. We
believe that learning these estimates requires less global information than predicting
absolute depth, but more detail from the individual patches. For example, given
two adjacent patches of a distinctive, unique, color and texture, we may be able to
safely conclude that they are part of the same object, and thus that their depths
are close, even without more global features. Hence, our relative depth features yijs
for two neighboring patches i and j at scale s will be the differences between their
histograms, i.e., yijs = yis − yjs.
2.5. POINT-WISE PROBABILISTIC MODEL 23
Figure 2.6: The multiscale MRF model for modeling relation between features anddepths, relation between depths at same scale, and relation between depths at differ-ent scales. (Only 2 out of 3 scales, and a subset of the edges, are shown.)
2.5 Point-wise probabilistic model
In this model, we use points in the image as basic unit, and infer only their 3D
location. The nodes in this MRF are a dense grid of points in the image, where
the value of each node represents its depth. We convert the depths to log scale for
emphasizing fractional (relative) errors in depth.
Since local images features are by themselves usually insufficient for estimating
depth, the model needs to reason more globally about the spatial structure of the
scene. We capture the spatial structure of the image by modeling the relationships
between depths in different parts of the image. Although the depth of a particular
patch depends on the features of the patch, it is also related to the depths of other
24 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
parts of the image. For example, the depths of two adjacent patches lying in the
same building will be highly correlated. We will use a hierarchical multiscale Markov
Random Field (MRF) to model the relationship between the depth of a patch and
the depths of its neighboring patches (Fig. 2.6). In addition to the interactions with
the immediately neighboring patches, there are sometimes also strong interactions
between the depths of patches which are not immediate neighbors. For example,
consider the depths of patches that lie on a large building. All of these patches will
be at similar depths, even if there are small discontinuities (such as a window on
the wall of a building). However, when viewed at the smallest scale, some adjacent
patches are difficult to recognize as parts of the same object. Thus, we will also model
interactions between depths at multiple spatial scales.
2.5.1 Gaussian model
Our first model will be a jointly Gaussian Markov Random Field (MRF) as shown
in Eq. 2.1. To capture the multiscale depth relations, we will model the depths di(s)
for multiple scales s = 1, 2, 3. In our experiments, we enforce a hard constraint
that depths at a higher scale are the average of the depths at the lower scale.8 More
formally, we define di(s+1) = (1/5)∑
j∈Ns(i)∪{i}dj(s). Here, Ns(i) are the 4 neighbors
of patch i at scale s.9
In Eq. 2.1, M is the total number of patches in the image (at the lowest scale); Z
is the normalization constant for the model; xi is the absolute depth feature vector
for patch i; and θ and σ = {σ1, σ2} are parameters of the model.10 In detail, we use
different parameters (θr, σ1r, σ2r) for each row r in the image, because the images
we consider are taken from a horizontally mounted camera, and thus different rows
of the image have different statistical properties. For example, a blue patch might
8One can instead have soft constraints relating the depths at higher scale to depths at lowerscale. One can also envision putting more constraints in the MRF, such as that points lying on along straight edge in an image should lie on a straight line in the 3D model, etc.
9Our experiments using 8-connected neighbors instead of 4-connected neighbors yielded minorimprovements in accuracy at the cost of a much longer inference time.
10Here each σ2 represents the (empirical) variance of each term, and θ are the weights parameter-izing the space of linear functions.
2.5. POINT-WISE PROBABILISTIC MODEL 25
represent sky if it is in upper part of image, and might be more likely to be water if
in the lower part of the image.
PG(d|X; θ, σ) =1
ZG
exp
−M∑
i=1
(di(1) − xTi θr)
2
2σ21r
−3∑
s=1
M∑
i=1
∑
j∈Ns(i)
(di(s) − dj(s))2
2σ22rs
(2.1)
PL(d|X; θ, λ) =1
ZL
exp
−
M∑
i=1
|di(1) − xTi θr|
λ1r
−
3∑
s=1
M∑
i=1
∑
j∈Ns(i)
|di(s) − dj(s)|
λ2rs
(2.2)
Our model is a conditionally trained MRF, in that its model of the depths d is
always conditioned on the image features X; i.e., it models only P (d|X). We first
estimate the parameters θr in Eq. 2.1 by maximizing the conditional log likelihood
ℓ(d) = logP (d|X; θr) of the training data. Since the model is a multivariate Gaussian,
the maximum likelihood estimate of parameters θr is obtained by solving a linear least
squares problem.
The first term in the exponent above models depth as a function of multiscale
features of a single patch i. The second term in the exponent places a soft “constraint”
on the depths to be smooth. If the variance term σ22rs is a fixed constant, the effect of
this term is that it tends to smooth depth estimates across nearby patches. However,
in practice the dependencies between patches are not the same everywhere, and our
expected value for (di − dj)2 may depend on the features of the local patches.
Therefore, to improve accuracy we extend the model to capture the “variance”
term σ22rs in the denominator of the second term as a linear function of the patches i
and j’s relative depth features yijs (discussed in Section 2.4.2). We model the variance
as σ22rs = uT
rs|yijs|. This helps determine which neighboring patches are likely to have
similar depths; for example, the “smoothing” effect is much stronger if neighboring
patches are similar. This idea is applied at multiple scales, so that we learn different
σ22rs for the different scales s (and rows r of the image). The parameters urs are
learned to fit σ22rs to the expected value of (di(s) − dj(s))
2, with a constraint that
26 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
urs ≥ 0 (to keep the estimated σ22rs non-negative), using a quadratic program (QP).
Similar to our discussion on σ22rs, we also learn the variance parameter σ2
1r = vTr xi
as a linear function of the features. Since the absolute depth features xi are non-
negative, the estimated σ21r is also non-negative. The parameters vr are chosen to fit
σ21r to the expected value of (di(r) − θT
r xi)2, subject to vr ≥ 0. This σ2
1r term gives
a measure of the uncertainty in the first term, and depends on the features. This is
motivated by the observation that in some cases, depth cannot be reliably estimated
from the local features. In this case, one has to rely more on neighboring patches’
depths, as modeled by the second term in the exponent.
After learning the parameters, given a new test-set image we can find the MAP
estimate of the depths by maximizing Eq. 2.1 in terms of d. Since Eq. 2.1 is Gaussian,
logP (d|X; θ, σ) is quadratic in d, and thus its maximum is easily found in closed form
(taking at most 1-2 seconds per image). More details are given in Appendix 2.6.1.
2.5.2 Laplacian model
We now present a second model (Eq. 2.2) that uses Laplacians instead of Gaussians
to model the posterior distribution of the depths. Our motivation for doing so is
three-fold. First, a histogram of the relative depths (di − dj) is empirically closer
to Laplacian than Gaussian (Fig. 2.7, see [52] for more details on depth statistics),
which strongly suggests that it is better modeled as one.11 Second, the Laplacian
distribution has heavier tails, and is therefore more robust to outliers in the image
features and to errors in the training-set depthmaps (collected with a laser scanner; see
Section 2.11). Third, the Gaussian model was generally unable to give depthmaps
with sharp edges; in contrast, Laplacians tend to model sharp transitions/outliers
better [129].
This model is parametrized by θr (similar to Eq. 2.1) and by λ1r and λ2rs, the
Laplacian spread parameters. Maximum-likelihood parameter estimation for the
11Although the Laplacian distribution fits the log-histogram of multiscale relative depths reason-ably well, there is an unmodeled peak near zero. This arises due to the fact that the neighboringdepths at the finest scale frequently lie on the same object.
2.5. POINT-WISE PROBABILISTIC MODEL 27
Figure 2.7: The log-histogram of relative depths. Empirically, the distribution ofrelative depths is closer to Laplacian than Gaussian.
Laplacian model is not tractable (since the partition function depends on θr). How-
ever, by analogy to the Gaussian case, we approximate this by solving a linear system
of equations Xrθr ≈ dr to minimize L1 (instead of L2) error, i.e., minθr||dr −Xrθr||1.
Here Xr is the matrix of absolute depth features. Following the Gaussian model, we
also learn the Laplacian spread parameters in the denominator in the same way, ex-
cept that the instead of estimating the expected values of (di−dj)2 and (di(r)−θ
Tr xi)
2,
we estimate the expected values of |di − dj| and |di(r)− θTr xi|, as a linear function of
urs and vr respectively. This is done using a Linear Program (LP), with urs ≥ 0 and
vr ≥ 0.
Even though maximum likelihood (ML) parameter estimation for θr is intractable
in the Laplacian model, given a new test-set image, MAP inference for the depths d is
tractable and convex. Details on solving the inference problem as a Linear Program
(LP) are given in Appendix 2.6.2.
Remark. We can also extend these models to combine Gaussian and Laplacian terms
in the exponent, for example by using a L2 norm term for absolute depth, and a L1
norm term for the interaction terms. MAP inference remains tractable in this setting,
28 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
and can be solved using convex optimization as a QP (quadratic program).
2.6 MAP inference
2.6.1 Gaussian model
We can rewrite Eq. 2.1 as a standard multivariate Gaussian,
PG(d|X; θ, σ) =1
ZG
exp
(
−1
2(d−Xaθr)
T Σ−1a (d−Xaθr)
)
where Xa = (Σ−11 + QT Σ−1
2 Q)−1Σ−11 X, with Σ1 and Σ2 representing the matrices of
the variances σ21,i and σ2
2,i in the first and second terms in the exponent of Eq. 2.1
respectively.12 Q is a matrix such that rows of Qd give the differences of the depths
in the neighboring patches at multiple scales (as in the second term in the exponent
of Eq. 2.1). Our MAP estimate of the depths is, therefore, d∗ = Xaθr.
During learning, we iterate between learning θ and estimating σ. Empirically,
σ1 ≪ σ2, and Xa is very close to X; therefore, the algorithm converges after 2-3
iterations.
2.6.2 Laplacian model
Exact MAP inference of the depths d ∈ RM can be obtained by maximizing logP (d|X; θ, λ)
in terms of d (Eq. 2.2). More formally,
d∗ = arg maxd
logP (d|X; θ, λ)
= arg mindcT1 |d−Xθr| + cT2 |Qd|
where, c1 ∈ RM with c1,i = 1/λ1,i, and c2 ∈ R
6M with c2,i = 1/λ2,i. Our features
are given by X ∈ RMxk and the learned parameters are θr ∈ R
k, which give a naive
12Note that if the variances at each point in the image are constant, then Xa = (I +σ2
1/σ22QT Q)−1X. I.e., Xa is essentially a smoothed version of X.
2.7. PLANE PARAMETRIZATION 29
estimate d = Xθr ∈ RM of the depths. Q is a matrix such that rows of Qd give
the differences of the depths in the neighboring patches at multiple scales (as in the
second term in the exponent of Eq. 2.2).
We add auxiliary variables ξ1 and ξ2 to pose the problem as a Linear Program
(LP):
d∗ = arg mind,ξ1,ξ2
cT1 ξ1 + cT2 ξ2
s.t. −ξ1 ≤ d− d ≤ ξ1
−ξ2 ≤ Qd ≤ ξ2
In our experiments, MAP inference takes about 7-8 seconds for an image.
2.7 Plane parametrization
In our first method, our patches were uniform in size and were arranged in a uniform
grid. However, often the structures in the image can be grouped into meaningful
regions (or segments). For example, all the patches on a large uniformly colored wall
would lie on the same plane and sometimes some regions consists of many planes at
different angles (such as trees). Therefore, we use the insight that most 3D scenes can
be segmented into many small, approximately planar surfaces (patches). Our method
starts by taking an image, and attempting to segment it into many such small planar
surfaces (see 2.8 for details). Note that in this case the patches are variable in size
and are not arranged into a uniform grid. (See Fig. 2.8.) Each of these small patches
will be our basic unit of representation (instead of using a uniform rectangular patch
earlier). Such representation follows natural depth discontinuities better, and hence
models the image better for the task of depth estimation.
In this method, other than assuming that the 3D structure is made up of a number
of small planes, we make no explicit assumptions about the structure of the scene.
This allows our approach to even apply to scenes with significantly richer structure
than only vertical surfaces standing on a horizontal ground, such as mountains, trees,
30 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.8: (Left) An image of a scene. (Right) Oversegmented image. Each smallsegment (superpixel) lies on a plane in the 3d world. (Best viewed in color.)
etc.
2.8 Representation
Our goal is to create a full photo-realistic 3D model from an image. Following most
work on 3D models in computer graphics and other related fields, we will use a
polygonal mesh representation of the 3D model, in which we assume the world is
made of a set of small planes.13 In detail, given an image of the scene, we first find
small homogeneous regions in the image, called “Superpixels” [32]. Each such region
represents a coherent region in the scene with all the pixels having similar properties.
(See Fig. 2.8.) Our basic unit of representation will be these small planes in the world,
and our goal is to infer the location and orientation of each one.
More formally, we parametrize both the 3D location and orientation of the infinite
plane on which a superpixel lies by using a set of plane parameters α ∈ R3. (Fig. 2.9)
(Any point q ∈ R3 lying on the plane with parameters α satisfies αT q = 1.) The value
1/|α| is the distance from the camera center to the closest point on the plane, and
13This assumption is reasonably accurate for most artificial structures, such as buildings. Somenatural structures such as trees could perhaps be better represented by a cylinder. However, sinceour models are quite detailed, e.g., about 2000 planes for a small scene, the planar assumption worksquite well in practice.
2.9. SEGMENT-WISE FEATURES 31
Figure 2.9: A 2-d illustration to explain the plane parameter α and rays R from thecamera.
the normal vector α = α|α|
gives the orientation of the plane. If Ri is the unit vector
(also called the ray Ri) from the camera center to a point i lying on a plane with
parameters α, then di = 1/RTi α is the distance of point i from the camera center.
2.9 Segment-wise features
In plane-parameter representation, our basic unit of representation is a superpixel (a
planar segment). Therefore, we compute features for a irregular segment Si (compared
to a rectangular patch in Section 2.4).
For each superpixel, we compute a battery of features to capture some of the
monocular cues discussed in Section 2.3. We also compute features to predict mean-
ingful boundaries in the images, such as occlusion and folds.
2.9.1 Monocular image features
To compute the segment-based features, we sum the filter outputs over the pixels in
superpixel Si as Ei(n) =∑
(x,y)∈Si|I ∗Fn|
k. (Note in point-wise representation we did
the summation over a uniform rectangular patch instead of a irregular superpixel.)
The shape of the segment gives information about depth as well. For example,
a big superpixel is more likely to be part of a wall vs a short irregular superpixel
32 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
that is more likely to be a irregular object, such as part of a tree. Our superpixel
shape and location based features (14, computed only for the superpixel) included
the shape and location based features in Section 2.2 of [46], and also the eccentricity
of the superpixel. (See Fig. 2.15.)
We attempt to capture more “contextual” information by also including features
from neighboring superpixels (we pick the largest four in our experiments), and at
multiple spatial scales (three in our experiments). The features, therefore, contain
information from a larger portion of the image, and thus are more expressive than just
local features. This makes the feature vector xi of a superpixel 34∗(4+1)∗3+14 = 524
dimensional.
2.9.2 Features for boundaries
Another strong cue for 3D structure perception is boundary information. If two
neighboring superpixels of an image display different features, humans would often
perceive them to be parts of different objects; therefore an edge between two super-
pixels with distinctly different features, is a candidate for a occlusion boundary or a
fold. To compute the features ǫij between superpixels i and j, we first generate 14
different segmentations for each image for 2 different scales for 7 different properties
based on textures, color, and edges. We found that features based on multiple seg-
mentations are better for indicating boundaries. The groupings (segments) produced
by over-segmentation algorithm were empirically found to be better than comparing
histograms in Section 2.4.2. We modified [32] to create segmentations based on these
properties. Each element of our 14 dimensional feature vector ǫij is then an indicator
if two superpixels i and j lie in the same segmentation. For example, if two super-
pixels belong to the same segments in all the 14 segmentations then it is more likely
that they are coplanar or connected. Relying on multiple segmentation hypotheses
instead of one makes the detection of boundaries more robust. The features ǫij are
the input to the classifier for the occlusion boundaries and folds.
2.10. PROBABILISTIC MODEL 33
Figure 2.10: (Left) Original image. (Right) Superpixels overlaid with an illustrationof the Markov Random Field (MRF). The MRF models the relations (shown by theedges) between neighboring superpixels. (Only a subset of nodes and edges shown.)
2.10 Probabilistic model
It is difficult to infer 3D information of a region from local cues alone (see Section 2.3),
and one needs to infer the 3D information of a region in relation to the 3D information
of other regions.
In our MRF model, we try to capture the following properties of the images:
• Image Features and depth: The image features of a superpixel bear some
relation to the depth (and orientation) of the superpixel.
• Connected structure: Except in case of occlusion, neighboring superpixels
are more likely to be connected to each other.
• Co-planar structure: Neighboring superpixels are more likely to belong to
the same plane, if they have similar features and if there are no edges between
them.
• Co-linearity: Long straight lines in the image plane are more likely to be
34 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
straight lines in the 3D model. For example, edges of buildings, sidewalk, win-
dows.
Note that no single one of these four properties is enough, by itself, to predict the
3D structure. For example, in some cases, local image features are not strong indica-
tors of the depth (and orientation) (e.g., a patch on a blank feature-less wall). Thus,
our approach will combine these properties in an MRF, in a way that depends on our
“confidence” in each of these properties. Here, the “confidence” is itself estimated
from local image cues, and will vary from region to region in the image.
Our MRF is composed of five types of nodes. The input to the MRF occurs
through two variables, labeled x and ǫ. These variables correspond to features com-
puted from the image pixels (see Section 2.9 for details.) and are always observed;
thus the MRF is conditioned on these variables. The variables ν indicate our degree of
confidence in a depth estimate obtained only from local image features. The variables
y indicate the presence or absence of occlusion boundaries and folds in the image.
These variables are used to selectively enforce coplanarity and connectivity between
superpixels. Finally, the variables α are the plane parameters that are inferred using
the MRF, which we call “Plane Parameter MRF.”14
Occlusion Boundaries and Folds: We use the variables yij ∈ {0, 1} to indicate
whether an “edgel” (the edge between two neighboring superpixels) is an occlusion
boundary/fold or not. The inference of these boundaries is typically not completely
accurate; therefore we will infer soft values for yij. (See Fig. 2.11.) More formally,
for an edgel between two superpixels i and j, yij = 0 indicates an occlusion bound-
ary/fold, and yij = 1 indicates none (i.e., a planar surface).
In many cases, strong image gradients do not correspond to the occlusion bound-
ary/fold, e.g., a shadow of a building falling on a ground surface may create an edge
between the part with a shadow and the one without. An edge detector that relies
just on these local image gradients would mistakenly produce an edge. However, there
are other visual cues beyond local image gradients that better indicate whether two
planes are connected/coplanar or not. Using learning to combine a number of such
14For comparison, we also present an MRF that only models the 3D location of the points in theimage (“Point-wise MRF,” see Appendix).
2.10. PROBABILISTIC MODEL 35
Figure 2.11: (Left) An image of a scene. (Right) Inferred “soft” values of yij ∈ [0, 1].(yij = 0 indicates an occlusion boundary/fold, and is shown in black.) Note that evenwith the inferred yij being not completely accurate, the plane parameter MRF willbe able to infer “correct” 3D models.
visual features makes the inference more accurate. In [84], Martin, Fowlkes and Malik
used local brightness, color and texture for learning segmentation boundaries. Here,
our goal is to learn occlusion boundaries and folds. In detail, we model yij using a
logistic response as P (yij = 1|ǫij;ψ) = 1/(1 + exp(−ψT ǫij)). where, ǫij are features
of the superpixels i and j (Section 2.9.2), and ψ are the parameters of the model.
During inference, we will use a mean field-like approximation, where we replace yij
with its mean value under the logistic model.
Now, we will describe how we model the distribution of the plane parameters α,
conditioned on y.
Fractional depth error: For 3D reconstruction, the fractional (or relative) error in
depths is most meaningful; it is used in structure for motion, stereo reconstruction,
etc. [65, 155] For ground-truth depth d, and estimated depth d, fractional error is
defined as (d − d)/d = d/d − 1. Therefore, we will be penalizing fractional errors in
our MRF.
MRF Model: To capture the relation between the plane parameters and the image
features, and other properties such as co-planarity, connectedness and co-linearity, we
36 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
formulate our MRF as
P (α|X, ν, y, R; θ) =1
Z
∏
i
f1(αi|Xi, νi, Ri; θ)
∏
i,j
f2(αi, αj|yij, Ri, Rj) (2.3)
where, αi is the plane parameter of the superpixel i. For a total of Si points in
the superpixel i, we use xi,sito denote the features for point si in the superpixel i.
Xi = {xi,si∈ R
524 : si = 1, ..., Si} are the features for the superpixel i. (Section 2.9.1)
Similarly, Ri = {Ri,si: si = 1, ..., Si} is the set of rays for superpixel i.15 ν is the
“confidence” in how good the (local) image features are in predicting depth (more
details later).
Figure 2.12: Illustration explaining effect of the choice of si and sj on enforcing (a)Connected structure and (b) Co-planarity.
The first term f1(·) models the plane parameters as a function of the image features
xi,si. We have RT
i,siαi = 1/di,si
(where Ri,siis the ray that connects the camera to the
3D location of point si), and if the estimated depth di,si= xT
i,siθr, then the fractional
15The rays are obtained by making a reasonable guess on the camera intrinsic parameters—thatthe image center is the origin and the pixel-aspect-ratio is one—unless known otherwise from theimage headers.
2.10. PROBABILISTIC MODEL 37
error would be
di,si− di,si
di,si
=1
di,si
(di,si) − 1 = RT
i,siαi(x
Ti,siθr) − 1
Therefore, to minimize the aggregate fractional error over all the points in the super-
pixel, we model the relation between the plane parameters and the image features
as
f1(αi|Xi, νi, Ri; θ) = exp
(
−
Si∑
si=1
νi,si
∣
∣RTi,siαi(x
Ti,siθr) − 1
∣
∣
)
(2.4)
The parameters of this model are θr ∈ R524. We use different parameters (θr) for
rows r = 1, ..., 11 in the image, because the images we consider are roughly aligned
upwards (i.e., the direction of gravity is roughly downwards in the image), and thus
it allows our algorithm to learn some regularities in the images—that different rows
of the image have different statistical properties. E.g., a blue superpixel might be
more likely to be sky if it is in the upper part of image, or water if it is in the lower
part of the image, or that in the images of environments available on the internet, the
horizon is more likely to be in the middle one-third of the image. (In our experiments,
we obtained very similar results using a number of rows ranging from 5 to 55.) Here,
νi = {νi,si: si = 1, ..., Si} indicates the confidence of the features in predicting the
depth di,siat point si.
16 If the local image features were not strong enough to predict
depth for point si, then νi,si= 0 turns off the effect of the term
∣
∣RTi,siαi(x
Ti,siθr) − 1
∣
∣.
The second term f2(·) models the relation between the plane parameters of two
superpixels i and j. It uses pairs of points si and sj to do so:
f2(·) =∏
{si,sj}∈N hsi,sj(·) (2.5)
We will capture co-planarity, connectedness and co-linearity, by different choices of
h(·) and {si, sj}.
16The variable νi,siis an indicator of how good the image features are in predicting depth for point
si in superpixel i. We learn νi,sifrom the monocular image features, by estimating the expected
value of |di − xTi θr|/di as φT
r xi with logistic response, with φr as the parameters of the model,features xi and di as ground-truth depths.
38 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.13: A 2-d illustration to explain the co-planarity term. The distance of thepoint sj on superpixel j to the plane on which superpixel i lies along the ray Rj,sj” isgiven by d1 − d2.
Connected structure: We enforce this constraint by choosing si and sj to be on
the boundary of the superpixels i and j. As shown in Fig. 2.12a, penalizing the
distance between two such points ensures that they remain fully connected. The
relative (fractional) distance between points si and sj is penalized by
hsi,sj(αi, αj, yij, Ri, Rj) = exp
(
−yij|(RTi,siαi −RT
j,sjαj)d|
)
(2.6)
In detail, RTi,siαi = 1/di,si
and RTj,sjαj = 1/dj,sj
; therefore, the term (RTi,siαi−R
Tj,sjαj)d
gives the fractional distance |(di,si− dj,sj
)/√
di,sidj,sj
| for d =√
dsidsj
. Note that in
case of occlusion, the variables yij = 0, and hence the two superpixels will not be
forced to be connected.
Co-planarity: We enforce the co-planar structure by choosing a third pair of points
s′′i and s′′j in the center of each superpixel along with ones on the boundary. (Fig. 2.12b)
To enforce co-planarity, we penalize the relative (fractional) distance of point s′′j from
2.10. PROBABILISTIC MODEL 39
(a) 2-d image (b) 3D world, top view
Figure 2.14: Co-linearity. (a) Two superpixels i and j lying on a straight line in the2-d image, (b) An illustration showing that a long straight line in the image plane ismore likely to be a straight line in 3D.
the plane in which superpixel i lies, along the ray Rj,s′′j(See Fig. 2.13).
hs′′
j(αi, αj, yij, Rj,s′′j
) = exp(
−yij|(RTj,s′′jαi −RT
j,s′′jαj)ds′′j
|)
(2.7)
with hs′′i ,s′′j(·) = hs′′i
(·)hs′′j(·). Note that if the two superpixels are coplanar, then
hs′′i ,s′′j= 1. To enforce co-planarity between two distant planes that are not connected,
we can choose three such points and use the above penalty.
Co-linearity: Consider two superpixels i and j lying on a long straight line in a 2-d
image (Fig. 2.14a). There are an infinite number of curves that would project to a
straight line in the image plane; however, a straight line in the image plane is more
likely to be a straight one in 3D as well (Fig. 2.14b). In our model, therefore, we
will penalize the relative (fractional) distance of a point (such as sj) from the ideal
straight line.
In detail, consider two superpixels i and j that lie on planes parameterized by αi
and αj respectively in 3D, and that lie on a straight line in the 2-d image. For a point
sj lying on superpixel j, we will penalize its (fractional) distance along the ray Rj,sj
40 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
from the 3D straight line passing through superpixel i. I.e.,
hsj(αi, αj, yij, Rj,sj
) = exp(
−yij|(RTj,sjαi −RT
j,sjαj)d|
)
(2.8)
with hsi,sj(·) = hsi
(·)hsj(·). In detail, RT
j,sjαj = 1/dj,sj
and RTj,sjαi = 1/d′j,sj
; therefore,
the term (RTj,sjαi − RT
j,sjαj)d gives the fractional distance
∣
∣
∣(dj,sj− d′j,sj
)/√
dj,sjd′j,sj
∣
∣
∣
for d =√
dj,sjd′j,sj
. The “confidence” yij depends on the length of the line and its
curvature—a long straight line in 2-d is more likely to be a straight line in 3D.
2.10.1 Parameter learning
Exact parameter learning based on conditional likelihood for the Laplacian models is
intractable, therefore, we use Multi-Conditional Learning (MCL) [15, 87] to divide the
learning problem into smaller learning problems for each of the individual densities.
MCL is a framework for optimizing graphical models based on a product of several
marginal conditional likelihoods each relying on common sets of parameters from an
underlying joint model and predicting different subsets of variables conditioned on
other subsets.
In detail, we will first focus on learning θr given the ground-truth depths d (ob-
tained from our 3D laser scanner, see Section 2.11) and the value of yij and νi,si. For
this, we maximize the conditional pseudo log-likelihood logP (α|X, ν, y, R; θr) as
θ∗r = arg maxθr
∑
i
log f1(αi|Xi, νi, Ri; θr)
+∑
i,j
log f2(αi, αj|yij, Ri, Rj)
Now, from Eq. 2.3 note that f2(·) does not depend on θr; therefore the learning prob-
lem simplifies to minimizing the L1 norm, i.e., θ∗r = arg minθr
∑
i
∑Si
si=1 νi,si
∣
∣
∣
1di,si
(xTi,siθr) − 1
∣
∣
∣.
This is efficiently solved using a Linear Program (LP).
In the next step, we learn the parameters φ of the logistic regression model for
estimating ν in footnote 16. Parameters of a logistic regression model can be esti-
mated by maximizing the conditional log-likelihood. [8] Now, the parameters ψ of
2.10. PROBABILISTIC MODEL 41
the logistic regression model P (yij|ǫij;ψ) for occlusion boundaries and folds are sim-
ilarly estimated using the hand-labeled ground-truth ground-truth training data by
maximizing its conditional log-likelihood.
2.10.2 MAP inference
MAP inference of the plane parameters α, i.e., maximizing the conditional likelihood
P (α|X, ν, y, R; θ), is efficiently performed by solving a LP. We implemented an effi-
cient method that uses the sparsity in our problem, so that inference can be performed
in about 4-5 seconds for an image having about 2000 superpixels on a single-core Intel
3.40GHz CPU with 2 GB RAM.
Details: When given a new test-set image, we find the MAP estimate of the plane pa-
rameters α by maximizing the conditional log-likelihood logP (α|X, ν, Y,R; θ). Note
that we solve for α as a continuous variable optimization problem, which is unlike
many other techniques where discrete optimization is more popular, e.g., [155]. From
Eq. 2.3, we have
α∗ = arg maxα
logP (α|X, ν, y, R; θr)
= arg maxα
log1
Z
∏
i
f1(αi|Xi, νi, Ri; θr)∏
i,j
f2(αi, αj|yij, Ri, Rj)
Note that the partition function Z does not depend on α. Therefore, from Eq. 2.4, 2.6
and 2.7 and for d = xT θr, we have
= arg minα
∑Ki=1
(
∑Si
si=1 νi,si
∣
∣
∣(RTi,siαi)di,si
− 1∣
∣
∣
+∑
j∈N(i)
∑
si,sj∈Bij
yij
∣
∣
∣(RTi,siαi −RT
j,sjαj)dsi,sj
∣
∣
∣
+∑
j∈N(i)
∑
sj∈Cj
yij
∣
∣
∣(RTj,sjαi −RT
j,sjαj)dsj
∣
∣
∣
)
where K is the number of superpixels in each image; N(i) is the set of “neighboring”
superpixels—one whose relations are modeled—of superpixel i; Bij is the set of pair of
42 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
points on the boundary of superpixel i and j that model connectivity; Cj is the center
point of superpixel j that model co-linearity and co-planarity; and dsi,sj=√
dsidsj
.
Note that each of terms is a L1 norm of a linear function of α; therefore, this is a L1
norm minimization problem, [13, chap. 6.1.1] and can be compactly written as
arg minx ‖Ax− b‖1 + ‖Bx‖1 + ‖Cx‖1
where x ∈ R3K×1 is a column vector formed by rearranging the three x-y-z components
of αi ∈ R3 as x3i−2 = αix, x3i−1 = αiy and x3i = αiz; A is a block diagonal matrix
such that A
[
(∑i−1
l=1 Sl) + si, (3i− 2) : 3i
]
= RTi,sidi,si
νi,siand b1 ∈ R
3K×1 is a column
vector formed from νi,si. B and C are all block diagonal matrices composed of rays R,
d and y; they represent the cross terms modeling the connected structure, co-planarity
and co-linearity properties.
In general, finding the global optimum in a loopy MRF is difficult. However
in our case, the minimization problem is an Linear Program (LP), and therefore
can be solved exactly using any linear programming solver. (In fact, any greedy
method including a loopy belief propagation would reach the global minima.) For
fast inference, we implemented our own optimization method, one that captures the
sparsity pattern in our problem, and by approximating the L1 norm with a smooth
function:
‖x‖1∼= Υβ(x) = 1
β
[
log (1 + exp(−βx)) + log (1 + exp(βx))]
Note that ‖x‖1 = limβ→∞ ‖x‖β, and the approximation can be made arbitrarily close
by increasing β during steps of the optimization. Then we wrote a customized Newton
method based solver that computes the Hessian efficiently by utilizing the sparsity.
[13]
2.11. DATA COLLECTION 43
(a) (b) (c)
(d) (e)
Figure 2.15: The feature vector. (a) The original image, (b) Superpixels for theimage, (c) An illustration showing the location of the neighbors of superpixel S3Cat multiple scales, (d) Actual neighboring superpixels of S3C at the finest scale, (e)Features from each neighboring superpixel along with the superpixel-shape featuresgive a total of 524 features for the superpixel S3C. (Best viewed in color.)
2.11 Data collection
We used a 3D laser scanner to collect images and their corresponding depthmaps
(Fig. 2.17). The scanner uses a laser device (SICK LMS-291) which gives depth
readings in a vertical column, with a 1.0◦ resolution. To collect readings along the
other axis (left to right), the SICK laser was mounted on a panning motor. The motor
rotates after each vertical scan to collect laser readings for another vertical column,
with a 0.5◦ horizontal angular resolution. We reconstruct the depthmap using the
vertical laser scans, the motor readings and known relative position and pose of the
44 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
(a) Image (b) Ground-Truth (c) Gaussian (d) Laplacian
Figure 2.16: Results for a varied set of environments, showing (a) original image, (b)ground truth depthmap, (c) predicted depthmap by Gaussian model, (d) predicteddepthmap by Laplacian model. (Best viewed in color).
2.12. EXPERIMENTS: DEPTHS 45
laser device and the camera. We also collected data of stereo pairs with corresponding
depthmaps (Section 3.2), by mounting the laser range finding equipment on a LAGR
(Learning Applied to Ground Robotics) robot (Fig. 2.18). The LAGR vehicle is
equipped with sensors, an onboard computer, and Point Grey Research Bumblebee
stereo cameras, mounted with a baseline distance of 11.7cm. [146]
2.12 Experiments: depths
We collected a total of 425 image+depthmap pairs, with an image resolution of
1704x2272 and a depthmap resolution of 86x107. In the experimental results reported
here, 75% of the images/depthmaps were used for training, and the remaining 25%
for hold-out testing. The images comprise a wide variety of scenes including natural
environments (forests, trees, bushes, etc.), man-made environments (buildings, roads,
sidewalks, trees, grass, etc.), and purely indoor environments (corridors, etc.). Due
to limitations of the laser, the depthmaps had a maximum range of 81m (the maxi-
mum range of the laser scanner), and had minor additional errors due to reflections,
missing laser scans, and mobile objects. Prior to running our learning algorithms,
we transformed all the depths to a log scale so as to emphasize multiplicative rather
than additive errors in training. Data used in the experiments is available at:
http://ai.stanford.edu/∼asaxena/learningdepth/
We tested our model on real-world test-set images of forests (containing trees,
bushes, etc.), campus areas (buildings, people, and trees), and indoor scenes (such as
corridors).
Table 2.1 shows the test-set results with different feature combinations of scales,
summary statistics, and neighbors, on three classes of environments: forest, campus,
and indoor. The Baseline model is trained without any features, and predicts the
mean value of depth in the training depthmaps. We see that multiscale and column
features improve the algorithm’s performance. Including features from neighboring
patches, which help capture more global information, reduces the error from .162
46 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.17: The 3D scanner used for collecting images and the correspondingdepthmaps.
2.12. EXPERIMENTS: DEPTHS 47
Figure 2.18: The custom built 3D scanner for collecting depthmaps with stereo imagepairs, mounted on the LAGR robot.
48 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.19: (a) original image, (b) ground truth depthmap, (c) “prior” depthmap(trained with no features), (d) features only (no MRF relations), (e) Full Laplacianmodel. (Best viewed in color).
Table 2.1: Effect of multiscale and column features on accuracy. The average absoluteerrors (RMS errors gave very similar trends) are on a log scale (base 10). H1 andH2 represent summary statistics for k = 1, 2. S1, S2 and S3 represent the 3 scales.C represents the column features. Baseline is trained with only the bias term (nofeatures).
Feature All Forest Campus Indoor
Baseline .295 .283 .343 .228Gaussian (S1,S2,S3, H1,H2,no neighbors) .162 .159 .166 .165Gaussian (S1, H1,H2) .171 .164 .189 .173Gaussian (S1,S2, H1,H2) .155 .151 .164 .157Gaussian (S1,S2,S3, H1,H2) .144 .144 .143 .144Gaussian (S1,S2,S3, C, H1) .139 .140 .141 .122Gaussian (S1,S2,S3, C, H1,H2) .133 .135 .132 .124Laplacian .132 .133 .142 .084
orders of magnitude to .133 orders of magnitude.17 We also note that the Lapla-
cian model performs better than the Gaussian one, reducing error to .084 orders of
magnitude for indoor scenes, and .132 orders of magnitude when averaged over all
scenes. Empirically, the Laplacian model does indeed give depthmaps with signifi-
cantly sharper boundaries (as in our discussion in Section 2.5.2; also see Fig. 2.16).
Fig. 2.19 shows that modeling the spatial relationships in the depths is important.
Depths estimated without using the second term in the exponent of Eq. 2.2, i.e.,
depths predicted using only image features with row-sensitive parameters θr, are
17Errors are on a log10 scale. Thus, an error of ε means a multiplicative error of 10ε in actualdepth. E.g., 10.132 = 1.355, which thus represents an 35.5% multiplicative error.
2.12. EXPERIMENTS: DEPTHS 49
very noisy (Fig. 2.19d).18 Modeling the relations between the neighboring depths at
multiple scales through the second term in the exponent of Eq. 2.2 also gave better
depthmaps (Fig. 2.19e). Finally, Fig. 2.19c shows the model’s “prior” on depths;
the depthmap shown reflects our model’s use of image-row sensitive parameters. In
our experiments, we also found that many features/cues were given large weights;
therefore, a model trained with only a few cues (e.g., the top 50 chosen by a feature
selection method) was not able to predict reasonable depths.
Our algorithm works well in a varied set of environments, as shown in Fig. 2.16
(last column). A number of vision algorithms based on “ground finding” (e.g., [36])
appear to perform poorly when there are discontinuities or significant luminance
variations caused by shadows, or when there are significant changes in the ground
texture. In contrast, our algorithm appears to be robust to luminance variations,
such as shadows (Fig. 2.16, 4th row) and camera exposure (Fig. 2.16, 2nd and 5th
rows).
Some of the errors of the algorithm can be attributed to errors or limitations of
the training set. For example, the maximum value of the depths in the training and
test set is 81m; therefore, far-away objects are all mapped to the distance of 81m.
Further, laser readings are often incorrect for reflective/transparent objects such as
glass; therefore, our algorithm also often estimates depths of such objects incorrectly.
Quantitatively, our algorithm appears to incur the largest errors on images which
contain very irregular trees, in which most of the 3D structure in the image is domi-
nated by the shapes of the leaves and branches. However, arguably even human-level
performance would be poor on these images.
We note that monocular cues rely on prior knowledge, learned from the train-
ing set, about the environment. This is because monocular 3D reconstruction is an
inherently ambiguous problem. Thus, the monocular cues may not generalize well
to images very different from ones in the training set, such as underwater images or
aerial photos.
To test the generalization capability of the algorithm, we also estimated depthmaps
of images downloaded from the Internet (images for which camera parameters are not
18This algorithm gave an overall error of .181, compared to our full model’s error of .132.
50 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.20: Typical examples of the predicted depths for images downloaded fromthe internet. (Best viewed in color.)
2.13. EXPERIMENTS: VISUAL 51
known).19 The model (using monocular cues only) was able to produce reasonable
depthmaps on most of the images (Fig. 2.20). Informally, our algorithm appears to
predict the relative depths quite well (i.e., their relative distances to the camera);20
even for scenes very different from the training set, such as a sunflower field, an oil-
painting scene, mountains and lakes, a city skyline photographed from sea, a city
during snowfall, etc.
2.13 Experiments: visual
We collected a total of 534 images+depths, with an image resolution of 2272x1704
and depths with resolution of 55x305. These images were collected during daytime in
a diverse set of urban and natural areas in the city of Palo Alto and its surrounding
regions.
(a) (b) (c) (d) (e)
Figure 2.21: (a) Original Image, (b) Ground truth depths, (c) Depth from imagefeatures only, (d) Point-wise MRF, (e) Plane parameter MRF. (Best viewed in color.)
We tested our model on rest of the 134 images (collected using our 3D scanner),
19Since we do not have ground-truth depthmaps for images downloaded from the Internet, we areunable to give a quantitative comparisons on these images. Further, in the extreme case of orthogonalcameras or very wide angle perspective cameras, our algorithm would need to be modified to takeinto account the field of view of the camera.
20For most applications such as object recognition using knowledge of depths, robotic navigation,or 3D reconstruction, relative depths are sufficient. The depths could be rescaled to give accurateabsolute depths, if the camera parameters are known or are estimated.
52 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.22: Typical depths predicted by our plane parameterized algorithm on hold-out test set, collected using the laser-scanner. (Best viewed in color.)
Figure 2.23: Typical results from our algorithm. (Top row) Original images, (Bottomrow) depths (shown in log scale, yellow is closest, followed by red and then blue)generated from the images using our plane parameter MRF. (Best viewed in color.)
2.13. EXPERIMENTS: VISUAL 53
Table 2.2: Results: Quantitative comparison of various methods.
Method correct % planes log10 Rel(%) correct
SCN NA NA 0.198 0.530HEH 33.1% 50.3% 0.320 1.423
Baseline-1 0% NA 0.300 0.698No priors 0% NA 0.170 0.447Point-wise mrf 23% NA 0.149 0.458
Baseline-2 0% 0% 0.334 0.516No priors 0% 0% 0.205 0.392Co-planar 45.7% 57.1% 0.191 0.373PP-MRF 64.9% 71.2% 0.187 0.370
and also on 588 internet images. The internet images were collected by issuing key-
words on Google image search. To collect data and to perform the evaluation of
the algorithms in a completely unbiased manner, a person not associated with the
project was asked to collect images of environments (greater than 800x600 size). The
person chose the following keywords to collect the images: campus, garden, park,
house, building, college, university, church, castle, court, square, lake, temple, scene.
The images thus collected were from places from all over the world, and contained
environments that were significantly different from the training set, e.g. hills, lakes,
night scenes, etc. The person chose only those images which were of “environments,”
i.e. she removed images of the geometrical figure ‘square’ when searching for keyword
‘square’; no other pre-filtering was done on the data.
In addition, we manually labeled 50 images with ‘ground-truth’ boundaries to
learn the parameters for occlusion boundaries and folds.
We performed an extensive evaluation of our algorithm on 588 internet test images,
and 134 test images collected using the laser scanner.
In Table 2.2, we compare the following algorithms:
(a) Baseline: Both for pointwise MRF (Baseline-1) and plane parameter MRF (Baseline-
2). The Baseline MRF is trained without any image features, and thus reflects “prior”
depths of sorts.
54 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.24: Typical results from HEH and our algorithm. Row 1: Original Image.Row 2: 3D model generated by HEH, Row 3 and 4: 3D model generated by ouralgorithm. (Note that the screenshots cannot be simply obtained from the originalimage by an affine transformation.) In image 1, HEH makes mistakes in some partsof the foreground rock, while our algorithm predicts the correct model; with the rockoccluding the house, giving a novel view. In image 2, HEH algorithm detects a wrongground-vertical boundary; while our algorithm not only finds the correct ground, butalso captures a lot of non-vertical structure, such as the blue slide. In image 3, HEHis confused by the reflection; while our algorithm produces a correct 3D model. Inimage 4, HEH and our algorithm produce roughly equivalent results—HEH is a bitmore visually pleasing and our model is a bit more detailed. In image 5, both HEHand our algorithm fail; HEH just predict one vertical plane at a incorrect location.Our algorithm predicts correct depths of the pole and the horse, but is unable todetect their boundary; hence making it qualitatively incorrect.
2.13. EXPERIMENTS: VISUAL 55
(b) Our Point-wise MRF (described in [152]): with and without constraints (connec-
tivity, co-planar and co-linearity).
(c) Our Plane Parameter MRF (PP-MRF): without any constraint, with co-planar
constraint only, and the full model.
(d) Saxena et al. (SCN), [134, 135] applicable for quantitative errors only.
(e) Hoiem et al. (HEH) [46]. For fairness, we scale and shift their depths before com-
puting the errors to match the global scale of our test images. Without the scaling
and shifting, their error is much higher (7.533 for relative depth error).
We compare the algorithms on the following metrics:
(a) % of models qualitatively correct,
(b) % of major planes correctly identified,21
(c) Depth error | log d−log d| on a log-10 scale, averaged over all pixels in the hold-out
test set,
(d) Average relative depth error |d−d|d
. (We give these two numerical errors on only the
134 test images that we collected, because ground-truth laser depths are not available
for internet images.)
Table 2.2 shows that both of our models (Point-wise MRF and Plane Parameter
MRF) outperform the other algorithms in quantitative accuracy in depth prediction.
Plane Parameter MRF gives better relative depth accuracy and produces sharper
depth discontinuities (Fig. 2.21, 2.22 and 2.23). Table 2.2 also shows that by cap-
turing the image properties of connected structure, co-planarity and co-linearity, the
models produced by the algorithm become significantly better. In addition to reduc-
ing quantitative errors, PP-MRF does indeed produce significantly better 3D models.
When producing 3D flythroughs, even a small number of erroneous planes make the
3D model visually unacceptable, even though the quantitative numbers may still show
small errors.
Our algorithm gives qualitatively correct models for 64.9% of images as compared
21For the first two metrics, we define a model as correct when for 70% of the major planes in theimage (major planes occupy more than 15% of the area), the plane is in correct relationship withits nearest neighbors (i.e., the relative orientation of the planes is within 30 degrees). Note thatchanging the numbers, such as 70% to 50% or 90%, 15% to 10% or 30%, and 30 degrees to 20 or 45degrees, gave similar trends in the results.
56 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Figure 2.25: Typical results from our algorithm. Original image (top), and a screen-shot of the 3D flythrough generated from the image (bottom of the image). 16 images(a-g,l-t) were evaluated as “correct” and 4 (h-k) were evaluated as “incorrect.”
2.13. EXPERIMENTS: VISUAL 57
Table 2.3: Percentage of images for which HEH is better, our PP-MRF is better, orit is a tie.
Algorithm %better
Tie 15.8%HEH 22.1%PP-MRF 62.1%
to 33.1% by HEH. The qualitative evaluation was performed by a person not as-
sociated with the project following the guidelines in Footnote 21. Delage, Lee and
Ng [26] and HEH generate a popup effect by folding the images at “ground-vertical”
boundaries—an assumption which is not true for a significant number of images;
therefore, their method fails in those images. Some typical examples of the 3D mod-
els are shown in Fig. 2.24. (Note that all the test cases shown in Fig. 5.1, 2.23, 2.24
and 2.25 are from the dataset downloaded from the internet, except Fig. 2.25a which
is from the laser-test dataset.) These examples also show that our models are often
more detailed, in that they are often able to model the scene with a multitude (over
a hundred) of planes.
We performed a further comparison. Even when both algorithms are evaluated
as qualitatively correct on an image, one result could still be superior. Therefore, we
asked the person to compare the two methods, and decide which one is better, or is
a tie.22 Table 2.3 shows that our algorithm outputs the better model in 62.1% of the
cases, while HEH outputs better model in 22.1% cases (tied in the rest).
Full documentation describing the details of the unbiased human judgment pro-
cess, along with the 3D flythroughs produced by our algorithm, is available online
at:
http://make3d.stanford.edu/research
22To compare the algorithms, the person was asked to count the number of errors made by eachalgorithm. We define an error when a major plane in the image (occupying more than 15% area inthe image) is in wrong location with respect to its neighbors, or if the orientation of the plane ismore than 30 degrees wrong. For example, if HEH fold the image at incorrect place (see Fig. 2.24,image 2), then it is counted as an error. Similarly, if we predict top of a building as far and thebottom part of building near, making the building tilted—it would count as an error.
58 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
Some of our models, e.g. in Fig. 2.25j, have cosmetic defects—e.g. stretched tex-
ture; better texture rendering techniques would make the models more visually pleas-
ing. In some cases, a small mistake (e.g., one person being detected as far-away in
Fig. 2.25h, and the banner being bent in Fig. 2.25k) makes the model look bad, and
hence be evaluated as “incorrect.”
2.14 Large-scale web experiment
Finally, in a large-scale web experiment, we allowed users to upload their photos on
the internet, and view a 3D flythrough produced from their image by our algorithm.
About 23846 unique users uploaded (and rated) about 26228 images.23 Users rated
48.1% of the models as good. If we consider the images of scenes only, i.e., exclude
images such as company logos, cartoon characters, closeups of objects, etc., then this
percentage was 57.3%. We have made the following website available for downloading
datasets/code, and for converting an image to a 3D model/flythrough:
http://make3d.stanford.edu
Our algorithm, trained on images taken in daylight around the city of Palo
Alto, was able to predict qualitatively correct 3D models for a large variety of
environments—for example, ones that have hills or lakes, ones taken at night, and
even paintings. (See Fig. 2.25 and the website.) We believe, based on our experiments
with varying the number of training examples (not reported here), that having a larger
and more diverse set of training images would improve the algorithm significantly.
2.15 Incorporating object information
In this section, we will demonstrate how our model can also incorporate other infor-
mation that might be available, for example, from object recognizers. In prior work,
Sudderth et al. [168] showed that knowledge of objects could be used to get crude
23No restrictions were placed on the type of images that users can upload. Users can rate themodels as good (thumbs-up) or bad (thumbs-down).
2.15. INCORPORATING OBJECT INFORMATION 59
Figure 2.26: (Left) Original Images, (Middle) Snapshot of the 3D model without usingobject information, (Right) Snapshot of the 3D model that uses object information.
60 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
depth estimates, and Hoiem et al. [45] used knowledge of objects and their location
to improve the estimate of the horizon. In addition to estimating the horizon, the
knowledge of objects and their location in the scene give strong cues regarding the
3D structure of the scene. For example, that a person is more likely to be on top of
the ground, rather than under it, places certain restrictions on the 3D models that
could be valid for a given image.
Here we give some examples of such cues that arise when information about objects
is available, and describe how we can encode them in our MRF:
(a) “Object A is on top of object B”
This constraint could be encoded by restricting the points si ∈ R3 on object A to
be on top of the points sj ∈ R3 on object B, i.e., sT
i z ≥ sTj z (if z denotes the
“up” vector). In practice, we actually use a probabilistic version of this constraint.
We represent this inequality in plane-parameter space (si = Ridi = Ri/(αTi Ri)). To
penalize the fractional error ξ =(
RTi zR
Tj αj −RT
j zRiαi
)
d (the constraint corresponds
to ξ ≥ 0), we choose an MRF potential hsi,sj(.) = exp (−yij (ξ + |ξ|)), where yij
represents the uncertainty in the object recognizer output. Note that for yij → ∞
(corresponding to certainty in the object recognizer), this becomes a “hard” constraint
RTi z/(α
Ti Ri) ≥ RT
j z/(αTj Rj).
In fact, we can also encode other similar spatial-relations by choosing the vector
z appropriately. For example, a constraint “Object A is in front of Object B” can be
encoded by choosing z to be the ray from the camera to the object.
(b) “Object A is attached to Object B”
For example, if the ground-plane is known from a recognizer, then many objects would
be more likely to be “attached” to the ground plane. We easily encode this by using
our connected-structure constraint.
(c) Known plane orientation
If orientation of a plane is roughly known, e.g. that a person is more likely to be
“vertical”, then it can be easily encoded by adding to Eq. 2.3 a term f(αi) =
exp(
−wi|αTi z|)
; here, wi represents the confidence, and z represents the up vector.
We implemented a recognizer (based on the features described in Section 5.3.3)
for ground-plane, and used the Dalal-Triggs Detector [22] to detect pedestrians. For
2.16. DISCUSSION 61
these objects, we encoded the (a), (b) and (c) constraints described above. Fig. 2.26
shows that using the pedestrian and ground detector improves the accuracy of the 3D
model. Also note that using “soft” constraints in the MRF (Section 2.15), instead of
“hard” constraints, helps in estimating correct 3D models even if the object recognizer
makes a mistake.
In an extension of our work [43], we addressed the problem of jointly solving ob-
ject detection and depth reconstruction. We proposed learning a set of related models
in such that they both solve their own problem and help each other. We developed
a framework called Cascaded Classification Models (CCM), where repeated instan-
tiations of these classifiers are coupled by their input/output variables in a cascade
that improves performance at each level. Our method requires only a limited “black
box” interface with the models, allowing us to use very sophisticated, state-of-the-art
classifiers without having to look under the hood. We demonstrated the effectiveness
of our method on a large set of natural images by combining the subtasks of scene cat-
egorization, object detection, multiclass image segmentation, and 3d reconstruction.
See Fig. 2.27 for a few examples.
2.16 Discussion
The problem of depth perception is fundamental to computer vision, one that has
enjoyed the attention of many researchers and seen significant progress in the last
few decades. However, the vast majority of this work, such as stereopsis, has used
multiple image geometric cues to infer depth. In contrast, single-image cues offer
a largely orthogonal source of information, one that has heretofore been relatively
underexploited. Given that depth and shape perception appears to be an impor-
tant building block for many other applications, such as object recognition [176, 45],
grasping [136], navigation [88], image compositing [58], and video retrieval [31], we
believe that monocular depth perception has the potential to improve all of these
applications, particularly in settings where only a single image of a scene is available.
We presented an algorithm for inferring detailed 3D structure from a single still
62 CHAPTER 2. SINGLE IMAGE DEPTH RECONSTRUCTION
(a)
Figure 2.27: (top two rows) three cases where CCM improved results for all tasks.In the first, for instance, the presence of grass allows the CCM to remove the boatdetections. The next four rows show four examples where detections are improvedand four examples where segmentations are improved. (Best viewed in color.)
2.16. DISCUSSION 63
image. Compared to previous approaches, our algorithm creates detailed 3D models
which are both quantitatively more accurate and visually more pleasing. Our ap-
proach begins by over-segmenting the image into many small homogeneous regions
called “superpixels” and uses an MRF to infer the 3D position and orientation of each.
Other than assuming that the environment is made of a number of small planes, we
do not make any explicit assumptions about the structure of the scene, such as the
assumption by Delage et al. [26] and Hoiem et al. [46] that the scene comprises verti-
cal surfaces standing on a horizontal floor. This allows our model to generalize well,
even to scenes with significant non-vertical structure.
Chapter 3
Multi-view Geometry using
Monocular Vision
3.1 Introduction
A 3D model built from a single image will almost invariably be an incomplete model
of the scene, because many portions of the scene will be missing or occluded. In this
section, we will use both the monocular cues and multi-view triangulation cues to
create better and larger 3D models.
Given a sparse set of images of a scene, it is sometimes possible to construct a 3D
model using techniques such as structure from motion (SFM) [33, 117], which start
by taking two or more photographs, then find correspondences between the images,
and finally use triangulation to obtain 3D locations of the points. If the images are
taken from nearby cameras (i.e., if the baseline distance is small), then these methods
often suffer from large triangulation errors for points far-away from the camera.1 If,
conversely, one chooses images taken far apart, then often the change of viewpoint
causes the images to become very different, so that finding correspondences becomes
difficult, sometimes leading to spurious or missed correspondences. (Worse, the large
baseline also means that there may be little overlap between the images, so that
1I.e., the depth estimates will tend to be inaccurate for objects at large distances, because evensmall errors in triangulation will result in large errors in depth.
64
3.1. INTRODUCTION 65
Figure 3.1: Two images taken from a stereo pair of cameras, and the depthmapcalculated by a stereo system.
few correspondences may even exist.) These difficulties make purely geometric 3D
reconstruction algorithms fail in many cases, specifically when given only a small set
of images.
However, when tens of thousands of pictures are available—for example, for frequently-
photographed tourist attractions such as national monuments—one can use the in-
formation present in many views to reliably discard images that have only few corre-
spondence matches. Doing so, one can use only a small subset of the images available
(∼15%), and still obtain a “3D point cloud” for points that were matched using
SFM. This approach has been very successfully applied to famous buildings such
as the Notre Dame; the computational cost of this algorithm was significant, and
required about a week on a cluster of computers [164].
In the specific case of estimating depth from two images taken from a pair of
stereo cameras, (Fig. 3.1) the depths are estimated by triangulation using the two
images. Over the past few decades, researchers have developed very good stereo
vision systems (see [155] for a review). Although these systems work well in many
environments, stereo vision is fundamentally limited by the baseline distance between
the two cameras. Specifically, their depth estimates tend to be inaccurate when the
distances considered are large (because even very small triangulation/angle estimation
errors translate to very large errors in distances). Further, stereo vision also tends to
fail for textureless regions of images where correspondences cannot be reliably found.
On the other hand, humans perceive depth by seamlessly combining monocular
66 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
cues with stereo cues.2 We believe that monocular cues and (purely geometric) stereo
cues give largely orthogonal, and therefore complementary, types of information about
depth. Stereo cues are based on the difference between two images and do not depend
on the content of the image. Even if the images are entirely random, it would still
generate a pattern of disparities (e.g., random dot stereograms [9]). On the other
hand, depth estimates from monocular cues are entirely based on the evidence about
the environment presented in a single image. In this chapter, we investigate how
monocular cues can be integrated with any reasonable stereo system, to obtain better
depth estimates than the stereo system alone.
The reason that many geometric “triangulation-based” methods sometimes fail
(especially when only a few images of a scene are available) is that they do not make
use of the information present in a single image. Therefore, we will extend our MRF
model to seamlessly combine triangulation cues and monocular image cues to build
a full photo-realistic 3D model of the scene. Using monocular cues will also help us
build 3D model of the parts that are visible only in one view.
3.2 Improving stereo vision using monocular Cues
3.2.1 Disparity from stereo correspondence
Depth estimation using stereo vision from two images (taken from two cameras sep-
arated by a baseline distance) involves three steps: First, establish correspondences
between the two images. Then, calculate the relative displacements (called “dispar-
ity”) between the features in each image. Finally, determine the 3D depth of the
feature relative to the cameras, using knowledge of the camera geometry.
Stereo correspondences give reliable estimates of disparity, except when large por-
tions of the image are featureless (i.e., correspondences cannot be found). Further,
2Stereo cues: Each eye receives a slightly different view of the world and stereo vision combinesthe two views to perceive 3D depth [180]. An object is projected onto different locations on thetwo retinae (cameras in the case of a stereo system), depending on the distance of the object. Theretinal (stereo) disparity varies with object distance, and is inversely proportional to the distance ofthe object. Disparity is typically not an effective cue for estimating small depth variations of objectsthat are far away.
3.2. IMPROVING STEREO VISION USING MONOCULAR CUES 67
for a given baseline distance between cameras, the accuracy decreases as the depth
values increase. In the limit of very distant objects, there is no observable disparity,
and depth estimation generally fails. Empirically, depth estimates from stereo tend
to become unreliable when the depth exceeds a certain distance.
Our stereo system finds good feature correspondences between the two images by
rejecting pixels with little texture, or where the correspondence is otherwise ambigu-
ous. More formally, we reject any feature where the best match is not significantly
better than all other matches within the search window. We use the sum-of-absolute-
differences correlation as the metric score to find correspondences [33]. Our cameras
(and algorithm) allow sub-pixel interpolation accuracy of 0.2 pixels of disparity. Even
though we use a fairly basic implementation of stereopsis, the ideas in this chapter
can just as readily be applied together with other, perhaps better, stereo systems.
3.2.2 Modeling uncertainty in Stereo
The errors in disparity are often modeled as either Gaussian [23] or via some other,
heavier-tailed distribution (e.g., [171]). Specifically, the errors in disparity have two
main causes: (a) Assuming unique/perfect correspondence, the disparity has a small
error due to image noise (including aliasing/pixelization), which is well modeled by a
Gaussian. (b) Occasional errors in correspondence cause larger errors, which results
in a heavy-tailed distribution for disparity [171].
If the standard deviation is σg in computing disparity g from stereo images
(because of image noise, etc.), then the standard deviation of the depths3 will be
σd,stereo ≈ σg/g. For our stereo system, we have that σg is about 0.2 pixels;4 this is
then used to estimate σd,stereo. Note therefore that σd,stereo is a function of the esti-
mated depth, and specifically, it captures the fact that variance in depth estimates is
larger for distant objects than for closer ones.
3Using the delta rule from statistics: Var(f(x)) ≈ (f ′(x))2Var(x), derived from a second orderTaylor series approximation of f(x). The depth d is related to disparity g as d = log(C/g), withcamera parameters determining C.
4One can also envisage obtaining a better estimate of σg as a function of a match metric usedduring stereo correspondence [14], such as normalized sum of squared differences; or learning σg asa function of disparity/texture based features.
68 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
3.2.3 Probabilistic model
We use our Markov Random Field (MRF) model, which models relations between
depths at different points in the image, to incorporate both monocular and stereo
cues. Therefore, the depth of a particular patch depends on the monocular features
of the patch, on the stereo disparity, and is also related to the depths of other parts
of the image.
In our Gaussian and Laplacian MRFs (Eq. 3.1 and 3.2), we now have an additional
term di,stereo, which is the depth estimate obtained from disparity.5 This term models
the relation between the depth and the estimate from stereo disparity. The other
terms in the models are similar to Eq. 2.1 and 2.2 in Section 2.5.
PG(d|X; θ, σ) =1
ZG
exp
−1
2
M∑
i=1
(di(1) − di,stereo)2
σ2i,stereo
+(di(1) − xT
i θr)2
σ21r
+3∑
s=1
∑
j∈Ns(i)
(di(s) − dj(s))2
σ22rs
(3.1)
PL(d|X; θ, λ) =1
ZL
exp
−
M∑
i=1
|di(1) − di,stereo|
λi,stereo
+|di(1) − xT
i θr|
λ1r
+3∑
s=1
∑
j∈Ns(i)
|di(s) − dj(s)|
λ2rs
(3.2)
3.2.4 Results on stereo
For these experiments, we collected 257 stereo pairs+depthmaps in a wide-variety
of outdoor and indoor environments, with an image resolution of 1024x768 and a
depthmap resolution of 67x54. We used 75% of the images/depthmaps for training,
and the remaining 25% for hold-out testing.
5In this work, we directly use di,stereo as the stereo cue. In [136], we use a library of featurescreated from stereo depths as the cues for identifying a grasp point on objects.
3.2. IMPROVING STEREO VISION USING MONOCULAR CUES 69
(a) Image (b) Laser (c) Stereo (d) Mono (Lap) (e) Stereo+Mono
Figure 3.2: Results for a varied set of environments, showing one image of the stereopairs (column 1), ground truth depthmap collected from 3D laser scanner (column2), depths calculated by stereo (column 3), depths predicted by using monocular cuesonly (column 4), depths predicted by using both monocular and stereo cues (column5). The bottom row shows the color scale for representation of depths. Closest pointsare 1.2 m, and farthest are 81m. (Best viewed in color)
70 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
Table 3.1: The average errors (RMS errors gave very similar trends) for various cuesand models, on a log scale (base 10).
Algorithm All Campus Forest Indoor
Baseline .341 .351 .344 .307Stereo .138 .143 .113 .182Stereo (smooth) .088 .091 .080 .099Mono (Gaussian) .093 .095 .085 .108Mono (Lap) .090 .091 .082 .105Stereo+Mono .074 .077 .069 .079
(Lap)
We quantitatively compare the following classes of algorithms that use monocular
and stereo cues in different ways:
(i) Baseline: This model, trained without any features, predicts the mean value of
depth in the training depthmaps.
(ii) Stereo: Raw stereo depth estimates, with the missing values set to the mean
value of depth in the training depthmaps.
(iii) Stereo (smooth): This method performs interpolation and region filling; using
the Laplacian model without the second term in the exponent in Eq. 3.2, and also
without using monocular cues to estimate λ2 as a function of the image.
(iv) Mono (Gaussian): Depth estimates using only monocular cues, without the
first term in the exponent of the Gaussian model in Eq. 3.1.
(v) Mono (Lap): Depth estimates using only monocular cues, without the first term
in the exponent of the Laplacian model in Eq. 3.2.
(vi) Stereo+Mono: Depth estimates using the full model.
Table 3.1 shows that although the model is able to predict depths using monocular
cues only (“Mono”), the performance is significantly improved when we combine both
mono and stereo cues. The algorithm is able to estimate depths with an error of .074
orders of magnitude, (i.e., 18.6% of multiplicative error because 10.074 = 1.186) which
represents a significant improvement over stereo (smooth) performance of .088.
Fig. 3.2 shows that the model is able to predict depthmaps (column 5) in a variety
of environments. It also demonstrates how the model takes the best estimates from
3.2. IMPROVING STEREO VISION USING MONOCULAR CUES 71
1.4 2.2 3.5 5.7 9.2 14.9 24.1 38.91 62.9
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Ground−truth distance (meters)
Erro
rs o
n a
log
scal
e (b
ase
10)
Stereo+MonoMono (Lap)Stereo (smooth)
Figure 3.3: The average errors (on a log scale, base 10) as a function of the distancefrom the camera.
both stereo and monocular cues to estimate more accurate depthmaps. For example,
in row 6 (Fig. 3.2), the depthmap generated by stereo (column 3) is very inaccurate;
however, the monocular-only model predicts depths fairly accurately (column 4).
The combined model uses both sets of cues to produce a better depthmap (column
5). In row 3, stereo cues give a better estimate than monocular ones. We again see
that our combined MRF model, which uses both monocular and stereo cues, gives
an accurate depthmap (column 5) correcting some mistakes of stereo, such as some
far-away regions that stereo predicted as close.
In Fig. 3.3, we study the behavior of the algorithm as a function of the 3D distance
from the camera. At small distances, the algorithm relies more on stereo cues, which
are more accurate than the monocular cues in this regime. However, at larger dis-
tances, the performance of stereo degrades, and the algorithm relies more on monoc-
ular cues. Since our algorithm models uncertainties in both stereo and monocular
cues, it is able to combine stereo and monocular cues effectively.
We note that monocular cues rely on prior knowledge, learned from the train-
ing set, about the environment. This is because monocular 3D reconstruction is an
72 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
inherently ambiguous problem. Thus, the monocular cues may not generalize well
to images very different from ones in the training set, such as underwater images or
aerial photos. In contrast, the stereopsis cues we used are are purely geometric, and
therefore should work well even on images taken from very different environments.
For example, the monocular algorithm fails sometimes to predict correct depths for
objects which are only partially visible in the image (e.g., Fig. 3.2, row 2: tree on the
left). For a point lying on such an object, most of the point’s neighbors lie outside the
image; hence the relations between neighboring depths are less effective here than for
objects lying in the middle of an image. However, in many of these cases, the stereo
cues still allow an accurate depthmap to be estimated (row 2, column 5).
3.3 Larger 3D models from multiple images
In the previous section, we showed that combining monocular cues and triangulation
cues gives better depth estimates. However, the method was applicable to stereo pairs
that are calibrated and placed closely together.
Now, we consider creating 3D models from multiple views (which are not necessar-
ily from a calibrated stereo pair) to create a larger 3D model. This problem is harder
because the cameras could be farther apart, and many portions could be visible from
one camera only.
One of other goals is to create 3D models that are quantitatively correct as well
as visually pleasing. We will use plane parameter representation for this purpose.
3.3.1 Representation
Given two small plane (superpixel) segmentations of two images, there is no guarantee
that the two segmentations are “consistent,” in the sense of the small planes (on a
specific object) in one image having a one-to-one correspondence to the planes in the
second image of the same object. Thus, at first blush it appears non-trivial to build a
3D model using these segmentations, since it is impossible to associate the planes in
one image to those in another. We address this problem by using our MRF to reason
3.3. LARGER 3D MODELS FROM MULTIPLE IMAGES 73
Figure 3.4: An illustration of the Markov Random Field (MRF) for inferring 3Dstructure. (Only a subset of edges and scales shown.)
simultaneously about the position and orientation of every plane in every image. If
two planes lie on the same object, then the MRF will (hopefully) infer that they have
exactly the same 3D position. More formally, in our model, the plane parameters
αni of each small ith plane in the nth image are represented by a node in our Markov
Random Field (MRF). Because our model uses L1 penalty terms, our algorithm will
be able to infer models for which αni = αm
j , which results in the two planes exactly
overlapping each other.
3.3.2 Probabilistic model
In addition to the image features/depth, co-planarity, connected structure, and co-
linearity properties, we will also consider the depths obtained from triangulation
(SFM)—the depth of the point is more likely to be close to the triangulated depth.
Similar to the probabilistic model for 3D model from a single image, most of these
cues are noisy indicators of depth; therefore our MRF model will also reason about
our “confidence” in each of them, using latent variables yT (Section 3.3.3).
Let Qn = [Rotation, Translation] ∈ R3×4 (technically SE(3)) be the camera pose
when image n was taken (w.r.t. a fixed reference, such as the camera pose of the first
image), and let dT be the depths obtained by triangulation (see Section 3.3.3). We
74 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
formulate our MRF as
P (α|X,Y, dT ; θ) ∝∏
n
f1(αn|Xn, νn, Rn, Qn; θn)
∏
n
f2(αn|yn, Rn, Qn)
∏
n
f3(αn|dn
T , ynT , R
n, Qn) (3.3)
where, the superscript n is an index over the images, For an image n, αni is the plane
parameter of superpixel i in image n. Sometimes, we will drop the superscript for
brevity, and write α in place of αn when it is clear that we are referring to a particular
image.
The first term f1(·) and the second term f2(·) capture the monocular properties,
and are same as in Eq. 2.3. We use f3(·) to model the errors in the triangulated depths,
and penalize the (fractional) error in the triangulated depths dT i and di = 1/(RTi αi).
For Kn points for which the triangulated depths are available, we therefore have
f3(α|dT , yT , R,Q) ∝Kn∏
i=1
exp(
−yT i
∣
∣dT iRiTαi − 1
∣
∣
)
. (3.4)
This term places a “soft” constraint on a point in the plane to have its depth
equal to its triangulated depth.
MAP Inference: For MAP inference of the plane parameters, we need to maximize
the conditional log-likelihood logP (α|X,Y, dT ; θ). All the terms in Eq. 3.3 are L1
norm of a linear function of α; therefore MAP inference is efficiently solved using a
Linear Program (LP).
3.3.3 Triangulation matches
In this section, we will describe how we obtained the correspondences across images,
the triangulated depths dT and the “confidences” yT in the f3(·) term in Section 3.3.2.
We start by computing 128 SURF features [4], and then calculate matches based
on the Euclidean distances between the features found. Then to compute the camera
3.3. LARGER 3D MODELS FROM MULTIPLE IMAGES 75
Figure 3.5: An image showing a few matches (left), and the resulting 3D model (right)without estimating the variables y for confidence in the 3D matching. The noisy 3Dmatches reduce the quality of the model. (Note the cones erroneously projecting outfrom the wall.)
poses Q = [Rotation, Translation] ∈ R3×4 and the depths dT of the points matched,
we use bundle adjustment [77] followed by using monocular approximate depths to
remove the scale ambiguity. However, many of these 3D correspondences are noisy;
for example, local structures are often repeated across an image (e.g., Fig. 3.5, 3.8
and 3.10).6 Therefore, we also model the “confidence” yT i in the ith match by using
logistic regression to estimate the probability P (yT i = 1) of the match being correct.
For this, we use neighboring 3D matches as a cue. For example, a group of spatially
consistent 3D matches is more likely to be correct than a single isolated 3D match.
We capture this by using a feature vector that counts the number of matches found
6Increasingly many cameras and camera-phones come equipped with GPS, and sometimes alsoaccelerometers (which measure gravity/orientation). Many photo-sharing sites also offer geo-tagging(where a user can specify the longitude and latitude at which an image was taken). Therefore, wecould also use such geo-tags (together with a rough user-specified estimate of camera orientation),together with monocular cues, to improve the performance of correspondence algorithms. In detail,we compute the approximate depths of the points using monocular image features as d = xT θ; thisrequires only computing a dot product and hence is fast. Now, for each point in an image B forwhich we are trying to find a correspondence in image A, typically we would search in a band aroundthe corresponding epipolar line in image A. However, given an approximate depth estimated fromfrom monocular cues, we can limit the search to a rectangular window that comprises only a subsetof this band. (See Fig. 3.6.) This would reduce the time required for matching, and also improvethe accuracy significantly when there are repeated structures in the scene, e.g., in Fig. 3.7 (See [150]for more details.)
76 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
in the present superpixel and in larger surrounding regions (i.e., at multiple spatial
scales), as well as measures the relative quality between the best and second best
match.
Figure 3.6: Approximate monocular depth estimates help to limit the search areafor finding correspondences. For a point (shown as a red dot) in image B, the corre-sponding region to search in image A is now a rectangle (shown in red) instead of aband around its epipolar line (shown in purple) in image A.
3.3.4 Phantom planes
This cue enforces occlusion constraints across multiple cameras. Concretely, each
small plane (superpixel) comes from an image taken by a specific camera. Therefore,
Figure 3.7: (a) Bad correspondences, caused by repeated structures in the world. (b)Use of monocular depth estimates results in better correspondences. Note the thesecorrespondences are still fairly sparse and slightly noisy, and are therefore insufficientfor creating a good 3D model if we do not additionally use monocular cues.
3.3. LARGER 3D MODELS FROM MULTIPLE IMAGES 77
there must be an unoccluded view between the camera and the 3D position of that
small plane—i.e., the small plane must be visible from the camera location where
its picture was taken, and it is not plausible for any other small plane (one from a
different image) to have a 3D position that occludes this view. This cue is important
because often the connected structure terms, which informally try to “tie” points
in two small planes together, will result in models that are inconsistent with this
occlusion constraint, and result in what we call “phantom planes”—i.e., planes that
are not visible from the camera that photographed it. We penalize the distance
between the offending phantom plane and the plane that occludes its view from the
camera by finding additional correspondences. This tends to make the two planes lie
in exactly the same location (i.e., have the same plane parameter), which eliminates
the phantom/occlusion problem.
(a) (b) (c) (d)
(e) (f)
Figure 3.8: (a,b,c) Three original images from different viewpoints; (d,e,f) Snapshotsof the 3D model predicted by our algorithm. (f) shows a top-down view; the top partof the figure shows portions of the ground correctly modeled as lying either within orbeyond the arch.
78 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
(a) (b) (c) (d)
Figure 3.9: (a,b) Two original images with only a little overlap, taken from the samecamera location. (c,d) Snapshots from our inferred 3D model.
(a) (b) (c) (d)
Figure 3.10: (a,b) Two original images with many repeated structures; (c,d) Snap-shots of the 3D model predicted by our algorithm.
3.3. LARGER 3D MODELS FROM MULTIPLE IMAGES 79
(a) (b) (c) (d)
(e) (f)
Figure 3.11: (a,b,c,d) Four original images; (e,f) Two snapshots shown from a larger3D model created using our algorithm.
3.3.5 Experiments
In this experiment, we create a photo-realistic 3D model of a scene given only a few
images (with unknown location/pose), even ones taken from very different viewpoints
or with little overlap. Fig. 3.8, 3.9, 3.10 and 3.11 show snapshots of some 3D mod-
els created by our algorithm. Using monocular cues, our algorithm is able to create
full 3D models even when large portions of the images have no overlap (Fig. 3.8, 3.9
and 3.10). In Fig. 3.8, monocular predictions (not shown) from a single image gave
approximate 3D models that failed to capture the arch structure in the images. How-
ever, using both monocular and triangulation cues, we were able to capture this 3D
arch structure. The models are available at:
http://make3d.stanford.edu/research
80 CHAPTER 3. MULTI-VIEW GEOMETRY USING MONOCULAR VISION
3.4 Discussion
We have presented a learning algorithm based on MRFs that captures monocular cues
and incorporates them into a stereo system so as to obtain significantly improved
depth estimates. The monocular cues and (purely geometric) stereo cues largely
orthogonal, and therefore complimentary, types of information about depth. We
show that by using both monocular and stereo cues, we obtain significantly more
accurate depth estimates than is possible using either monocualr or stereo cues alone.
We then preesented an algorithm that builds large 3D reconstructions of outdoor
environments, given only a small number of images. Our algorithm segments each
of the images into a number of small planes, and simultaneously infers the 3D posi-
tion and orientation of each plane in every image. Using an MRF, it reasons about
both monocular depth cues and triangulation cues, while further taking into account
various properties of the real world, such as occlusion, co-planarity, and others.
Chapter 4
Obstacle Avoidance using
Monocular Vision
4.1 Introduction
Consider the problem of a small robot trying to navigate through a cluttered envi-
ronment, for example, a small electric car through a forest (see Figure 4.1). In this
chapter, we consider the problem of obstacle avoidance for a remote control car in
unstructured outdoor environments.
We present an approach in which supervised learning is first used to estimate
depths from a single monocular image. For the purposes of driving a car, we do not
need to estimate depths at each point in the image (as we did in chapter 2). It is
sufficient to be able to estimate a single depth value for each steering direction (i.e.,
distance to the obstacle in each steering direction). Our learning algorithm is trained
using images labeled with ground-truth distances to the closest obstacles. Since we
are estimating depth only in each steering direction (a 1D problem) instead of each
point in the image (a 2D problem), the learning model is significantly simplified. The
resulting algorithm is able to learn monocular vision cues that accurately estimate
the relative depths of obstacles in a scene. Reinforcement learning/policy search is
then applied within a simulator that renders synthetic scenes. This learns a control
policy that selects a steering direction as a function of the vision system’s output.
81
82 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
We present results evaluating the predictive ability of the algorithm both on held out
test data, and in actual autonomous driving experiments.
Figure 4.1: (a) The remote-controlled car driven autonomously in various clutteredunconstrained environments, using our algorithm. (b) A view from the car, withthe chosen steering direction indicated by the red square; the estimated distances toobstacles in the different directions are shown by the bar graph below the image.
Most work on obstacle detection using vision has focused on binocular vision/steoreopsis.
In this paper, we present a monocular vision obstacle detection algorithm based on
supervised learning. Our motivation for this is two-fold: First, we believe that single-
image monocular cues have not been effectively used in most obstacle detection sys-
tems; thus we consider it an important open problem to develop methods for obstacle
detection that exploit these monocular cues. Second, the dynamics and inertia of
high speed driving (5m/s on a small remote control car) means that obstacles must
be perceived at a distance if we are to avoid them, and the distance at which stan-
dard binocular vision algorithms can perceive them is fundamentally limited by the
“baseline” distance between the two cameras and the noise in the observations [24],
and thus difficult to apply to our problem.1
1While it is probably possible, given sufficient effort, to build a stereo system for this task, theone commercial system we evaluated (because of vibrations and motion blur) was unable to reliablydetect obstacles even at 1m range.
4.1. INTRODUCTION 83
We collected a dataset of several thousand images, each correlated with a laser
range scan that gives the distance to the nearest obstacle in each direction. After
training on this dataset (using the laser range scans as the ground-truth target labels),
a supervised learning algorithm is then able to accurately estimate the distances to
the nearest obstacles in the scene. This becomes our basic vision system, the output
of which is fed into a higher level controller trained using reinforcement learning.
An addition motivation for our work stems from the observation that the majority
of successful robotic applications of RL (including autonomous helicopter flight, some
quadruped/biped walking, snake robot locomotion, etc.) have relied on model-based
RL [59], in which an accurate model or simulator of the MDP is first built.2 In control
tasks in which the perception problem is non-trivial—such as when the input sensor
is a camera—the quality of the simulator we build for the sensor will be limited by
the quality of the computer graphics we can implement. We were interested in asking:
Does model-based reinforcement learning still make sense in these settings?
Since many of the cues that are useful for estimating depth can be re-created
in synthetic images, we implemented a graphical driving simulator. We ran exper-
iments in which the real images and laser scan data was replaced with synthetic
images. Our learning algorithm can be trained either on real camera images labeled
with ground-truth distances to the closest obstacles, or on a training set consisting of
synthetic graphics images. This is interesting in scenarios where real data for train-
ing might not be readily available. We also used the graphical simulator to train a
reinforcement learning algorithm, and ran experiments in which we systematically
varied the level of graphical realism. We show that, surprisingly, even using low- to
medium-quality synthetic images, it is often possible to learn depth estimates that
give reasonable results on real camera test images. We also show that, by combin-
ing information learned from both the synthetic and the real datasets, the resulting
vision system performs better than one trained on either one of the two. Similarly,
the reinforcement learning controller trained in the graphical simulator also performs
well in real world autonomous driving. Videos showing the system’s performance
in real world driving experiments (using the vision system built with supervised
2Some exceptions to this include [63, 66].
84 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
learning, and the controller learned using reinforcement learning) are available at
http://ai.stanford.edu/∼asaxena/rccar/
4.2 Related work
The most commonly studied for this problem is stereo vision [155]. Depth-from-
motion or optical flow is based on motion parallax [3]. Both methods require finding
correspondence between points in multiple images separated over space (stereo) or
time (optical flow). Assuming that accurate correspondences can be established,
both methods can generate very accurate depth estimates. However, the process of
searching for image correspondences is computationally expensive and error prone,
which can dramatically degrade the algorithm’s overall performance.
Gini [36] used single camera vision to drive a indoor robot, but relied heavily
on the known color and texture of the ground, and hence does not generalize well
and will fail in unstructured outdoor environments. In [118], monocular vision and
apprenticeship learning (also called imitation learning) was used to drive a car, but
only on highly structured roads and highways with clear lane markings, in which the
perception problem is much easier. [72] also successfully applied imitation learning
to driving in richer environments, but relied on stereo vision.
4.3 Vision system
For the purposes of driving a car, we do not need to estimate depths at each point in
the image. It is sufficient to be able to estimate a single depth value for each steering
direction (i.e., distance to the obstacle in each steering direction).
Therefore, we formulate the vision problem as one of depth estimation over vertical
stripes of the image. The output of this system will be fed as input to a higher level
control algorithm.
In detail, we divide each image into vertical stripes. These stripes can informally
be thought of as corresponding to different steering directions. In order to learn to
estimate distances to obstacles in outdoor scenes, we collected a dataset of several
4.3. VISION SYSTEM 85
Figure 4.2: Laser range scans overlaid on the image. Laser scans give one rangeestimate per degree.
thousand outdoor images (Fig. 4.2). Each image is correlated with a laser range
scan (using a SICK laser range finder, along with a mounted webcam of resolution
352x288, Fig. 4.3), that gives the distance to the nearest obstacle in each stripe of the
image. We create a spatially arranged vector of local features capturing monocular
depth cues. To capture more of the global context, the feature vector of each stripe
is augmented with the features of the stripe to the left and right. We use linear
regression on these features to learn to predict the relative distance to the nearest
obstacle in each of the vertical stripes.
Since many of the cues that we used to estimate depth can be re-created in syn-
thetic images, we also developed a custom driving simulator with a variable level
of graphical realism. By repeating the above learning method on a second data set
consisting of synthetic images, we learn depth estimates from synthetic images alone.
Additionally, we combine information learned from the synthetic dataset with that
learned from real images, to produce a vision system that performs better than either
does individually.
86 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
Figure 4.3: Rig for collecting correlated laser range scans and real camera images.
4.3.1 Feature vector
Each image is divided into 16 stripes, each one labeled with the distance to the
nearest obstacle. In order to emphasize multiplicative rather than additive errors,
we converted each distance to a log scale. Early experiments training with linear
distances gave poor results (details omitted due to space constraints).
Each stripe is divided into 11 vertically overlapping windows (Fig. 4.4). We denote
each window as Wsr for s = 1...16 stripes and r = 1...11 windows per stripe. For each
window, coefficients representing texture energies and texture gradients are calculated
as described below. Ultimately, the feature vector for a stripe consists of coefficients
for that stripe, the stripe to its left, and the stripe to its right. Thus, the spatial
arrangement in the feature vector for the windows allows some measure of global
structure to be encoded in it.
4.3. VISION SYSTEM 87
Figure 4.4: For each overlapping window Wsr, statistical coefficients (Law’s textureenergy, Harris angle distribution, Radon) are calculated. The feature vector for astripe consists of the coefficients for itself, its left column and right column.
88 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
Computation speed is the limiting criterion while designing features for this problem—
the vision system has to operate at more than about 10 fps in order for the car to
successfully avoid obstacles. Therefore, we only use the first nine of the filters in
(Fig. 2.4). We apply these 9 masks to the intensity channel image IY , and for the
color channels Cb and Cr, we calculate the local averaging mask M1 only; this gives
a total of 11 texture energy coefficients for each window. Lastly, we estimate the
texture energy En for the rth window in the sth stripe (denoted as Wsr) as
En(s, r) =∑
x,y∈Wsr
|Fn(x, y)| (4.1)
In order to calculate a texture gradient that is robust to noise in the image and that
can capture a greater variety of textures, we use a variant of the Radon transform3 and
a variant of the Harris corner detector.4 These features measure how the directions
of edges in each window are distributed.
4.3.2 Training
Using the labeled data and the feature vector for each column as described above, we
trained linear models to estimate the log distance to the nearest obstacle in a stripe
of the image. In the simplest case, standard linear regression was used in order to
3The Radon transform is a standard method for estimating the density of edges at various ori-entations. [24] We found the distribution of edges for each window Wsr using a variant of the thismethod. The Radon transform maps an image I(x, y) into a new (θ, ρ) coordinate system, where θcorresponds to possible edge orientations. Thus the image is now represented as Iradon(θ, ρ) insteadof I(x, y). For each of 15 discretized values of θ, we pick the highest two values of Iradon(θ, ρ) byvarying ρ, i.e., R(θ) = toptwoρ(g(θ, ρ))
4A second approach to finding distributions of directional edges over windows in the image is touse a corner detector. [24] More formally, given an image patch p (in our case 5x5 pixels), the Harriscorner detector first computes the 2x2 matrix M of intensity gradient covariances, and then comparesthe relative magnitude of the matrix’s two eigenvalues. [24] We use a slightly different variant of thisalgorithm, in which rather than thresholding on the smaller eigenvalue, we first calculate the anglerepresented by each eigenvector and put the eigenvalues into bins for each of 15 angles. We thensum up all of the eigenvalues in each bin over a window and use the sums as the input features tothe learning algorithm.
4.3. VISION SYSTEM 89
find weights w for N total images and S stripes per image
w = arg minw
N∑
i=1
S∑
s=1
(
wTxis − ln(di(s)))2
(4.2)
where, xis is the feature vector for stripe s of the ith image, and di(s) is the distance
to the nearest obstacle in that stripe.
Additionally, we experimented with outlier rejection methods such as robust re-
gression (iteratively re-weighted using the “fair response” function, Huber, 1981) . For
this task, this method did not provide significant improvements over linear regres-
sion, and simple minimization of the sum of squared errors produced nearly identical
results to the more complex methods. All of the results below are given for linear
regression.
4.3.3 Synthetic graphics data
We created graphics datasets of synthetic images of typical scenes that we expect
to see while driving the car. Graphics data with various levels of detail were cre-
ated, ranging from simple two-color trees of uniform height without texture; through
complex scenes with five different kind of trees of varying heights, along with tex-
ture, haze and shadows (Fig. 4.5). The scenes were created by placing trees randomly
throughout the environment (by sampling tree locations from a uniform distribution).
The width and height of each tree was again chosen uniformly at random between a
minimum and maximum value.
There are two reasons why it is useful to supplement real images with synthetic
ones. First, a very large amount of synthetica data can be inexpensively created,
spanning a variety of environments and scenarios. Comparably large and diverse
amounts of real world data would have been much harder to collect. Second, in
the synthetic data, there is no noise in the labels of the ground truth distances to
obstacles.
90 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
Figure 4.5: Graphics images in order of increasing level of detail. (In color, whereavailable.) (a) Uniform trees, (b) Different types of trees of uniform size, (c) Treesof varying size and type, (d) Increased density of trees, (e) Texture on trees only, (f)Texture on ground only, (g) Texture on both, (h) Texture and Shadows.
4.3.4 Error metrics
Since the ultimate goal of these experiments is to be able to drive a remote control
car autonomously, the estimation of distance to nearest obstacle is really only a proxy
for true goal of choosing the best steering direction. Successful distance estimation
allows a higher level controller to navigate in a dense forest of trees by simply steering
in the direction that is the farthest away. The real error metric to optimize in this
case should be the mean time to crash. However, since the vehicle will be driving in
unstructured terrain, experiments in this domain are not easily repeatable.
In the ith image, let α be a possible steering direction (with each direction corre-
sponding to one of the vertical stripes in the image), let αchosen be steering direction
chosen by the vision system (chosen by picking the direction cooresponding to the
farthest predicted distance), let di(α) be the actual distance to the obstacle in direc-
tion α, and let di(α) be the distance predicted by the learning algorithm. We use the
following error metrics:
4.3. VISION SYSTEM 91
Depth. The mean error in log-distance estimates of the stripes is defined as
Edepth =1
N
1
S
N∑
i=1
S∑
s=1
∣
∣
∣ln(di(s)) − ln(di(s))∣
∣
∣ (4.3)
Relative Depth To calculate the relative depth error, we remove the mean from the
true and estimated log-distances for each image. This gives a free-scaling constraint
to the depth estimates, reducing the penalty for errors in estimating the scale of the
scene as a whole.
Choosen Direction Error. When several steering directions α have nearly identical
di(α), it is it is unreasonable to expect the learning algorithm to reliably pick out the
single best direction. Instead, we might wish only to ensure that di(αchosen) is nearly
as good as the best possible direction. To measure the degree to which this holds, we
define the error metric
Eα =1
N
N∑
i=1
∣
∣
∣ln(max
s(di(s)) − ln(di(αchosen))
∣
∣
∣(4.4)
This gives us the difference between the true distance in the chosen direction and the
true distance in the actual best direction.
Hazard Rate. Because the car is driving at a high speed (about 5m/s), it will crash
into an obstacle when the vision system chooses a column containing obstacles less
than about 5m away. Letting dHazard=5m denote the distance at which an obstacle
becomes a hazard (and writing 1{·} to denote the indicator function), we define the
hazard-rate as
HazardRate =1
N
N∑
i=1
1{di(αchosen) < dHazard}, (4.5)
4.3.5 Combined vision system
We combined the system trained on synthetic data with the one trained on real images
in order to reduce the hazard rate error. The system trained on synthetic images alone
had a hazard rate of 11.0%, while the best performing real-image system had a hazard
92 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
rate of 2.69%. Training on a combined dataset of real and synthetic images did not
produce any improvement over the real-image system, even though the two separately
trained algorithms make the same mistake only 1.01% of the time. Thus, we chose
a simple voting heuristic to improve the accuracy of the system. Specifically, we ask
each system output its top two steering directions, and a vote is taken among these
four outputs to select the best steering direction (breaking ties by defaulting to the
algorithm trained on real images). This results in a reduction of the hazard rate to
2.04% (Fig. 4.7).5
4.4 Control and reinforcement learning
In order to convert the output of the vision system into actual steering commands for
the remote control car, a control policy must be developed. The policy controls how
aggressively to steer, what to do if the vision system predicts a very near distance for
all of the possible steering directions in the camera-view, when to slow down, etc. We
model the RC car control problem as a Markov decision process (MDP) [169]. We
then used the Pegasus policy search algorithm [101] (with locally greedy search) to
learn the the parameters of our control policy. Here is a description of the six learned
parameters6 of our control policy:
• θ1: σ of the Gaussian used for spatial smoothing of the predicted distances.
• θ2: if di(αchosen) < θ2, take evasive action rather than steering towards αchosen.
• θ3: the maximum change in steering angle at any given time step.
• θ4, θ5: parameters used to choose which direction to turn if no location in the
image is a good steering direction (using the current steering direction and the
predicted distances of the left-most and right-most stripes of the image).
• θ6: the percent of max throttle to use during an evasive turn.
5The other error metrics are not directly applicable to this combined output.6Since the simulator did not accurately represent complex ground contact forces that act on the
car when driving outdoors, two additional parameters were set manually: the maximum throttle,and a turning scale parameter
4.4. CONTROL AND REINFORCEMENT LEARNING 93
The reward function was given as
R(s) = −|vdesired − vactual| −K · Crashed (4.6)
where vdesired and vactual are the desired and actual speeds of the car, Crashed is a
binary variable stating whether or not the car has crashed in that time step. In our
experiments, we used K = 1000. Thus, the vehicle attempts to maintain the desired
forward speed while minimizing contact with obstacles.
Each trial began with the car initialized at (0,0) in a randomly generated environ-
ment and ran for a fixed time horizon. The learning algorithm converged after 1674
iterations of policy search.
Figure 4.6: Experimental setup showing the car and the instruments to control itremotely.
94 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
4.5 Experiments
4.5.1 Experimental setup
Our test vehicle is based on an off-the-shelf remote control car (the Traxxas Stampede
measuring 12”x16”x9” with wheel clearance of about 2”). We mounted on it a Drag-
onFly spy camera (from PointGrey Research, 320x240 pixel resolution, 4mm focal
length). The camera transmits images at up to 20 frames per second to a receiver
attached to a laptop running a 1.6GHz Pentium. Steering and throttle commands are
sent back to the RC transmitter from the laptop via a custom-built serial interface.
Finally, the RC transmitter sends the commands directly to the car. The system ran
at about 10 frames-per-second (fps).
4.5.2 Results
Table 4.2 shows the test-set error rates obtained using different feature vectors, when
training on real camera images. As a baseline for comparing the performance of
different features, the first row in the table shows the error obtained by using no
feature at all (i.e., always predict the mean distance and steer randomly). The results
indicate that texture energy and texture gradient have complementary information,
with their combination giving better performance than either alone. Table 4.3 shows
the error rates for synthetic images at various degrees of detail. The performance of
the vision system trained on synthetic images alone first goes up as we add texture
in the images, and add variety in the images, by having different types and sizes of
trees. However, it drops when we add shadows and haze. We attribute this to the
fact that real shadows and haze are very different from the ones in synthetic images,
and the learning algorithm significantly over-fits these images (showing virtually no
error at all on synthetic test images).
Figure 4.7 shows hazard rates for the vision system relying on graphics-trained
weights alone, real images trained weights alone, and improvement after combining
the two systems.
We drove the car in four different locations and under different weather conditions
4.5. EXPERIMENTS 95
Figure 4.7: Reduction in hazard rate by combining systems trained on synthetic andreal data.
Table 4.1: Real driving tests under different terrains
Obstacle Obstacle Ground Mean Max AvgDensity Type Time Time Speed
(sec) (sec) (m/s)
1 High Sculptures, Trees, Uneven Ground, 19 24 4Bushes, Rocks Rocks, Leaves
2 Low Trees, Even, 40 80 6man-made with Leavesbenches
3 Medium Trees Rough with 80 >200 24 Medium Uniform Trees, Leaves, tiles 40 70 5
benches
(see Fig. 4.8 for a few snapshots). The learned controller parameters were directly
ported from the simulator to the real world (rescaled to fit in the range of steering and
throttle commands allowed by the real vehicle). For each location, the mean time7
7Some of the test locations were had nearby roads, and it was unsafe to allow the car to continuedriving outside the bounds of our testing area. In these cases, the car was given a restart and the
96 CHAPTER 4. OBSTACLE AVOIDANCE USING MONOCULAR VISION
Figure 4.8: An image sequence showing the car avoiding a small bush, a difficultobstacle to detect in presence of trees which are larger visually yet more distant. (a)First row shows the car driving to avoid a series of obstacles (b) Second row showsthe car view. (Blue square is the largest distance direction. Red square is the chosensteering direction by the controller.) (Best viewed in color)
before crash is given in Table 4.1. All the terrain types had man-made structures
(like cars and buildings) in the background. The unstructured testing sites were
limited to areas where no training or development images were taken. Videos of the
learned controller driving autonomously are available on the web at the URL given in
Section 4.1. Performance while driving in the simulator was comparable to the real
driving tests.
4.6 Discussion
We have used supervised learning to learn monocular depth estimates, and used them
to drive an RC car at high speeds through the environment. Further, the experiments
with the graphical simulator show that model-based RL holds great promise even in
settings involving complex environments and complex perception. In particular, a
vision system trained on computer graphics was able to give reasonable depth estimate
on real image data, and a control policy trained in a graphical simulator worked well
on real autonomous driving.
time was accumulated.
4.6. DISCUSSION 97
Table 4.2: Test error on 3124 images, obtained on different feature vectors after train-ing data on 1561 images. For the case of “None” (without any features), the systempredicts the same distance for all stripes, so relative depth error has no meaning.
Feature EdepthRelDepth
EαHazardRate
None .900 - 1.36 23.8%Y (intensity) .748 .578 .792 9.58%Laws only .648 .527 .630 2.82%Laws .640 .520 .594 2.75%Radon .785 .617 .830 6.47%Harris .687 .553 .713 4.55%Law+Harris .612 .508 .566 2.69%Laws+Radon .626 .519 .581 2.88%Harris+Radon .672 .549 .649 3.20%Law+Har+Rad .604 .508 .546 2.69%
Table 4.3: Test error on 1561 real images, after training on 667 graphics images, withdifferent levels of graphics realism using Laws features.
Train EdepthRelDepth
EαHazardRate
None .900 - 1.36 23.8%Laser Best .604 .508 .546 2.69%Graphics-8 .925 .702 1.23 15.6%Graphics-7 .998 .736 1.10 12.8%Graphics-6 .944 .714 1.04 14.7%Graphics-5 .880 .673 .984 11.0%Graphics-4 1.85 1.33 1.63 34.4%Graphics-3 .900 .694 1.78 38.2%Graphics-2 .927 .731 1.72 36.4%Graphics-1 1.27 1.00 1.56 30.3%G-5 (L+Harris) .929 .713 1.11 14.5%
Chapter 5
Robotic Grasping: Identifying
Grasp-point
5.1 Introduction
We consider the problem of grasping novel objects, specifically ones that are being
seen for the first time through vision. The ability to grasp a previously unknown
object is a challenging yet important problem for robots. For example, a household
robot can use this ability to perform tasks such as cleaning up a cluttered room
and unloading items from a dishwasher. In this chapter, we will present a learning
algorithm that enables a robot to grasp a wide-variety of commonly found objects,
such as plates, tape-rolls, jugs, cellphones, keys, screwdrivers, staplers, a thick coil of
wire, a strangely shaped power horn, and others.
Modern-day robots can be carefully hand-programmed or “scripted” to carry out
many complex manipulation tasks, ranging from using tools to assemble complex
machinery, to balancing a spinning top on the edge of a sword [162]. However, au-
tonomously grasping a previously unknown object still remains a challenging problem.
If we are trying to grasp a previously known object, or if we are able to obtain a full 3D
model of the object, then various approaches such as ones based on friction cones [85],
form- and force-closure [7], pre-stored primitives [89], or other methods can be ap-
plied. However, in practical scenarios it is often very difficult to obtain a full and
98
5.1. INTRODUCTION 99
Figure 5.1: Our robot unloading items from a dishwasher.
accurate 3D reconstruction of an object seen for the first time through vision. For
stereo systems, 3D reconstruction is difficult for objects without texture, and even
when stereopsis works well, it would typically reconstruct only the visible portions of
the object. Even if more specialized sensors such as laser scanners (or active stereo)
are used to estimate the object’s shape, we would still only have a 3D reconstruction
of the front face of the object.
In contrast to these approaches, we propose a learning algorithm that neither
requires, nor tries to build, a 3D model of the object. Instead it predicts, directly
as a function of the images, a point at which to grasp the object. Informally, the
100 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
Figure 5.2: Some examples of objects on which the grasping algorithm was tested.
algorithm takes two or more pictures of the object, and then tries to identify a point
within each 2D image that corresponds to a good point at which to grasp the object.
(For example, if trying to grasp a coffee mug, it might try to identify the mid-point
of the handle.) Given these 2D points in each image, we use triangulation to obtain
a 3D position at which to actually attempt the grasp. Thus, rather than trying to
triangulate every single point within each image in order to estimate depths—as in
dense stereo—we only attempt to triangulate one (or at most a small number of)
points corresponding to the 3D point where we will grasp the object. This allows
us to grasp an object without ever needing to obtain its full 3D shape, and applies
even to textureless, translucent or reflective objects on which standard stereo 3D
reconstruction fares poorly (see Figure 5.5).
To the best of our knowledge, our work represents the first algorithm capable of
grasping novel objects (ones where a 3D model is not available), including ones from
novel object classes, that we are perceiving for the first time using vision.
This paper focuses on the task of grasp identification, and thus we will consider
only objects that can be picked up without performing complex manipulation.1 We
1For example, picking up a heavy book lying flat on a table might require a sequence of com-plex manipulations, such as to first slide the book slightly past the edge of the table so that themanipulator can place its fingers around the book.
5.2. RELATED WORK 101
will attempt to grasp a number of common office and household objects such as
toothbrushes, pens, books, cellphones, mugs, martini glasses, jugs, keys, knife-cutters,
duct-tape rolls, screwdrivers, staplers and markers (see Figure 5.2). We will also
address the problem of unloading items from dishwashers.
The remainder of this chapter is structured as follows. In Section 5.2, we describe
related work. In Section 5.3, we describe our learning approach, as well as our proba-
bilistic model for inferring the grasping point. In Section 5.5, we describe our robotic
manipulation platforms. In Section 5.6, we describe the motion planning/trajectory
planning for moving the manipulator to the grasping point. In Section 5.7, we re-
port the results of extensive experiments performed to evaluate our algorithm, and
Section 5.8 concludes.
5.2 Related work
Most work in robot manipulation assumes availability of a complete 2D or 3D model
of the object, and focuses on designing control and planning methods to achieve
a successful and stable grasp. Here, we will discuss in detail prior work that uses
learning or vision for robotic manipulation, and refer the reader to [7, 85, 161] for a
more general survey of past work in robotic manipulation.
In simulation environments (without real world experiments), learning has been
applied to robotic manipulation for several different purposes. For example, [107]
used Support Vector Machines (SVM) to estimate the quality of a grasp given a
number of features describing the grasp and the object. [48, 49] used partially ob-
servable Markov decision processes (POMDP) to choose optimal control policies for
two-fingered hands. They also used imitation learning to teach a robot whole-body
grasps. [89] used heuristic rules to generate and evaluate grasps for three-fingered
hands by assuming that the objects are made of basic shapes such as spheres, boxes,
cones and cylinders each with pre-computed grasp primitives. All of these methods
assumed full knowledge of the 3D model of the object. Further, these methods were
not tested through real-world experiments, but were instead modeled and evaluated
in a simulator.
102 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
If a 3D model of the object is given, then one can use techniques such as form-
and force-closure to compute stable grasps, i.e., the grasps with complete restraint.
A grasp with complete restraint prevents loss of contact and thus is very secure. A
form-closure of a solid object is a finite set of wrenches (force-moment combinations)
applied on the object, with the property that any other wrench acting on the object
can be balanced by a positive combination of the original ones [85]. Intuitively, form
closure is a way of defining the notion of a “firm grip” of the object [82]. It gurantees
maintenance of contact as long as the links of the hand and the object are well
approximated as rigid and as long as the joint actuators are sufficiently strong [123].
[75] proposed an efficient algorithm for computing all n-finger form-closure grasps on a
polygonal object based on a set of sufficient and necessary conditions for form-closure.
The primary difference between form-closure and force-closure grasps is that force-
closure relies on contact friction. This translates into requiring fewer contacts to
achieve force-closure than form-closure[73, 47, 90]. [131] demonstrated that a neces-
sary and sufficient condition for force-closure is that the primitive contact wrenches
resulted by contact forces positively span the entire wrench space. [102] considered
force-closure for planar grasps using friction points of contact. Ponce et al. presented
several algorithms for computing force-closure grasps. In [119], they presented al-
gorithms for computing force-closure grasps for two fingers for curved 2D objects.
In [120], they considered grasping 3D polyhedral objects with a hand equipped with
four hard fingers and assumed point contact with friction. They proved necessary
and sufficient conditions for equilibrium and force-closure, and present a geometric
characterization of all possible types of four-finger equilibrium grasps. [6] investi-
gated the form and force closure properties of the grasp, and proved the equivalence
of force-closure analysis with the study of the equilibria of an ordinary differential
equation. All of these methods focussed on inferring the configuration of the hand
that results in a stable grasp. However, they assumed that the object’s full 3D model
is available in advance.
Some work has been done on using vision for real world grasping experiments;
however most were limited to grasping 2D planar objects. For uniformly colored
planar objects lying on a uniformly colored table top, one can find the 2D contour
5.2. RELATED WORK 103
(a) Martini glass (b) Mug (c) Eraser (d) Book (e) Pencil
Figure 5.3: The images (top row) with the corresponding labels (shown in red in thebottom row) of the five object classes used for training. The classes of objects usedfor training were martini glasses, mugs, whiteboard erasers, books and pencils.
of the object quite reliably. Using local visual features (based on the 2D contour)
and other properties such as form- and force-closure, the methods discussed below
decide the 2D locations at which to place (two or more) fingertips to grasp the object.
[113, 19] estimated 2D hand orientation using K-means clustering for simple objects
(specifically, square, triangle and round “blocks”). [94, 95] calculated 2D positions of
three-fingered grasps from 2D object contours based on feasibility and force closure
criteria. [12] also considered the grasping of planar objects and chose the location of
the three fingers of a hand by first classifying the object as circle, triangle, square or
rectangle from a few visual features, and then using pre-scripted rules based on fuzzy
logic. [56] used Q-learning to control the arm to reach towards a spherical object to
grasp it using a parallel plate gripper.
If the desired location of the grasp has been identified, techniques such as visual
servoing that align the gripper to the desired location [69] or haptic feedback [109] can
be used to pick up the object. [114] learned to sequence together manipulation gaits
for four specific, known 3D objects. However, they considered fairly simple scenes,
and used online learning to associate a controller with the height and width of the
104 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
bounding ellipsoid containing the object. For grasping known objects, one can also
use Learning-by-Demonstration [54], in which a human operator demonstrates how
to grasp an object, and the robot learns to grasp that object by observing the human
hand through vision.
The task of identifying where to grasp an object (of the sort typically found in the
home or office) involves solving a difficult perception problem. This is because the
objects vary widely in appearance, and because background clutter (e.g., dishwasher
prongs or a table top with a pattern) makes it even more difficult to understand the
shape of a scene. There are numerous robust learning algorithms that can infer useful
information about objects, even from a cluttered image. For example, there is a large
amount of work on recognition of known object classes (such as cups, mugs, etc.),
e.g., [157]. The performance of these object recognition algorithms could probably be
improved if a 3D model of the object were available, but they typically do not require
such models. For example, that an object is cup-shaped can often be inferred directly
from a 2D image. Our approach takes a similar direction, and will attempt to infer
grasps directly from 2D images, even ones containing clutter. [134, 149] also showed
that given just a single image, it is often possible to obtain the 3D structure of a
scene. While knowing the 3D structure by no means implies knowing good grasps,
this nonetheless suggests that most of the information in the 3D structure may already
be contained in the 2D images, and suggests that an approach that learns directly
from 2D images holds promise. Indeed, [83] showed that humans can grasp an object
using only one eye.
Our work also takes inspiration from [16], which showed that cognitive cues and
previously learned knowledge both play major roles in visually guided grasping in
humans and in monkeys. This indicates that learning from previous knowledge is an
important component of grasping novel objects. Further, [38] showed that there is a
dissociation between recognizing objects and grasping them, i.e., there are separate
neural pathways that recognize objects and that direct spatial control to reach and
grasp the object. Thus, given only a quick glance at almost any rigid object, most
primates can quickly choose a grasp to pick it up, even without knowledge of the
object type. Our work represents perhaps a first step towards designing a vision
5.3. LEARNING THE GRASPING POINT 105
grasping algorithm which can do the same.
5.3 Learning the grasping point
We consider the general case of grasping objects—even ones not seen before—in 3D
cluttered environments such as in a home or office. To address this task, we will use
an image of the object to identify a location at which to grasp it.
Because even very different objects can have similar sub-parts, there are certain
visual features that indicate good grasps, and that remain consistent across many
different objects. For example, jugs, cups, and coffee mugs all have handles; and
pens, white-board markers, toothbrushes, screw-drivers, etc. are all long objects
that can be grasped roughly at their mid-point (Figure 5.3). We propose a learning
approach that uses visual features to predict good grasping points across a large range
of objects.
In our approach, we will first predict the 2D location of the grasp in each image;
more formally, we will try to identify the projection of a good grasping point onto
the image plane. Then, given two (or more) images of an object taken from different
camera positions, we will predict the 3D position of a grasping point. If each of these
points can be perfectly identified in each image, then we can easily “triangulate” from
these images to obtain the 3D grasping point. (See Figure 5.7a.) In practice it is
difficult to identify the projection of a grasping point into the image plane (and, if
there are multiple grasping points, then the correspondence problem—i.e., deciding
which grasping point in one image corresponds to which point in another image—
must also be solved). This problem is further exacerbated by imperfect calibration
between the camera and the robot arm, and by uncertainty in the camera position if
the camera was mounted on the arm itself. To address all of these issues, we develop a
probabilistic model over possible grasping points, and apply it to infer a good position
at which to grasp an object.
106 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
5.3.1 Grasping point
For most objects, there is typically a small region that a human (using a two-fingered
pinch grasp) would choose to grasp it; with some abuse of terminology, we will infor-
mally refer to this region as the “grasping point,” and our training set will contain
labeled examples of this region. Examples of grasping points include the center region
of the neck for a martini glass, the center region of the handle for a coffee mug, etc.
(See Figure 5.3.)
For testing purposes, we would like to evaluate whether the robot successfully
picks up an object. For each object in our test set, we define the successful grasp
region to be the region where a human/robot using a two-fingered pinch grasp would
(reasonably) be expected to successfully grasp the object. The error in predicting the
grasping point (reported in Table 5.1) is then defined as the distance of the predicted
point from the closest point lying in this region. (See Figure 5.4; this region is
usually somewhat larger than that used in the training set, defined in the previous
paragraph.) Since our gripper (Figure 5.1) has some passive compliance because of
attached foam/rubber, and can thus tolerate about 0.5cm error in positioning, the
successful grasping region may extend slightly past the surface of the object. (E.g.,
the radius of the cylinder in Figure 5.4 is about 0.5cm greater than the actual neck
of the martini glass.)
5.3.2 Synthetic data for training
We apply supervised learning to identify patches that contain grasping points. To
do so, we require a labeled training set, i.e., a set of images of objects labeled with
the 2D location of the grasping point in each image. Collecting real-world data of
this sort is cumbersome, and manual labeling is prone to errors. Thus, we instead
chose to generate, and learn from, synthetic data that is automatically labeled with
the correct grasps. (We also used synthetic data to learn distances in natural scenes
for obstacle avoidance system, see 4.3.3.)
In detail, we generate synthetic images along with correct grasps (Figure 5.3)
5.3. LEARNING THE GRASPING POINT 107
Figure 5.4: An illustration showing the grasp labels. The labeled grasp for a martiniglass is on its neck (shown by a black cylinder). For two predicted grasping pointsP1 and P2, the error would be the 3D distance from the grasping region, i.e., d1 andd2 respectively.
108 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
using a computer graphics ray tracer.2 There is a relation between the quality of
the synthetically generated images and the accuracy of the algorithm. The better
the quality of the synthetically generated images and graphical realism, the better
the accuracy of the algorithm. Therefore, we used a ray tracer instead of faster, but
cruder, OpenGL style graphics. OpenGL style graphics have less realism, the learning
performance sometimes decreased with added graphical details in the rendered images
(see 4.3.3).
The advantages of using synthetic images are multi-fold. First, once a synthetic
model of an object has been created, a large number of training examples can be
automatically generated by rendering the object under different (randomly chosen)
lighting conditions, camera positions and orientations, etc. In addition, to increase
the diversity of the training data generated, we randomized different properties of
the objects such as color, scale, and text (e.g., on the face of a book). The time-
consuming part of synthetic data generation was the creation of the mesh models of
the objects. However, there are many objects for which models are available on the
internet that can be used with only minor modifications. We generated 2500 examples
from synthetic data, comprising objects from five object classes (see Figure 5.3).
Using synthetic data also allows us to generate perfect labels for the training set
with the exact location of a good grasp for each object. In contrast, collecting and
manually labeling a comparably sized set of real images would have been extremely
time-consuming.
We have made the data available online at:
http://ai.stanford.edu/∼asaxena/learninggrasp/data.html
5.3.3 Features
In our approach, we begin by dividing the image into small rectangular patches, and
for each patch predict if it contains a projection of a grasping point onto the image
2Ray tracing [37] is a standard image rendering method from computer graphics. It handlesmany real-world optical phenomenon such as multiple specular reflections, textures, soft shadows,smooth curves, and caustics. We used PovRay, an open source ray tracer.
5.3. LEARNING THE GRASPING POINT 109
(a) (b)
Figure 5.5: (a) An image of textureless/transparent/reflective objects. (b) Depthsestimated by our stereo system. The grayscale value indicates the depth (darker beingcloser to the camera). Black represents areas where stereo vision failed to return adepth estimate.
plane.
Instead of relying on a few visual cues such as presence of edges, we will compute
a battery of features for each rectangular patch. By using a large number of different
visual features and training on a huge training set (Section 5.3.2), we hope to obtain
a method for predicting grasping points that is robust to changes in the appearance
of the objects and is also able to generalize well to new objects.
We start by computing features for three types of local cues: edges, textures, and
color. [146, 135] We transform the image into YCbCr color space, where Y is the in-
tensity channel, and Cb and Cr are color channels. We compute features representing
edges by convolving the intensity channel with 6 oriented edge filters (Figure 2.4).
Texture information is mostly contained within the image intensity channel, so we
apply 9 Laws’ masks to this channel to compute the texture energy. For the color
channels, low frequency information is most useful for identifying grasps; our color
features are computed by applying a local averaging filter (the first Laws mask) to
the 2 color channels. We then compute the sum-squared energy of each of these filter
outputs. This gives us an initial feature vector of dimension 17.
To predict if a patch contains a grasping point, local image features centered on
110 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
the patch are insufficient, and one has to use more global properties of the object.
We attempt to capture this information by using image features extracted at multiple
spatial scales (3 in our experiments) for the patch. Objects exhibit different behav-
iors across different scales, and using multi-scale features allows us to capture these
variations. In detail, we compute the 17 features described above from that patch as
well as the 24 neighboring patches (in a 5x5 window centered around the patch of
interest). This gives us a feature vector x of dimension 1 ∗ 17 ∗ 3 + 24 ∗ 17 = 459.
(a) Coffee Pot (b) Duct tape (c) Marker (d) Mug (e) Synthetic Mar-tini Glass
Figure 5.6: Grasping point classification. The red points in each image show thelocations most likely to be a grasping point, as predicted by our logistic regressionmodel. (Best viewed in color.)
Although we rely mostly on image-based features for predicting the grasping point,
some robots may be equipped with range sensors such as a laser scanner or a stereo
camera. In these cases (Section 5.7.3), we also compute depth-based features to
improve performance. More formally, we apply our texture based filters to the depth
image obtained from a stereo camera, append them to the feature vector used in
classification, and thus obtain a feature vector xs ∈ R918. Applying these texture
based filters this way has the effect of computing relative depths, and thus provides
information about 3D properties such as curvature. However, the depths given by a
stereo system are sparse and noisy (Figure 5.5) because many objects we consider are
textureless or reflective. Even after normalizing for the missing depth readings, these
features improved performance only marginally.
5.3. LEARNING THE GRASPING POINT 111
5.3.4 Probabilistic model
Using our image-based features, we will first predict whether each region in the image
contains the projection of a grasping point. Then in order to grasp an object, we will
statistically “triangulate” our 2D predictions to obtain a 3D grasping point.
In detail, on our manipulation platforms (Section 5.5), we have cameras mounted
either on the wrist of the robotic arm (Figure 5.11) or on a frame behind the robotic
arm (Figure 5.9). When the camera is mounted on the wrist, we command the arm
to move the camera to two or more positions, so as to acquire images of the object
from different viewpoints. However, there are inaccuracies in the physical positioning
of the arm, and hence there is some slight uncertainty in the position of the camera
when the images are acquired. We will now describe how we model these position
errors.
Formally, let C be the image that would have been taken if the actual pose of
the camera was exactly equal to the measured pose (e.g., if the robot had moved
exactly to the commanded position and orientation, in the case of the camera being
mounted on the robotic arm). However, due to positioning error, instead an image C
is taken from a slightly different location. Let (u, v) be a 2D position in image C, and
let (u, v) be the corresponding image position in C. Thus C(u, v) = C(u, v), where
C(u, v) is the pixel value at (u, v) in image C. The errors in camera position/pose
should usually be small,3 and we model the difference between (u, v) and (u, v) using
an additive Gaussian model: u = u+ ǫu, v = v + ǫv, where ǫu, ǫv ∼ N(0, σ2).
Now, to predict which locations in the 2D image are grasping points (Figure 5.6),
we define the class label z(u, v) as follows. For each location (u, v) in an image C,
z(u, v) = 1 if (u, v) is the projection of a grasping point onto the image plane, and
z(u, v) = 0 otherwise. For a corresponding location (u, v) in image C, we similarly
define z(u, v) to indicate whether position (u, v) represents a grasping point in the
image C. Since (u, v) and (u, v) are corresponding pixels in C and C, we assume
3The robot position/orientation error is typically small (position is usually accurate to within1mm), but it is still important to model this error. From our experiments (see Section 5.7), if we setσ2 = 0, the triangulation is highly inaccurate, with average error in predicting the grasping pointbeing 15.4 cm, as compared to 1.8 cm when appropriate σ2 is chosen.
112 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
z(u, v) = z(u, v). Thus:
P (z(u, v) = 1|C) = P (z(u, v) = 1|C)
=
∫
ǫu
∫
ǫv
P (ǫu, ǫv)P (z(u+ ǫu, v + ǫv) = 1|C)dǫudǫv (5.1)
Here, P (ǫu, ǫv) is the (Gaussian) density over ǫu and ǫv. We then use logistic regression
to model the probability of a 2D position (u+ ǫu, v + ǫv) in C being a good grasping
point:
P (z(u+ ǫu, v + ǫv) = 1|C) = P (z(u+ ǫu, v + ǫv) = 1|x; θ)
= 1/(1 + e−xT θ) (5.2)
where x ∈ R459 are the features for the rectangular patch centered at (u + ǫu, v +
ǫv) in image C (described in Section 5.3.3). The parameter of this model θ ∈
R459 is learned using standard maximum likelihood for logistic regression: θ∗ =
arg maxθ
∏
i P (zi|xi; θ), where (xi, zi) are the synthetic training examples (image
patches and labels), as described in Section 5.3.2. Figure 5.6a-d shows the result
of applying the learned logistic regression model to some real (non-synthetic) images.
3D grasp model: Given two or more images of a new object from different camera
positions, we want to infer the 3D position of the grasping point. (See Figure 5.7.)
Because logistic regression may have predicted multiple grasping points per image,
there is usually ambiguity in the correspondence problem (i.e., which grasping point
in one image corresponds to which graping point in another). To address this while
also taking into account the uncertainty in camera position, we propose a probabilistic
model over possible grasping points in 3D space. In detail, we discretize the 3D work-
space of the robotic arm into a regular 3D grid G ⊂ R3, and associate with each grid
element j a random variable yj, so that yj = 1 if grid cell j contains a grasping point,
and yj = 0 otherwise.
From each camera location i = 1, ..., N , one image is taken. In image Ci, let the
ray passing through (u, v) be denoted Ri(u, v). Let Gi(u, v) ⊂ G be the set of grid-
cells through which the ray Ri(u, v) passes. Let r1, ...rK ∈ Gi(u, v) be the indices of
5.3. LEARNING THE GRASPING POINT 113
Figure 5.7: (a) Diagram illustrating rays from two images C1 and C2 intersecting ata grasping point (shown in dark blue). (Best viewed in color.)
the grid-cells lying on the ray Ri(u, v) .
We know that if any of the grid-cells rj along the ray represent a grasping point,
then its projection is a grasp point. More formally, zi(u, v) = 1 if and only if yr1= 1 or
yr2= 1 or . . . or yrK
= 1. For simplicity, we use a (arguably unrealistic) naive Bayes-
like assumption of independence, and model the relation between P (zi(u, v) = 1|Ci)
and P (yr1= 1 or . . . or yrK
= 1|Ci) as
P (zi(u, v) = 0|Ci) = P (yr1= 0, ..., yrK
= 0|Ci)
=K∏
j=1
P (yrj= 0|Ci) (5.3)
Assuming that any grid-cell along a ray is equally likely to be a grasping point, this
therefore gives
P (yrj= 1|Ci) = 1 − (1 − P (zi(u, v) = 1|Ci))
1/K (5.4)
Next, using another naive Bayes-like independence assumption, we estimate the
114 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
probability of a particular grid-cell yj ∈ G being a grasping point as:
P (yj = 1|C1, ..., CN ) =P (yj = 1)P (C1, ..., CN |yj = 1)
P (C1, ..., CN)
=P (yj = 1)
P (C1, ..., CN)
N∏
i=1
P (Ci|yj = 1)
=P (yj = 1)
P (C1, ..., CN)
N∏
i=1
P (yj = 1|Ci)P (Ci)
P (yj = 1)
∝N∏
i=1
P (yj = 1|Ci) (5.5)
where P (yj = 1) is the prior probability of a grid-cell being a grasping point (set to a
constant value in our experiments). One can envision using this term to incorporate
other available information, such as known height of the table when the robot is asked
to pick up an object from a table. Using Equations 5.1, 5.2, 5.4 and 5.5, we can now
compute (up to a constant of proportionality that does not depend on the grid-cell)
the probability of any grid-cell yj being a valid grasping point, given the images.4
Stereo cameras: Some robotic platforms have stereo cameras (e.g., the robot in
Figure 5.10); therefore we also discuss how our probabilistic model can incorporate
stereo images. From a stereo camera, since we also get a depth value w(u, v) for
each location (u, v) in the image,5 we now obtain a 3D image C(u, v, w(u, v)) for 3D
positions (u, v, w).
However because of camera positioning errors (as discussed before), we get C(u, v, w(u, v))
instead of actual image C(u, v, w(u, v)). We again model the difference between
4In about 2% of the trials described in Section 5.7.2, grasping failed because the algorithm foundpoints in the images that did not actually correspond to each other. (E.g., in one image the pointselected may correspond to the midpoint of a handle, and in a different image a different point maybe selected that corresponds to a different part of the same handle.) Thus, triangulation using thesepoints results in identifying a 3D point that does not lie on the object. By ensuring that the pixelvalues in a small window around each of the corresponding points are similar, one would be able toreject some of these spurious correspondences.
5The depths estimated from a stereo camera are very sparse, i.e., the stereo system finds validpoints only for a few pixels in the image. (see Figure 5.5) Therefore, we still mostly rely on imagefeatures to find the grasp points. The pixels where the stereo camera was unable to obtain a depthare treated as regular (2D) image pixels.
5.3. LEARNING THE GRASPING POINT 115
(u, v, w) and (u, v, w) using an additive Gaussian: u = u+ ǫu, v = v+ ǫv, w = w+ ǫw,
where ǫu, ǫv ∼ N(0, σ2uv). ǫw ∼ N(0, σ2
w). Now, for our class label z(u, v, w) in the 3D
image, we have:
P (z(u, v, w) = 1|C) = P (z(u, v, w) = 1|C)
=
∫
ǫu
∫
ǫv
∫
ǫw
P (ǫu, ǫv, ǫw)
P (z(u+ ǫu, v + ǫv, w + ǫw) = 1|C)dǫudǫvdǫw (5.6)
Here, P (ǫu, ǫv, ǫw) is the (Gaussian) density over ǫu, ǫv and ǫw. Now our logistic
regression model is
P (z(u, v, w(u, v)) = 1|C) = P (z(u, v, w(u, v)) = 1|xs; θs)
= 1/(1 + e−xTs θs) (5.7)
where xs ∈ R918 are the image and depth features for the rectangular patch centered at
(u, v) in image C (described in Section 5.3.3). The parameter of this model θs ∈ R918
is learned similarly.
Now to use a stereo camera in estimating the 3D grasping point, we use
P (yj = 1|Ci) = P (zi(u, v, w(u, v)) = 1|Ci) (5.8)
in Eq. 5.5 in place of Eq. 5.4 when Ci represents a stereo camera image with depth
information at (u, v).
This framework allows predictions from both regular and stereo cameras to be
used together seamlessly, and also allows predictions from stereo cameras to be useful
even when the stereo system failed to recover depth information at the predicted
grasp point.
116 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
5.3.5 MAP inference
Given a set of images, we want to infer the most likely 3D location of the grasping
point. Therefore, we will choose the grid cell j in the 3D robot workspace that
maximizes the conditional log-likelihood logP (yj = 1|C1, ..., CN ) in Eq. 5.5. More
formally, let there be NC regular cameras and N − NC stereo cameras. Now, from
Eq. 5.4 and 5.8, we have:
arg maxj logP (yj = 1|C1, ..., CN)
= arg maxj
logN∏
i=1
P (yj = 1|Ci)
= arg maxj
NC∑
i=1
log(
1 − (1 − P (zi(u, v) = 1|Ci))1/K)
+N∑
i=NC+1
log (P (zi(u, v, w(u, v)) = 1|Ci)) (5.9)
where P (zi(u, v) = 1|Ci) is given by Eq. 5.1 and 5.2 and P (zi(u, v, w(u, v)) = 1|Ci) is
given by Eq. 5.6 and 5.7.
A straightforward implementation that explicitly computes the sum above for
every single grid-cell would give good grasping performance, but be extremely in-
efficient (over 110 seconds). Since there are only a few places in an image where
P (zi(u, v) = 1|Ci) is significantly greater than zero, we implemented a counting al-
gorithm that explicitly considers only grid-cells yj that are close to at least one ray
Ri(u, v). (Grid-cells that are more than 3σ distance away from all rays are highly
unlikely to be the grid-cell that maximizes the summation in Eq. 5.9.) This counting
algorithm efficiently accumulates the sums over the grid-cells by iterating over all N
images and rays Ri(u, v),6 and results in an algorithm that identifies a 3D grasping
position in 1.2 sec.
6In practice, we found that restricting attention to rays where P (zi(u, v) = 1|Ci) > 0.1 allows usto further reduce the number of rays to be considered, with no noticeable degradation in performance.
5.4. PREDICTING ORIENTATION 117
5.4 Predicting orientation
In [141], we presented an algorithm for predicting the 3D orientation of an object
from its image. Here, for the sake of simplicity, we will present the learning algorithm
in the context of grasping using our 5-dof arm on STAIR 1.
As discussed in Section 5.6, our task is to predict the 2D wrist orientation θ of
the gripper (at the predicted grasping point) given the image. For example, given a
picture of a closed book (which we would like to grasp at its edge), we should choose
an orientation in which the robot’s two fingers are parallel to the book’s surfaces,
rather than perpendicular to the book’s cover.
Since our robotic arm has a parallel plate gripper comprising two fingers that
close in parallel, a rotation of π results in similar configuration of the gripper. This
results in a discontinuity at θ = π, in that the orientation of the gripper at θ = π is
equivalent to θ = 0. Therefore, to handle this symmetry, we will represent angles via
y(θ) = [cos(2θ), sin(2θ)] ∈ R2. Thus, y(θ +Nπ) = [cos(2θ + 2Nπ), sin(2θ + 2Nπ)] =
y(θ).
Now, given images features x, we model the conditional distribution of y as a
multi-variate Gaussian:
P (y|x;w,K) = (2π)−n/2|K|1/2
exp
[
−1
2(y − wTx)TK(y − wTx)
]
where K−1 is a covariance matrix. The parameters of this model w and K are learnt
by maximizing the conditional log likelihood log∏
i P (yi|xi;w).
Now when given a new image, our MAP estimate for y is given as follows. Since
||y||2 = 1, we will choose
y∗ = arg maxy:||y||2=1
logP (y|x;w,K) = arg maxy:||y||2=1
yTKq
for q = wTx. (This derivation assumed K = σ2I for some σ2, which will roughly hold
true if θ is chosen uniformly in the training set.) The closed form solution of this is
y = Kq/||Kq||2.
118 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
In our robotic experiments, typically 30◦ accuracy is required to successfully grasp
an object, which our algorithm almost always attains. In an example, Figure 5.8 shows
the predicted orientation for a pen.
Figure 5.8: Predicted orientation at the grasping point for pen. Dotted line representsthe true orientation, and solid line represents the predicted orientation.
5.5 Robot platforms
Our experiments were performed on two robots built for the STAIR (STanford AI Robot)
project.7 Each robot has an arm and other equipment such as cameras, computers,
etc. (See Fig. 5.9 and 5.10.) The STAIR platforms were built as part of a project
whose long-term goal is to create a general purpose household robot that can navigate
in indoor environments, pick up and interact with objects and tools, and carry out
tasks such as tidy up a room or prepare simple meals. Our algorithms for grasping
novel objects represent perhaps a small step towards achieving some of these goals.
STAIR 1 uses a harmonic arm (Katana, by Neuronics), and is built on top of
a Segway robotic mobility platform. Its 5-dof arm is position-controlled and has a
parallel plate gripper. The arm has a positioning accuracy of ±1 mm, a reach of
62cm, and can support a payload of 500g. Our vision system used a low-quality
webcam (Logitech Quickcam Pro 4000) mounted near the end effector and a stereo
7See http://www.cs.stanford.edu/group/stair for details.
5.6. PLANNING 119
camera (Bumblebee, by Point Grey Research). In addition, the robot has a laser
scanner (SICK LMS-291) mounted approximately 1m above the ground for navigation
purposes. (We used the webcam in the experiments on grasping novel objects, and the
Bumblebee stereo camera in the experiments on unloading items from dishwashers.)
STAIR 2 sits atop a holonomic mobile base, and its 7-dof arm (WAM, by Barrett
Technologies) can be position or torque-controlled, is equipped with a three-fingered
hand, and has a positioning accuracy of ±0.6 mm. It has a reach of 1m and can
support a payload of 3kg. Its vision system uses a stereo camera (Bumblebee2, by
Point Grey Research) and a Swissranger camera [170].
We used a distributed software framework called Switchyard [125] to route mes-
sages between different devices such as the robotic arms, cameras and computers.
Switchyard allows distributed computation using TCP message passing, and thus
provides networking and synchronization across multiple processes on different hard-
ware platforms. Now, we are using its second version called ROS [126].
5.6 Planning
After identifying a 3D point at which to grasp an object, we need to find an arm pose
that realizes the grasp, and then plan a path to reach that arm pose so as to pick up
the object.
Given a grasping point, there are typically many end-effector orientations consis-
tent with placing the center of the gripper at that point. The choice of end-effector
orientation should also take into account other constraints, such as location of nearby
obstacles, and orientation of the object.
5-dof arm. When planning in the absence of obstacles, we found that even fairly
simple methods for planning worked well. Specifically, on our 5-dof arm, one of the
degrees of freedom is the wrist rotation, which therefore does not affect planning
to avoid obstacles. Thus, we can separately consider planning an obstacle-free path
using the first 4-dof, and deciding the wrist rotation. To choose the wrist rotation,
using a simplified version of our algorithm in [141], we learned the 2D value of the 3D
grasp orientation projected onto the image plane (see Appendix). Thus, for example,
120 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
Figure 5.9: STAIR 1 platform. This robot is equipped with a 5-dof arm and a parallelplate gripper.
5.6. PLANNING 121
Figure 5.10: STAIR 2 platform. This robot is equipped with 7-dof Barrett arm andthree-fingered hand.
122 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
if the robot is grasping a long cylindrical object, it should rotate the wrist so that the
parallel-plate gripper’s inner surfaces are parallel (rather than perpendicular) to the
main axis of the cylinder. Further, we found that using simple heuristics to decide
the remaining degrees of freedom worked well.8
When grasping in the presence of obstacles, such as when unloading items from
a dishwasher, we used a full motion planning algorithm for the 5-dof as well as for
the opening of the gripper (a 6th degree of freedom). Specifically, having identified
possible goal positions in configuration space using standard inverse kinematics [85],
we plan a path in 6-dof configuration space that takes the end-effector from the
starting position to a goal position, avoiding obstacles. For computing the goal ori-
entation of the end-effector and the configuration of the fingers, we used a criterion
that attempts to minimize the opening of the hand without touching the object be-
ing grasped or other nearby obstacles. Our planner uses Probabilistic Road-Maps
(PRMs) [159], which start by randomly sampling points in the configuration space.
It then constructs a “road map” by finding collision-free paths between nearby points,
and finally finds a shortest path from the starting position to possible target positions
in this graph. We also experimented with a potential field planner [61], but found
the PRM method gave better results because it did not get stuck in local optima.
7-dof arm. On the STAIR 2 robot, which uses a 7-dof arm, we use the full algorithm
in [141], for predicting the 3D orientation of a grasp, given an image of an object.
This, along with our algorithm to predict the 3D grasping point, determines six of the
seven degrees of freedom (i.e., the end-effector location and orientation). For deciding
the seventh degree of freedom, we use a criterion that maximizes the distance of the
8Four degrees of freedom are already constrained by the end-effector 3D position and the chosenwrist angle. To decide the fifth degree of freedom in uncluttered environments, we found thatmost grasps reachable by our 5-dof arm fall in one of two classes: downward grasps and outward
grasps. These arise as a direct consequence of the shape of the workspace of our 5 dof robotic arm(Figure 5.11). A “downward” grasp is used for objects that are close to the base of the arm, whichthe arm will grasp by reaching in a downward direction (Figure 5.11, first image), and an “outward”grasp is for objects further away from the base, for which the arm is unable to reach in a downwarddirection (Figure 5.11, second image). In practice, to simplify planning we first plan a path towardsan approach position, which is set to be a fixed distance away from the predicted grasp point towardsthe base of the robot arm. Then we move the end-effector in a straight line forward towards thetarget grasping point. Our grasping experiments in uncluttered environments (Section 5.7.2) wereperformed using this heuristic.
5.7. EXPERIMENTS 123
arm from the obstacles. Similar to the planning on the 5-dof arm, we then apply a
PRM planner to plan a path in the 7-dof configuration space.
5.7 Experiments
Table 5.1: Mean absolute error in locating the grasping point for different objects,as well as grasp success rate for picking up the different objects using our roboticarm. (Although training was done on synthetic images, testing was done on the realrobotic arm and real objects.)
Objects Similar to Ones Trained onTested on Mean Abs Grasp-
Error Success(cm) rate
Mugs 2.4 75%Pens 0.9 100%Wine Glass 1.2 100%Books 2.9 75%Eraser/Cellphone 1.6 100%Overall 1.80 90.0%
Novel ObjectsTested on Mean Abs Grasp-
Error Success(cm) rate
Stapler 1.9 90%Duct Tape 1.8 100%Keys 1.0 100%Markers/Screwdriver 1.1 100%Toothbrush/Cutter 1.1 100%Jug 1.7 75%Translucent Box 3.1 75%Powerhorn 3.6 50%Coiled Wire 1.4 100%Overall 1.86 87.8%
124 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
Figure 5.11: The robotic arm picking up various objects: screwdriver, box, tape-roll,wine glass, a solder tool holder, coffee pot, powerhorn, cellphone, book, stapler andcoffee mug. (See Section 5.7.2.)
5.7. EXPERIMENTS 125
5.7.1 Experiment 1: Synthetic data
We first evaluated the predictive accuracy of the algorithm on synthetic images (not
contained in the training set). (See Figure 5.6e.) The average accuracy for classifying
whether a 2D image patch is a projection of a grasping point was 94.2% (evaluated
on a balanced test set comprised of the five objects in Figure 5.3). Even though the
accuracy in classifying 2D regions as grasping points was only 94.2%, the accuracy
in predicting 3D grasping points was higher because the probabilistic model for in-
ferring a 3D grasping point automatically aggregates data from multiple images, and
therefore “fixes” some of the errors from individual classifiers.
5.7.2 Experiment 2: Grasping novel objects
We tested our algorithm on STAIR 1 (5-dof robotic arm, with a parallel plate gripper)
on the task of picking up an object placed on an uncluttered table top in front of the
robot. The location of the object was chosen randomly (and we used cardboard boxes
to change the height of the object, see Figure 5.11), and was completely unknown to
the robot. The orientation of the object was also chosen randomly from the set of
orientations in which the object would be stable, e.g., a wine glass could be placed
vertically up, vertically down, or in a random 2D orientation on the table surface
(see Figure 5.11). (Since the training was performed on synthetic images of objects
of different types, none of these scenarios were in the training set.)
In these experiments, we used a web-camera, mounted on the wrist of the robot,
to take images from two or more locations. Recall that the parameters of the vision
algorithm were trained from synthetic images of a small set of five object classes,
namely books, martini glasses, white-board erasers, mugs/cups, and pencils. We
performed experiments on coffee mugs, wine glasses (empty or partially filled with
water), pencils, books, and erasers—but all of different dimensions and appearance
than the ones in the training set—as well as a large set of objects from novel object
classes, such as rolls of duct tape, markers, a translucent box, jugs, knife-cutters,
cellphones, pens, keys, screwdrivers, staplers, toothbrushes, a thick coil of wire, a
strangely shaped power horn, etc. (See Figures 5.2 and 5.11.) We note that many of
126 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
these objects are translucent, textureless, and/or reflective, making 3D reconstruction
difficult for standard stereo systems. (Indeed, a carefully-calibrated Point Gray stereo
system, the Bumblebee BB-COL-20,—with higher quality cameras than our web-
camera—fails to accurately reconstruct the visible portions of 9 out of 12 objects.
See Figure 5.5.)
In extensive experiments, the algorithm for predicting grasps in images appeared
to generalize very well. Despite being tested on images of real (rather than synthetic)
objects, including many very different from ones in the training set, it was usually
able to identify correct grasp points. We note that test set error (in terms of average
absolute error in the predicted position of the grasp point) on the real images was only
somewhat higher than the error on synthetic images; this shows that the algorithm
trained on synthetic images transfers well to real images. (Over all 5 object types used
in the synthetic data, average absolute error was 0.81cm in the synthetic images; and
over all the 14 real test objects, average error was 1.84cm.) For comparison, neonate
humans can grasp simple objects with an average accuracy of 1.5cm. [11]
Table 5.1 shows the errors in the predicted grasping points on the test set. The
table presents results separately for objects which were similar to those we trained on
(e.g., coffee mugs) and those which were very dissimilar to the training objects (e.g.,
duct tape). For each entry in the table, a total of four trials were conducted except
for staplers, for which ten trials were conducted. In addition to reporting errors in
grasp positions, we also report the grasp success rate, i.e., the fraction of times the
robotic arm was able to physically pick up the object. For a grasp to be counted as
successful, the robot had to grasp the object, lift it up by about 1ft, and hold it for
30 seconds. On average, the robot picked up the novel objects 87.8% of the time.
For simple objects such as cellphones, wine glasses, keys, toothbrushes, etc., the
algorithm performed perfectly in our experiments (100% grasp success rate). How-
ever, grasping objects such as mugs or jugs (by the handle) allows only a narrow
trajectory of approach—where one “finger” is inserted into the handle—so that even
a small error in the grasping point identification causes the arm to hit and move the
object, resulting in a failed grasp attempt. Although it may be possible to improve
the algorithm’s accuracy, we believe that these problems can best be solved by using
5.7. EXPERIMENTS 127
a more advanced robotic arm that is capable of haptic (touch) feedback.
In many instances, the algorithm was able to pick up completely novel objects
(a strangely shaped power-horn, duct-tape, solder tool holder, etc.; see Figures 5.2
and 5.11). Perceiving a transparent wine glass is a difficult problem for standard
vision (e.g., stereopsis) algorithms because of reflections, etc. However, as shown in
Table 5.1, our algorithm successfully picked it up 100% of the time. Videos showing
the robot grasping the objects are available at
http://ai.stanford.edu/∼asaxena/learninggrasp/
5.7.3 Experiment 3: Unloading items from dishwashers
The goal of the STAIR project is to build a general purpose household robot. As
a step towards one of STAIR’s envisioned applications, in this experiment we con-
sidered the task of unloading items from dishwashers (Figures 5.1 and 5.14). This
is a difficult problem because of the presence of background clutter and occlusion
between objects—one object that we are trying to unload may physically block our
view of a second object. Our training set for these experiments also included some
hand-labeled real examples of dishwasher images (Figure 5.12), including some im-
ages containing occluded objects; this helps prevent the algorithm from identifying
grasping points on the background clutter such as dishwasher prongs. Along with the
usual features, these experiments also used the depth-based features computed from
the depth image obtained from the stereo camera (Section 5.3.3). Further, in these
experiments we did not use color information, i.e., the images fed to the algorithm
were grayscale.
In detail, we asked a person to arrange several objects neatly (meaning inserted
over or between the dishwasher prongs, and with no pair of objects lying flush against
each other; Figures 10 and 11 show typical examples) in the upper tray of the dish-
washer. To unload the items from the dishwasher, the robot first identifies grasping
points in the image. Figure 5.13 shows our algorithm correctly identifying grasps on
multiple objects even in the presence of clutter and occlusion. The robot then uses
128 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
Figure 5.12: Example of a real dishwasher image, used for training in the dishwasherexperiments.
these grasps and the locations of the obstacles (perceived using a stereo camera) to
plan a path while avoiding obstacles, to pick up the object. When planning a path
to a grasping point, the robot chooses the grasping point that is most accessible to
the robotic arm using a criterion based on the grasping point’s height, distance to
the robot arm, and distance from obstacles. The robotic arm then removes the first
object (Figure 5.14) by lifting it up by about 1ft and placing it on a surface on its
right using a pre-written script, and repeats the process above to unload the next
Table 5.2: Grasp-success rate for unloading items from a dishwasher.
Tested on Grasp-Success-ratePlates 100%Bowls 80%Mugs 60%Wine Glass 80%Overall 80%
5.7. EXPERIMENTS 129
Figure 5.13: Grasping point detection for objects in a dishwasher. (Only the pointsin top five grasping regions are shown.)
130 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
item. As objects are removed, the visual scene also changes, and the algorithm will
find grasping points on objects that it had missed earlier.
We evaluated the algorithm quantitatively for four object classes: plates, bowls,
mugs and wine glasses. (We have successfully unloaded items from multiple dish-
washers; however we performed quantitative experiments only on one dishwasher.)
We performed five trials for each object class (each trial used a different object). We
achieved an average grasping success rate of 80.0% in a total of 20 trials (see Ta-
ble 5.2). Our algorithm was able to successfully pick up objects such as plates and
wine glasses most of the time. However, due to the physical limitations of the 5-dof
arm with a parallel plate gripper, it is not possible for the arm to pick up certain
objects, such as bowls, if they are in certain configurations (see Figure 5.15). For
mugs, the grasp success rate was low because of the problem of narrow trajectories
discussed in Section 5.7.2.
We also performed tests using silverware, specifically objects such as spoons and
forks. These objects were often placed (by the human dishwasher loader) against the
corners or walls of the silverware rack. The handles of spoons and forks are only
about 0.5cm thick; therefore a larger clearance than 0.5cm was needed for them to
be grasped using our parallel plate gripper, making it physically extremely difficult
to do so. However, if we arrange the spoons and forks with part of the spoon or fork
at least 2cm away from the walls of the silverware rack, then we achieve a grasping
success rate of about 75%.
Some of the failures were because some parts of the object were not perceived;
therefore the arm would hit and move the object resulting in a failed grasp. In such
cases, we believe an algorithm that uses haptic (touch) feedback would significantly
increase grasp success rate. Some of our failures were also in cases where our algorithm
correctly predicts a grasping point, but the arm was physically unable to reach that
grasp. Therefore, we believe that using an arm/hand with more degrees of freedom,
will significantly improve performance for the problem of unloading a dishwasher.
5.7. EXPERIMENTS 131
Figure 5.14: Dishwasher experiments (Section 5.7.3): Our robotic arm unloads itemsfrom a dishwasher.
132 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
Figure 5.15: Dishwasher experiments: Failure case. For some configurations of certainobjects, it is impossible for our 5-dof robotic arm to grasp it (even if a human werecontrolling the arm).
5.8. DISCUSSION 133
5.7.4 Experiment 4: Grasping kitchen and office objects
Our long-term goal is to create a useful household robot that can perform many
different tasks, such as fetching an object in response to a verbal request and cooking
simple kitchen meals. In situations such as these, the robot would know which object
it has to pick up. For example, if the robot was asked to fetch a stapler from an
office, then it would know that it needs to identify grasping points for staplers only.
Therefore, in this experiment we study how we can use information about object type
and location to improve the performance of the grasping algorithm.
Consider objects lying against a cluttered background such as a kitchen or an
office. If we predict the grasping points using our algorithm trained on a dataset
containing all five objects, then we typically obtain a set of reasonable grasping point
predictions (Figure 5.16, left column). Now suppose we know the type of object we
want to grasp, as well as its approximate location in the scene (such as from an
object recognition algorithm [39]). We can then restrict our attention to the area of
the image containing the object, and apply a version of the algorithm that has been
trained using only objects of a similar type (i.e., using object-type specific parameters,
such as using bowl-specific parameters when picking up a cereal bowl, using spoon-
specific parameters when picking up a spoon, etc). With this method, we obtain
object-specific grasps, as shown in Figure 5.16 (right column).
Achieving larger goals, such as cooking simple kitchen meals, requires that we
combine different algorithms such as object recognition, navigation, robot manipula-
tion, etc. These results demonstrate how our approach could be used in conjunction
with other complementary algorithms to accomplish these goals.
5.8 Discussion
We proposed an algorithm for enabling a robot to grasp an object that it has never
seen before. Our learning algorithm neither tries to build, nor requires, a 3D model
of the object. Instead it predicts, directly as a function of the images, a point at
which to grasp the object. In our experiments, the algorithm generalizes very well to
134 CHAPTER 5. ROBOTIC GRASPING: IDENTIFYING GRASP-POINT
Kitchen
Office
Figure 5.16: Grasping point classification in kitchen and office scenarios: (Left Col-umn) Top five grasps predicted by the grasp classifier alone. (Right Column) Top twograsps for three different object-types, predicted by the grasp classifier when giventhe object types and locations. The red points in each image are the predicted grasplocations. (Best viewed in color.)
5.8. DISCUSSION 135
novel objects and environments, and our robot successfully grasped a wide variety of
objects in different environments such as dishwashers, office and kitchen.
Chapter 6
Robotic Grasping: Full Hand
Configuration
6.1 Introduction
Consider the problem of grasping novel objects, in the presence of significant amounts
of clutter. A key challenge in this setting is that a full 3D model of the scene is
typically not available. Instead, a robot’s sensors (such as a camera or a depth
sensor) can usually estimate only the shape of the visible portions of the scene. In
this chapter, we present an algorithm that, given such partial models of the scene,
selects a grasp—that is, a configuration of the robot’s arm and fingers—to try to pick
up an object.
Our algorithm in chapter 5 applied supervised learning to identify visual properties
that indicate good grasps, given a 2-d image of the scene. However, it only chose a
3D “grasp point”—that is, the 3D position (and 3D orientation; Saxena et al. [139])
of the center of the end-effector. Thus, it did not generalize well to more complex
arms and hands, such as to multi-fingered hands where one has to not only choose
the 3D position (and orientation) of the hand, but also address the high dof problem
of choosing the positions of all the fingers.
If a full 3D model (including the occluded portions of a scene) were available,
then methods such as form- and force-closure [86, 7, 116] and other grasp quality
136
6.1. INTRODUCTION 137
metrics [107, 48, 18] can be used to try to find a good grasp. However, given only
the point cloud returned by stereo vision or other depth sensors, a straightforward
application of these ideas is impossible, since we do not have a model of the occluded
portions of the scene.
In prior work that used vision for real-world grasping experiments, most were
limited to grasping 2-d planar objects. For a uniformly colored planar object lying on
a uniformly colored table top, one can find the 2-d contour of the object quite reliably.
Using local visual features (based on the 2-d contour) and other properties such as
form- and force-closure, [19, 17, 12, 93] computed the 2-d locations at which to place
(two or three) fingertips to grasp the object. In more general settings (i.e., non-planar
grasps), Edsinger and Kemp ([28]) grasped cylindrical objects using a power grasp
by using visual servoing and Platt et al. ([115]) used schema structured learning for
grasping simple objects (spherical and cylindrical) using power grasps; however, this
does not apply to grasping general shapes (e.g., a cup by its handle) or to grasping
in cluttered environments.
In detail, we will consider a robot that uses a camera, together with a depth
sensor, to perceive a scene. The depth sensor returns a “point cloud,” corresponding
to 3D locations that it has found on the front unoccluded surfaces of the objects.
(See Fig. 6.1.) Such point clouds are typically noisy (because of small errors in the
depth estimates); but more importantly, they are also incomplete.1
Our approach begins by computing a number of features of grasp quality, using
both 2-d image and the 3D point cloud features. For example, the 3D data is used to
compute a number of grasp quality metrics, such as the degree to which the fingers
are exerting forces normal to the surfaces of the object, and the degree to which
they enclose the object. Using such features, we then apply a supervised learning
algorithm to estimate the degree to which different configurations of the full arm and
fingers reflect good grasps.
Our grasp planning algorithm that relied on long range vision sensors (such as
cameras, swissranger), errors and uncertainty ranging from small deviations in the
1For example, standard stereo vision fails to return depth values for textureless portions of theobject, thus its point clouds are typically very sparse. Further, the Swissranger gives few points onlybecause of its low spatial resolution of 144 × 176.
138 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.1: Image of an environment (left) and the 3D point-cloud (right) returnedby the Swissranger depth sensor.
object’s location to occluded surfaces significantly limited the reliability of these open-
loop grasping strategies. (We found that approximately 65% of the grasp failures were
because we used only long range sensors and lacked a reactive controller with sufficient
local surface pose information. ) Therefore, we also present new optical proximity
sensors mounted on robot hand’s fingertips and learning algorithms to enable the
robot to reactively adjust the grasp. These sensors are a very useful complement
to long-range sensors (such as vision and 3D cameras), and provide a sense of “pre-
touch” to the robot, thus helping in the grasping performance.
We test our algorithm on two robots, on a variety of objects of shapes very different
from ones in the training set, including a ski boot, a coil of wire, a game controller,
and others. Even when the objects are placed amidst significant clutter, our algorithm
often selects successful grasps.
We present an additional application of our algorithms to opening new doors.
This capability enables our robot to navigate in spaces that were earlier blocked by
doors and elevators.
This chapter is organized as follows. We first give a brief overview of the 3D
sensor used in Section 6.2. We then describe our method for grasping using this 3D
data in Section 6.3. We describe our experiments and results in Section 6.4. We then
describe the optical proximity sensors in Section 6.5. Finally, we describe the door
6.2. 3D SENSOR 139
Figure 6.2: (Left) Barrett 3-fingered hand. (Right) Katana parallel plate gripper.
opening application in Section 6.6.
6.2 3D sensor
The SwissRanger camera is a time-of-flight depth sensor that returns a 144 × 176
array of depth estimates, spanning a 47.5◦×39.6◦ field of view, with a range of about
8m. Using an infrared structured light source, each pixel in the camera independently
measures the arrival time of the light reflected back by the objects. Its depth estimates
are typically accurate to about 2cm. However, it also suffers from systematic errors,
in that it tends not to return depth estimates for dark objects, nor for surfaces lying
at a large angle relative to the camera’s image plane. (See Fig. 6.1 for a sample 3D
scan.)
6.3 Grasping Strategy
There are many different properties that make certain grasps preferable to others.
Examples of such properties include form- and force-closure (to minimize slippage),
sufficient contact with the object, distance to obstacles (to increase robustness of
the grasp), and distance between the center of the object and the grasping point
140 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
(to increase stability). In real world grasping, however, such properties are difficult
to compute exactly, because of the quality of sensor data. Our algorithm will first
compute a variety of features that attempt to capture some of these properties. Using
these features, we then apply supervised learning to predict whether or not a given
arm/finger configuration reflects a good grasp.
Definition of grasp: We will infer the full goal configuration of the arm/fingers that
is required to grasp an object. For example, STAIR 1 uses an arm with 5 joints and
a parallel plate gripper (with one degree of freedom); for this robot, the configuration
is given by α ∈ R6. The second robot STAIR 2 uses an arm with 7 joints, equipped
with a three-fingered hand that has 4 joints (Fig. 6.2); for this robot, our algorithm
infers a configuration α ∈ R11. We will informally refer to this goal configuration as
a “grasp.”
We will then use a motion planner algorithm [159] to plan a path (one that avoids
obstacles) from the initial configuration to this goal configuration.
6.3.1 Probabilistic model
We will use both the point-cloud R and the image I taken of a scene to infer a goal
configuration α of the arm/fingers.
In chapter 5, we classified each 2-d point in a given image as a 1 (candidate grasp)
or 0. For example, for an image of a mug, it would try to classify the handle and
rim as candidate grasping points. In our approach, we use a similar classifier that
computes a set of image features and predicts the probability P (y = 1|α, I) ∈ [0, 1]
of each point in the image being a candidate grasping point. However, a remaining
difficulty is that this algorithm does not take into account the arm/finger kinematics;
thus, many of the 2-d points it selects are physically impossible for the robot to reach.
To address this problem, we use a second classifier that, given a configuration α
and a point-cloud R, predicts the probability P (y|α,R) that the grasp will succeed.
This classifier will compute features that capture a variety of properties that are
indicative of grasp quality. Our model will combine these two classifiers to estimate
the probability P (y|α,R, I) of a grasp α succeeding. Let y ∈ {0, 1} indicate whether
6.3. GRASPING STRATEGY 141
{α,R, I} is a good grasp. We then have:
P (y|α,R, I) ∝ P (R, I|y, α)P (y, α) (6.1)
We assume conditional independence of R and I, and uniform priors P (α), P (I),
P (R) and P (y). Hence,
P (y|α,R, I) ∝ P (R|y, α)P (I|y, α)P (y, α)
∝ P (y|α,R)P (y|α, I) (6.2)
Here, the P (y|α, I) is the 2-d image classifier term similar to the one in chapter 5.
We also use P (y|α,R; θ) = 1/ exp(
1 + ψ(R,α)T θ)
, where ψ(R,α) are the features
discussed in the next section.
Inference: From Eq. 6.2, the inference problem is:
α∗ = arg maxα
logP (y = 1|α,R, I)
= arg maxα
logP (y = 1|α,R) + logP (y = 1|α, I)
Now, we note that a configuration α has a very small chance of being the optimal
configuration if either one of the two terms is very small. Thus, for efficiency, we
implemented an image-based classifier P (y|I, α) that returns only a small set of 3D
points with a high value. We use this to restrict the set of configurations α to
only those in which the hand-center lies on one of the 3D location output by the
image-based classifier. Given one such 3D location, finding a full configuration α
for it now requires solving only an n − 3 dimensional problem (where n = 6 or 11,
depending on the arm). Further, we found that it was sufficient to consider only
a few locations of the fingers, which further reduces the search space; e.g., in the
goal configuration, the gap between the finger and the object is unlikely to be larger
than a certain value. By sampling randomly from this space of “likely” configurations
(similar to sampling in PRMs) and evaluating the grasp quality only of these samples,
we obtain an inference algorithm that is computationally tractable and that also
142 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
typically obtains good grasps.
Algorithm 1 Grasping an Object
1: Get a set of candidate 2-d grasp points in the image using a classifier based on[137]
2: Use depth sensors (stereo/swissranger) to find corresponding 3D grasp point3: for Each grasp point in candidate 3D grasp points set do4: Use arm inverse kinematics to generate configuration(s) with hand center near
the 3D grasp point5: Add configurations(s) for which arm/finger does not hit obstacles to the can-
didate valid configuration set6: end for7: for Each configuration in candidate valid configuration set do8: Compute features using the point-cloud and the configuration9: Use features to classify if grasp will be good/bad
10: Score[grasp] ⇐ ScoreForCandidateGrasp (from classifier)11: end for12: Execute grasp∗ = arg maxScore[grasp]13: Plan path for grasp∗ using PRM motion planner
6.3.2 Features
Below, we describe the features that make up the feature vector ψ(R,α) used to
estimate a good grasp. The same features were used for both robots.
Presence / Contact: For a given finger configuration, some part of an object
should be inside the volume enclosed by the fingers. Intuitively, more enclosed points
indicate that the grasp contains larger parts of the object, which generally decreases
the difficulty of grasping it (less likely to miss). For example, for a coil of wire, it
is better to grasp a bundle rather than a single wire. To robustly capture this, we
calculate a number of features—the number of points contained in spheres of different
sizes located at the hand’s center, and also the number of points located inside the
volume enclosed within the finger-tips and the palm of the hand.
Symmetry / Center of Mass: Even if many points are enclosed by the hand, their
distribution is also important. For example, a stick should be grasped at the middle
instead of at the tip, as slippage might occur in the latter case due to a greater torque
6.3. GRASPING STRATEGY 143
(a) (b)
Figure 6.3: Features for symmetry / evenness.
induced by gravity. To capture this property, we calculate a number of features based
on the distribution of points around the hand’s center along an axis perpendicular to
the line joining the fingers. To ensure grasp stability, an even distribution (1:1 ratio)
of points on both sides of the axis is desirable. More formally, if there are N points
on one side and N ′ on the other side, then our feature would be |N −N ′|/(N +N ′).
Again, to increase robustness, we use several counting methods, such as counting all
the points, and counting only those points not enclosed by the hand.
Local Planarity / Force-Closure: One needs to ensure a few properties to avoid
slippage, such as a force closure on the object (e.g., to pick up a long tube, a grasp
that lines up the fingers along the major axis of the tube would likely fail). Further,
a grasp in which the finger direction (i.e., direction in which the finger closes) is
perpendicular to the local tangent plane is more desirable, because it is more likely to
lie within the friction cone, and hence is less likely to slip. For example, when grasping
a plate, it is more desirable that the fingers close in a direction perpendicular to the
plate surface.
To capture such properties, we start with calculating the principal directions of
a 3D point cloud centered at the point in question. This gives three orthonormal
component directions ui, with u1 being the component with largest variance, followed
by u2 and u3. (Let σi be the corresponding variances.) For a point on the rim of a
144 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.4: Demonstration of PCA features.
circular plate, u1 and u2 would lie in the plane in which the plate lies, with u1 usually
tangent to the edge, and u2 facing inwards. Ideally, the finger direction should be
orthogonal to large variance directions and parallel to the small variance ones. For fj
as the finger direction (j = 1 for parallel gripper, and j = 1, 2 for three-fingered hand),
we would calculate the following features: (a) Directional similarity, sij = |ui · fj|,
and (b) Difference from ideal, ( σ1−σi
σ1−σ3
− sij)2.
6.4 Experiments
We performed three sets of extensive experiments: grasping with our three-fingered
7-dof arm in uncluttered as well as cluttered environments, and on our 5-dof arm
with a parallel plate gripper for unloading items in a cluttered dishwasher.
6.4.1 Grasping single novel objects
We considered several objects from 13 novel object classes in a total of 150 experi-
ments. These object classes varied greatly in shape, size, and appearance, and are
very different from the plates, bowls, and rectangular blocks used in the training set.
During the experiments, objects were placed at a random location in front of our
robot. Table 6.1 shows the results: “Prediction” refers to the percentage of cases
the final grasp and plan were good, and “Grasp success” is the percentage of cases
in which the robot predicted a good grasp and actually picked up the object as well
6.4. EXPERIMENTS 145
Figure 6.5: Snapshots of our robot grasping novel objects of various sizes/shapes.
(i.e., if the object slipped and was not picked up, then it counted as failure).
In 5, we required that objects be neatly placed on a “rack.” However, we consider
the significantly harder task of grasping randomly placed objects in any orientation.
Further, many objects such as ski boots, helmets, etc. require more intricate grasps
(such as inserting a finger in the ski boot because it is too big to hold as a power
grasp).
For each object class, we performed 10-20 trials, with different instances of each
object for each class (e.g., different plates for “plates” class). The average “Predic-
tion” accuracy was 81.3%; and the actual grasp success rate was 76%. The success
rate was different depending on the size of the object.2 Both prediction and actual
grasping were best for medium sizes, with success rates of 92% and 86% respectively.
Even though handicapped with a significantly harder experimental trial (i.e., objects
2We defined an object to be small if it could be enclosed within the robot hand, medium if itwas approximately 1.5-3 times the size of the hand, and large otherwise (some objects were even 4ftlong).
146 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Table 6.1: STAIR 2. Grasping single objects. (150 trials.)
Object Class Size Prediction Grasp successBall small 80% 80%Apple small 90% 80%Gamepad small 85% 80%CD container small 70% 60%Helmet med 100% 100%Ski boot med 80% 80%Plate bundle med 100% 80%Box med 90% 90%Robot arm link med 90% 80%Cardboard tube large 70% 65%Foam large 60% 60%Styrofoam large 80% 70%Coil of wire large 70% 70%
not neatly stacked, but thrown in random places and a larger variation of object
sizes/shapes considered) as compared to Saxena et al., our algorithm surpasses their
success rate by 6%.
6.4.2 Grasping in cluttered scenarios
In this case, in addition to the difficulty in perception, manipulation and planning
became significantly harder in that the arm had to avoid all other objects while
reaching for the predicted grasp; this significantly reduced the number of feasible
candidates and increased the difficulty of the task.
In each trial, more than five objects were placed in random locations (even where
objects touched each other, see some examples in Fig. 6.5). Using only the 2-d image-
based method of Saxena et al. (with PRM motion planning), but not our algorithm
that considers finding all arm/finger joints from partial 3D data, success was below
5%. In a total of 40 experiments, our success rate was 75% (see Table 6.2). We
believe our robot is the first one to be able to automatically grasp objects, of types
never seen before, placed randomly in such heavily cluttered environments.
6.4. EXPERIMENTS 147
Figure 6.6: Barrett arm grasping an object using our algorithm.
148 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Table 6.2: STAIR 2. Grasping in cluttered scenes. (40 trials.)
Environment Object Prediction Grasp SuccessTerrain Tube 87.5% 75%Terrain Rock 100% 75%Kitchen Plate 87.5% 75%Kitchen Bowl 75% 75%
Table 6.3: STAIR 1. Dishwasher unloading results. (50 trials.)
Object Class Prediction Good Actual SuccessPlate 100% 85%Bowl 80% 75%Mug 80% 60%
We have made our grasping movies available at:
http://stair.stanford.edu/multimedia.php
6.5 Optical proximity sensors
In our grasp planning algorithm that relied on long range vision sensors (such as
cameras, swissranger), errors and uncertainty ranging from small deviations in the
object’s location to occluded surfaces significantly limited the reliability of these open-
loop grasping strategies. We found that approximately 65% of the grasp failures were
because we used only long range sensors and lacked a reactive controller with sufficient
local surface pose information.
Tactile sensing has been employed as a means to augment the initial grasp and
manipulation strategies by addressing inconsistencies in the contact forces during ob-
ject contact and manipulation [173]. However, tactile sensors have to actually touch
the object in order to provide useful information. Because current sensor technologies
are not sensitive enough to detect finger contacts before causing significant object mo-
tion, their use is limited to either minor adjustments of contact forces at pre-computed
6.5. OPTICAL PROXIMITY SENSORS 149
Figure 6.7: Three-fingered Barrett Hand with our optical proximity sensors mountedon the finger tips.
grasp configurations, or to planning algorithms that require iterative re-grasping of
the object in order to grasp successfully [109, 172, 50]. While the latter approach has
shown substantial improvements in grasp reliability, it requires a significant amount
of time and frequently causes lighter objects to be knocked over during the repeated
grasp attempts.
The limitations of tactile-sensing-augmented grasp planning can be overcome by
‘pre-touch’ sensing. This modality has recently become a popular means of bridging
the gap in performance between long range vision and tactile sensors. In pre-touch
sensing, gripper-mounted, short-range (0-4cm) proximity sensors are used to estimate
the absolute distance and orientation (collectively called surface pose) of a desired
contact location without requiring the robot to touch the object [163]. The vast ma-
jority of these pre-touch proximity sensors use optical methods because of their high
precision [10, 106, 179, 111, 112]. Optical sensors, however, are highly sensitive to
surface reflection properties. Alternatively, capacitive-based proximity sensors have
also been used [183]. While invariant to surface properties, these capacitive-based
150 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
sensors have difficulty detecting materials with low dielectric contrast, such as fab-
rics and thin plastics. Unfortunately, in both cases, present sensor calibration and
modeling techniques have yet to produce pose estimates that are robust enough to be
useful across the range of surface textures, materials, and geometries encountered in
unstructured environments. Furthermore, the finger tips of typical robotic grippers
are too small to accommodate the sensors used in previous work.
Therefore, in [51] we developed an optical proximity sensor, embedded in the
fingers of the robot (see Fig. 6.7), and show how it can be used to estimate local
object geometries and perform better reactive grasps. This allows for on-line grasp
adjustments to an initial grasp configuration without the need for premature object
contact or re-grasping strategies. Specifically, a design for a low cost, pose-estimating
proximity sensor is presented that meets the small form factor constraints of a typical
robotic gripper.
The data from these sensors is interpreted using empirically derived models and
the resulting pose estimates are provided to a reactive grasp closure controller that
regulates contact distances even in the absence of reliable surface estimates. This
allows the robot to move the finger tip sensors safely into configurations where sensor
data for the belief-state update can be gathered. Simultaneously, the pose estimates
of the sensor interpretation approach can provide the necessary information to adjust
the grasp configuration to match the orientation of the object surface, increasing the
likelihood of a stable grasp.
We perform a series of grasping experiments to validate the system using a dex-
terous robot consisting of a 7-DOF Barrett arm and multi-fingered hand. In these
tests, we assume that an approximate location of the object to be grasped has al-
ready been determined either from long range vision sensors [136], or through human
interaction [60]. From the initial grasp positions, the system exhibits improved grasp-
ing on a variety of household objects with varying materials, surface properties, and
geometries.
6.5. OPTICAL PROXIMITY SENSORS 151
Figure 6.8: Normalized Voltage Output vs. Distance for a TCND5000 emitter-detector pair.
6.5.1 Optical Sensor Hardware
The purpose of the optical sensor hardware is to provide data that can be used by
the modeling algorithm to construct pose estimates of nearby surfaces. The design
is driven by a series of constraints, including size, sensor response, and field of view.
The hardware design was mostly due to Paul Nangeroni [51]; however, we provide the
full design details for completeness.
A basic optical sensor consists of an emitter, photo-receiver, and signal processing
circuitry. The light from the emitter is reflected by nearby surfaces and received by
the photo-receiver. The amplitude and phase of the light vary as a function of the
distance to the surface, its orientation, and other properties of the surface material
(reflectance, texture, etc.) [10]. In amplitude-modulated proximity sensor design, the
most commonly preferred method, these variations in amplitude can be converted
152 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.9: Possible fingertip sensor configurations.
into pose estimates by measuring the response from constellations of at least three
receivers focused at the same point on the target surface [10, 127, 179].
Although conceptually simple, modeling the pose of unknown surfaces is difficult
because of the non-monotonic behavior of the proximity sensor receivers. The re-
sponse of a single sensor is a function of the distance to the target surface and the
baseline between the emitter and receiver, as shown in Fig. 6.8. While the response
in the far-field varies as the inverse square of the distance [105], the response of the
sensor in the near field is far more complex. The decrease in received light energy in
the near field is governed not only by the reflective properties of the surface, but also
by the geometric baseline between the emitter and receiver. As distance approaches
6.5. OPTICAL PROXIMITY SENSORS 153
zero in the near field, the amount of energy that can be reflected between the emitter
and receiver decreases because the overlap between the emitter and receiver cones
decreases. This results in a sharp drop-off of received light intensity. To avoid com-
plications in modeling the sensor data, most approaches, including ours, offset the
sensors from the surface of the gripper to maximize the far field sensor response. Al-
though this sacrifices the higher sensitivity of the near field, the longer range of the
far field is better suited to grasping applications [10].
The selection of specific sensor hardware and signal processing electronics is severely
constrained by the limited space in typical robotic fingers (for instance, our Barrett
fingers have a 1cm x 2cm x 1cm cavity volume). Bonen and Walker used optical
fibers to achieve the desired geometric arrangement of emitters and receivers in their
respective sensors. However, the bend radius and large terminations of the fibers vi-
olate the space constraints in this application. Instead, our design uses four low-cost,
off-the-shelf infrared emitter/receiver pairs (Vishay TCND50000) on each finger. The
small size (6 x 3.7 x 3.7 mm) meets the volume requirements of the design and the
range (2-40mm) is ideal for pre-touch sensing.
The arrangement of these sensors represents yet another design tradeoff between
field of view and pose estimation accuracy. Large fields of view, both out from and
in front of the face of the fingertip, as shown in Fig. 6.9a, are advantageous for
detecting oncoming objects. Unfortunately, the sensor spacing needed to achieve this
reduces the overlap between adjacent sensor pairs and lowers the signal to noise ratio.
Conversely, arranging the sensors to focus the emitters and maximize crosstalk, as
shown in Fig. 6.9b, improves local pose estimation accuracy at the expense of broader
range data to nearby objects. The final configuration, shown in Fig. 6.9c, consists of
three sensors arranged in a triangle on the face of the finger to estimate pose with a
fourth sensor at a 45◦ angle to the finger tip to increase the field of view. Although
surface area exists for placing additional sensors, the quantity is limited to four by
the available space in the Barrett finger cavity for pre-amplifier circuitry. The sensors
are inset into the finger to minimize near-field effects, and the aluminum housing
is matte anodized to decrease internal specular reflection. The crosstalk between
154 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.10: Front view of sensor constellation. Hatches show the crosstalk betweenadjacent sensor pairs.
adjacent sensors is illustrated by the hatched area in Fig. 6.10.3
The complete proximity sensing suite consists of twelve total emitter-detector
pairs, four on each finger. The emitters are pulsed in sequence by a PIC 18F4550
micro-controller located on the wrist of the robot (as shown in Fig. 6.7) so that 16
readings are generated for each finger on each cycle. The collected sensor data is pre-
amplified by circuitry in the finger tip and then sampled by a 10-bit A/D converter
before streaming the data back to the robot over a serial link. In spite of the 950nm
operating frequency, raw sensor readings remain sensitive to ambient light effects.
Background subtraction was used to remove this ambient component and increase
the signal to noise ratio.
3While the crosstalk between sensors 1-3 can be useful when the primary sensor values saturate,which occurs on many light-colored surfaces, experimental testing showed the crosstalk to be belowthe noise floor on many of the object surfaces we encountered. Since this work assumes that thenature of the surface is unknown a priori, our current work ignores the crosstalk and only uses theprimary sensor values.
6.5. OPTICAL PROXIMITY SENSORS 155
6.5.2 Sensor model
Given a series of observations from the sensors on each finger, our goal is to estimate
the surface pose (distance and orientation) of the surface in front of the finger.
More formally, let o = (o1, o2, o3) ∈ R3 be the readings from the three sensors
grouped in a triangle,4 and s = (d, xrot, zrot) be the surface pose of the local surface
(approximated as a plane). Here, d is the straight-line distance from the center of the
three sensors to the surface of the object (along the vector pointing outwards from
the finger surface) sensors. xrot and zrot are the relative orientations of the surface
around the finger’s x-axis (pitch) and z-axis (roll) respectively, as shown in Fig. 6.10.
(They are equal to 90◦ when the object surface is parallel to the finger surface.)
One of the major challenges in the use of optical sensors is that intrinsic surface
properties (such as reflectivity, diffusivity, etc.) cause the relationship between the
raw sensor signal and the surface pose to vary significantly across different surface
types. For that reason, prior work using short-range proximity sensors to find surface
pose has focused on using multiple direct models obtained by performing regression
on empirical data. This data is gathered by recording sensor array readings for each
relevant surface in turn, placed at a comprehensive set of known surface poses [10].
In particular, [10, 106, 179] both use a least-squares polynomial fit of data taken for
a known surface or group of similar surfaces to directly estimate surface pose given
sensor readings. However, acquiring enough data to successfully model a new surface
is extremely time-consuming, and having to do so for every potential surface that
might be encountered is prohibitive. In practical grasping scenarios, we need to be
able to deal with unknown and never-before-encountered surfaces.
Reference Forward Model
As opposed to fully characterizing every surface with a separate model, we use a
single reference model that is scaled with an estimate of the object’s IR reflectivity
4Sensor 4, which is offset by 45◦ to increase the field of view, is not included in this sensormodel because it rarely focuses on the same flat surface as the other three. As a stand-alone sensor,it is nevertheless independently useful, particularly when the peak expected value (object surfacereflectivity) is known. For instance, it can be used to prevent unexpected collisions with objects whilemoving, or even to move to a fixed distance from a table or other surface with known orientation.
156 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
parameter in order to obtain an approximate forward model for the specific object
of interest.5 By scaling with the estimated peak value, the model becomes roughly
invariant to surface brightness.
For this work, the calibration data was taken using a Kodak grey card, which is
a surface often used to adjust white balance in photography. The data consists of
1904 samples of o taken at distance values ranging from 0.5 cm to 3.4 cm, xrot values
ranging from 30◦ to 110◦, and zrot values ranging from 40◦ to 140◦. Fig. 6.11 shows
the calibration data.
Figure 6.11: Calibration Data. (Top row) plots xrot vs. energy for all three sensors,with binned distances represented by different colors (distance bin centers in mm:green=5, blue=9, cyan=13, magenta=19, yellow=24, black=29, burlywood=34).(Bottom row) shows zrot vs. energy for all three sensors.
5Estimation of the surface reflectivity value is not particularly onerous to collect, as a single graspof the object, particularly with the use of raw values to attempt to align at least one finger withthe surface of the object, is sufficient to collect the required value. Alternatively, an estimate of thepeak value could be obtained by observing the object with an IR camera or with a laser rangefinderthat provides IR intensity values.
6.5. OPTICAL PROXIMITY SENSORS 157
Figure 6.12: Locally weighted linear regression on calibration data. The plot showsenergy values for one sensor at a fixed distance and varying x and z orientation. Greenpoints are recorded data; red points show the interpolated and extrapolated estimatesof the model.
Belief State Model
In order to address the sensitivity of the direct model to the object surface char-
acteristics, we use an empirical forward sensor model and a probabilistic estimation
process to derive the current best estimate for s. In our setting, as our robot hand is
executing a grasp trajectory, we take many sensor readings with fingers at different
(known) locations, and use the sensor readings, the finger positions, and our empirical
sensor model to update a belief state (a probability distribution over possible values
of s) at each time step.
Here, we assume that the object is static during a grasp. This is a reasonable
assumption due to the fact that because we are using proximity or ‘pre-touch’ sen-
sors, the actual grasp and sensing procedure does not make contact with the object
prior to complete closure of the grasp and thus does not actively cause any object
158 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
displacements. In addition, we assume that the surface seen under each finger is
locally planar throughout the entire grasp process.6
For each finger, let S be the set of all possible states s, and let st, ot and at be the
state, sensor readings (observation), and hand pose (actions) at time t, respectively.
At all times, we will track the belief state bt := P (s0|o1...ot, a1...at), the probability
distribution over all possible surface poses in S seen at the first time step, given
all observations and hand poses since. We assume that our sensor observations at
different times are conditionally independent given the surface pose:
P (s0|o1...oT , a1...aT ) =T∏
t=1
P (s0|ot, at) (6.3)
Observation Model
An essential part of finding the most likely surface pose is having a model for how
likely it is that we could have seen the current sensor readings if that pose were true.
Specifically, the quantity we are looking for is P (ot|at, st), the probability of seeing
the sensor readings we obtained at time t (ot) given the current hand configuration
(at) and a particular surface pose (st). For this, we need a function mapping states
to observations, C(s) = o. We use locally weighted linear regression on our scaled
calibration data set to estimate o given s. An example of the values obtained using
locally weighted linear regression is shown in Fig. 6.12, where the green points are
actual data and the red points are estimated values. Each estimated point uses a plane
derived from only the 8 closest actual data points. Also, because extrapolation using
locally weighted linear regression is extremely poor, the estimated point is clipped to
be no greater than the highest of those 8 values, and no less than the lowest.
We then assume that the estimated model values are correct (for a given st, we
expect to see the model-estimated ot), and that any deviations we see in the actual
sensor readings, ǫ = ot − C(st), are due to Gaussian noise, ǫ ∼ N(0, σ2). This as-
sumption is wildly inaccurate, as the errors are in fact systematic, but this assumption
6Even when the surface under the finger changes and the old data becomes erroneous, the esti-mates would get progressively better as more data from the new surface is observed. It is also simpleto place a higher weight on new data, to make the adjustment faster.
6.5. OPTICAL PROXIMITY SENSORS 159
nonetheless allows one to find the closest alignment of the observed o points to the
scaled calibration model without any assumptions about surface characteristics. For
our experiments, sensor readings were scaled to vary from 1 to 100, and σ was set to
be 25, since we expect significant deviations from our model values.
Inference
At each time step, we wish to compute an estimate st of the current state. The states,
S, are discretized to a uniform grid. We can represent bt as a grid of probabilities
that sum to 1, and update the belief state probabilities at each time step using our
actions and observations as we would any Bayesian filter.
More specifically, the belief state update is performed as follows:
bt =P (s0|o1...ot, a1...at)
=P (ot|a0, at, s0)P (s0|o1...ot−1, a1...at−1)
P (ot)
=P (ot|at, st)bt−1
P (ot)(6.4)
We assume a uniform prior on the sensor readings, and thus the denominator can be
normalized out. We then find the expected value of the state s0 as follows:
s0 = E(s0) =∑
s0
P (s0|o1...ot, a1...at)s0 (6.5)
We can compute st from s0 using the hand kinematics and the known hand positions
at and a0.
The advantage of combining observations in this manner is that, while the actual
sensor readings we expect to see vary greatly from surface to surface, readings across
different sensors in the array vary in a similar way with respect to orientation changes
for all surfaces. Thus, a number of observations over a wide range of different values
of s can be used to align a new set of observations with a model that is not quite right.
The s with expected (model) observations that align best with the actual observations
is generally the one with the highest likelihood, even with large model error.
160 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.13: A sequence showing the grasp trajectory chosen by our algorithm. Ini-tially, the fingers are completely open; as more data comes, the estimates get better,and the hand turns and closes the finger in such a way that all the fingers touch theobject at the same time.
6.5.3 Reactive grasp closure controller
To verify the usefulness of our ‘pre-grasp’ proximity sensors, a reactive grasp closure
controller was designed and implemented that uses the proximity data to move the
fingers such that: 1) they do not collide with the object, and 2) they close simulta-
neously, forming a symmetric grasp configuration around a pre-specified grasp point.
This ability is aimed at achieving successful grasps even in the presence of significant
uncertainty about the object geometry, and at allowing the sensor interpretation ap-
proach to safely collect data to improve the surface estimates.
The hierarchical control architecture used here composes two reactive control ele-
ments that run asynchronously and that control separate aspects of the grasp closure
process. At the bottom level, a finger distance controller controls the fingers to main-
tain approximately equal distances to the object surface. On top of this control
element, a kinematic conditioning controller controls the arm to center the object
within the hand and to cause the fingers to be parallel to the surface.
6.5. OPTICAL PROXIMITY SENSORS 161
Figure 6.14: Different objects on which we present our analysis of the model predic-tions. See Section 6.5.4 for details.
Table 6.4: Model Test Experiment Error
dist(cm) xrot(◦) zrot(
◦)
Brown Wood Bowl .35 4.4 13.5Beige Plastic Bowl .34 4.4 23.0Black Plastic Box .49 10.3 16.5Blue Plastic Box .59 5.3 21.7Cardboard Box .20 3.8 14.4Yellow Box .43 3.6 17.3
Average .40 5.3 17.7
6.5.4 Experiments
We will first evaluate the ability of our algorithm to model sensor data in Section 6.5.4.
Then we will use the model to perform grasping experiments in Section 6.5.4.
The sensors described in Section 6.5.1 were mounted onto the finger tips of a
Barrett hand. This manipulator has a total of 4 degrees of freedom (three fingers
each with one degree of freedom, and a finger “spread”). The hand was attached to
a 7-dof arm (WAM, by Barrett Technologies) mounted on the STAIR (STanford AI
Robot) platform.
Prediction Model
As a test of our estimation method, we placed the Barrett hand at known positions
relative to a number of commonly found objects of various materials. (See Fig. 6.14
for their pictures.) Our test data consisted of three different orientations for each
object: parallel to the fingers, +10◦ rotation, and −20◦ rotation. The fingers were
162 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
then made to close slowly around the object until each finger nearly touched the
surface. Sets of 100 raw sensor readings were taken every 1.2 degrees of finger base
joint bend, and for every 1.2 degrees of finger spread (up to at most 17 degrees) and
averaged. Results showing errors in the final estimates of (d, xrot, zrot) are in Table
6.4.
Table 6.4 shows that our model is able to predict the distances with an average
error of 0.4cm. We are also able to estimate xrot reasonably accurately, with an
average error of 5.3◦. The high error in xrot in the case of the black plastic box can be
attributed to the fact that the first finger in one run did not see the same surface the
entire time, which is a requirement for reasonable predictions with our model. Note
that these objects have significantly different surface properties, and other than the
peak sensor value, no other surface characteristics were assumed. High-reflectance
surfaces tended to do worse than matte surfaces due to significant deviation from
the calibration model. Nonetheless, our experiments show that we are able to make
reasonable predictions by the time the fingers touch the surface.
The higher errors in zrot are a consequence of the movements used during the
data collection and estimation process. Since the sensor interpretation approach uses
belief updates to determine the maximum likelihood orientation, the quality of the
resulting estimates depends on the proximity data encountered along the fingers’
trajectories, with larger variations in the local geometry resulting in more accurate
estimates. In the procedure used here to gather the test data, a wide range of xrot
angles was encountered within the useful range of the sensors due to the strong
curling of the Barrett fingers. On the other hand, only a very limited range of
zrot angles were correctly observed due to the significantly smaller variation available
using the hand’s spread angle. Furthermore, most spread actions moved the fingers to
positions significantly further from the object, often resulting in sensor readings that
were no longer observable. While this is a problem in the test data and illustrates the
advantage of active sensing strategies, its cause should be largely alleviated when using
the grasp closure controller to establish the initial grasp, due to the ability of the finger
distance controller to maintain the fingers at a distance that provides usable results
throughout the finger spread operation. Additionally, the inclusion of arm motions
6.5. OPTICAL PROXIMITY SENSORS 163
through the use of the kinematic conditioning controller should further enhance the
range of zrot angles encountered during a grasp, and thus allow for somewhat better
zrot estimates.
Grasping Experiments
Our goal was to focus on the final approach of grasping using proximity sensors. Our
system could be used in a variety of settings, including the point-and-click approach
of [60], where a laser pointer is used to highlight an object for a robot to pick it up,
or combined with long range vision sensors that select optimal grasp points [136].
In this experiment, we placed a variety of objects (weighing less than 3 pounds)
in known locations on a table (see Fig. 6.16), with some objects flush against each
other. These objects are of a variety of shapes, ranging from simple boxes or bowls,
to more complicated shapes such as a ski boot.
The robot moves the hand to the approximate center of the object and executes a
grasp strategy,7 using our controller to move the hand and the fingers in response to
estimated distances from the sensors and the pose estimation algorithm. The robot
then picks up the object and moves it to verify that the grasp is stable. (See Fig. 6.13
and Fig. 6.16 for some images of the robot picking up the objects.)
Figure 6.15: Failure cases: (a) Shiny can, (b) Transparent cup
7One could envision using better strategies, such as those based on vision-based learning [140,141], for moving the hand to a grasping point on the object.
164 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.16: Snapshots of the robot picking up different objects.
Note that our objects are of a variety of shapes and made of a variety of mate-
rials. Out of the 26 grasping trials performed on 21 unique objects,8 our grasping
system failed three times. Two of these failures were due to extreme surface types: a
transparent cup and a highly reflective aluminum can (Fig. 6.15). To address these
cases, our optical-based proximity sensors could be combined with a capacitive-based
system that is good at detecting metallic or glass surfaces. The third object was
a ski-boot, for which our approach worked perfectly; however, due to the object’s
weight, our robot was unable to lift it up.
A video of the grasp experiments is available at the following url:
http://stair.stanford.edu/proximitygrasping/
In the video, the hand rotates the fingers in many cases to be approximately
8Procedure used to choose the objects : We asked a person not associated with the project to bringobjects larger than about 8 inches in length and less than 3 pounds from our lab and different offices.No other selection criterion was specified, and therefore we believe that our objects represented areasonably unbiased sample of commonly found objects.
6.6. APPLICATIONS TO OPENING NEW DOORS 165
parallel to the object surface and causes the fingers to contact at nearly the same
time, thereby improving the grasp.
6.6 Applications to opening new doors
Recently, there is growing interest in using robots not only in controlled factory en-
vironments but also in unstructured home and office environments. In the past,
successful navigation algorithms have been developed for robots in these environ-
ments. However, to be practically deployed, robots must also be able to manipulate
their environment to gain access to new spaces. In Klingbeil, Saxena, and Ng [64],
we presented our work towards enabling a robot to autonomously navigate anywhere
in a building by opening doors, even those it has never seen before.
Most prior work in door opening (e.g., [122, 110]) assumes that a detailed 3D
model of the door (and door handle) is available, and focuses on developing the
control actions required to open one specific door. In practice, a robot must rely on
only its sensors to perform manipulation in a new environment.
Our methods for grasping, described in previous chapters use a vision-based learn-
ing approach to choose a point at which to grasp an object. However, a task such
as opening a door is more involved in that it requires a series of manipulation tasks;
the robot must first plan a path to reach the handle and then apply a series of
forces/torques (which may vary in magnitude and direction) to open the door. The
vision algorithm must be able to infer more information than a single grasp point to
allow the robot to plan and execute such a manipulation task.
In this chapter, we focus on the problem of manipulation in novel environments
where a detailed 3D model of the object is not available. We note that objects, such
as door handles, vary significantly in appearance, yet similar function implies similar
form. Therefore, we will design vision-based learning algorithms that attempt to
capture the visual features shared across different objects that have similar functions.
To perform a manipulation task, such as opening a door, the robot end-effector must
move through a series of way-points in cartesian space, while achieving the desired
orientation at each way-point, in order to turn the handle and open the door. A
166 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
small number of keypoints such as the handle’s axis of rotation and size provides
sufficient information to compute such a trajectory. We use vision to identify these
“visual keypoints,” which are required to infer the actions needed to perform the
task. To open doors or elevators, there are various types of actions a robot can
perform; the appropriate set of actions depends on the type of control object the
robot must manipulate, e.g., left/right turn door handle, spherical doorknob, push-
bar door handle, elevator button, etc. Our algorithm learns the visual features that
indicate the appropriate type of control action to use.
Finally, to demonstrate the robustness of our algorithms, we provide results from
extensive experiments on 20 different doors in which the robot was able to reliably
open new doors in new buildings, even ones which were seen for the first time by the
robot (and the researchers working on the algorithm).
6.6.1 Related work
There has been a significant amount of work done in robot navigation [174]. Many
of these use a SLAM-like algorithm with a laser scanner for robot navigation. Some
of these works have, in fact, even identified doors [34, 184, 166, 5, 1]. However, all of
these works assumed a known map of the environment (where they could annotate
doors); and more importantly none of them considered the problem of enabling a
robot to autonomously open doors.
In robotic manipulation, most work has focused on developing control actions for
different tasks, such as grasping objects [7], assuming a perfect knowledge of the envi-
ronment (in the form of a detailed 3D model). Recently, some researchers have started
using vision-based algorithms for some applications, e.g. [69]. These algorithms do
not apply to manipulation problems where one needs to estimate a full trajectory of
the robot and also consider interactions with the object being manipulated.
There has been some recent work in opening doors using manipulators [128, 108,
62, 122, 110, 55]; however, these works focused on developing control actions assuming
a pre-surveyed location of a known door handle. In addition, these works implicitly
assumed some knowledge of the type of door handle, since a turn lever door handle
6.6. APPLICATIONS TO OPENING NEW DOORS 167
must be grasped and manipulated differently than a spherical door knob or a push-bar
door.
Little work has been done in designing autonomous elevator-operating robots.
Notably, [91] demonstrated their robot navigating to different floors using an elevator,
but their training phase (which requires that a human must point out where the
appropriate buttons are and the actions to take for a given context) used the same
elevator as the one used in their test demonstration. Other researchers have addressed
the problem of robots navigating in elevators by simply having the robot stand and
wait until the door opens and then ask a human to press the correct floor button [29].
Table 6.5: Visual keypoints for some manipulation tasks.
Manipulation task Visual Keypoints
Turn a door handle 1. Location of the handle2. Its axis of rotation3. Length of the handle4. Type (left-turn,
right-turn, etc.)
Press an 1. Location of the buttonelevator button 2. Normal to the surface
Open a 1. Location of the traydishwasher tray 2. Direction to pull or push it
In the application of elevator-operating robots, some robots have been deployed
in places such as hospitals [130, 30]. However, expensive modifications must be made
to the elevators, so that the robot can use a wireless communication to command the
elevator. For opening doors, one can also envision installing automatic doors, but our
work removes the need to make these expensive changes to the many elevators and
doors in a typical building.
In contrast to many of these previous works, our work does not assume existence
of a known model of the object (such as the door, door handle, or elevator button) or
a precise knowledge of the location of the object. Instead, we focus on the problem of
manipulation in novel environments, in which a model of the objects is not available,
and one needs to rely on noisy sensor data to identify visual keypoints. Some of
these keypoints need to be determined with high accuracy for successful manipulation
168 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
(especially in the case of elevator buttons).
6.6.2 Algorithm
Consider the task of pressing an elevator button. If our perception algorithm is able
to infer the location of the button and a direction to exert force in, then one can
design a control strategy to press it. Similarly, in the task of pulling a drawer, our
perception algorithm needs to infer the location of a point to grasp (e.g., a knob or a
handle) and a direction to pull. In the task of turning a door handle, our perception
algorithm needs to infer the size of the door handle, the location of its axis, and a
direction to push, pull or rotate.
More generally, for many manipulation tasks, the perception algorithm needs to
identify a set of properties, or “visual keypoints” which define the action to be taken.
Given these visual keypoints, we use a planning algorithm that considers the kinemat-
ics of the robot and the obstacles in the scene, to plan a sequence of control actions
for the robot to carry out the manipulation task.
Dividing a manipulation task into these two parts: (a) an algorithm to identify
visual keypoints, and (b) an algorithm to plan a sequence of control actions, allows
us to easily extend the algorithm to new manipulation tasks, such as opening a
dishwasher. To open a dishwasher tray, the visual keypoints would be the location of
the tray and the desired direction to move it. This division acts as a bridge between
state of the art methods developed in computer vision and the methods developed in
robotics planning and control.
6.6.3 Identifying visual keypoints
Objects such as door handles vary significantly in appearance, yet similar function
implies similar form. Our learning algorithms will, therefore, try to capture the visual
features that are shared across different objects having similar function.
In this paper, the tasks we consider require the perception algorithm to: (a) locate
the object, (b) identify the particular sub-category of object (e.g., we consider door
handles of left-turn or right-turn types), and (c) identify some properties such as the
6.6. APPLICATIONS TO OPENING NEW DOORS 169
surface normal or door handle axis of rotation. An estimate of the surface normal
helps indicate a direction to push or pull, and an estimate of the door handle’s axis
of rotation helps in determining the action to be performed by the arm.
In the field of computer vision, a number of algorithms have been developed that
achieve good performance on tasks such as object recognition [160, 132]. Perception
for robotic manipulation, however, goes beyond object recognition in that the robot
not only needs to locate the object but also needs to understand what task the object
can perform and how to manipulate it to perform that task. For example, if the
intention of the robot is to enter a door, it must determine the type of door handle
(i.e., left-turn or right-turn) and an estimate of its size and axis of rotation, in order
to compute the appropriate action (i.e., to turn the door handle left and push/pull).
Manipulation tasks also typically require more accuracy than what is currently
possible with most classifiers. For example, to press an elevator button, the 3D loca-
tion of the button must be determined within a few millimeters (which corresponds to
a few pixels in the image), or the robot will fail to press the button. Finally, another
challenge in designing perception algorithms is that different sensors are suitable for
different perception tasks. For example, a laser range finder is more suitable for build-
ing a map for navigation, but a 2D camera is a better sensor for finding the location
of the door handle. We will first describe our image-based classifier.
Object Recognition
To capture the visual features that remain consistent across objects of similar function
(and hence appearance), we start with a 2D sliding window classifier. We use a
supervised learning algorithm that employs boosting to compute a dictionary of Haar
features.
In detail, the supervised training procedure first randomly selects ten small win-
dows to produce a dictionary of Haar features [176]. In each iteration, it trains decision
trees using these features to produce a model while removing irrelevant features from
the dictionary.
There are a number of contextual properties that we take advantage of to improve
the classification accuracy. Proximity of objects to each other and spatial cues, such
170 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.17: Results on test set. The green rectangles show the raw output from theclassifiers, and the blue rectangle is the one after applying context.
as that a door handle is less likely to be found close to the floor, can be used to learn
a location based prior (partly motivated by [176]).
We experimented with several techniques to capture the fact that the labels (i.e.,
the category of the object found) have correlation. An algorithm that simply uses
non-maximal suppression of overlapping windows for choosing the best candidate
locations resulted in many false positives—12.2% on the training set and 15.6% on
the test set. Thus, we implemented an approach that takes advantage of the context
for the particular objects we are trying to identify. For example, we know that doors
(and elevator call panels) will always contain at least one handle (or button) and
never more than two handles. We can also expect that if there are two objects, they
will lie in close proximity to each other and they will likely be horizontally aligned (in
the case of door handles) or vertically aligned (in the case of elevator call buttons).
This approach resulted in much better recognition accuracy.
Figure 3 shows some of the door handles identified using our algorithm. Table 6.6
6.6. APPLICATIONS TO OPENING NEW DOORS 171
Table 6.6: Accuracies for recognition and localization.
Recognition Localization
Door handle 94.5% 93.2%
Elevator buttons 92.1% 91.5%
shows the recognition and localization accuracies. “Localization accuracy” is com-
puted by assigning a value of 1 to a case where the estimated location of the door
handle or elevator button was within 2 cm of the correct location and 0 otherwise. An
error of more than 2 cm would cause the robot arm to fail to grasp the door handle
(or push the elevator button) and open the door.
Once the robot has identified the location of the object in an image, it needs to
identify the object type and infer control actions from the object properties to know
how to manipulate it. Given a rectangular patch containing an object, we classify
what action to take. In our experiments, we considered three types of actions: turn
left, turn right, and press. The accuracy of the classifier that distinguishes left-turn
from right-turn handles was 97.3%.
Estimates from 3D data
Our object recognition algorithms give a 2D location in the image for the visual
keypoints. However, we need their corresponding 3D locations to be able to plan a
path for the robot.
In particular, once an approximate location of the door handle and its type is
identified, we use 3D data from the stereo camera to estimate the axis of rotation
of the door handle. Since, the axis of a right- (left-) turn door handle is the left-
(right-) most 3D point on the handle, we build a logistic classifier for door-axis using
two features—distance of the point from the door and its distance from the center
(towards left or right). Similarly, we use PCA on the local 3D point cloud to estimate
the orientation of the surface—required in cases such as elevator buttons and doors
for identifying the direction to apply force.
However, the data obtained from a stereo sensor is often noisy and sparse in that
172 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
the stereo sensor fails to give depth measurements when the areas considered are
textureless, e.g., blank elevator walls [146]. Therefore, we also present a method
to fuse the 2D image location (inferred by our object recognition algorithm), with a
horizontal laser scanner (available on many mobile robots) to obtain the 3D location of
the object. Here, we make a ground-vertical assumption—that every door is vertical
to a ground-plane [26].9 This enables our approach to be used on robots that do not
have a 3D sensor such as a stereo camera (that are often more expensive).
Planning and Control
Given the visual keypoints and the goal, we build upon a Probabilistic RoadMap
(PRM) [159] motion planning algorithm for obtaining a smooth, collision-free path
for the robot to execute.
6.6.4 Experiments
We used a Voronoi-based global planner for navigation; this enabled the robot to
localize itself in front of and facing a door within 20cm and ±20 degrees. An experi-
ment began with the robot starting at a random location within 3m of the door. It
used lasers to navigate to the door, and our vision-based classifiers to find the handle.
In the experiments, our robot saw all of our test locations for the first time. The
training images for our vision-based learning algorithm were collected in completely
separate buildings, with different doors and door handle shapes, structure, decoration,
ambient lighting, etc. We tested our algorithm on two different buildings on a total
9In detail, a location in the image corresponds to a ray in 3D, which would intersect the planein which the door lies. Let the planar laser readings be denoted as li = (x(θi), y(θi)). Let the originof the camera be at c ∈ R
3 in arm’s frame, and let r ∈ R3 be the unit ray passing from the camera
center through the predicted location of the door handle in the image plane. I.e., in the robot frame,the door handle lies on a line connecting c and c + r.
Let T ∈ R2×3 be a projection matrix that projects the 3D points in the arm frame into the plane
of the laser. In the laser plane, therefore, the door handle is likely to lie on a line passing throughTc and T (c + r).
t∗ = mint
∑
i∈Ψ||T (c + rt) − li||
22 (6.6)
where Ψ is a small neighborhood around the ray r. Now the location of the 3D point to move theend-effector to is given by s = c + rt∗.
6.7. DISCUSSION 173
Table 6.7: Error rates obtained for the robot opening the door in a total number of34 trials.
Door Num of Recog. Class. Localiza- Success-Type trials (%) (%) tion (cm) rate
Left 19 89.5% 94.7% 2.3 84.2%Right 15 100% 100% 2.0 100%
Total 34 94.1% 97.1% 2.2 91.2%
of five different floors (about 20 different doors). Many of the test cases were also run
where the robot localized at different angles, typically between -30 and +30 degrees
with respect to the door, to verify the robustness of our algorithms.
In a total of 34 experiments, our robot was able to successfully open the doors
31 out of 34 times. Table 6.7 details the results; we achieved an average recognition
accuracy of 94.1% and a classification accuracy of 97.1%. We define the localization
error as the mean error (in cm) between the predicted and actual location of the door
handle. This led to a success-rate (fraction of times the robot actually opened the
door) of 91.2%. Notable failures among the test cases included glass doors (erroneous
laser readings), doors with numeric keypads, and very dim/poor lighting conditions.
In future work, we plan to use active vision [57], which takes visual feedback into
account while objects are being manipulated and thus provides complementary infor-
mation that would hopefully improve the performance on these tasks.
Videos of the robot opening new doors are available at:
http://ai.stanford.edu/∼asaxena/openingnewdoors
6.7 Discussion
We presented a learning algorithm that, given partial models of the scene from 3-d
sensors, selects a grasp—that is, a configuration of the robot’s arm and fingers—to
try to pick up an object. We tested our approach on a robot grasping a number of
objects cluttered backgrounds.
We also presented a novel gripper-mounted optical proximity sensor that enables
stable grasping of common objects using pre-touch pose estimation. components. We
174 CHAPTER 6. ROBOTIC GRASPING: FULL HAND CONFIGURATION
Figure 6.18: Some experimental snapshots showing our robot opening different typesof doors.
converted the data provided by these sensors into pose estimates of nearby surfaces
by a probabilistic model that combines observations over a wide range of finger/hand
configurations. We validated the system with experiments using a Barrett hand,
which showed improvements in reactive grasping.
To navigate and perform tasks in unstructured environments, robots must be able
to perceive their environments to identify what objects to manipulate and how they
can be manipulated to perform the desired tasks. We presented a framework that
identifies some visual keypoints using our vision-based learning algorithms. Our robot
was then able to use these keypoints to plan and execute a path to perform the desired
task. This strategy enabled our robot to open a variety of doors, even ones it had
not seen before.
Chapter 7
Conclusions
One of the most basic perception abilities for a robot is perceiving the 3D shape of
an environment. This is also useful in a number of robotics applications such as in
navigation and in manipulation.
In this dissertation, we described learning algorithms for depth perception from
a single still image. The problem of depth perception is fundamental to computer
vision, one that has enjoyed the attention of many researchers and seen significant
progress in the last few decades. However, the vast majority of this work, such as
stereopsis, has used multiple image geometric cues to infer depth. In contrast, single-
image cues offer a largely orthogonal source of information, one that had heretofore
been relatively underexploited.
We proposed an algorithm for single image depth perception that used a machine
learning technique called Markov Random Field (MRF). It successfully modeled the
spatial dependency between different regions in the image. The estimated 3D models
were both quantitatively accurate and visually pleasing. Further, our algorithm was
able to accurately estimate depths from images of a wide variety of environments.
We also studied the performance of our algorithm through a live deployment in the
form of a web service called Make3D. Tens of thousands of users successfully used it
to automatically convert their own pictures into 3D models.
Given that depth and shape perception appears to be an important building block
for many other applications, such as object recognition, image compositing and video
175
176 CHAPTER 7. CONCLUSIONS
retrieval, we believe that monocular depth perception has the potential to improve
all of these applications, particularly in settings where only a single image of a scene
is available.
One such application that we described in this dissertation was to improve the
performance of 3D reconstruction from multiple views. In particular, we considered
stereo vision, where the depths are estimated by triangulation using two images. Over
the past few decades, researchers have developed very good stereo vision systems. Al-
though these systems work well in many environments, stereo vision is fundamentally
limited by the baseline distance between the two cameras. We found that monocular
cues and (purely geometric) stereo cues give largely orthogonal, and therefore com-
plementary, types of information about depth. Stereo cues are based on the difference
between two images and do not depend on the content of the image. On the other
hand, depth estimates from monocular cues are entirely based on the evidence about
the environment presented in a single image.
We investigated how monocular cues can be integrated with any reasonable stereo
system, to obtain better depth estimates than the stereo system alone. Our learning
algorithm (again using Markov Random Fields) modeled the depth as a function of
both single image features as well as the depths obtained from stereo triangulation if
available, and combined them to obtain better depth estimates. We then also created
3D models from multiple views (which are not necessarily from a calibrated stereo
pair) to create a larger 3D model.
We applied these ideas of depth perception to a challenging problem of obstacle
avoidance for a small robot navigating through a cluttered environment. For small-
sized robots, the size places a frugal payload restriction on weight and power supply.
In these robots, only light-weight small sensors that take less power could be used.
Cameras are an attractive option because they are small, passive sensors with low
power requirements.
We used supervised learning to learn monocular depth estimates (distances to
obstacles in each steering direction) from a single image captured by the onboard
camera, and used them to drive an RC car at high speeds through the environment.
The vision system was able to give reasonable depth estimate on real image data,
177
and a control policy trained in a graphical simulator worked well on real autonomous
driving. Our method was able to avoid obstacles and drive autonomously in novel
environments in various locations.
We then considered the problem of grasping novel objects. Grasping is an impor-
tant ability for robots and is necessary in many tasks, such as for household robots
cleaning up a room or unloading items from a dishwasher. Most of the prior work
had assumed that a known (and registered) full 3D model of the object is available.
In reality, it is quite hard to obtain such a model. We presented a supervised learning
algorithm that predicts, directly as a function of the images, a point at which to grasp
the object. In our experiments, the algorithm generalized very well to novel objects
and environments, and our robot successfully grasped a wide variety of objects in
different environments such as dishwashers, office and kitchen.
Next, we considered more complex arms and hands (e.g., multi-fingered hands
where one has to not only choose a 3D position to grasp the object at, but also address
the high degree-of-freedom problem of choosing the positions of all the fingers. This
is further complicated by presence of significant amounts of clutter. A key challenge
in this setting is that a full 3D model of the scene is typically not available. Instead,
a robot’s sensors (such as a camera or a depth sensor) can usually estimate only the
shape of the visible portions of the scene. We presented an algorithm that, given such
partial models of the scene, selects a grasp—that is, a configuration of the robot’s
arm and fingers—to try to pick up an object.
In order to improve performance of grasping, we also presented a novel gripper-
mounted optical proximity sensor that enabled stable grasping of common objects
using pre-touch pose estimation. These sensors are a nice complement to the long-
range vision sensors (such as the cameras).
A household robot not only needs to pick up objects, but it also needs to navigate
and perform other manipulation tasks. One such manipulation task is the task of
opening new doors. We presented a method that identifies some visual keypoints
using vision. This enabled our robot to navigate to places inaccessible before by
opening doors, even the ones it had not seen before.
Bibliography
[1] D. Anguelov, D. Koller, E. Parker, and S. Thrun. Detecting and modeling doors
with mobile robots. In ICRA, 2004.
[2] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis.
SCAPE: shape completion and animation of people. ACM Trans. Graph., 24
(3):408–416, 2005. ISSN 0730-0301.
[3] J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of optical flow
techniques. IJCV, 12:43–77, 1994.
[4] H. Bay, T. Tuytelaars, and L.V. Gool. Surf: Speeded up robust features. In
European Conference on Computer Vision (ECCV), 2006.
[5] P. Biber and T. Duckett. Dynamic maps for long-term operation of mobile
service robots. In RSS, 2005.
[6] A. Bicchi. On the closure properties of robotic grasping. IJRR, 14(4):319–334,
1995.
[7] A. Bicchi and V. Kumar. Robotic grasping and contact: a review. In Interna-
tional Conference on Robotics and Automation (ICRA), 2000.
[8] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[9] I. Blthoff, H. Blthoff, and P. Sinha. Top-down influences on stereoscopic depth-
perception. Nature Neuroscience, 1:254–257, 1998.
178
BIBLIOGRAPHY 179
[10] A. Bonen. Development of a robust electro-optical proximity-sensing system.
Ph.D. Dissertation, Univ. Toronto, 1995.
[11] T. G. R. Bower, J. M. Broughton, and M. K. Moore. Demonstration of intention
in the reaching behaviour of neonate humans. Nature, 228:679–681, 1970.
[12] D.L. Bowers and R. Lumia. Manipulation of unmodeled objects using intelligent
grasping schemes. IEEE Transactions on Fuzzy Systems, 11(3), 2003.
[13] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
[14] M.Z. Brown, D. Burschka, and G.D. Hager. Advances in computational stereo.
IEEE Trans on Pattern Analysis and Machine Intelligence (PAMI), 25(8):993–
1008, 2003. ISSN 0162-8828.
[15] M. Kelm C. Paul, X. Wang and A. McCallum. Multi-conditional learning for
joint probability models with latent variables. In NIPS Workshop Advances
Structured Learning Text and Speech Processing, 2006.
[16] U. Castiello. The neuroscience of grasping. Nature Reviews Neuroscience, 6:
726–736, 2005.
[17] E. Chinellato, R.B. Fisher, A. Morales, and A.P. del Pobil. Ranking planar
grasp configurations for a three-finger hand. In ICRA, 2003.
[18] M. Ciocarlie, C. Goldfeder, and P. Allen. Dimensionality reduction for hand-
independent dexterous robotic grasping. In IROS, 2007.
[19] J. Coelho, J. Piater, and R. Grupen. Developing haptic and visual perceptual
categories for reaching and grasping with a humanoid robot. Robotics and
Autonomous Systems, 37:195–218, 2001.
[20] N. Cornelis, B. Leibe, K. Cornelis, and L. Van Gool. 3d city modeling using
cognitive loops. In Video Proceedings of CVPR (VPCVPR), 2006.
180 BIBLIOGRAPHY
[21] A. Criminisi, I. Reid, and A. Zisserman. Single view metrology. International
Journal of Computer Vision (IJCV), 40:123–148, 2000.
[22] N Dalai and B Triggs. Histogram of oriented gradients for human detection. In
Computer Vision and Pattern Recognition (CVPR), 2005.
[23] S. Das and N. Ahuja. Performance analysis of stereo, vergence, and focus as
depth cues for active vision. IEEE Trans Pattern Analysis & Machine Intelli-
gence (IEEE-PAMI), 17:1213–1219, 1995.
[24] E.R. Davies. Laws’ texture energy in texture. In Machine Vision: Theory,
Algorithms, Practicalities 2nd Edition. Academic Press, San Diego, 1997.
[25] E. Delage, H. Lee, and A.Y. Ng. Automatic single-image 3d reconstructions
of indoor manhattan world scenes. In International Symposium on Robotics
Research (ISRR), 2005.
[26] E. Delage, H. Lee, and A.Y. Ng. A dynamic bayesian network model for au-
tonomous 3d reconstruction from a single indoor image. In Computer Vision
and Pattern Recognition (CVPR), 2006.
[27] J. Diebel and S. Thrun. An application of markov random fields to range
sensing. In NIPS, 2005.
[28] A. Edsinger and C. Kemp. Manipulation in human environments. In Int’l Conf
Humanoid Robotics, 2006.
[29] Simmons et al. Grace and george: Autonomous robots for the aaai robot chal-
lenge. In AAAI Robot challenge, 2004.
[30] J.M. Evans. Helpmate: an autonomous mobile robot courier for hospitals. In
IROS, 1992.
[31] R. Ewerth, M. Schwalb, and B. Freisleben. Using depth features to retrieve
monocular video shots. In ACM International Conference on Image and Video
Retrieval, 2007.
BIBLIOGRAPHY 181
[32] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmenta-
tion. IJCV, 59, 2004.
[33] D.A. Forsyth and J. Ponce. Computer Vision : A Modern Approach. Prentice
Hall, 2003.
[34] D. Fox, W. Burgard, and S. Thrun. Markov localization for mobile robots in
dynamic environments. JAIR, 1999.
[35] C. Frueh and A. Zakhor. Constructing 3D city models by merging ground-based
and airborne views. In Computer Vision and Pattern Recognition (CVPR),
2003.
[36] G. Gini and A. Marchi. Indoor robot navigation with single camera vision. In
PRIS, 2002.
[37] A. S. Glassner. An Introduction to Ray Tracing. Morgan Kaufmann Publishers,
Inc., San Francisco, 1989.
[38] M. A. Goodale, A. D. Milner, L. S. Jakobson, and D. P. Carey. A neurological
dissociation between perceiving objects and grasping them. Nature, 349:154–
156, 1991.
[39] S. Gould, J. Arfvidsson, A. Kaehler, B. Sapp, M. Meissner, G. Bradski,
P. Baumstarck, S. Chung, and A. Y. Ng. Peripheral-foveal vision for real-time
object recognition and tracking in video. In International Joint Conference on
Artificial Intelligence (IJCAI), 2007.
[40] F. Han and S.C. Zhu. Bayesian reconstruction of 3d shapes and scenes from
a single image. In ICCV Workshop Higher-Level Knowledge in 3D Modeling
Motion Analysis, 2003.
[41] T. Hassner and R. Basri. Example based 3d reconstruction from single 2d
images. In CVPR workshop on Beyond Patches, 2006.
182 BIBLIOGRAPHY
[42] X. He, R.S. Zemel, and M.A. Carreira-Perpinan. Multiscale conditional random
fields for image labeling. In CVPR, 2004.
[43] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classification models:
Combining models for holistic scene understanding. In Neural Information
Processing Systems (NIPS), 2008.
[44] A. Hertzmann and S.M. Seitz. Example-based photometric stereo: Shape re-
construction with general, varying brdfs. IEEE Trans Pattern Analysis and
Machine Intelligence (PAMI), 27(8):1254–1264, 2005.
[45] D. Hoiem, A.A. Efros, and M. Hebert. Putting objects in perspective. In
Computer Vision and Pattern Recognition (CVPR), 2006.
[46] D. Hoiem, A.A. Efros, and M. Herbert. Geometric context from a single image.
In International Conference on Computer Vision (ICCV), 2005.
[47] J. Hong, G. Lafferiere, B. Mishra, and X. Tan. Fine manipulation with multi-
finger hands. In ICRA, 1990.
[48] K. Hsiao, L. Kaelbling, and T. Lozano-Perez. Grasping POMDPs. In Interna-
tional Conference on Robotics and Automation (ICRA), 2007.
[49] K. Hsiao and T. Lozano-Perez. Imitation learning of whole-body grasps. In
IEEE/RJS International Conference on Intelligent Robots and Systems (IROS),
2006.
[50] K. Hsiao, T. Lozano-Perez, and L. P. Kaelbling. Robust belief-based execution
of manipulation programs. In WAFR, 2008.
[51] K. Hsiao, P. Nangeroni, M. Huber, A. Saxena, and A.Y. Ng. Reactive grasping
using optical proximity sensors. In International Conference on Robotics and
Automation (ICRA), 2009.
[52] J. Huang, A.B. Lee, and D. Mumford. Statistics of range images. Computer
Vision and Pattern Recognition (CVPR), 2000.
BIBLIOGRAPHY 183
[53] P.J. Huber. Robust Statistics. Wiley, New York, 1981.
[54] M. Hueser, T. Baier, and J. Zhang. Learning of demonstrated grasping skills by
stereoscopic tracking of human hand configuration. In International Conference
on Robotics and Automation (ICRA), 2006.
[55] A. Jain and C.C. Kemp. Behaviors for robust door opening and doorway traver-
sal with a force-sensing mobile manipulator. In RSS Workshop on Robot Ma-
nipulation: Intelligence in Human, 2008.
[56] I. Kamon, T. Flash, and S. Edelman. Learning to grasp using visual information.
In International Conference on Robotics and Automation (ICRA), 1996.
[57] D. Katz and O. Brock. Manipulating articulated objects with interactive per-
ception. In ICRA, 2008.
[58] M. Kawakita, K. Iizuka, T. Aida, T. Kurita, and H. Kikuchi. Real-time three-
dimensional video image composition by depth information. In IEICE Elec-
tronics Express, 2004.
[59] M. Kearns and S. Singh. Finite-sample rates of convergence for q-learning and
indirect methods. In NIPS 11, 1999.
[60] C.C. Kemp, C. Anderson, H. Nguyen, A. Trevor, and Z. Xu. A point-and-click
interface for the real world: Laser designation of objects for mobile manipula-
tion. In HRI, 2008.
[61] O. Khatib. The potential field approach and operational space formulation in
robot control. Adaptive and Learning Systems: Theory and Applications, pages
367–377, 1986.
[62] D. Kim, J.H. Kang, C.S. Hwang, and G.T. Park. Mobile robot for door opening
in a house. LNAI, 3215:596–602, 2004.
[63] M. Kim and W. Uther. Automatic gait optimisation for quadruped robots. In
Proc. Australasian Conf. on Robotics and Automation, pages 1–9, 2003.
184 BIBLIOGRAPHY
[64] E. Klingbeil, A. Saxena, and A.Y. Ng. Learning to open new doors. In Robotics
Science and Systems (RSS) workshop on Robot Manipulation, 2008.
[65] R Koch, M Pollefeys, and L Van Gool. Multi viewpoint stereo from uncalibrated
video sequences. In European Conference on Computer Vision (ECCV), 1998.
[66] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast
quadrupedal locomotion. In Proc. IEEE Int’l Conf. Robotics and Automation,
2004.
[67] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother. Probabilistic
fusion of stereo with color and contrast for bilayer segmentation. IEEE Pattern
Analysis and Machine Intelligence (PAMI), 28(9):1480–1492, 2006.
[68] S. Konishi and A. Yuille. Statistical cues for domain specific image segmenta-
tion with performance analysis. In Computer Vision and Pattern Recognition
(CVPR), 2000.
[69] D. Kragic and H. I. Christensen. Robust visual servoing. International Journal
of Robotics Research (IJRR), 22:923–939, 2003.
[70] S. Kumar and M. Hebert. Discriminative fields for modeling spatial dependen-
cies in natural images. In Neural Information Processing Systems (NIPS) 16,
2003.
[71] J. Lafferty, A. McCallum, and F. Pereira. Discriminative fields for modeling
spatial dependencies in natural images. In ICML, 2001.
[72] Y. LeCun. Presentation at Navigation, Locomotion and Articulation workshop.
Washington DC, 2003.
[73] J.W. Li, M.H. Jin, and H. Liu. A new algorithm for three-finger force-closure
grasp of polygonal objects. In ICRA, 2003.
[74] T. Lindeberg and J. Garding. Shape from texture from a multi-scale perspective.
In ICCV, 1993.
BIBLIOGRAPHY 185
[75] Y.H. Liu. Computing n-finger form-closure grasps on polygonal objects. IJRR,
19(2):149–158, 2000.
[76] J.M. Loomis. Looking down is looking up. Nature News and Views, 414:155–
156, 2001.
[77] M. Lourakis and A. Argyros. A generic sparse bundle adjustment c/c++ pack-
age based on the levenberg-marquardt algorithm. Technical report, Foundation
for Research and Technology - Hellas, 2006.
[78] Y. Lu, J.Z. Zhang, Q.M.J. Wu, and Z.N. Li. A survey of motion-parallax-based
3-d reconstruction algorithms. IEEE Tran on Systems, Man and Cybernetics,
Part C, 34:532–548, 2004.
[79] A. Maki, M. Watanabe, and C. Wiles. Geotensity: Combining motion and
lighting for 3d surface reconstruction. International Journal of Computer Vision
(IJCV), 48(2):75–90, 2002.
[80] J. Malik and P. Perona. Preattentive texture discrimination with early vision
mechanisms. Journal of the Optical Society of America A, 7(5):923–932, 1990.
[81] J. Malik and R. Rosenholtz. Computing local surface orientation and shape
from texture for curved surfaces. International Journal of Computer Vision
(IJCV), 23(2):149–168, 1997.
[82] X. Markenscoff, L. Ni, and C.H. Papadimitriou. The geometry of grasping.
IJRR, 9(1):61–74, 1990.
[83] J. J. Marotta, T. S. Perrot, D. Nicolle3, P. Servos, and M. A. Goodale. Adapting
to monocular vision: grasping with one eye. 104:107–114, 2004.
[84] D.R. Martin, C.C. Fowlkes, and J. Malik. Learning to detect natural image
boundaries using local brightness, color and texture cues. IEEE Trans Pattern
Analysis and Machine Intelligence, 26, 2004.
186 BIBLIOGRAPHY
[85] M. T. Mason and J. K. Salisbury. Manipulator grasping and pushing opera-
tions. In Robot Hands and the Mechanics of Manipulation. The MIT Press,
Cambridge, MA, 1985.
[86] M.T. Mason and J.K. Salisbury. Robot Hands and the Mechanics of Manipula-
tion. MIT Press, Cambridge, MA, 1985.
[87] A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning:
generative/discriminative training for clustering and classification. In AAAI,
2006.
[88] J. Michels, A. Saxena, and A.Y. Ng. High speed obstacle avoidance using
monocular vision and reinforcement learning. In International Conference on
Machine Learning (ICML), 2005.
[89] A. T. Miller, S. Knoop, P. K. Allen, and H. I. Christensen. Automatic grasp
planning using shape primitives. In International Conference on Robotics and
Automation (ICRA), 2003.
[90] B. Mishra, J.T. Schwartz, and M. Sharir. On the existence and synthesis of
multifinger positive grips. Algorithmica, Special Issue: Robotics, 2(4):541–558,
1987.
[91] J. Miura, Y. Yano, K. Iwase, and Y. Shirai. Task model-based interactive
teaching. In IROS, 2004.
[92] T.M. Moldovan, S. Roth, and M.J. Black. Denoising archival films using
a learned bayesian model. In International Conference on Image Processing
(ICIP), 2006.
[93] A. Morales, E. Chinellato, P.J. Sanz, A.P. del Pobil, and A.H. Fagg. Learning
to predict grasp reliability for a multifinger robot hand by using visual features.
In Int’l Conf AI Soft Comp., 2004.
BIBLIOGRAPHY 187
[94] A. Morales, P. J. Sanz, and A. P. del Pobil. Vision-based computation of three-
finger grasps on unknown planar objects. In IEEE/RSJ Intelligent Robots and
Systems Conference, 2002.
[95] A. Morales, P. J. Sanz, A. P. del Pobil, and A. H. Fagg. An experiment in
constraining vision-based finger contact selection with gripper geometry. In
IEEE/RSJ Intelligent Robots and Systems Conference, 2002.
[96] E.N. Mortensen, H. Deng, and L. Shapiro. A SIFT descriptor with global
context. In Computer Vision and Pattern Recognition (CVPR). 2005.
[97] K. Murphy, A. Torralba, and W.T. Freeman. Using the forest to see the trees:
A graphical model relating features, objects, and scenes. In Neural Information
Processing Systems (NIPS) 16, 2003.
[98] T. Nagai, T. Naruse, M. Ikehara, and A. Kurematsu. Hmm-based surface
reconstruction from single images. In Proc IEEE International Conf Image
Processing (ICIP), volume 2, 2002.
[99] S.G. Narasimhan and S.K. Nayar. Shedding light on the weather. In Computer
Vision and Pattern Recognition (CVPR), 2003.
[100] O. Nestares, R. Navarro, J. Portilia, and A. Tabernero. Efficient spatial-domain
implementation of a multiscale image representation based on gabor functions.
Journal of Electronic Imaging, 7(1):166–173, 1998.
[101] A.Y. Ng and M. Jordan. Pegasus: A policy search method for large mdps and
pomdps. In Proc. 16th Conf. UAI, 2000.
[102] V.D. Nguyen. Constructing stable force-closure grasps. In ACM Fall joint
computer conference, 1986.
[103] A. Oliva and A. Torralba. Building the gist of a scene: the role of global image
features in recognition. IEEE Trans Pattern Analysis and Machine Intelligence
(PAMI), 155:23–36, 2006.
188 BIBLIOGRAPHY
[104] B.A. Olshausen and D.J. Field. Sparse coding with an over-complete basis set:
A strategy employed by v1? Vision Research, 37:3311–3325, 1997.
[105] O. Partaatmadja, B. Benhabib, E. Kaizerman, and M. Q. Dai. A two-
dimensional orientation sensor. Journal of Robotic Systems, 1992.
[106] O. Partaatmadja, B. Benhabib, A. Sun, and A. A. Goldenberg. An electroop-
tical orientation sensor for robotics. Robotics and Automation, IEEE Transac-
tions on, 8:111–119, 1992.
[107] R. Pelossof, A. Miller, P. Allen, and T. Jebara. An svm learning approach
to robotic grasping. In International Conference on Robotics and Automation
(ICRA), 2004.
[108] L. Petersson, D. Austin, and D. Kragic. High-level control of a mobile manip-
ulator for door opening. In IROS, 2000.
[109] A. Petrovskaya, O. Khatib, S. Thrun, and A. Y. Ng. Bayesian estimation
for autonomous object manipulation based on tactile sensors. In International
Conference on Robotics and Automation, 2006.
[110] A. Petrovskaya and A.Y. Ng. Probabilistic mobile manipulation in dynamic
environments, with application to opening doors. In IJCAI, 2007.
[111] G. Petryk and M. Buehler. Dynamic object localization via a proximity sen-
sor network. In Int’l Conf Multisensor Fusion and Integration for Intelligent
Systems, 1996.
[112] G. Petryk and M. Buehler. Robust estimation of pre-contact object trajectories.
In Fifth Symposium on Robot Control, 1997.
[113] J. H. Piater. Learning visual features to predict hand orientations. In ICML
Workshop on Machine Learning of Spatial Knowledge, 2002.
[114] R. Platt, A. H. Fagg, and R. Grupen. Reusing schematic grasping policies.
In IEEE-RAS International Conference on Humanoid Robots, Tsukuba, Japan,
2005.
BIBLIOGRAPHY 189
[115] R. Platt, R. Grupen, and A. Fagg. Improving grasp skills using schema struc-
tured learning. In ICDL, 2006.
[116] N.S. Pollard. Closure and quality equivalence for efficient synthesis of grasps
from examples. IJRR, 23(6), 2004.
[117] M. Pollefeys. Visual modeling with a hand-held camera. International Journal
of Computer Vision (IJCV), 59, 2004.
[118] D. Pomerleau. An autonomous land vehicle in a neural network. In NIPS 1.
Morgan Kaufmann, 1989.
[119] J. Ponce, D. Stam, and B. Faverjon. On computing two-finger force-closure
grasps of curved 2d objects. IJRR, 12(3):263–273, 1993.
[120] J. Ponce, S. Sullivan, A. Sudsang, J.D. Boissonnat, and J.P. Merlet. On com-
puting four-finger equilibrium and force-closure grasps of polyhedral objects.
IJRR, 16(1):11–35, 1997.
[121] J. Porrill, J.P. Frisby, W.J. Adams, and D. Buckley. Robust and optimal use
of information in stereo vision. Nature, 397:63–66, 1999.
[122] M. Prats, P. Sanz, and A.P. del Pobil. Task planning for intelligent robot
manipulation. In IASTED Artificial Intel App, 2007.
[123] D. Prattichizzo and J.C. Trinkle. Springer Handbook of Robotics, pages 671–698.
Springer-Verlag New York, LLC, 2008.
[124] M. Quartulli and M. Datcu. Bayesian model based city reconstruction from
high resolution ISAR data. In IEEE/ISPRS Joint Workshop Remote Sensing
and Data Fusion over Urban Areas, 2001.
[125] M. Quigley. Switchyard. Technical report, Stanford Univer-
sity, http://ai.stanford.edu/∼mquigley/switchyard, 2007. URL
http://ai.stanford.edu/ mquigley/switchyard/.
190 BIBLIOGRAPHY
[126] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger,
R. Wheeler, and A. Y. Ng. Ros: an open-source robot operating system. In
Open-Source Software workshop of ICRA, 2009.
[127] P. P. L. Regtien. Accurate optical proximity detector. Instrumentation and
Measurement Technology Conference, 1990. IMTC-90. Conference Record., 7th
IEEE, pages 141–143, 1990.
[128] C. Rhee, W. Chung, M. Kim, Y. Shim, and H. Lee. Door opening control using
the multi-fingered robotic hand for the indoor service robot. In ICRA, 2004.
[129] Kotz S, Kozubowski T, and Podgorski K. The Laplace Distribution and Gen-
eralizations: A Revisit with Applications to Communications, Economics, En-
gineering, and Finance. Birkhauser, 2001.
[130] T. Sakai, H. Nakajima, D. Nishimura, H. Uematsu, and K. Yukihiko. Au-
tonomous mobile robot system for delivery in hospital. In MEW Technical
Report, 2005.
[131] J.K. Salisbury and B. Roth. Kinematic and force analysis of articulated hands.
ASME J. Mechanisms, Transmissions, Automat. Design, 105:33–41, 1982.
[132] B. Sapp, A. Saxena, and A.Y. Ng. A fast data collection and augmentation
procedure for object recognition. In AAAI, 2008.
[133] A. Saxena, A. Anand, and A. Mukerjee. Robust facial expression recognition
using spatially localized geometric model. In International Conf Systemics,
Cybernetics and Informatics (ICSCI), 2004.
[134] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular
images. In Neural Information Processing Systems (NIPS) 18, 2005.
[135] A. Saxena, S.H. Chung, and A.Y. Ng. 3-D depth reconstruction from a single
still image. International Journal of Computer Vision (IJCV), 76(2):157–173,
2007.
BIBLIOGRAPHY 191
[136] A. Saxena, J. Driemeyer, J. Kearns, and A.Y. Ng. Robotic grasping of novel
objects. In Neural Information Processing Systems (NIPS) 19, 2006.
[137] A. Saxena, J. Driemeyer, J. Kearns, and A.Y. Ng. Robotic grasping of novel
objects. In NIPS, 2006.
[138] A. Saxena, J. Driemeyer, J. Kearns, C. Osondu, and A. Y. Ng. Learning to grasp
novel objects using vision. In 10th International Symposium of Experimental
Robotics (ISER), 2006.
[139] A. Saxena, J. Driemeyer, and A. Y. Ng. Learning 3-d object orientation from
images. In NIPS workshop on Robotic Challenges for Machine Learning, 2007.
[140] A. Saxena, J. Driemeyer, and A. Y. Ng. Robotic grasping of novel objects using
vision. IJRR, 27:157–173, 2008.
[141] A. Saxena, J. Driemeyer, and A. Y. Ng. Learning 3-d object orientation from
images. In ICRA, 2009.
[142] A. Saxena, A. Gupta, and A. Mukerjee. Non-linear dimensionality reduction
by locally linear isomaps. In Proc 11th Int’l Conf on Neural Information Pro-
cessing, 2004.
[143] A. Saxena, G. Gupta, V. Gerasimov, and S. Ourselin. In use parameter esti-
mation of inertial sensors by detecting multilevel quasi-static states. In KES,
2005.
[144] A. Saxena and A.Y. Ng. Learning sound location from a single microphone. In
ICRA, 2009.
[145] A. Saxena, S. Ray, and R.K. Varma. A novel electric shock protection system
based on contact currents on skin surface. In NPSC, 2002.
[146] A. Saxena, J. Schulte, and A.Y. Ng. Depth estimation using monocular and
stereo cues. In International Joint Conference on Artificial Intelligence (IJCAI),
2007.
192 BIBLIOGRAPHY
[147] A. Saxena and A. Singh. A microprocessor based speech recognizer for isolated
hindi digits. In IEEE ACE, 2002.
[148] A. Saxena, S.G. Srivatsan, V. Saxena, and S. Verma. Bioinspired modification
of polystyryl matrix: Single-step chemical evolution to a moderately conducting
polymer. Chemistry Letters, 33(6):740–741, 2004.
[149] A. Saxena, M. Sun, and A. Y. Ng. Learning 3-d scene structure from a single
still image. In ICCV workshop on 3D Representation for Recognition (3dRR-
07), 2007.
[150] A. Saxena, M. Sun, and A.Y. Ng. 3-d reconstruction from sparse views using
monocular vision. In ICCV workshop on Virtual Representations and Modeling
of Large-scale environments (VRML), 2007.
[151] A. Saxena, M. Sun, and A.Y. Ng. Make3d: Depth perception from a single still
image. In AAAI, 2008.
[152] A. Saxena, M. Sun, and A.Y. Ng. Make3d: Learning 3d scene structure from a
single still image. IEEE Transactions of Pattern Analysis and Machine Intelli-
gence (PAMI), 31(5):824–840, 2009.
[153] A. Saxena, L. Wong, and A.Y. Ng. Learning grasp strategies with partial shape
information. In AAAI, 2008.
[154] A. Saxena, L. Wong, M. Quigley, and A.Y. Ng. A vision-based system for
grasping novel objects in cluttered environments. In ISRR, 2007.
[155] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame
stereo correspondence algorithms. International Journal of Computer Vision
(IJCV), 47, 2002.
[156] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured
light. In Computer Vision and Pattern Recognition (CVPR), 2003.
BIBLIOGRAPHY 193
[157] H. Schneiderman and T. Kanade. Probabilistic modeling of local appearance
and spatial relationships for object recognition. In Computer Vision and Pattern
Recognition (CVPR), 1998.
[158] S.H. Schwartz. Visual Perception 2nd ed. Appleton and Lange, Connecticut,
1999.
[159] F. Schwarzer, M. Saha, and J.-C. Latombe. Adaptive dynamic collision check-
ing for single and multiple articulated robots in complex environments. IEEE
Transactions on Robotics and Automation, 21:338–353, 2005.
[160] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by
visual cortex. In Computer Vision and Pattern Recognition (CVPR), 2005.
[161] K. Shimoga. Robot grasp synthesis: a survey. International Journal of Robotics
Research (IJRR), 15:230–266, 1996.
[162] T. Shin-ichi and M. Satoshi. Living and working with robots. Nipponia, 2000.
[163] J.R. Smith, E. Garcia, R. Wistort, and G. Krishnamoorthy. Electric field imag-
ing pretouch for robotic graspers. In International Conference on Intelligent
Robots and Systems (IROS), 2007.
[164] N. Snavely, S.M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-
tions in 3d. ACM SIGGRAPH, 25(3):835–846, 2006.
[165] S.P. Soundararaj, A.K. Sujeeth, and A. Saxena. Autonomous indoor helicopter
flight using a single onboard camera. In IROS, 2009.
[166] C. Stachniss and W. Burgard. Mobile robot mapping and localization in non-
static environments. In AAAI, 2005.
[167] G. Strang and T. Nguyen. Wavelets and filter banks. Wellesley-Cambridge
Press, Wellesley, MA, 1997.
194 BIBLIOGRAPHY
[168] E.B. Sudderth, A. Torralba, W.T. Freeman, and A.S. Willisky. Depth from
familiar objects: A hierarchical model for 3D scenes. In Computer Vision and
Pattern Recognition (CVPR), 2006.
[169] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998.
[170] Swissranger. Mesa imaging. In http://www.mesa-imaging.ch, 2005.
[171] R. Szelinski. Bayesian modeling of uncertainty in low-level vision. In Interna-
tional Conference on Computer Vision (ICCV), 1990.
[172] T. Takahashi, T. Tsuboi, T. Kishida, Y. Kawanami, S. Shimizu, M. Iribe,
T. Fukushima, and M. Fujita. Adaptive grasping by multi fingered hand with
tactile sensor based on robust force and position control. In ICRA, 2008.
[173] J. Tegin and J. Wikander. Tactile sensing in intelligent robotic manipulation -
a review. Industrial Robot: An International Journal, pages 264–271, 2005.
[174] S. Thrun and M. Montemerlo. The graph slam algorithm with applications
to large-scale mapping of urban structures. International Journal of Robotics
Research, 2006.
[175] S. Thrun and B. Wegbreit. Shape from symmetry. In International Conf on
Computer Vision (ICCV), 2005.
[176] A. Torralba. Contextual priming for object detection. International Journal of
Computer Vision, 53(2):161–191, 2003.
[177] A. Torralba and A. Oliva. Depth estimation from image structure. IEEE Trans
Pattern Analysis and Machine Intelligence (PAMI), 24(9):1–13, 2002.
[178] L. Torresani and A. Hertzmann. Automatic non-rigid 3D modeling from video.
In European Conference on Computer Vision (ECCV), 2004.
[179] S. Walker, K. Loewke, M. Fischer, C. Liu, and J. K. Salisbury. An optical fiber
proximity sensor for haptic exploration. ICRA, 2007.
BIBLIOGRAPHY 195
[180] B.A. Wandell. Foundations of Vision. Sinauer Associates, Sunderland, MA,
1995.
[181] A.E. Welchman, A. Deubelius, V. Conrad, H.H. Blthoff, and Z. Kourtzi. 3D
shape perception from combined depth cues in human visual cortex. Nature
Neuroscience, 8:820–827, 2005.
[182] A.S. Willsky. Multiresolution markov models for signal and image processing.
IEEE, 2002.
[183] R. Wistort and J.R. Smith. Electric field servoing for robotic manipulation. In
IROS, 2008.
[184] D. Wolf and G. S. Sukhatme. Online simultaneous localization and mapping in
dynamic environments. In ICRA, 2004.
[185] B. Wu, T.L. Ooi, and Z.J. He. Perceiving distance accurately by a directional
process of integrating ground information. Letters to Nature, 428:73–77, 2004.
[186] R. Zhang, P.S. Tsai, J.E. Cryer, and M. Shah. Shape from shading: A survey.
IEEE Trans Pattern Analysis & Machine Intelligence (IEEE-PAMI), 21:690–
706, 1999.
[187] W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfield. Face recognition: A
literature survey. ACM Computing Surveys, 35:399–458, 2003.