structured output prediction and binary code learning in
TRANSCRIPT
Structured Output Prediction and Binary Code
Learning in Computer Vision
by
Guosheng Lin
A thesis submitted in fulfillment for the
degree of Doctor of Philosophy
in the
Faculty of Engineering, Computer and Mathematical Sciences
School of Computer Science
December 2014
Declaration
I certify that this work contains no material which has been accepted for the award of
any other degree or diploma in any university or other tertiary institution and, to the
best of my knowledge and belief, contains no material previously published or written
by another person, except where due reference has been made in the text. In addition,
I certify that no part of this work will, in the future, be used in a submission for any
other degree or diploma in any university or other tertiary institution without the prior
approval of the University of Adelaide and where applicable, any partner institution
responsible for the joint-award of this degree.
I give consent to this copy of my thesis when deposited in the University Library, being
made available for loan and photocopying, subject to the provisions of the Copyright
Act 1968.
I also give permission for the digital version of my thesis to be made available on the web,
via the Universitys digital research repository, the Library catalogue and also through
web search engines, unless permission has been granted by the University to restrict
access for a period of time.
Signed:
Date:
iii
Acknowledgements
Above all, I would like to express my profound gratitude to my principal supervisor Prof.
Chunhua Shen for his continued guidance and strong support throughout my time as
his student. He taught me not only fundamental and advanced knowledge of machine
learning, but also scientific thinking and writing skills. He gave me insightful guidance
and creative suggestions on my research problems. He made himself available to me at
any time for discussing research problems. This thesis would not have been possible
without his guidance.
I would like to thank my co-supervisor Prof. David Suter. He encouraged me working
on my research interests, helped me to develop scientific thinking, inspired me thinking
outside the box, and improved my presentation skills. I would also like to thank my
co-supervisor Dr.Tat-Jun Chin. His questions raised in meetings helped me deliver clear
and convincing presentations of my ideas.
I would like to thank Prof. Anton van den Hengel for his great support and invaluable
collaboration. He gave me important suggestions on presentation, and offer great help on
paper writing and revision. I am also grateful to Dr. Qinfeng (Javen) Shi. His insightful
viewpoints and suggestions helped me improve my research methodologies and paper
writing.
Many thanks go to ACVT researchers (current or previous) for their kindness and bene-
ficial discussions: Dr. Xi Li, Dr. Sakrapee (Paul) Paisitkriangkrai, Dr. Peng Wang, Dr.
Lingqiao Liu and Dr. Jin Yu. I would also like to thank all of my lab mates, especially
Quoc-Huy Tran, Zhenhua Wang, Rui Yao, Fumin Shen and Zhen Zhang.
Finally, my special appreciation is dedicated to my parents, brother, sister for their
everlasting support, and to my girlfriend with whom I share my success and frustration.
v
Publications
This thesis is based on the content of the following peer-reviewed conference and journal
papers:
1. Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel and David
Suter; “Fast Supervised Hashing with Decision Trees for High-Dimensional Data”;
In proceeding of IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2014.
2. Guosheng Lin, Chunhua Shen, Jianxin Wu;
“Optimizing Ranking Measures for Compact Binary Code Learning”; In proceeding
of European Conference on Computer Vision (ECCV), 2014.
3. Chunhua Shen, Guosheng Lin, Anton van den Hengel;
“StructBoost: Boosting Methods For Predicting Structured Output Variables”; IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
4. Guosheng Lin, Chunhua Shen, David Suter and Anton van den Hengel;
“A General Two-Step Approach to Learning-Based Hashing”; In proceeding of
International Conference on Computer Vision (ICCV), 2013.
5. Xi Li*, Guosheng Lin*, Chunhua Shen, Anton van den Hengel and Anthony Dick
(* indicates equal contribution);
“Learning Hash Functions Using Column Generation”; In proceeding of Interna-
tional Conference on Machine Learning (ICML), 2013.
6. Guosheng Lin, Chunhua Shen and Anton van den Hengel;
“Approximate Constraint Generation for Efficient Structured Boosting”; In pro-
ceeding of International Conference on Image Processing (ICIP), 2013.
7. Guosheng Lin, Chunhua Shen, Anton van den Hengel and David Suter;
“Fast Training of Effective Multi-class Boosting Using Coordinate Descent Opti-
mization”; In proceeding of Asian Conference on Computer Vision (ACCV), 2012.
vii
THE UNIVERSITY OF ADELAIDE
Abstract
Faculty of Engineering, Computer and Mathematical Sciences
School of Computer Science
Doctor of Philosophy
by Guosheng Lin
Machine learning techniques play essential roles in many computer vision applications.
This thesis is dedicated to two types of machine learning techniques which are important
to computer vision: structured learning and binary code learning. Structured learning is
for predicting complex structured output of which the components are inter-dependent.
Structured outputs are common in real-world applications. The image segmentation
mask is an example of structured output. Binary code learning is to learn hash functions
that map data points into binary codes. The binary code representation is popular for
large-scale similarity search, indexing and storage. This thesis has made practical and
theoretical contributions to these two types of learning techniques.
The first part of this thesis focuses on boosting based structured output prediction.
Boosting is a type of methods for learning a single accurate predictor by linearly com-
bining a set of less accurate weak learners. As a special case of structured learning,
we first propose an efficient boosting method for multi-class classification, which can be
applied to image classification. Different from many existing multi-class boosting meth-
ods, we train class specified weak learners by separately learning weak learners for each
class. We also develop a fast coordinate descent method for solving the optimization
problem, in which we have closed-form solution for each coordinate update.
For general structured output prediction, we propose a new boosting based method,
which we refer to as StructBoost. StructBoost supports nonlinear structured learning
by combining a set of weak structured learners. Our StructBoost generalizes standard
boosting approaches such as AdaBoost, or LPBoost to structured learning. The result-
ing optimization problem is challenging in the sense that it may involve exponentially
many variables and constraints. We develop cutting plane and column generation based
algorithms to efficiently solve the optimization. We show the versatility and usefulness
of StructBoost on a range of problems such as optimizing the tree loss for hierarchi-
cal multi-class classification, optimizing the Pascal overlap criterion for robust visual
tracking and learning conditional random field parameters for image segmentation.
ix
x
The last part of this thesis focuses on hashing methods for binary code learning. We
develop three novel hashing methods which focus on different aspects of binary code
learning. We first present a column generation based hash function learning method
for preserving triplet based relative pairwise similarity. Given a set of triplets that
encode the pairwise similarity comparison information, our method learns hash functions
within the large-margin learning framework. At each iteration of the column generation
procedure, the best hash function is selected. We show that our method with triplet
based formulation and large-margin learning is able to learn high quality hash functions.
The second hashing learning method in this thesis is a flexible and general method with
a two-step learning scheme. Most existing approaches to hashing apply a single form
of hash function, and an optimization process which is typically deeply coupled to this
specific form. This tight coupling restricts the flexibility of the method to respond to the
data, and can result in complex optimization problems that are difficult to solve. In this
chapter we propose a flexible yet simple framework that is able to accommodate different
types of loss functions and hash functions. This framework allows a number of existing
approaches to hashing to be placed in context, and simplifies the development of new
problem-specific hashing methods. Our framework decomposes the hashing learning
problem into two steps: hash bit learning and hash function learning based on the
learned bits. The first step can typically be formulated as binary quadratic problems,
and the second step can be accomplished by training standard binary classifiers. These
two steps can be easily solved by leveraging sophisticated algorithms in the literature.
The third hashing learning method aims for efficient and effective hash function learning
on large-scale and high-dimensional data, which is an extension of our general two-step
hashing method. Non-linear hash functions have demonstrated their advantage over
linear ones due to their powerful generalization capability. In the literature, kernel func-
tions are typically used to achieve non-linearity in hashing, which achieve encouraging
retrieval performance at the price of slow evaluation and training time. We propose to
use boosted decision trees for achieving non-linearity in hashing, which are fast to train
and evaluate, hence more suitable for hashing with high dimensional data. In our ap-
proach, we first propose sub-modular formulations for the hashing binary code inference
problem and an efficient GraphCut based block search method for solving large-scale
inference. Then we learn hash functions by training boosted decision trees to fit the bi-
nary codes. We show that our method significantly outperforms most existing methods
both in retrieval precision and training time, especially for high-dimensional data.
Dedicated to my family.
xi
Contents
Declaration iii
Acknowledgements v
Publications vii
Abstract ix
Contents xiii
List of Figures xvii
List of Tables xxi
1 Introduction 1
1.1 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Binary Code Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background Literature 9
2.1 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Structured SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Column generation boosting . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Column generation for multi-class boosting . . . . . . . . . . . . . 15
2.2.2.1 MultiBoost with hinge loss . . . . . . . . . . . . . . . . . 16
2.2.2.2 MultiBoost with exponential loss . . . . . . . . . . . . . . 17
2.3 Binary Code Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Spectral hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Self-taught hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Supervised hashing with kernel . . . . . . . . . . . . . . . . . . . . 24
xiii
xiv CONTENTS
3 Fast Training of Effective Multi-class Boosting Using Coordinate De-scent 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Column generation for MultiBoostcw . . . . . . . . . . . . . . . . . 33
3.2.2 Fast coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 UCI datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Handwritten digit recognition . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 Three Image datasets: PASCAL07, LabelMe, CIFAR10 . . . . . . 43
3.3.4 Scene recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.5 Traffic sign recognition . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.6 FCD evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 StructBoost: Boosting Methods for Predicting Structured OutputVariables 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Structured boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 1-slack formulation for fast optimization . . . . . . . . . . . . . . . 56
4.2.2 Cutting-plane optimization for the 1-slack primal . . . . . . . . . . 58
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Examples of StructBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Ordinal regression and AUC optimization . . . . . . . . . . . . . . 61
4.3.3 Multi-class boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.4 Hierarchical classification with taxonomies . . . . . . . . . . . . . . 62
4.3.5 Optimization of the Pascal image overlap criterion . . . . . . . . . 64
4.3.6 CRF parameter learning . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 AUC optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Hierarchical multi-class classification . . . . . . . . . . . . . . . . . 72
4.4.4 Visual tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.5 CRF parameter learning for image segmentation . . . . . . . . . . 76
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Learning Hash Functions Using Column Generation 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1 main contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS xv
5.2 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1 Learning hash functions with squared hinge loss . . . . . . . . . . 89
5.3 Hashing with general smooth convex loss functions . . . . . . . . . . . . . 93
5.3.1 Hashing with logistic loss . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Hashing with `∞ norm regularization . . . . . . . . . . . . . . . . . . . . . 96
5.5 Extension of regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.2 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6.3 Competing methods . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6.4 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6.5 Quantitative comparison results . . . . . . . . . . . . . . . . . . . . 103
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 A General Two-Step Approach to Learning-Based Hashing 105
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Two-Step Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Solving binary quadratic problems . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.1 Comparing methods . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.2 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.3 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.4 Using different loss functions . . . . . . . . . . . . . . . . . . . . . 124
6.4.5 Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.6 Using different hash functions . . . . . . . . . . . . . . . . . . . . . 125
6.4.7 Results on large datasets . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7 Fast Supervised Hashing with Decision Trees for High-DimensionalData 127
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2.1 Step 1: Binary code inference . . . . . . . . . . . . . . . . . . . . . 132
7.2.2 Step 2: Learning boosted trees as hash functions . . . . . . . . . . 135
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.1 Comparison with KSH . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3.2 Comparison with TSH . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3.3 Experiments on different features . . . . . . . . . . . . . . . . . . . 145
7.3.4 Comparison with dimension reduction . . . . . . . . . . . . . . . . 147
7.3.5 Comparison with unsupervised methods . . . . . . . . . . . . . . . 148
7.3.6 More features and more bits . . . . . . . . . . . . . . . . . . . . . . 149
7.3.7 Large dataset: SUN397 . . . . . . . . . . . . . . . . . . . . . . . . 150
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8 Conclusion 153
8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A Appendix for MultiBoostcw 157
xvi CONTENTS
A.1 Dual problem of MultiBoostcw . . . . . . . . . . . . . . . . . . . . . . . . 157
A.2 MultiBoostcw with the hinge loss . . . . . . . . . . . . . . . . . . . . . . . 158
B Appendix for StructBoost 161
B.1 Dual formulation of n-slack . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.2 Dual formulation of 1-slack . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C Appendix for CGHash 165
C.1 Learning hashing functions with the hinge loss . . . . . . . . . . . . . . . 165
C.1.1 Using `1 norm regularization . . . . . . . . . . . . . . . . . . . . . 165
C.1.2 Using l∞ norm regularization . . . . . . . . . . . . . . . . . . . . . 166
Bibliography 169
List of Figures
1.1 An example of the image segmentation task. The first row is the inputimages and the second row is the segmentation label masks. The labelmask is the structured output that we aim to predict, which identifiestarget objects (cars here) from the background. . . . . . . . . . . . . . . 2
1.2 An illustration of hashing based similarity preserving . . . . . . . . . . . . 4
1.3 An illustration of image retrieval. The first column shows query images,and the rest are retrieved images in the database. These retrieved imagesare expected to be semantically similar to the corresponding query images. 4
3.1 Results on 2 UCI datasets: VOWEL and ISOLET. CW and CW-1 are ourmethods. CW-1 uses stage-wise setting. The number after the methodname is the mean value with standard deviation of the last iteration. Ourmethods converge much faster and achieve competitive test accuracy. Thetotal training time and the solver time of our methods both are less thanthat of MultiBoost [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Experiments on 2 handwritten digit recognition datasets: USPS, PENDIG-ITS. CW and CW-1 are our methods. CW-1 uses stage-wise setting. Ourmethods converge much faster, achieve best test error and use less trainingtime. Ada.MO has similar convergence rate as ours, but requires muchmore training time. With a maximum training time of 1000 seconds,Ada.MO failed to finish 500 iterations on all datasets. . . . . . . . . . . . 42
3.3 Experiments on handwritten digit recognition datasets: MNIST. CW andCW-1 are our methods. CW-1 uses stage-wise setting. Our methodsconverge much faster, achieve best test error and use less training time.Ada.MO has similar convergence rate as ours, but requires much moretraining time. With a maximum training time of 1000 seconds, Ada.MOfailed to finish 500 iterations on this dataset. . . . . . . . . . . . . . . . . 43
3.4 Results on a traffic sign dataset: GTSRB. CW and CW-1 (stage-wisesetting) are our methods. Our methods converge much faster, achievebest test error and use less training time. . . . . . . . . . . . . . . . . . . 44
3.5 Experiments on 3 image datasets: PASCAL07, LabelMe and CIFAR10.CW and CW-1 are our methods. CW-1 uses stage-wise setting. Ourmethods converge much faster, achieve best test error and use less trainingtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Experiments on 2 scene recognition datasets: SCENE15 and a subset ofSUN. CW and CW-1 are our methods. CW-1 uses stage-wise setting.Our methods converge much faster and achieve best test error and useless training time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xvii
xviii LIST OF FIGURES
3.7 Solver comparison between FCD with different parameter setting andLBFGS-B [2]. One column for one dataset. The number after “FCD”is the setting for the maximum iteration (τmax) of FCD. The stage-wisesetting of FCD is the fastest one. See the text for details. . . . . . . . . . 48
4.1 The hierarchy structures of two selected subsets of the SUN dataset [3]used in our experiments for hierarchical image classification. . . . . . . . . 63
4.2 Classification with taxonomies (tree loss), corresponding to the first ex-ample in Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 AUC optimization on two UCI datasets. The objective values and op-timization time are shown in the figure by varying boosting (or columngeneration) iterations. It shows that 1-slack achieves similar objectivevalues as n-slack but needs less running time. . . . . . . . . . . . . . . . . 70
4.4 Test performance versus the number of boosting iterations of multi-classclassification. StBoost-stump and StBoost-per denote our StructBoostusing decision stumps and perceptrons as weak learners, respectively. Theresults of SSVM and SVM are shown as straight lines in the plots. Thevalues shown in the legend are the error rates of the final iteration foreach method. Our methods perform better than SSVM in most cases. . . 71
4.5 Bounding box overlap in frames of several video sequences. Our Struct-Boost often achieves higher scores of box overlap compared with othertrackers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Bounding box center location error in frames of several video sequences.Our StructBoost often achieves lower center location errors comparedwith other trackers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Some tracking examples for several video sequences: “coke”, “david”, ,“walk”, “bird” and “tiger2” (best viewed on screen). The output bound-ing boxes of our StructBoost better overlap against the ground truth thanthe compared methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.8 Some segmentation results on the Graz-02 dataset (car). Compared withAdaBoost, structured output learning methods (StructBoost and SSVM)present sharper segmentation boundaries, and better spatial regulariza-tion. Compared with SSVM, our StructBoost with non-linear parameterlearning performs better, demonstrating more accurate foreground objectboundaries and cleaner backgrounds. . . . . . . . . . . . . . . . . . . . . . 78
4.9 Some segmentation results on the Graz-02 dataset (bicycle). Comparedwith AdaBoost, structured output learning methods (StructBoost andSSVM) present sharper segmentation boundaries, and better spatial reg-ularization. Compared with SSVM, our StructBoost with non-linearparameter learning performs better, demonstrating more accurate fore-ground object boundaries and cleaner backgrounds. . . . . . . . . . . . . . 79
4.10 Some segmentation results on the Graz-02 dataset (person). Comparedwith AdaBoost, structured output learning methods (StructBoost andSSVM) present sharper segmentation boundaries, and better spatial reg-ularization. Compared with SSVM, our StructBoost with non-linearparameter learning performs better, demonstrating more accurate fore-ground object boundaries and cleaner backgrounds. . . . . . . . . . . . . . 80
LIST OF FIGURES xix
4.11 Some segmentation results on the Graz-02 dataset (person). Comparedwith AdaBoost, structured output learning methods (StructBoost andSSVM) present sharper segmentation boundaries, and better spatial reg-ularization. Compared with SSVM, our StructBoost with non-linearparameter learning performs better, demonstrating more accurate fore-ground object boundaries and cleaner backgrounds. . . . . . . . . . . . . . 81
5.1 Results of precision-recall (using 64 bits). It shows that our CGHashperforms the best in most cases. . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Precision of top-50 retrieved examples using different number of bits. Itshows that our CGHash performs the best in most cases. . . . . . . . . . 99
5.3 Nearest-neighbor classification error using different number of bits. Itshows that our CGHash performs the best in most cases. . . . . . . . . . 100
5.4 Performance of CGHash using different values of K (K ∈ 3, 10, 20, 30)on the SCENE-15 dataset. Results of precision-recall (using 60 bits),precision of top-50 retrieved examples and nearest-neighbor classificationare shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 Two retrieval examples for CGHash on the LABELME and MNIST datasets.Query examples are shown in the left column, and the retrieved examplesare shown in the right part. . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1 An illustration of Two-Step Hashing . . . . . . . . . . . . . . . . . . . . . 110
6.2 Some retrieval examples of our method TSH on CIFAR10. The firstcolumn shows query images, and the rest are top 40 retrieved images inthe database. False predictions are marked by red boxes. . . . . . . . . . 115
6.3 Some retrieval examples of our method TSH on CIFAR10. The firstcolumn shows query images, and the rest are top 40 retrieved images inthe database. False predictions are marked by red boxes. . . . . . . . . . 116
6.4 Results on 2 datasets of supervised methods. Results show that TSHoutperforms others usually by a large margin. The running up methodsare STHs-RBF and KSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5 Results on 2 datasets for comparing unsupervised methods. Results showthat TSH outperforms others usually by a large margin. . . . . . . . . . . 118
6.6 Results on SCENE15, USPS and ISOLET for comparing with supervisedand unsupervised methods. Our TSH perform the best. . . . . . . . . . . 119
6.7 Results on 2 datasets of our method using different hash functions. Re-sults show that using kernel hash function (TSH-RBF and TSH-KF)achieves best performances. . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.8 Code compression time using different hash functions. Results show thatusing kernel transferred feature (TSH-KF) is much faster then SVM withRBF kernel (TSH-RBF). Linear SVM is the fastest one. . . . . . . . . . 121
6.9 Comparison of supervised methods on 2 large scale datasets: Flickr1Mand Tiny580k. Our method TSH achieves on par result with KSH. TSHand KSH significantly outperform other methods. . . . . . . . . . . . . . 121
6.10 Comparison of unsupervised methods on 2 large scale datasets: Flickr1Mand Tiny580k. The first row shows the results of supervised methods andthe second row for unsupervised methods. Our method TSH achieveson par result with KSH. TSH and KSH significantly outperform othermethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
xx LIST OF FIGURES
7.1 Some retrieval examples of our method FastHash on CIFAR10. The firstcolumn shows query images, and the rest are retrieved images in thedatabase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Some retrieval examples of our method FastHash on ESPGAME. Thefirst column shows query images, and the rest are retrieved images in thedatabase. False predictions are marked by red boxes. . . . . . . . . . . . 138
7.3 Comparison of KSH and our FastHash on all datasets. The precision andrecall curves are given in the first two rows. The precision curves of thetop 2000 retrieved examples are given on the last 2 rows. The numberafter “KSH” is the number of support vectors. Both of our FastHash andFastHash-Full outperform KSH by a large margin. . . . . . . . . . . . . . 141
7.4 Comparison of various combinations of hash functions and binary infer-ence methods. Note that the proposed FastHash uses decision tree ashash functions. The proposed decision tree hash function performs muchbetter than the linear SVM hash function. Moreover, our FastHash per-forms much better than TSH when using the same hash function in Step2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.5 Results on high-dimensional codebook features. The precision and recallcurves are given in the first two rows. The precision curves of the top2000 retrieved examples are given on the last 2 rows. Both our FastHashand FastHash-Full outperform their comparators by a large margin. . . . 146
7.6 The retrieval precision results of unsupervised methods. Unsupervisedmethods perform poorly for preserving label based similarity. Our FastHashoutperform others by a large margin. . . . . . . . . . . . . . . . . . . . . . 150
7.7 The precision curve of top 2000 retrieved examples on large image datasetSUN397 using 1024 bits. Here we compare with those methods which canbe efficiently trained up to 1024 bits on the whole training set. OurFastHash outperforms others by a large margin. . . . . . . . . . . . . . . 151
List of Tables
4.1 AUC maximization. We compare the performance of n-slack and 1-slackformulations. “−” means that the method is not able to converge within amemory and time limit. We can see that 1-slack can achieve similar AUCresults on training and testing data as n-slack while 1-slack is significantlyfaster than n-slack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Multi-class classification test errors (%) on several UCI and MNIST datasets.1-v-a SVM is the one-vs-all SVM. StBoost-stump and StBoost-per denoteour StructBoost using decision stumps and perceptrons as weak learners,respectively. StructBoost outperforms SSVM in most cases and achievescompetitive performance compared with other multi-class classifiers. . . . 72
4.3 Hierarchical classification. Results of the tree loss and the 1/0 loss (classi-fication error rate) on subsets of the SUN dataset. StructBoost-tree usesthe hierarchy class formulation with the tree loss, and StructBoost-flatuses the standard multi-class formulation. StructBoost-tree that mini-mizes the tree loss performs best. . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Average bounding box overlap scores on benchmark videos. Struck50 [4]is structured SVM tracking with a buffer size of 50. Our StructBoostperforms the best in most cases. Struck performs the second best, whichconfirms the usefulness of structured output learning. . . . . . . . . . . . 73
4.5 Average center errors on benchmark videos. Struck50 [4] is structuredSVM tracking with a buffer size of 50. We observe similar results as inTable 4.4: Our StructBoost outperforms others on most sequences, andStruck is the second best. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Image segmentation results on the Graz-02 dataset. The results showthe the pixel accuracy, intersection-union score (including the foregroundand background) and precision = recall value (as in [5]). Our methodStructBoost for nonlinear parameter learning performs better than SSVMand other methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Summary of the 6 datasets used in the experiments. . . . . . . . . . . . . 102
6.1 Results (using hash codes of 32 bits) of TSH using different loss func-tions, and a selection of other supervised and unsupervised methods on3 datasets. The upper part relates the results on training data and thelower on testing data. The results show that Step-1 of our method isable to generate effective binary codes that outperforms those of compet-ing methods on the training data. On the testing data our method alsooutperforms others by a large margin in most cases. . . . . . . . . . . . . 114
xxi
xxii LIST OF TABLES
6.2 Training time (in seconds) for TSH using different loss functions, and sev-eral other supervised methods on 3 datasets. The value inside a bracketsis the time used in the first step for inferring the binary codes. The resultsshow that our method is efficient. Note that the second step of learningthe hash functions can be easily parallelised. . . . . . . . . . . . . . . . . 114
7.1 Comparison of KSH and our FastHash. KSH results with different numberof support vectors. Both of our FastHash and FastHash-Full outperformKSH by a large margin in terms of training time, binary encoding time(Test time) and retrieval precision. . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Comparison of TSH and our FastHash for binary code inference in Step 1.The proposed Block GraphCut (Block-GC) achieves much lower objectivevalue and also takes less inference time than the spectral method, and thusperforms much better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 Comparison of combination of hash functions and binary inference meth-ods. The proposed decision tree hash function performs much better thanlinear SVM hash function. Moreover, our FastHash performs much betterthan TSH when using the same hash function in Step-2. . . . . . . . . . 144
7.4 Comparison of TSH and our FastHash. Results of TSH with the linearSVM and the budgeted RBF kernel [6] hash functions (TSH-BRBF) forthe Step-2 are presented. Our FastHash outperforms TSH by a largemargin both on training speed and retrieval performance. . . . . . . . . . 145
7.5 Results using two types of features: low-dimensional GIST features andthe high-dimensional codebook features. Our FastHash and FastHash-Full outperform the comparators by a large margin on both feature types.In terms of training time, our FastHash is also much faster than otherson the high-dimensional codebook features. . . . . . . . . . . . . . . . . . 147
7.6 Results of methods with dimension reduction. KSH, SPLH and STHs aretrained with PCA feature reduction. Our FastHash outperforms othersby a large margin on retrieval performance. . . . . . . . . . . . . . . . . . 148
7.7 Performance of our FastHash on more features (22400 dimensions) andmore bits (1024 bits). It shows that FastHash can be efficiently trained onhigh-dimensional features with large bit length. The training and binarycoding time (Test time) of FastHash is only linearly increased with thebit length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.8 Results on the large image dataset SUN397 using 11200-dimensional code-book features. Our FastHash can be efficiently trained to large bit length(1024 bits) on this large training set. FastHash outperforms other meth-ods by a large margin on retrieval performance. . . . . . . . . . . . . . . 149
Chapter 1
Introduction
The general goal of computer vision research is to make computers understand the visual
world. Typical computer vision tasks include image classification, object detection,
image segmentation, visual tracking and image retrieval. Newly emerged tasks include
action recognition, event detection, object retrieval. Machine learning techniques are
playing essential roles on these computer vision tasks. In this thesis we focus on two
types of machine learning techniques: structured learning and binary code learning,
which are important to many computer vision applications.
1.1 Structured Learning
Conventional supervised learning problems, such as classification and regression, aim
to learn a function that predicts a best output value of y ∈ R for an input vector
x ∈ Rd. In many applications, however, the outputs are often complex and cannot
be well represented by a single scalar, but the most appropriate outputs are objects
(vectors, sequences, trees, etc.). The components of the output are inter-dependent.
Such problems are referred to as structured output prediction.
Let y denote a structured output. The domain of output is denoted by Y. The struc-
tured output y can be any objects and it varies in different application. There are
complex interactions/dependencies among the components of the output. These inter-
actions/dependencies within y are usually modeled by directed/undirected graphical
models.
For example, in the application of image segmentation, the input x is an image and the
output y is a label matrix ( or called a label mask). Each element in y is the label
value of the corresponding pixel in the input image. A simple image segmentation task
1
2 Chapter 1 Introduction
Figure 1.1: An example of the image segmentation task. The first row is the input images andthe second row is the segmentation label masks. The label mask is the structured output thatwe aim to predict, which identifies target objects (cars here) from the background.
is shown in Figure 1.1, in which the task is to predict a segmentation label mask. A
label mask identifies target objects from the background.
In structured learning, we aim to find a prediction function g which has a structured
output: y = g(x). This prediction function is assumed to take the following form:
y? = g(x) = argmaxy
f(x,y). (1.1)
Here f is a function that measures the consistency of the input and output. The predic-
tion is achieved by maximizing f(x,y) over all possible y ∈ Y. A well-known structured
learning method is Structured SVM (SSVM) [7].
Structured learning is widely applied in computer vision applications, such as image
segmentation [8, 9], object detection [10], visual tracking [11, 12], event detection [13],
action recognition [14]. The survey of [15] provides a comprehensive review of structured
learning and its application in computer vision.
In these structured learning applications, most of them learn a linear models for struc-
tured output prediction. Non-linear models usually have better generalization perfor-
mance than the linear ones; hence a practical non-learning structured learning method
is very disable. However the non-linear structured learning receive much less attentions
compared to linear learning. It is a great challenge for learning non-linear models for
structured prediction. For example, using kernels in SSVM usually leads to very ex-
pensive computation, thus it greatly limits the applications of kernel SSVM. In this
thesis, we propose a boosting based structure learning method for efficient non-linear
structured prediction. This boosting based method combines a set of weak structured
learner to form a strong structured learner.
Chapter 1 Introduction 3
1.2 Binary Code Learning
In binary code learning, we aim to learn a set of mapping functions that turn the original
high-dimensional data into binary codes. We refer to these mapping functions as hash
functions, and the corresponding learning methods as hashing methods. Suppose the
learning task is to find m hash functions for mapping input examples into m-bit binary
codes. Let x ∈ Rd denote a data point, and Φ(x) denote the output of m hash functions.
Φ(x) can be written as:
Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (1.2)
Each hash function outputs a binary value:
h(x) ∈ −1, 1. (1.3)
The output of these hash functions are m-bit binary codes: Φ(x) ∈ −1, 1m. Generally
these hash functions are able to preserve some kind of data similarity in the Hamming
space.
The term “hashing” here refers to compact binary encoding [16]. Hashing methods have
been particularly successful for similarity search (or called nearest neighbor search).
Hashing based similarity search has been applied in a variety of computer vision appli-
cations, such as object recognition [16], image retrieval [17], image classification [18],
large-scale object detection [19], image matching [20], fast pose estimation [21], compact
local descriptor learning [22–24]. Content based image retrieval is an important appli-
cation of hashing methods, which is to search for similar images from a large collection
of images based on the visual content. An example for the image retrieval application
is shown in Figure 1.3. Similarity search techniques are essential to image retrieval
applications.
One main advantage of hashing methods for similarity search is their high computational
efficiency. The similarity between two images can be measured by calculating Hamming
distance based on their binary codes. The Hamming distance of two binary codes is the
number of bits with different values. The Hamming distance only requires simple bit
operation which makes it extremely efficient for computation.
After obtaining binary codes, there are two common approaches for similarity search [17]:
one is using hash look-up tables [25, 26], the other one is hamming distance ranking. In
the first approach, look-up tables are constructed using binary codes as keys. Given a
query, data points within a fixed hamming distance are returned as the retrieved result.
In the second approach, given a query, all data points in the database are ranked based
4 Chapter 1 Introduction
Similar Dissimilar
Hashing
11001100
Hashing
11001101
Hashing
00110011
Small Hamming Distance Large Hamming Distance
Figure 1.2: An illustration of hashing based similarity preserving
Figure 1.3: An illustration of image retrieval. The first column shows query images, and the restare retrieved images in the database. These retrieved images are expected to be semanticallysimilar to the corresponding query images.
Chapter 1 Introduction 5
on their hamming distance to the query, then a desired number of top ranked data
points are returned as the retrieved result. Both approaches are computational efficient.
Additionally, compact binary codes are extremely efficient for large-scale data storage.
In general, hash functions are generated with the aim of preserving some notion of
similarity between data points. It is expected that the hamming distance of two similar
data points are smaller than that of two dissimilar data points. An illustrative example
is shown in Figure 1.2.
In computer vision applications, the notion of “similarity” can be roughly categorized as
visual similarity and semantic similarity. Visually similar images means that they have
similar visual appearance. For example, images have small euclidean distances in the
pixel space are visually similar. In contrast, semantically similar images usually means
that they are relevant in terms of high-level semantic content. For example, images
containing objects comes from the same category are semantically similar. Clearly,
semantically similar images may be not necessarily visually similar, and vice versa. The
visual differences are caused by many reasons. For example, the visual appearance of
one object would change dramatically due to the variations of viewpoint, illumination,
scaling, translation, rotation. Moreover, two objects may come from some relevant sub-
categories but visually look different. In real-world applications, the semantic similarity
would be much more desirable.
Generally, for preserving the semantic similarity, similarity labels are provided for hash
function learning. Hash methods which aim to preserve label based similarity are referred
to as supervised methods [17, 27–31]. Respectively, hash methods for preserving the
similarity on the original feature space are referred to as unsupervised methods [32–40].
Supervised methods are able to preserve user provided or domain-specific similarities,
and not restricted to the Euclidean based similarities which are defined on the feature
space. Hence they are more flexible and favorable for real-world applications. We focus
on developing supervised hashing methods in this thesis.
The performances of existing supervised hashing methods are still far from satisfactory.
These methods usually have difficulties in accurate similarity preserving and large-scale
learning.
Nowadays, a large number of images or videos can be easily obtained in many applica-
tions. This huge amount of data brings great challenges for hashing learning. Most of
the existing supervised hashing methods become impractically slow when they perform
training on large-scale data. They are only able to learn hash functions from a relatively
small scale of training data (up to tens of thousands) and low-dimensional features (up
to hundreds).
6 Chapter 1 Introduction
Many supervised methods aim to learn linear hash functions, while non-linear hash func-
tion learning receive much less attentions. Non-linear hash functions are able to improve
similarity preserving accuracy over linear ones. Kernel hash functions are typically ap-
plied to achieve non-linear learning. However, evaluating and learning kernel functions
are computational expensive, especially on high-dimensional data. Overall, it remains a
great challenge to learn non-linear hash functions for accurate similarity preserving and
large-scale learning.
In this thesis, we aim to propose practical hashing methods for semantic similarity
search, which have good search precision and can be efficiently trained and evaluated on
large-scale high-dimensional datasets.
1.3 Contributions
The contributions of this thesis are on the fields of structured learning, binary code learn-
ing and their applications in computer vision. In Chapter 2, we present the literature
background of structured learning and binary code learning. Multi-class classification
is a special case of structured output prediction. We first propose an efficient boosting
method for multi-class classification in Chapter 3. Then in Chapter 4, we propose a
boosting based structured learning method for general structured output prediction. In
the following Chapter 5, 6 and 7, we focus on binary code learning and propose three
novel hashing methods. Finally in Chapter 8, we conclude the thesis and discuss future
work. We describe the contributions in more details as following:
Chapter 3
We present a novel column generation based boosting method for multi-class classifi-
cation [41]. Our multi-class boosting is formulated in a single optimization problem.
Different from most existing multi-class boosting methods, which use the same set of
weak learners for all the classes, we learn a separate set of weak learners for each class.
We show that using these class specified weak learners leads to fast convergence, with-
out introducing additional computational overhead in the training procedure. To further
make the training more efficient and scalable, we also propose a fast coordinate descent
method for solving the optimization problem at each boosting iteration. The proposed
coordinate descent method is conceptually simple and easy to implement in that it has
a closed-form solution for each coordinate update. Experimental results on a variety
of datasets show that, compared to a range of existing multi-class boosting methods,
Chapter 1 Introduction 7
the proposed method has much faster convergence rate and better generalization per-
formance in most cases. We also empirically show that the proposed fast coordinate
descent algorithm is able to efficiently solve the optimization.
Chapter 4
Recently, structured learning has found many applications in computer vision. Inspired
by structured support vector machines (SSVM), here we propose a new boosting algo-
rithm for structured output prediction, which we refer to as StructBoost [42]. Struct-
Boost supports nonlinear structured learning by combining a set of weak structured
learners.
As SSVM generalizes SVM, our StructBoost generalizes standard boosting approaches
such as AdaBoost, or LPBoost to structured learning. The resulting optimization prob-
lem of StructBoost is more challenging than SSVM in the sense that it may involve
exponentially many variables and constraints. In contrast, for SSVM one usually has
an exponential number of constraints and a cutting-plane method is used. In order to
efficiently solve StructBoost, we formulate an equivalent 1-slack formulation and solve
it using a combination of cutting planes and column generation. We show the versa-
tility and usefulness of StructBoost on a range of problems such as optimizing the tree
loss for hierarchical multi-class classification, optimizing the Pascal overlap criterion
for robust visual tracking and learning conditional random field parameters for image
segmentation.
Chapter 5
In this chapter, we propose a column generation based method [43] for learning data-
dependent hash functions based on relative pairwise similarity information. Given a set
of triplets that encode the pairwise similarity comparison information, our method learns
hash functions that preserve the relative comparison relations in the data within the
large-margin learning framework. The learning procedure is implemented using column
generation and hence is named CGHash. At each iteration of the column generation
procedure, the best hash function is selected. Unlike many other hashing methods, our
method generalizes to new data points naturally. We show that our method with triplet
based formulation and large-margin learning is able to learn high quality hash functions
for similarity search.
8 Chapter 1 Introduction
Chapter 6
In this chapter, we propose a flexible and general method [44] with a two-step learning
scheme. Most existing approaches to hashing apply a single form of hash function, and
an optimization process which is typically deeply coupled to this specific form. This
tight coupling restricts the flexibility of the method to respond to the data, and can
result in complex optimization problems that are difficult to solve. Here we propose
a flexible yet simple framework that is able to accommodate different types of loss
functions and hash functions. This framework allows a number of existing approaches
to hashing to be placed in context, and simplifies the development of new problem-
specific hashing methods. Our framework decomposes the hashing learning problem
into two steps: hash bit learning and hash function learning based on the learned bits.
The first step can typically be formulated as binary quadratic problems, and the second
step can be accomplished by training standard binary classifiers. Both problems have
been extensively studied in the literature. Our extensive experiments demonstrate that
the proposed framework is effective, flexible and outperforms the state-of-the-art.
Chapter 7
In this chapter, we propose a hashing method [45] for efficient and effective learning
on large-scale and high-dimensional data, which is an extension of our general two-
step hashing method. Supervised hashing aims to map the original features to compact
binary codes that are able to preserve label based similarity in the Hamming space. Non-
linear hash functions have demonstrated their advantage over linear ones due to their
powerful generalization capability. In the literature, kernel functions are typically used
to achieve non-linearity in hashing, which achieve encouraging retrieval performance at
the price of slow evaluation and training time. Here we propose to use boosted decision
trees for achieving non-linearity in hashing, which are fast to train and evaluate, hence
more suitable for hashing with high dimensional data. In our approach, we first propose
sub-modular formulations for the hashing binary code inference problem and an efficient
GraphCut based block search method for solving large-scale inference. Then we learn
hash functions by training boosted decision trees to fit the binary codes. Experiments
demonstrate that our proposed method significantly outperforms most state-of-the-art
methods in retrieval precision and training time. Especially for high-dimensional data,
our method is orders of magnitude faster than many methods in terms of training time.
Chapter 2
Background Literature
This chapter provides the background of structured learning and binary code learning,
which the later chapters are based on. We explain some basic notations and review some
popular existing methods which related to the focus of this thesis.
2.1 Structured Learning
Structured learning aims to learn a prediction function g which has a structured output:
y = g(x). This prediction function is assumed to take the following form:
y? = g(x) = argmaxy
f(x,y). (2.1)
Here f is a function that measures the consistency of the input and output. The pre-
diction is achieved by solving an inference problem which is finding an output y? that
maximizes f(x,y). Algorithms for solving the prediction in (2.1) depend on applica-
tions. For example, in the application of image segmentation [8], the GraphCut [46]
algorithm is applied to solve the inference problem for prediction. The survey of [15]
provides a comprehensive review of structured learning and its application in computer
vision.
Existing structured learning methods can be categorized into two categories: proba-
bilistic approaches and max-margin approaches. Probabilistic approaches estimate the
distribution of underlying data, hence require an expensive normalization step. Popu-
lar existing methods in this category include Conditional Random Fields (CRFs) [47],
Maximum Entropy Discrimination Markov Networks [48], and so on.
9
10 Chapter 2 Background Literature
In contrast with probabilistic approaches, max-margin approaches directly learn a dis-
criminative function, and only require to solve maximum a posteriori (MAP) inference
problems. These MAP inference problems are usually similar to the structured predic-
tion problem in (2.1), thus similar inference algorithms can be applied. Popular existing
methods in this category include Structured SVM [7], Max-Margin Markov Networks
[49] and so on.
2.1.1 Structured SVM
Structured SVM (SSVM) [7] is a well-known max-margin structured learning method.
Our boosting based structured learning method which is described in Chapter 4 can be
seen as an extension of SSVM for efficient nonlinear learning.
In structured learning, a prediction function g can be learned by minimizing the regu-
larized structural empirical risk functional, which can be written as:
mingJ(g) := νΩ(g) +Remp(g), (2.2)
where, Remp(g) :=1
m
m∑i=1
∆(yi, g(xi)). (2.3)
Here g is a prediction function which is aimed to learn; Remp is the empirical risk which
is defined on the input-output pairs: (x1,y1), (x2,y2), . . . , (xm,ym) ∈ X × Y; ∆(·) is
a structured loss which measure how well does the prediction match the ground truth;
νΩ(g) is the regularization term for controlling the model complexity, in which ν(ν > 0)
is a predefined trade-off parameter.
Directly solving the optimization in (2.3) is difficult, because the term ∆(yi, g(xi)) is
piece-wise constant. SSVM replaces the ∆-loss in (2.3) by its convex upper bound. As
shown in [50, 51], minimizing a convex upper bound to (2.3) is sufficient for learning a
good prediction function. In SSVM, the prediction function is defined as:
g(x) = argmaxy
f(x,y;w) = argmaxy
w>Ψ(x,y), (2.4)
where Ψ(x,y) is a joint feature mapping of the input-output pair. The optimization
problem of SSVM is written as:
minw
1
2‖w‖2 +
C
m
m∑i=1
l(xi,yi,w), (2.5)
Chapter 2 Background Literature 11
where,
l(xi,yi,w) = argmaxy∈Y
∆(yi,y)− f(xi,yi;w) + f(xi,y;w). (2.6)
Here `2 norm is used for regularization and C is a regularization trade-off parameter.
Clearly we have:
g(xi) = argmaxy
f(xi,y;w) ≥ f(xi,yi;w); (2.7)
hence we have the following relations:
∆(yi, g(xi)) ≤∆(yi, g(xi))− f(xi,yi;w) + f(xi,y;w) (2.8a)
≤ argmaxy∈Y
∆(yi,y)− f(xi,yi;w) + f(xi,y;w) (2.8b)
=l(xi,yi,w). (2.8c)
The above relations show that l(xi,yi,w) is an upper bound of ∆(g(xi),yi). The loss
formulation of SSVM in (2.5) is equivalent to the conventional slack formulation of
SSVM, which is written as:
minw,ξ
1
2‖w‖2 +
C
m
m∑i=1
ξi (2.9a)
s.t. ∀i : ξi ≥ 0, (2.9b)
∀i = 1, . . . ,m and ∀y ∈ Y :
w>Ψ(xi,yi)−w>Ψ(xi,y) ≥ ∆(yi,y)− ξi. (2.9c)
The definition of the prediction loss ∆(yi,y) depends on applications. In multi-class
classification, as a special case of structured learning, ∆ is defined as the zero-one loss.
In image segmentation, usually ∆ is defined as Hamming loss between two label mask
[8].
The optimization problem of SSVM is a convex problem. However, the loss function in
SSVM is not everywhere differentiable which is the result of the max operation. SSVM
can be solved by sub-gradient descent or cutting-plane [7, 52, 53], on-line sub-gradient
descent [54–56] and so on.
The cutting-plane algorithm constructs an working set of constrains and iteratively adds
constraints into the working set. Specifically, in each iteration, it finds most violated
constraints in (2.9c) and adds them to the working set, then solves the optimization in
(2.9c) with the constraints in the working set. Finding the most violated constraints is
12 Chapter 2 Background Literature
to solve the following MAP inference problem for each input xi:
y? = argmaxy
w>Ψ(xi,y) + ∆(yi,y). (2.10)
This MAP inference problem is similar to the prediction inference problem in (2.1);
hence it can be solved in a similar way. This inference step is also required in many
other methods for solving SSVM.
2.2 Boosting
In this thesis, we explore boosting methods for multi-class classification and general
structured learning. We explain the background of boosting learning in this section.
Basically, boosting methods construct a strong learner by combining a set of weak learn-
ers. In the classification problem, typical boosting method aims to learn a set of weak
classifiers and their corresponding weightings. Each weak classifier only has moderate
classification accuracy. Weak classifiers are then combined based on the learned weight-
ings to form a single strong classifier. A well known example of boosting methods is the
AdaBoost [57].
2.2.1 Column generation boosting
There are a variety of boosting methods in the literature. These boosting methods usu-
ally can be explained by some general boosting learning frameworks. For example, the
AnyBoost [58] framework shows that many boosting methods perform gradient descent
in function space. In this thesis we focus on the column generation framework [59, 60] for
boosting learning. In this framework, boosting learning is formulated as an optimization
problem which usually involves infinitely many variables, and this optimization can be
solved by the column generation technique. Column generation based boosting methods
are able to update the weightings of all weak learners when adding new weak learners
in each iteration. This kind of boosting learning, which update the weights of all weak
learners in each iteration, is referred to as totally-corrective learning.
Here we describe LPboost [60] as a simple example of column generation based boosting
methods. LPboost is a boosting method for binary classification, which solves a linear
programming optimization problem.
First we describe some basic notations. The inequality between two vectors means
element-wise inequality. 1 is a vector with elements all being one. Its dimension should
be clear from the context. Respectively, 0 is a vector with elements all being zero.
Chapter 2 Background Literature 13
A training example is denoted by x and its class label is denoted by y. In binary
classification, we have y ∈ −1,+1. A weak classifier φ is a function that maps an
example x to −1,+1:
φ(x) ∈ −1,+1. (2.11)
The domain of all possible weak learners is denoted by C: φ(·) ∈ C. The output of all
weak classifiers are defined by a column vector:
Φ(x) = [φ1(x), φ2(x), · · · , φm(x)]>. (2.12)
The weak learner weighting vector is denoted by w, and w ≥ 0. Weak classifiers are
linearly combined to form the final strong classifier:
f(x) = sign[w>Φ(x)
]= sign
[ m∑j
wjφj(x)
]. (2.13)
LPBoost is a max-margin learning method with the hinge loss. The optimization prob-
lem for LPBoost is written as:
minw,ξ
1>w +C
n
∑i
ξi (2.14a)
s.t. w ≥ 0, ξ ≥ 0, (2.14b)
∀i = 1, . . . , n : yiw>Φ(xi) ≥ 1− ξi. (2.14c)
Here `1 norm is used for regularization; C is a regularization trade-off parameter; n is
the number of training examples. By solving the above optimization, we can learn a set
of weak classifiers Φ(·) and their corresponding weightings w.
In the viewpoint of column generation boosting, all possible weak learners are considered
in the optimization in (2.14). The number of weak learners in (2.14) is the size of the
weak learner domain: |C|, which can be infinitely large; thus the dimension of the
weighting vector w is infinitely large. The initialized weighting for all weak learners is
0. A weak classifier with 0 weighting means that this weak classifier is not included in
the final strong classifier. By solving the optimization in (2.14), We are able to obtain
a sparse solution of the weightings w. In this way, we obtain a small number of weak
classifiers which have nonzero weightings. Finally these weak classifiers with nonzero
weights construct the strong classifier.
Column generation is a technique for solving optimization problem which has a large
number of variables and cannot be directly solved. Starting from an empty working
14 Chapter 2 Background Literature
set of weak classifiers, column generation iteratively generates new weak classifiers and
add to the working set. New weak classifiers are generated by finding most violated
constraints in the dual problem of (2.14). The dual problem of (2.14) can be derived
as:
maxµ
n∑i
µi (2.15a)
s.t. ∀i = 1, . . . , n : 0 ≤ µi ≤C
n, (2.15b)
∀φ(·) ∈ C :n∑i
µiyiφ(xi) ≤ 1. (2.15c)
Here µi is a dual variable associated with one constraint in (2.14c). In each column
generation iteration, we perform the following two steps:
1. Generate a new weak classifier by finding the most violated constraint in the dual
problem in (2.15), add to the weak classifiers working set Wφ.
2. Solve the primal problem in (2.14) or the dual problem (2.15) using weak classifiers
from the working set Wφ. Obtain the primal and dual solution: w,µ.
The learning algorithm repeats these two steps until it reaches a predefined number of
iterations.
In the first step, generating a new weak classifier is to solve the following optimization:
φ?(·) = argmaxφ(·)∈C
n∑i
µiyiφ(xi), (2.16)
which is finding a most violated constraint in the dual problem (2.15). The above
optimization is a weighted binary classification problem, and the dual solution µ here
is seen as examples weightings. Typically, we can train a decision stump/tree classifier
as the the weak classifier solution.
We refer to the primal problem in (2.14) using weak classifiers from the working set
Wφ as the reduced primal problem. Respectively, the dual problem in (2.15) using weak
classifiers from the working set Wφ as the reduced dual problem.
In the second step, the reduced primal problem of (2.14) or the reduced dual problem of
(2.15) is a linear programming problem (LP); hence we can use MOSEK or other off-the-
shelf LP solver to obtain the primal solution w and dual solution µ. The dual solution
is required for generating new weak classifiers. The weightings w of all weak classifiers
in the working set are updated for each iteration, thus LPBoost is a totally-corrective
boosting method.
Chapter 2 Background Literature 15
2.2.2 Column generation for multi-class boosting
Shen and Hao [1] propose a column generation based boosting method for multi-class
classification, which is an extension of LPBoost (described in section 2.2.1). We refer
to this method as MultiBoost. In Chapter 4, we extend MultiBoost for boosting based
structured learning. MultiBoost follows the max-margin learning formulation of multi-
class SVM which is proposed by Crammer and Singer in [61].
We denote the number of classes by K. The class label y takes a value from 1 to K:
y ∈ 1, 2, . . . ,K. MultiBoost learns a weighting vector for each class; hence it learns K
models. We denoted by wy the weighting vector of the class y. Then the classification
score of the class y is:
fy(x) = w>yΦ(x) =m∑j
wy,jφj(x). (2.17)
Given a test data point x, the prediction function is:
y? = argmaxy
fy(x) = argmaxy
w>yΦ(x), (2.18)
which means finding a class label with the largest confidence.
MultiBoost is a flexible framework that a variety of loss functions can be applied. Here
we describe the hinge loss and the exponential loss as examples. For a training example,
we expect that the classification score of the ground truth class is higher than that of
any other classes. The multi-class margin associated with the training example (xi, yi)
is defined as
γ(i,y) = w>yiΦ(xi)−w>yΦ(xi). (2.19)
Intuitively, γ(i,y) is the difference between the classification scores of the ground truth
class and the other class. The training of MultiBoost is to encourage this margin to be
large.
16 Chapter 2 Background Literature
2.2.2.1 MultiBoost with hinge loss
The training of MultiBoost with the hinge loss is to solve the following optimization
problem:
minw,ξ
1>w +C
n
n∑i
ξi (2.20a)
s.t. w ≥ 0, ξ ≥ 0, (2.20b)
∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :
w>yiΦ(xi)−w>yΦ(xi) ≥ 1− ξi. (2.20c)
The corresponding dual problem is written as:
maxµ
∑i
∑y 6=yi
µ(i,y) (2.21a)
s.t. ∀φ and ∀c = 1, . . . ,K :∑i(yi=c)
∑y 6=yi
µ(i,y)φ(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φ(xi) ≤ 1 (2.21b)
∀i = 1, . . . , n : 0 ≤∑y 6=yi
µ(i,y) ≤C
n. (2.21c)
The column generation algorithm for MultiBoost is similar to that of LPBoost, which
is described in in section 2.2.1. The sub-problem for learning a new weak classifier is
finding the most violated constraint in the dual problem, which is written as:
[φ?(·), c?] = argmaxφ(·)∈C, c∈1,...,K
∑i(yi=c)
∑y 6=yi
µ(i,y)φ(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φ(xi). (2.22)
Recall that the reduced primal problem is the primal problem in (2.20) using weak
classifiers from the working set. The reduced primal problem can be solved by LP
solvers to obtain the primal and dual solution, which is similar to LPBoost.
Chapter 2 Background Literature 17
2.2.2.2 MultiBoost with exponential loss
The training of MultiBoost with the exponential loss is to solve the following optimiza-
tion problem:
minw
1>w +C
p
n∑i
∑y 6=yi
exp (−γ(i,y)) (2.23a)
s.t. w ≥ 0, ξ ≥ 0, (2.23b)
∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :
w>yiΦ(xi)−w>yΦ(xi) = γ(i,y). (2.23c)
Here p is the number of constraints: p = n(K − 1). In the above optimization, one
constraint corresponds to one margin variable, which is different from the hinge loss
formulation in (2.20). The corresponding dual problem is written as:
maxµ
∑i
∑y 6=yi
µ(i,y)
[1− log
p
C− log µ(i,y)
](2.24a)
s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)
∑y 6=yi
µ(i,y)φ(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φ(xi) ≤ 1, (2.24b)
∀i = 1, . . . , n : 0 ≤∑y 6=yi
µ(i,y) ≤C
p. (2.24c)
Similar to the case of the hinge loss, the sub-problem for learning a new weak classifier is
finding the most violated constraint in the dual problem, which is written in (2.22). The
primal problem of (2.23) has a smooth objective, which is different from the hinge loss
formulation in (2.20). Hence we are able to applied LBFGS-B [2] to solve the reduced
primal problem. The dual solution µ? can be calculated using the primal solution w?.
According to the KKT conditions, the dual solution is written as:
µ?(i,y) =C
pexp
[w?>y Φ(xi)−w?>
yi Φ(xi)]. (2.25)
2.3 Binary Code Learning
In binary Code Learning, we aim to learn a set of mapping functions that turn the
original high-dimensional data into binary codes. We refer to these mapping functions as
hash functions, and the corresponding learning methods as hashing methods. Suppose
the learning task is to find m hash functions for mapping input examples into m-bit
18 Chapter 2 Background Literature
binary codes. Let x ∈ Rd denote a data point, and Φ(x) denote the output of m hash
functions. Φ(x) is written as:
Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (2.26)
Each hash function outputs a binary value:
h(x) ∈ −1, 1. (2.27)
The output of these hash functions are m-bit binary codes: Φ(x) ∈ −1, 1m. These
hash functions are able to preserve some kind of data similarity in the Hamming space.
Various types of loss functions and hash functions have been used in existing methods.
Generally, the formulation of hashing learning is to encourage small Hamming distance
for similar data pairs and large for dissimilar data pairs. The Hamming distance between
two binary codes is the number of bits taking different values:
dhm(xi,xj) =
m∑r=1
[1− δ(hr(xi), hr(xj))
], (2.28)
in which δ(·, ·) ∈ 0, 1 is an indication function. If two inputs are equal, δ(·, ·) outputs
1; otherwise, it outputs 0. Closely related to Hamming distance, the Hamming affinity
is calculated by the inner product of two binary codes:
shm(xi,xj) =
m∑r=1
hr(xi)hr(xj). (2.29)
In terms of Hamming affinity, hashing learning encourages positive Hamming affinity
values for similar data pairs and negative values for dissimilar data pairs. Loss functions
are typically defined on the basis of the Hamming distance or Hamming affinity of similar
and dissimilar data pairs.
Existing hashing methods can be roughly categorized into unsupervised and supervised
methods. Unsupervised methods [32–40] aim to preserving the similarity which is calcu-
lated in the original feature space, while supervised methods [17, 27–31] aim to preserve
label based similarity. Supervised methods require similarity labels (usually pairwise or
triplet based) for supervised learning.
Various types of loss functions have been used in existing methods. For example, the
Laplacian affinity loss is used in SPH [34], STH [33], AGH [38]; the quantization loss
is used in ITQ [36] and K-means Hashing [37]; the Hamming distance or Hamming
affinity based similarity regression loss is used in MDSH [35], KSH [31], BRE [28], semi-
supervised hashing [17]; a hinge-like loss is used in MLH [29].
Chapter 2 Background Literature 19
Different types of hash functions have been used in existing methods. For example, linear
perceptron functions are use in random projection LSH [32, 62], semi-supervised hashing
[17] and ITQ [36]; kernel functions are used in KSH [31] and KLSH [63]; eigenfunctions
are used in SPH [34] and MDSH [35].
A variety of techniques have been proposed for hashing learning. For example, random
projection is used in LSH [32, 62] and KLSH [63]; spectral graph analysis for exploring
data manifold is used in SPH [34], MDSH [35], STH[33], AGH [38], and inductive hashing
[64]; vector quantization is used in ITQ [36], K-means Hashing [37], bilinear hashing [40],
kernel methods are used in KSH [31] and KLSH [63].
In the next, we review a number of popular existing unsupervised and supervised hashing
methods, and discuss their strength and weakness.
2.3.1 Locality-sensitive hashing
Locality-sensitive hashing (LSH) [32] is a pioneer work of hashing methods. LSH family
generates random hash functions for mapping data points into binary codes. A popular
class of LSH uses random projection to generate hash functions [62, 65]. It is able
to preserve cosine similarity which is calculated in the input feature space. The hash
function generated by random projection is written as:
h(x) = sign (w>x+ b). (2.30)
Here the vector w and the scalar b are randomly generated. When applying hash table
look-up for nearest neighbor retrieval, using multiple hash tables is able to improve the
retrieval precision of LSH. LSH has been extended for preserving a variety of similar-
ity measures, such as p-norm distance [66], Mahalanobis distance [67], and the kernel
similarity [63].
Randomly generated hash functions are data-independent. LSH has a drawback that
it usually requires a large length of binary codes (long bits) to achieve good retrieval
precision. Rather than randomly generating hash functions in LSH, data dependent
methods learn meaningful hash functions from the data, thus they are able to generate
more efficient binary codes. Data dependent methods can be divided into unsupervised
and supervised methods. Next we explain some popular data dependent methods.
20 Chapter 2 Background Literature
2.3.2 Spectral hashing
Spectral hashing (SPH) [34] is a data-dependent unsupervised method for learning uncor-
related binary codes which are able to preserve Gaussian similarity. Let x1,x2, . . . ,xndenote a set of training examples, and z1, z2, . . . ,zn denote the corresponding binary
codes. We have z ∈ −1,+1m, where m is the number of bits. The pairwise similarity
information is provided by a similarity matrix Y. The element of Y: yij measures the
similarity between the example xi and xj . Here yij is calculated by Gaussian kernel on
the input features, which is written as:
yij = exp (−‖xi − xj‖2
σ2). (2.31)
The Hamming distances between similar neighbors should be small. SPH aims to learn
uncorrelated binary codes which minimize the Hamming distances. It solves the follow-
ing optimization problem to generate the binary codes of the training examples:
minZ
n∑i,j=1
yij‖zi − zj‖2 (2.32a)
s.t. ∀i = 1, . . . , n : zi ∈ −1,+1m, (2.32b)
n∑i
zi = 0, (2.32c)
1
m
n∑i
ziz>j = I. (2.32d)
Here Z is a matrix of binary codes for all training examples; I is an identity matrix. The
constraint in (2.32c) requires balanced binary codes for each bit, which means that the
bit values are evenly distributed for each bit. The constraint in (2.32d) requires the bits
are uncorrelated. Uncorrelated codes help to reduce redundancy and balanced codes
help to evenly distribute examples into hash buckets. The above optimization is difficult
to solve due to the binary constraint in (2.32b). After spectral relaxation, which drops
the binary constraints in (2.32b), the above optimization can be solved by finding graph
Laplacian eigenvectors [68]. The final binary codes can be generated by thresholding
the Laplacian eigenvectors.
Solving the above optimization is only for obtaining the binary codes of training ex-
amples. It is more important that we need to know how to generate binary codes for
new examples. In other words, we need to learn hash functions which can be efficiently
evaluated on new examples. In manifold learning, the problem of generating represen-
tations for new examples is called out-of-example extension. Nystrom methods [69, 70]
are often applied for solving out-of-example extension, but they are computationally
Chapter 2 Background Literature 21
expensive for large datasets. To enable efficient out-of-example extension, SPH assumes
that data follow an uniform distribution. When data follow an uniform distribution and
pairwise similarity is defined by Gaussian kernel, it shows that the solution of hash func-
tions is a type of eigenfunctions which have an outer product form and can be efficiently
evaluated on new examples. A new example is evaluated by the learned eigenfunctions
and then thresholded to output binary codes.
There are a number of disadvantages of SPH. For real-world data, usually the assump-
tion of uniform distribution is not true. SPH optimizes the Laplacian affinity loss, which
only tries to pull together those similar data pairs but does not push away those dissimi-
lar data pairs. As shown in manifold learning, this may lead to inferior performance [71].
Moreover, the constraints of uncorrelated binary codes may not help to improve the sim-
ilarity search accuracy. Actually in some recent supervised methods, interdependently
learned binary codes show good performance for similarity preserving [17, 29, 31, 44, 45].
2.3.3 Self-taught hashing
Self-taught hashing (STH) [33] applies a two-step learning scheme for hash function
learning. In the first step, STH infers the binary codes of all training examples; then
in the second step, it trains binary classifiers as hash functions to fit the binary codes.
STH optimizes the Laplacian affinity loss for binary codes inference which is similar to
spectral hashing (SPH) [34]. SPH is described in the previous section.
Let x1,x2, . . . ,xn denote a set of training examples, and z1, z2, . . . ,zn denote the
corresponding binary codes. We have z ∈ −1,+1m, where m is the number of bits. A
similarity matrix Y is used for providing pairwise similarity information. The element
of Y: yij measures the similarity between the example xi and xj . STH uses local
similarity informations [72]. The similarity relations of one example are only defined on
a few number of nearest neighbors; hence the similarity matrix Y is highly sparse. The
pairwise similarity value is calculated using cosine similarity on the input features [72].
Let S denote a set of defined pairwise relations. The pairwise similarity values is written
as:
∀(i, j) ∈ S : yij =x>i xj‖xi‖‖xj‖
, (2.33)
and ∀(i, j) /∈ S : yij = 0. (2.34)
22 Chapter 2 Background Literature
We introduce a diagonal n× n matrix: D whose diagonal entries are given by:
D(i, i) =n∑j=1
yij . (2.35)
In the first step for binary code inference, STH minimizes the following objective:
n∑i,j=1
yij‖zi − zj‖2 (2.36)
We define a matrix L as:
L = D−Y. (2.37)
We refer to the matrix L as graph Laplacian [68]. With some simple mathematical
transformation, the objective in (2.36) can be replaced by:
trace (Z>LZ). (2.38)
Overall, the optimization for binary code inference is written as:
minZ
trace (Z>LZ) (2.39a)
s.t. Z ∈ −1,+1m×n, (2.39b)
Z>D1 = 0, (2.39c)
Z>DZ = I. (2.39d)
Here Z is a matrix of binary codes for all training examples, one column corresponds
to one example; I is an identity matrix. The above optimization looks similar to that
of SPH in (2.32) except that the diagonal matrix D is involved in the constraints. The
above problem can bey relaxed by dropping the binary constraints in (2.39b). The
relaxed problem becomes the Laplacian Eigenmap (LapEig) problem [73] in manifold
learning. We solve the following generalized eigenvalue problem:
Lv = λDv (2.40)
to obtain m eigenvectors: [v1, . . . ,vm] which corresponds to the m smallest eigenvalues
These m eigenvectors is the solution of the relaxed problem. The final binary code
solution Z is obtained by thresholding these eigenvectors. The threshold values are
given by the median values of eigenvectors, which results in balanced binary codes.
In the second step, STH trains linear SVM binary classifiers as hash functions to fit the
Chapter 2 Background Literature 23
binary codes obtained from the first step. The optimization problem of SVM classifica-
tion is written as:
minw,ξ
1
2‖w‖2 + C
n∑i
ξi (2.41a)
s.t. ∀i = 1, . . . , n : ziw>xi ≥ 1− ξi, (2.41b)
ξ ≥ 0. (2.41c)
We need to train m classifiers in the second step. The learned hash functions have the
following form:
h(x) = sign (w>x+ b). (2.42)
In contrast to SPH, STH trains simple classifiers as hash functions to solve the out-of-
example problem. No restrictive assumption of the data distribution is made. STH has
empirically shown that this simple way for hash function learning is able to significantly
outperform SPH.
SPH requires to use Gaussian kernel similarity for obtaining efficient eigenfunctions. In
contrast to SPH, STH is not limited to the type of similarity definition. The work in
[30] extends STH to supervised hashing which uses label based similarity definition. The
training of supervised STH accepts any user-defined similarity matrix.
The two-step learning scheme in STH provides an interesting paradigm for hashing
learning. Compared to SPH, this learning approach has a better and simpler solution
for the out-of-example extension problem. It also avoids the complex optimization of
highly non-convex problems in some recent methods [28, 29, 31] for directly learning
hash functions.
STH optimizes the Laplacian affinity loss, which only tries to pull together those similar
data pairs but does not push away those dissimilar data pairs. As shown in manifold
learning, this may lead to inferior performance [71]. It is not clear that how to in-
fer binary codes when using other meaningful loss functions, such as KSH [31], BRE
[28], MLH [29] loss functions. In Chapter 6, we propose a general two-step learning
framework, which extends STH to a much more general case. Our approach can easily
incorporate with any Hamming distance or Hamming affinity based loss functions. For
example, using the KSH or BRE loss function we are able to outperform STH which
uses the Laplacian affinity loss function.
24 Chapter 2 Background Literature
2.3.4 Supervised hashing with kernel
Preserving semantic similarity is favorable in many computer vision applications. Images
from the same category or from the same object may have large distance (e.g., euclidean
distance) in the feature space, because the variations of view points, illumination, scales
and so on. The similarity measured in the input feature space (e.g., euclidean distance,
cosine similarity, Gaussian affinity) may not able to reveal the semantic similarity. Su-
pervised hashing is designed to preserve label based similarity.
Supervised hashing with kernel (KSH) [31] is a supervised method for learning kernel
hash functions. KSH has shown better performance over many other supervised meth-
ods, e.g., MLH [29] and BRE [28]. KSH minimizes the Hamming affinity loss function
for hash function learning. The Hamming affinity is closely related to the Hamming
distance. Hamming distance based loss functions usually require complex optimization,
e.g., MLH and BRE. As shown in KSH, using the Hamming affinity loss is able to simply
the optimization.
We define m as the number of hash functions which we need to learn. The output of
these m hash functions is denoted by Φ(x):
Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (2.43)
Each hash function h(x) has binary output: h(x) ∈ −1,+1. The Hamming affinity is
calculated by the inner product of two binary codes:
shm(xi,xj) =
m∑r=1
hr(xi)hr(xj). (2.44)
Intuitively, the optimization should encourage positive affinity value of similar data pairs
and negative affinity value of dissimilar data pairs. A similarity matrix Y is provided for
supervised learning, which encodes pairwise similarity information. The element of Y:
yij measures the similarity of the data point xi and xj . Specifically, if yij = 1, these two
data points are similar; if yij = −1, these two data points are dissimilar; if yij = 0, the
relation of these two data points are undefined. KSH solves the following optimization
problem:
minΦ(·)
n∑i=1
n∑j=1
[myij −
m∑r=1
hr(xi)hr(xj)
]2
. (2.45)
Chapter 2 Background Literature 25
KSH uses kernel based hash functions. One hash function is written as:
h(x) = sign
Q∑q=1
wqκ(x′q,x) + b
. (2.46)
Here X′ = x′1, . . . ,x′Q is a set of Q support vectors; κ(·, ·) is a kernel function; wq is
the weighting coefficient and b is the bias term. The RBF kernel function is used in the
implementation of KSH, which is written as:
κ(x′q,x) = exp (−‖x− x′q‖2
2σ2), (2.47)
in which σ is a predefined kernel parameter. Similar to the Kernelized Locality-Sensitive
Hashing (KLSH) [63] method, these support vectors are uniformly sampled from training
examples.
Before the optimization step, KSH evaluates the kernel response of all training examples
using the predefined support vectors. In this way, the input features are transferred to
kernel features. Clearly, the dimension of the kernel features is the number of support
vectors. To simplify the optimization, the kernel features are centered by subtracting
their means in all dimensions, and the bias term in (2.46) is removed. The centered
kernel feature of the q-th support vector is defined as:
ψq(x) = κ(x′q,x) (2.48)
= κ(x′q,x)− 1
n
n∑i=1
κ(x′q,xi) (2.49)
The kernel features of all support vectors are denoted by Ψ(x), which is written as:
Ψ(x) = [ψ1(x), ψ2(x), . . . , ψQ(x)]. (2.50)
The hash function formulation becomes:
h(x) = sign
Q∑q=1
wqκ(x′q,x)
= sign (w>Ψ(x)). (2.51)
With the above hash function formulation, the optimization of (2.45) can be rewritten
as:
minW
n∑i=1
n∑j=1
[myij −
m∑r=1
sign (w>rΨ(xi)) sign (w>rΨ(xj))
]2
. (2.52)
26 Chapter 2 Background Literature
We need to solve the above optimization to obtain the kernel weighting coefficients w
of all hash functions. KSH solves the above optimization in a greedy way. It optimizes
for one bit each time, based on the solution of previous bits. When solving for the r-th
bit, we define a residual value aijr as:
aijr = myij −r−1∑p=1
sign (w>pΨ(xi)) sign (w>pΨ(xj)). (2.53)
Thus the optimization for the r-th bit is written as
minwr
n∑i=1
n∑j=1
[aijr − sign (w>rΨ(xi)) sign (w>rΨ(xj)
]2
. (2.54)
To further simply the optimization, we have the following relations:[aijr − sign (w>rΨ(xi)) sign (w>rΨ(xj)
]2
(2.55)
=− 2aijr sign (w>rΨ(xi)) sign (w>rΨ(xj)+
[sign (w>rΨ(xi)) sign (w>rΨ(xj)
]2
+ a2ijr
(2.56)
=− 2aijr sign (w>rΨ(xi)) sign (w>rΨ(xj) + const (2.57)
Hence, the optimization in (2.54) is equal to:
maxwr
n∑i=1
n∑j=1
aijr sign (w>rΨ(xi)) sign (w>rΨ(xj). (2.58)
For solving the above optimization, KSH first applies spectral relaxation and solve the
relaxed problem to obtain an initial solution w0. Applying spectral relaxation is mainly
dropping the sign function. The relaxed problem is a standard generalized eigenvalue
problem, whose solution is the eigenvector corresponding to the largest eigenvalue. Then
it generates a smoothed problem by replacing the sign function to a sigmoid function.
Finally, based on the initial solution w0, it solves the smoothed problem using an accel-
erated gradient decent algorithm.
In Chapter 6, we introduce our general two-step method (TSH), which can easily incor-
porate with any Hamming affinity or Hamming distance based loss functions, including
the KSH loss function describe here, the BRE similarity reconstruction loss function, and
the MLH hinge-like loss function. KSH with the Hamming affinity based loss function
indeed simplifies the optimization. However, with two-step decomposition, our TSH
hashing learning is much simpler than KSH which directly solve for hash functions.
Chapter 2 Background Literature 27
Moreover, the BRE reconstruction loss and MLH hinge-like loss function can be also
efficiently optimized in our TSH framework.
KSH requires a set of predefined support vectors and does not force sparsity on the
weighting coefficients, which is not practical for training on large-scale dataset. A small
number of support vectors is usually not sufficient to have a good prediction accuracy.
The default setting of the number of support vectors in KSH is only 300, which is far
from enough. However, a large number of support vectors will significantly slow down
the training and testing of KSH.
In our two-step method: TSH, we train binary classifiers for learning hash functions,
hence any sophisticated kernel classifiers can be applied in TSH. Note that kernel based
binary classifier is well-studies. For example, LIBSVM [74] is a widely used implemen-
tation of kernel SVM classifier, and the recent on-line learning method in [6] is able to
efficiently train a budgeted kernel SVM with any user-defined sparsity. These sophisti-
cated kernel methods all provide sparse solutions and can be seamlessly applied in our
TSH. Compared to KSH with its naive kernel solution, our TSH is much more flexible,
useful and effective.
Though kernel hash functions are able to boost up performance, however they are gen-
erally inefficient for training and testing, even using sophisticated training methods.
When training and testing on large-scale and high-dimensional data, kernel methods
become impractically slow. In Chapter 7, we extend our TSH framework and propose
a fast supervised method for hashing learning on large-scale and high-dimensional data,
which is referred to as FastHash. Our FastHash efficiently learns decision trees as hash
functions. Compared to kernel hash functions, decision tree based hash functions can
easily deal with a very large number of training data with high dimensionality (tens of
thousands), and has the desirable non-linear mapping.
Chapter 3
Fast Training of Effective
Multi-class Boosting Using
Coordinate Descent
In this chapter, we present a novel column generation based boosting method for multi-
class classification [41]. Our multi-class boosting is formulated as a single optimization
problem. Different from most existing multi-class boosting methods, which use the same
set of weak learners for all the classes, we learn a separate set of weak learners for each
class. In other words, the weak learners are class-specific. We show that using these
class specified weak learners leads to fast convergence, without introducing additional
computational overhead in the training procedure. To further make the training more
efficient and scalable, we also propose a fast coordinate descent method for solving
the optimization problem at each boosting iteration. The proposed coordinate descent
method is conceptually simple and easy to implement in that it has a closed-form solution
for each coordinate update. Experimental results on a variety of datasets show that,
compared to a range of existing multi-class boosting methods, the proposed method
has much faster convergence rate and better generalization performance in most cases.
We also empirically show that the proposed fast coordinate descent algorithm is able to
efficiently solve the optimization.
3.1 Introduction
Boosting methods combine a set of weak classifiers (weak learners) to form a strong clas-
sifier. Boosting has been extensively studied [75] and applied to a wide range of applica-
tions due to its robustness and efficiency (e.g., real-time object detection [76]). Despite
29
30 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
the fact that most classification tasks are inherently multi-class problems, the majority
of boosting algorithms are designed for binary classification. A popular approach to
multi-class boosting is to split the multi-class problem into a bunch of binary classifi-
cation problems. A simple example is the one-vs-all approach. The well-known error
correcting output coding (ECOC) methods [77] belong to this category. AdaBoost.ECC
[78], AdaBoost.MH and AdaBoost.MO [79] can all be viewed as examples of the ECOC
approach. The second approach is to directly formulate multi-class as a single learning
task, which is based on pairwise model comparisons between different classes. Shen and
Hao [1] use direct formulation for multi-class boosting (referred to as MultiBoost) is
such an example. From the perspective of optimization, MultiBoost can be seen as an
extension of the binary column generation boosting framework [59, 60] to the multi-class
case. Our work here builds upon MultiBoost.
For most existing multi-class boosting, including MultiBoost, different classes share the
same set of weak learners. However, a weak learner is usually aim to reduce the error for a
particular class. Sharing a weak learner across different classes usually leads to a sparse
solution of the model parameters and hence slow convergence. We will discuss more
details on this point. To solve this problem, in this work we propose a novel formulation
(referred to as MultiBoostcw) for multi-class boosting by using separate sets of weak
learners. Namely, each class uses its own set of weak learners. Compared to MultiBoost,
MultiBoostcw converges much faster, generally has better generalization performance
and does not introduce additional time cost for training. Note that AdaBoost.MO
proposed in [79] also uses different sets of weak classifiers for each class. AdaBoost.MO
is based on ECOC and the code matrix in AdaBoost.MO is specified before learning.
Therefore, the underlying dependence between the fixed code matrix and generated
binary classifiers is not explicitly taken into consideration. In contrast, our MultiBoostcw
is based on the direct formulation of multi-class boosting, which leads to fundamentally
different optimization strategies. More importantly, as shown in our experiments, our
MultiBoostcw is much more scalable than AdaBoost.MO although both enjoy faster
convergence than most other multi-class boosting.
MultiBoost requires sophisticated optimization tools like Mosek or LBFGS-B [2] to solve
the resulting optimization problem at each boosting iteration, which is not very scalable.
Here we propose a coordinate descent algorithm, which is termed as Fast Coordinate
Descent (FCD), for fast optimization of the resulting problem at each boosting iteration.
Specifically, FCD chooses one variable at a time and efficiently solve the single-variable
sub-problem. Coordinate decent (CD) technique has been applied to solve many large-
scale optimization problems. Yuan et al. [80] present comprehensive empirical compar-
isons of `1 regularized classification algorithms. They conclude that CD methods are
very competitive for solving large-scale problems. In the formulation of MultiBoost (also
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 31
in our MultiBoostcw), the number of variables is the product of the number of classes
and the number of weak learners, which can be very large (especially when the number
of classes is large). Therefore CD methods may be a better choice for fast optimization
of multi-class boosting. Our method FCD is specially tailored for the optimization of
MultiBoostcw. We are able to obtain a closed-form solution for each variable update,
thus the optimization can be extremely fast. Moreover, the proposed FCD is easy to
implement and no optimization toolbox is required.
3.1.1 Main Contributions
1. We propose a novel multi-class boosting method (MultiBoostcw) that uses class
specified weak learners. Unlike MultiBoost sharing a single set of weak learners
across different classes, our method uses a separate set of weak learners for each
class. We generate K (the number of classes) weak learners in each boosting
iteration—one weak learner for each class. With this mechanism, we are able to
achieve much faster convergence.
2. Similar to MultiBoost [1], we employ column generation to implement the boosting
training. We derive the Lagrange dual problem of the new multi-class boosting
formulation which enables us to design fully corrective multi-class algorithms using
the primal-dual optimization technique.
3. We propose a coordinate descent based method (termed as FCD) for fast training
of MultiBoostcw. We obtain an analytical solution for each variable update. We
use the Karush-Kuhn-Tucker (KKT) conditions to derive effective stop criteria
and construct working sets of violated variables for faster optimization. We show
that FCD can be applied not only to fully corrective optimization, which updates
all variables, but also to fast stage-wise optimization, which updates newly added
variables only. The stage-wise optimization of our multi-class boosting is similar
to the optimization in standard AdaBoost.
3.1.2 Notation
A bold lower-case letter (u, v) denotes a column vector. An element-wise inequality
between two vectors or matrices such as u ≥ v means ui ≥ vi for all i. Let us assume
that we have K classes. A weak learner is a function that maps an example x to
−1,+1. We denote each weak learner by φ: φy,j(·) ∈ C, in which, y = 1 . . .K and
j = 1 . . .m. C is the space of all possible weak learners; m is the number of learned weak
32 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
learners. We define column vectors:
Φy(x) = [φy,1(x), φy,2(x), · · · , φy,m(x)]> (3.1)
as the outputs of weak learners associated with the y-th class on example x. We denote
the weak learner coefficients wy for class y. Then the strong classifier for class y is:
fy(x) = w>yΦy(x). (3.2)
We need to learn K strong classifiers, one for each class. Given a test data x, the
classification rule is:
y? = argmaxy
fy(x). (3.3)
1 is a vector with elements all being one. Its dimension should be clear from the context.
Respectively, 0 is a vector with elements all being zero.
3.2 The Proposed Method
We show how to formulate the multi-class boosting problem in the large margin learning
framework. Analogue to MultiBoost, we can define the multi-class margin associated
with the training example (xi, yi) as
ρ(i,y) = w>yiΦyi(xi)−w>yΦy(xi), (3.4)
for y 6= yi. Intuitively, ρ(i,y) is the difference of the classification scores between a
“wrong” model and the right model. We want to make this margin as large as possible.
MultiBoostcw with the exponential loss can be formulated as:
minw≥0
1>w +C
p
∑i
∑y 6=yi
exp (−ρ(i,y)) (3.5a)
s.t. ∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :
w>yiΦyi(xi)−w>yΦy(xi) = ρ(i,y). (3.5b)
As in conventional boosting, w is constrained to be non-negative, and the `1 norm
regularization on w is applied here. The number of constraints is denoted by p. We
have p = n× (K − 1), in which n is the number of training examples. The parameter C
controls the complexity of the learned model. The model parameter is:
w = [w1,w2, . . . ,wK ]> ∈ RK·m×1. (3.6)
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 33
Algorithm 1: CG: Column generation for MultiBoostcw
Input: training examples: (x1; y1), (x2; y2), · · · ; regularization parameter: C; themaximum boosting iteration number.
Output: K discriminant function f(x, y;w) = w>yΦy(x), y = 1 · · ·K.
1 Initialize: weak learner working sets Hc = ∅(c = 1 · · ·K); initialize∀(i, y 6= yi) : µ(i,y) = 1.
2 repeat3 solve (3.10) to find K weak learners: φ?c(·), c = 1 · · ·K; and add them to the weak
learner working set Hc;4 solve the primal problem (3.5b) on the current weak learner working sets:
φc ∈ Hc, c = 1, . . . ,K; obtain w (we use coordinate descent of Algorithm 2);5 update dual variables µ in (3.11) using the primal solution w and the KKT
conditions (3.11);
6 until the maximum boosting iteration is reached ;
Minimizing (3.5b) encourages the confidence score of the correct label yi to be larger than
the confidence of any other labels. We define Y as a set of K labels: Y = 1, 2, . . . ,K.The discriminant function f : X× Y 7→ R we need to learn is:
f(x, y;w) = w>yΦy(x) =∑j
wy,jφy,j(x). (3.7)
The class label prediction y? for an unknown example x is to maximize f(x, y;w) over
y, which means finding a class label with the largest confidence:
y? = argmaxy
f(x, y;w) = argmaxy
w>yΦ(x). (3.8)
MultiBoostcw is an extension of MultiBoost [1] for multi-class classification. In Multi-
Boost, different classes share the same set of weak learners Φ. In contrast, in our
MultiBoostcw, each class associates a separate set of weak learners. We show that
MultiBoostcw learns a more compact model than MultiBoost. MultiBoostcw is a flexible
framework that it can easily work with different kinds of loss functions. MultiBoostcw
with the hinge loss is described in Appendix A.2.
3.2.1 Column generation for MultiBoostcw
To apply column generation for boosting learning, we need to derive the dual problem
of (3.5b). The detail for deriving the dual problem is described in Appendix A.1. the
dual problem of (3.5b) is written as (3.9), in which c is the index of class labels. µ(i,y)
34 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
Algorithm 2: FCD: Fast coordinate decent for MultiBoostcw
Input: training examples: (x1; y1), · · · , (xm; ym); coordinate descent tolerance:ε;weak learner set: Hc, c = 1, . . . ,K; initial value of w; maximum updateiteration: τmax.
Output: w.1 Initialize: initialize variable working set S by variables in w that correspond to newly
added weak learners; initialize µ in (3.30); working set iteration index τ = 0.2 repeat3 τ = τ + 1; reset the inner loop index: q = 0 ;4 while q < |S| (|S| is the size of S) do5 q = q + 1;6 pick one variable index j from S: if τ = 1 sequentially pick one, else randomly
pick one;7 Compute V− and V+ in (3.35) using µ;8 update variable wj in (3.22) using V− and V+;9 update µ in (3.34) using the updated wj ;
10 compute the violated values θ in (3.29) for all variables;11 re-construct the variable working set S in (3.31) using θ;
12 until the stop condition in (3.32) is satisfied or maximum iteration reached:τ >= τmax;
is the dual variable which is associated with one constraint in (3.5b):
maxµ
∑i
∑y 6=yi
µ(i,y)
[1− log
p
C− log µ(i,y)
](3.9a)
s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)
∑y 6=yi
µ(i,y)φyi(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φy(xi) ≤ 1, (3.9b)
∀i = 1, . . . , n : 0 ≤∑y 6=yi
µ(i,y) ≤C
p. (3.9c)
Following the idea of column generation [59], we divide the original problem (3.5b) into
a master problem and a sub-problem, and solve them alternatively. The master problem
is a reduced problem of (3.5b) which only considers the generated weak learners. The
sub-problem is to generate K weak learners (corresponding K classes) by finding the
most violated constraint of each class in the dual form (3.9), and add them to the master
problem. The sub-problem for finding most violated constraints is be written as:
∀c =1, . . . ,K :
φ?c(·) = argmaxφc(·)∈C
∑i(yi=c)
∑y 6=yi
µ(i,y)φyi(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φy(xi). (3.10)
The column generation procedure for MultiBoostcw is described in Algorithm 1. Essen-
tially, we repeat the following two steps until convergence:
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 35
1. We solve the master problem (3.5b) with φc ∈ Hc, c = 1, . . . ,K, to obtain the
primal solution w. Hc is the working set of generated weak learners associated
with the c-th class. The dual solution µ? can be calculated using the primal
solution w?. According to the KKT conditions, the dual solution is written as:
µ?(i,y) =C
pexp
[w?>y Φy(xi)−w?>
yi Φyi(xi)]. (3.11)
2. With the dual solution µ?(i,y), we solve the sub-problem (3.10) to generate K weak
learners: φ?c , c = 1, 2, . . . ,K, and add to the weak learner working set Hc.
In MultiBoostcw, K weak learners are generated for K classes respectively in each itera-
tion, while in MultiBoost, only one weak learner is generated at each column generation
and shared by all classes. For MultiBoost, as shown in [1], the sub-problem for finding
the most violated constraint in the dual form is:
[φ?(·), c?] = argmaxφ(·)∈C, c∈1,...,K
∑i(yi=c)
∑y 6=yi
µ(i,y)φ(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φ(xi). (3.12)
In MultiBoost, the above problem (3.12) is solved to to generated one weak learner
in each column generation iteration. Note that solving (3.12) is to search over all
K classes to find the best weak learner φ?, thus the computational cost is the same
as MultiBoostcw. This is the reason why MultiBoostcw does not introduce additional
training cost compared to MultiBoost.
In general, the solution [w1, · · · ,wK ] of MultiBoost is highly sparse. This can be ob-
served in our empirical study. The weak learner generated by solving (3.12) actually is
targeted for one class, thus using this weak learner across all classes in MultiBoost leads
to a very sparse solution. The sparsity of [w1, · · · ,wK ] indicates that one weak learner
is usually only useful for the prediction of a very few number of classes (typically only
one), but useless for most other classes. In this sense, forcing different classes to use the
same set of weak learners may not be necessary and usually it leads to slow convergence.
In contrast, using separate weak learner set for each class, MultiBoostcw tends to have
a dense solution of w. With K weak learners generated at each iteration, MultiBoostcw
converges much faster.
3.2.2 Fast coordinate descent
To further speed up the training, we propose a fast coordinate descent algorithm (FCD)
for solving the primal optimization problem (3.5) of MultiBoostcw at each column gener-
ation iteration. This efficient algorithm also can be applied in MultiBoost. The details of
36 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
FCD is presented in Algorithm 2. The high-level idea is simple. FCD works iteratively,
and at each iteration (referred to as working set iteration), we compute the violated
value of the KKT conditions for each variable in w, and construct a working set of
violated variables (denoted as S), then pick variables from S for update (one variable at
a time). We also use the violated values for defining stop criteria.
Our FCD is a mix of sequential and stochastic coordinate descent. For the first working
set iteration, variables are sequentially picked for update (cyclic CD); in later working
set iterations, variables are randomly picked (stochastic CD). In the sequel, we present
the details of FCD. First, we describe how to update one variable of w by solving a
single-variable optimization problem. For notation simplicity, we define:
δΦi(y) = Φyi(xi)⊗ ρ(yi)− Φy(xi)⊗ ρ(y), (3.13)
and
δφi(y) = φyi(xi)⊗ ρ(yi)− φy(xi)⊗ ρ(y), (3.14)
ρ(y) is the orthogonal label coding vector:
ρ(y) = [δ(y, 1), δ(y, 2), · · · , δ(y,K)]> ∈ 0, 1K . (3.15)
Here δ(y, k) is the indicator function that returns 1 if y = k, otherwise 0. ⊗ denotes the
tensor product. MultiBoostcw in (3.5b) can be equivalently written as:
minw≥0
1>w +C
p
∑i
∑y 6=yi
exp[−w>δΦi(y)
]. (3.16)
We assume that binary weak learners are used here: φ(x) ∈ +1,−1. δφi,j(y) de-
notes the j-th dimension of δΦi(y), and δΦi,j(y) denotes the rest dimensions of δΦi(y)
excluding the j-th. Obviously, the output of δφi,j(y) only takes three possible values:
δφi,j(y) ∈ −1, 0,+1. For the j-th dimension, we define:
Djv = (i, y) | δφji (y) = v, i ∈ 1, . . . ,m, y ∈ Y/yi , v ∈ −1, 0,+1; (3.17)
thus Djv is a set of constraint indices (i, y) that the output of δφi,j(y) is v. wj denotes
the j-th variable of w; wj denotes the rest variables of w excluding the j-th. Let g(w)
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 37
be the objective function of the optimization (3.16). g(w) can be de-composited as:
g(w) = 1>w +C
p
∑i
∑y 6=yi
exp[−w>δΦi(y)
]= 1>wj + wj +
C
p
∑i,y 6=yi
exp[− w>j δΦi,j(y)− w>j δφi,j(y)
]= 1>wj + wj +
C
p
exp (w>j )
∑(i,y)∈Dj
−1
exp[− w>j δΦi,j(y)
]+
exp (−w>j )∑
(i,y)∈Dj+1
exp[− w>j δΦi,j(y)
]+
∑(i,y)∈Dj
0
exp[− w>j δΦi,j(y)
]
= 1>wj + wj +C
p
[exp (w>j )V− + exp (−w>j )V+ + V0
]. (3.18)
Here we have defined:
V− =∑
(i,y)∈Dj−1
exp[− w>j δΦi,j(y)
], V0 =
∑(i,y)∈Dj
0
exp[− w>j δΦi,j(y)
](3.19a)
V+ =∑
(i,y)∈Dj+1
exp[− w>j δΦi,j(y)
](3.19b)
In the variable update step, one variable wj is picked at a time for updating and other
variables wj are fixed; thus we need to minimize g in (3.18) w.r.t wj , which is a single-
variable minimization. It can be written as:
minwj≥0
wj +C
p
[V− exp (w>j ) + V+ exp (−w>j )
]. (3.20)
The derivative of the objective function in (3.20) with wj > 0 is:
∂g
∂wj= 0 =⇒ 1 +
C
p
[V− exp (w>j )− V+ exp (−w>j )
]= 0. (3.21)
By solving (3.21) with the bounded constraint wj ≥ 0, we obtain the analytical solution
of the optimization in (3.20) (since V− > 0):
w?j = max
0, log
(√V+V− +
p2
4C2− p
2C
)− log V−
. (3.22)
When C is large, (3.22) can be approximately simplified as:
w?j = max
0,
1
2log
V+
V−
. (3.23)
With the analytical solution in (3.22), the update of each dimension of w can be per-
formed extremely efficiently. We can find that, the main requirement for obtaining the
38 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
closed-form solution is that the use of discrete weak learners.
We use the KKT conditions to construct a set of violated variables and derive meaningful
stop criteria. For the optimization of MultiBoostcw (3.16), KKT conditions are necessary
conditions and also sufficient for optimality. The Lagrangian of (3.16) is:
L = 1>w +C
p
∑i
∑y 6=yi
exp[−w>δΦi(y)
]−α>w. (3.24)
According to the KKT conditions, w? is the optimal for (3.20) if and only if w? satisfies
w? ≥ 0,α? ≥ 0, (3.25)
∀j : α?jw?j = 0 (3.26)
and
∀j : ∇jL(w?) = 0. (3.27)
For wj > 0, we have:
∂L
∂wj= 0 =⇒ 1− C
p
∑i
∑y 6=yi
exp[−wδΦi(y)
]δφi,j(y)− αj = 0.
Considering the complementary slackness: α?jw?j = 0, if w?j > 0, we have α?j = 0; if
w?j = 0, we have α?j ≥ 0. The optimality conditions can be written as:
∀j :
1− C
p
∑i
∑y 6=yi exp
[−w?δΦi(y)
]δφi,j(y) = 0, if w?j > 0;
1− Cp
∑i
∑y 6=yi exp
[−w?δΦi(y)
]δφi,j(y) ≥ 0, if w?j = 0.
(3.28)
For notation simplicity, we define a column vector µ as in (3.30). With the optimality
conditions (3.28), we define θj in (3.29) as the violated value of the j-th variable of the
solution w?:
θj =
|1− C
p
∑i
∑y 6=yi µ(i,y) δφi,j(y)| if w?j > 0
max0, Cp
∑i
∑y 6=yi µ(i,y) δφi,j(y)− 1 if w?j = 0
, (3.29)
in which,
µ(i,y) = exp[−w>δΦi(y)
]. (3.30)
At each working set iteration of FCD, we compute the violated values θ, and construct
a working set S of violated variables; then we randomly (except the first iteration) pick
one variable from S for update. We repeat picking for |S| times; |S| is the number of
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 39
elements in S. S is defined as
S = j |θj > ε (3.31)
where ε is a tolerance parameter. Analogous to [81] and [80], with the definition of the
variable violated values θ in (3.29), we can define the stop criteria as:
maxjθj ≤ ε, (3.32)
where ε can be the same tolerance parameter as in the definition of working set S
(3.31). The stop condition (3.32) shows if the largest violated value is smaller than
some threshold, FCD terminates. We can see that using KKT conditions is actually
using the gradient information. An inexact solution for w is acceptable for each column
generation iteration, thus we place a maximum iteration number (τmax in Algorithm 2)
for FCD to prevent unnecessary computation.
We need to compute µ before obtaining θ, but computing µ in (3.30) is expensive.
Fortunately, we are able to incrementally update µ after the update of one variable wj
to avoid re-computing of (3.30). µ in (3.30) can be equally written as:
µ(i,y) = exp[− w>j δΦi,j(y)− wjδφi,j(y)
]. (3.33)
So the update of µ is then:
µ(i,y) = µold(i,y) exp
[δφi,j(y)(wold
j − wj)]. (3.34)
With the definition of µ in (3.33), the value V− and V+ for one variable update can
be efficiently computed by using µ to avoid the expensive computation in (3.19a) and
(3.19b); V− and V+ can be equally defined as:
V− =∑
(i,y)∈Dj−1
µ(i,y) exp (−wj), V+ =∑
(i,y)∈Dj+1
µ(i,y) exp (wj). (3.35)
Some discussion on FCD (Algorithm 2) is as follows:
1) Stage-wise optimization is a special case of FCD. Compared to totally corrective
optimization which considers all variables of w for update, stage-wise only considers
those newly added variables for update. We initialize the working set using the newly
added variables. For the first working set iteration, we sequentially update the new added
variables. If setting the maximum working set iteration to 1 (τmax = 1 in Algorithm 2),
FCD becomes a stage-wise algorithm. Thus FCD is a generalized algorithm with totally
corrective update and stage-wise update as special cases. In the stage-wise setting,
40 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
usually a large C (regularization parameter) is implicitly enforced, thus we can use the
analytical solution in (3.23) for variable update.
2) Randomly picking one variable for update without any guidance leads to slow local
convergence. When the solution gets close to the optimality, usually only very few
variables need update, and most picks do not “hit”. In column generation (CG), the
initial value of w is initialized by the solution of last CG iteration. This initialization
is already fairly close to optimality. Therefore the slow local convergence for stochastic
coordinate decent (CD) is more serious in column generation based boosting. Here we
have used the KKT conditions to iteratively construct a working set of violated variables,
and only the variables in the working set need update. This strategy leads to faster CD
convergence.
3.3 Experiments
We evaluate our method MultiBoostcw on some UCI datasets and a variety of multi-
class image classification applications, including digit recognition, scene recognition,
and traffic sign recognition. We compare MultiBoostcw against MultiBoost [1] with
the exponential loss, and some popular existing multi-class boosting algorithms: Ad-
aBoost.ECC [78], AdaBoost.MH [79] and AdaBoost.MO [79]. We use FCD as the solver
for MultiBoostcw, and LBFGS-B [2] for MultiBoost. We also perform further experi-
ments to evaluate FCD in detail. For all experiments, the best regularization parameter
C for MultiBoostcw and MultiBoost is selected from 102 to 105; the tolerance parameter
in FCD is set to 0.1 (ε = 0.1). We use MultiBoostcw-1 (CW-1) to denote MultiBoostcw
using the stage-wise setting of FCD. The suffix “-1” here means the iteration parameter
(τmax) is set to 1 in Algorithm 2. In MultiBoostcw-1, we fix C to be a large value:
C = 108.
All experiments are run 5 times. We compare the testing error, the total training
time and solver time on all datasets. The results show that our MultiBoostcw and
MultiBoostcw-1 converge much faster than other methods, use less training time than
MultiBoost, and achieve the best testing error on most datasets.
AdaBoost.MO [79] (Ada.MO) has a similar convergence rate as our method, but it is
much slower than our method and becomes intractable for large scale datasets. We run
Ada.MO on some UCI datasets and MNIST. Results are shown in Figure 3.1, Figure
3.2 and 3.3. We set a maximum training time (1000 seconds) for Ada.MO; the training
time of all other methods is below this maximum time on those datasets. If maximum
time reached, we report the results of those finished iterations.
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 41
100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
iterations [VOWEL]
Te
st
err
or
ADA.MO(0.117±0.014)
ADA.MH(0.130±0.016)
ADA.ECC(0.103±0.024)
MultiB(0.080±0.025)
CW(ours)(0.101±0.016)
CW−1(ours)(0.085±0.018)
100 200 300 400 5000
0.2
0.4
0.6
0.8
1
iterations [ISOLET]
Te
st
err
or
ADA.MH(0.067±0.005)
ADA.ECC(0.106±0.008)
MultiB(0.069±0.005)
CW(ours)(0.056±0.007)
CW−1(ours)(0.050±0.004)
100 200 300 400 5000
100
200
300
400
500
iterations [VOWEL]
Tra
inin
g t
ime
(se
co
nd
s)
ADA.MO (419.7±39.5)
MultiB (135.5±10.4)
CW(ours) (107.5±7.2)
CW−1(ours) (76.8±4.8)
100 200 300 400 5000
500
1000
1500
2000
2500
3000
3500
4000
iterations [ISOLET]
Tra
inin
g t
ime
(se
co
nd
s)
MultiB (3621.3±549.8)
CW(ours) (2430.4±371.1)
CW−1(ours) (2553.8±213.8)
100 200 300 400 5000
10
20
30
40
50
60
70
iterations [VOWEL]
So
lve
r tim
e (
se
co
nd
s)
MultiB (69.6±5.4)
CW(ours) (43.5±4.0)
CW−1(ours) (24.7±1.7)
100 200 300 400 5000
500
1000
1500
2000
iterations [ISOLET]
So
lve
r tim
e (
se
co
nd
s)
MultiB (1604.7±243.8)
CW(ours) (746.7±119.5)
CW−1(ours) (699.7±58.8)
Figure 3.1: Results on 2 UCI datasets: VOWEL and ISOLET. CW and CW-1 are our methods.CW-1 uses stage-wise setting. The number after the method name is the mean value with stan-dard deviation of the last iteration. Our methods converge much faster and achieve competitivetest accuracy. The total training time and the solver time of our methods both are less thanthat of MultiBoost [1].
3.3.1 UCI datasets
We use 2 UCI multi-class datasets: VOWEL and ISOLET. For each dataset, we ran-
domly select 75% data for training and the rest for testing. Results are shown in Figure
3.1.
42 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
iterations [USPS]
Te
st
err
or
ADA.MO(0.057±0.006)
ADA.MH(0.056±0.003)
ADA.ECC(0.061±0.007)
MultiB(0.053±0.004)
CW(ours)(0.046±0.001)
CW−1(ours)(0.042±0.002)
100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
iterations [PENDIGITS]
Te
st
err
or
ADA.MO(0.042±0.004)
ADA.MH(0.035±0.003)
ADA.ECC(0.035±0.003)
MultiB(0.022±0.001)
CW(ours)(0.023±0.003)
CW−1(ours)(0.019±0.003)
100 200 300 400 5000
200
400
600
800
1000
iterations [USPS]
Tra
inin
g t
ime
(se
co
nd
s)
ADA.MO (990.8±8.8)
MultiB (847.1±126.5)
CW(ours) (518.4±86.5)
CW−1(ours) (394.3±7.5)
100 200 300 400 5000
200
400
600
800
1000
iterations [PENDIGITS]
Tra
inin
g t
ime
(se
co
nd
s)
ADA.MO (976.7±25.0)
MultiB (936.5±153.1)
CW(ours) (592.9±96.3)
CW−1(ours) (404.9±2.9)
100 200 300 400 5000
100
200
300
400
500
iterations [USPS]
So
lve
r tim
e (
se
co
nd
s)
MultiB (408.1±69.9)
CW(ours) (188.4±36.0)
CW−1(ours) (110.8±2.5)
100 200 300 400 5000
100
200
300
400
500
iterations [PENDIGITS]
So
lve
r tim
e (
se
co
nd
s)
MultiB (477.6±80.2)
CW(ours) (238.8±34.3)
CW−1(ours) (121.6±1.5)
Figure 3.2: Experiments on 2 handwritten digit recognition datasets: USPS, PENDIGITS. CWand CW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge much faster,achieve best test error and use less training time. Ada.MO has similar convergence rate as ours,but requires much more training time. With a maximum training time of 1000 seconds, Ada.MOfailed to finish 500 iterations on all datasets.
3.3.2 Handwritten digit recognition
We use 3 handwritten datasets: MNIST, USPS and PENDIGITS. For MNIST, we ran-
domly sample 1000 examples from each class, and use the original test set of 10,000
examples. For USPS and PENDIGITS, we randomly select 75% for training, the rest
for testing. Results are shown in Figure 3.2 and 3.3.
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 43
100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
iterations [MNIST]T
est
err
or
ADA.MO(0.110±0.001)
ADA.MH(0.109±0.002)
ADA.ECC(0.121±0.003)
MultiB(0.104±0.004)
CW(ours)(0.097±0.004)
CW−1(ours)(0.092±0.001)
100 200 300 400 5000
200
400
600
800
1000
iterations [MNIST]
Tra
inin
g t
ime
(se
co
nd
s)
ADA.MO (981.3±14.6)
MultiB (956.5±56.8)
CW(ours) (730.1±99.3)
CW−1(ours) (577.6±5.9)
100 200 300 400 5000
100
200
300
400
500
iterations [MNIST]
So
lve
r tim
e (
se
co
nd
s)
MultiB (468.4±18.3)
CW(ours) (255.7±27.4)
CW−1(ours) (163.6±3.2)
Figure 3.3: Experiments on handwritten digit recognition datasets: MNIST. CW and CW-1 areour methods. CW-1 uses stage-wise setting. Our methods converge much faster, achieve besttest error and use less training time. Ada.MO has similar convergence rate as ours, but requiresmuch more training time. With a maximum training time of 1000 seconds, Ada.MO failed tofinish 500 iterations on this dataset.
3.3.3 Three Image datasets: PASCAL07, LabelMe, CIFAR10
For PASCAL07, we use 5 types of features provided in [82]. For labelMe, we use the
subset: LabelMe-12-50k1 [83] and generate GIST features. For these two datasets, we
use those images which only have one class label. We use 70% data for training, the
rest for testing. For CIFAR102, we construct 2 datasets, one uses GIST features and
the other uses the pixel values. We use the provided test set and 5 training sets for 5
times run. Results are shown in Figure 3.5.
3.3.4 Scene recognition
We use 2 scene image datasets: Scene15 [84] and SUN [3]. For Scene15, we randomly
select 100 images per class for training, and the rest for testing. We generate histograms
1http://www.ais.uni-bonn.de/download/datasets.html2http://www.cs.toronto.edu/˜kriz/cifar.html
44 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
50 100 150 2000
0.2
0.4
0.6
0.8
1
iterations [GTSRB]
Te
st
err
or
ADA.MH(0.103±0.005)
ADA.ECC(0.236±0.011)
MultiB(0.116±0.008)
CW(ours)(0.081±0.004)
CW−1(ours)(0.082±0.003)
50 100 150 2000
50
100
150
200
250
300
350
400
iterations [GTSRB]
So
lve
r tim
e (
se
co
nd
s)
MultiB (380.1±6.0)
CW(ours) (260.0±9.0)
CW−1(ours) (205.7±9.4)
Figure 3.4: Results on a traffic sign dataset: GTSRB. CW and CW-1 (stage-wise setting) areour methods. Our methods converge much faster, achieve best test error and use less trainingtime.
of code words as features. The code book size is 200. An image is divided into 31 sub-
windows in a spatial hierarchy manner. We generate histograms in each sub-windows,
thus the histogram feature dimension is 6200. For SUN dataset, we construct a subset
of the original dataset containing 25 categories. For each category, we use the top 200
images, and randomly select 80% data for training, the rest for testing. We use the
HOG features described in [3]. Results are shown in Figure 3.6.
3.3.5 Traffic sign recognition
We use the GTSRB3 traffic sign dataset. There are 43 classes and more than 50000
images. We use the provided 3 types of HOG features; thus there are 6052 features in
total. We randomly select 100 examples per class for training and use the original test
set. Results are shown in Figure 3.4.
3.3.6 FCD evaluation
We perform further experiments to evaluate FCD with different parameter settings,
and compare to the LBFGS-B [2] solver. We use 3 datasets in this section: VOWEL,
USPS and SCENE15. We run FCD with different settings of the maximum working set
iteration (τmax in Algorithm 2) to evaluate how the setting of τmax (maximum working
set iteration) affects the performance of FCD. We also run LBFGS-B [2] solver for solving
the same optimization (3.5b) as FCD. We set C = 104 for all cases. Results are shown
in Figure 3.7. For LBFGS-B, we use the default converge setting to get a moderate
solution. The number after “FCD” in the figure is the setting of τmax in Algorithm 2
3http://benchmark.ini.rub.de/
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 45
for FCD. Results show that the stage-wise case (τmax = 1) of FCD is the fastest one, as
expected. When we set τmax ≥ 2, the objective value of the optimization (3.5b) of our
method converges much faster than LBFGS-B. Thus setting of τmax = 2 is sufficient to
achieve a very accurate solution, and at the same time has faster convergence and less
running time than LBFGS-B.
3.4 Conclusion
we have presented a novel multi-class boosting method based on the column generation
technique. Different from most existing multi-class boosting, we train a separate set of
weak learners for each class, which results in much faster convergence. We also develop
an efficient coordinate decent method for solving the optimization. A wide range of
experiments demonstrate that the proposed multi-class boosting achieves competitive
testing accuracy, converges much faster, and has fast training speed due to the proposed
efficient coordinate descent algorithm.
46 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
100 200 300 400 5000.4
0.5
0.6
0.7
0.8
0.9
iterations [PASCAL07]
Te
st
err
or
ADA.MH(0.545±0.008)
ADA.ECC(0.575±0.008)
MultiB(0.541±0.007)
CW(ours)(0.480±0.003)
CW−1(ours)(0.475±0.003)
100 200 300 400 5000.2
0.3
0.4
0.5
0.6
0.7
iterations [LABELME−SUB]
Te
st
err
or
ADA.MH(0.254±0.002)
ADA.ECC(0.273±0.004)
MultiB(0.250±0.004)
CW(ours)(0.223±0.002)
CW−1(ours)(0.225±0.002)
100 200 300 400 5000.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
iterations [CIFAR10−GIST]
Te
st
err
or
ADA.MH(0.495±0.006)
ADA.ECC(0.518±0.005)
MultiB(0.495±0.004)
CW(ours)(0.470±0.002)
CW−1(ours)(0.476±0.004)
100 200 300 400 500
0.65
0.7
0.75
0.8
0.85
iterations [CIFAR10−RAW]
Te
st
err
or
ADA.MH(0.647±0.004)
ADA.ECC(0.664±0.003)
MultiB(0.644±0.006)
CW(ours)(0.621±0.004)
CW−1(ours)(0.648±0.004)
100 200 300 400 5000
100
200
300
400
500
600
700
800
iterations [PASCAL07]
So
lve
r tim
e (
se
co
nd
s)
MultiB (728.6±71.9)
CW(ours) (440.7±33.5)
CW−1(ours) (285.4±2.4)
100 200 300 400 5000
200
400
600
800
1000
1200
iterations [LABELME−SUB]
So
lve
r tim
e (
se
co
nd
s)
MultiB (1015.4±76.1)
CW(ours) (993.3±40.4)
CW−1(ours) (394.4±4.5)
100 200 300 400 5000
50
100
150
200
250
300
350
400
iterations [CIFAR10−RAW]
So
lve
r tim
e (
se
co
nd
s)
MultiB (400.0±39.4)
CW(ours) (242.0±35.4)
CW−1(ours) (161.6±2.0)
100 200 300 400 5000
50
100
150
200
250
300
350
400
iterations [CIFAR10−GIST]
So
lve
r tim
e (
se
co
nd
s)
MultiB (350.9±37.5)
CW(ours) (218.2±20.0)
CW−1(ours) (161.9±4.5)
Figure 3.5: Experiments on 3 image datasets: PASCAL07, LabelMe and CIFAR10. CW andCW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge much faster,achieve best test error and use less training time.
Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 47
100 200 300 400 5000.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
iterations [SCENE15]
Te
st
err
or
ADA.MH(0.245±0.009)
ADA.ECC(0.269±0.005)
MultiB(0.278±0.004)
CW(ours)(0.229±0.004)
CW−1(ours)(0.225±0.005)
100 200 300 400 5000.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
iterations [SUN−25]T
est
err
or
ADA.MH(0.564±0.013)
ADA.ECC(0.606±0.013)
MultiB(0.601±0.008)
CW(ours)(0.525±0.009)
CW−1(ours)(0.528±0.007)
100 200 300 400 5000
100
200
300
400
500
600
iterations [SCENE15]
Tra
inin
g t
ime
(se
co
nd
s)
MultiB (592.1±52.6)
CW(ours) (417.1±25.6)
CW−1(ours) (278.5±6.2)
100 200 300 400 5000
500
1000
1500
2000
2500
iterations [SUN−25]
Tra
inin
g t
ime
(se
co
nd
s)
MultiB (2189.6±85.4)
CW(ours) (1850.2±47.0)
CW−1(ours) (1589.1±51.9)
100 200 300 400 5000
50
100
150
200
250
iterations [SCENE15]
So
lve
r tim
e (
se
co
nd
s)
MultiB (242.0±25.0)
CW(ours) (103.7±10.1)
CW−1(ours) (49.5±0.9)
100 200 300 400 5000
200
400
600
800
1000
iterations [SUN−25]
So
lve
r tim
e (
se
co
nd
s)
MultiB (944.3±46.3)
CW(ours) (541.8±19.6)
CW−1(ours) (340.1±20.2)
Figure 3.6: Experiments on 2 scene recognition datasets: SCENE15 and a subset of SUN. CWand CW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge much fasterand achieve best test error and use less training time.
48 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent
50 100 150 200 250 300
400
600
800
1000
1200
1400
1600
boosting iterations [VOWEL]
Ob
jective
fu
nctio
n v
alu
e
LBFGS−B (357.297±6.773)
FCD−10 (337.756±3.227)
FCD−5 (337.863±3.274)
FCD−2 (338.353±3.423)
FCD−1 (352.022±3.868)
50 100 150 200 250 3000
5
10
15
20
25
30
boosting iterations [VOWEL]
So
lve
r tim
e (
se
co
nd
s)
LBFGS−B (23.7±2.5)
FCD−10 (25.4±2.6)
FCD−5 (22.8±1.9)
FCD−2 (15.6±1.0)
FCD−1 (8.9±1.3)
50 100 150 200 250 300
400
600
800
1000
1200
boosting iterations [USPS]
Ob
jective
fu
nctio
n v
alu
e
LBFGS−B (284.069±3.444)
FCD−10 (263.744±1.465)
FCD−5 (263.860±1.391)
FCD−2 (264.701±1.434)
FCD−1 (292.395±1.778)
50 100 150 200 250 3000
50
100
150
200
boosting iterations [USPS]
So
lve
r tim
e (
se
co
nd
s)
LBFGS−B (103.5±5.5)
FCD−10 (174.1±12.5)
FCD−5 (143.6±9.3)
FCD−2 (82.1±4.2)
FCD−1 (45.3±3.3)
50 100 150 200 250 300200
400
600
800
1000
boosting iterations [USPS]
Ob
jective
fu
nctio
n v
alu
e
LBFGS−B (240.503±1.976)
FCD−10 (239.617±1.898)
FCD−5 (239.599±1.869)
FCD−2 (239.627±1.859)
FCD−1 (275.267±2.544)
50 100 150 200 250 3000
10
20
30
40
50
60
boosting iterations [USPS]
So
lve
r tim
e (
se
co
nd
s)
LBFGS−B (55.5±4.3)
FCD−10 (47.3±5.5)
FCD−5 (42.5±2.3)
FCD−2 (32.8±4.0)
FCD−1 (22.9±1.4)
Figure 3.7: Solver comparison between FCD with different parameter setting and LBFGS-B [2].One column for one dataset. The number after “FCD” is the setting for the maximum iteration(τmax) of FCD. The stage-wise setting of FCD is the fastest one. See the text for details.
Chapter 4
StructBoost: Boosting Methods
for Predicting Structured Output
Variables
Boosting is a method for learning a single accurate predictor by linearly combining a set
of less accurate weak learners. Recently, structured learning has found many applications
in computer vision. Inspired by structured support vector machines (SSVM), here we
propose a new boosting algorithm for structured output prediction, which we refer to
as StructBoost [42]. StructBoost supports nonlinear structured learning by combining
a set of weak structured learners.
As SSVM generalizes SVM, our StructBoost generalizes standard boosting approaches
such as AdaBoost, or LPBoost to structured learning. The resulting optimization prob-
lem of StructBoost is more challenging than SSVM in the sense that it may involve
exponentially many variables and constraints. In contrast, for SSVM one usually has
an exponential number of constraints and a cutting-plane method is used. In order to
efficiently solve StructBoost, we formulate an equivalent 1-slack formulation and solve
it using a combination of cutting planes and column generation. We show the versa-
tility and usefulness of StructBoost on a range of problems such as optimizing the tree
loss for hierarchical multi-class classification, optimizing the Pascal overlap criterion
for robust visual tracking and learning conditional random field parameters for image
segmentation.
49
50 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
4.1 Introduction
Structured learning has attracted considerable attention in machine learning and com-
puter vision in recent years (see, for example [4, 7, 10, 15]). Conventional supervised
learning problems, such as classification and regression, aim to learn a function that
predicts a best output value of y ∈ R for an input vector x ∈ Rd. In many appli-
cations, however, the outputs are often complex and cannot be well represented by a
single scalar, but the most appropriate outputs are objects (vectors, sequences, trees,
etc.). The components of the output are inter-dependent. Such problems are referred
to as structured output prediction.
Structured support vector machines (SSVM) [7] generalize the multi-class SVM of [85]
and [61] to the much broader problem of predicting interdependent and structured out-
puts. SSVM uses discriminant functions that take advantage of the dependencies and
structure of outputs. In SSVM, the general form of the learned discriminant function
is f(x,y;w) : X × Y 7→ R over input-output pairs and the prediction is achieved by
maximizing f(x,y;w) over all possible y ∈ Y. Note that to introduce non-linearity, the
discriminant function can be defined by an implicit feature mapping function that is
only accessible as a particular inner product in a reproducing kernel Hilbert space. This
is the so-called kernel trick.
On the other hand, boosting algorithms linearly combine a set of moderately accurate
weak learners to form a nonlinear strong predictor, whose prediction performance is
usually highly accurate. Recently, Shen and Hao [1] proposed a direct formulation for
multi-class boosting using the loss functions of multi-class SVMs [61, 85]. Inspired by the
general boosting framework of Shen and Li [59], they implemented multi-class boosting
using column generation. Here we go further by generalizing multi-class boosting of
Shen and Hao to broad structured output prediction problems. StructBoost thus enables
nonlinear structured learning by combining a set of weak structured learners.
The effectiveness of SSVM has been limited by the fact that only the linear kernel is
typically used. This limitation arises largely as a result of the computational expense
of training and applying kernelized SSVMs. Nonlinear kernels often deliver improved
prediction accuracy over that of linear kernels, but at the cost of significantly higher
memory requirements and computation time. This is particularly the case when the
training size is large, because the number of support vectors is linearly proportional to
the size of training data [86]. Boosting, however, learns models which are much faster
to evaluate. Boosting can also select relevant features during the course of learning by
using particular weak learners such as decision stumps or decision trees, while almost all
nonlinear kernels are defined on the entire feature space. It thus remains difficult (if not
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 51
impossible) to see how kernel methods can select/learn explicit features. For boosting,
the learning procedure also selects or induces relevant features. The final model learned
by boosting methods is thus often significantly simpler and computationally cheaper. In
this sense, the proposed StructBoost possesses the advantages of both nonlinear SSVM
and boosting methods.
4.1.1 Main contributions
The main contributions of this work are three-fold.
1. We propose StructBoost, a new fully-corrective boosting method that combines a
set of weak structured learners for predicting a broad range of structured outputs.
We also discuss special cases of this general structured learning framework, includ-
ing multi-class classification, ordinal regression, optimization of complex measures
such as the Pascal image overlap criterion and conditional random field (CRF)
parameters learning for image segmentation.
2. To implement StructBoost, we adapt the efficient cutting-plane method—originally
designed for efficient linear SVM training [87]—for our purpose. We equivalently
reformulate the n-slack optimization to 1-slack optimization.
3. We apply the proposed StructBoost to a range of computer vision problems and
show that StructBoost can indeed achieve state-of-the-art performance in some
of the key problems in the field. In particular, we demonstrate a state-of-the-art
object tracker trained by StructBoost. We also demonstrate an application for
CRF and super-pixel based image segmentation, using StructBoost together with
graph cuts for CRF parameter learning.
Since StructBoost builds upon the fully corrective boosting of Shen and Li [59], it inherits
the desirable properties of column generation based boosting, such as a fast convergence
rate and clear explanations from the primal-dual convex optimization perspective.
4.1.2 Related work
The current state-of-the-art structured learning methods are CRF [88] and SSVM [7],
which capture the interdependency among output variables. Note that CRFs formulate
global training for structured prediction as a convex optimization problem. SSVM also
follows this path but employs a different loss function (hinge loss) and optimization
methods. Our StructBoost is directly inspired by SSVM. StructBoost can be seen as
52 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
an extension of boosting methods to structured prediction. It therefore builds upon
the column generation approach to boosting from [59] and the direct formulation for
multi-class boosting [1]. Indeed, we show that the multi-class boosting of [1] is a special
case of the general framework presented here.
CRF and SSVM have been applied to various problems in machine learning and com-
puter vision mainly because the learned models can easily integrate prior knowledge
about the problem structure. For example, the linear chain CRF has been widely used
in natural language processing [88, 89]. SSVM takes the context into account using the
joint feature maps over the input-output pairs, where features can be represented equiv-
alently as in CRF [87]. CRF is particularly of interest in computer vision for its success
in semantic image segmentation [90]. A critical issue in semantic image segmentation is
to integrate local and global features for the prediction of local pixel/segment labels. Se-
mantic segmentation is achieved by exploiting the class information with a CRF model.
SSVM can also be used for similar purposes as demonstrated in [91]. Blaschko and
Lampert [10] trained SSVM models to predict the bounding box of objects in a given
image, by optimizing the Pascal bounding box overlap score. The work in [4] introduced
structured learning to real-time object detection and tracking, which also optimizes the
Pascal box overlap score. SSVM has also been used to learn statistics that capture the
spatial arrangements of various object classes in images [92]. The trained model can
then simultaneously predict a structured labeling of the entire image. Based on the idea
of large-margin learning in SSVM, Szummer et al. [8] learned optimal parameters of a
CRF, avoiding tedious cross validation. The survey of [15] provided a comprehensive
review of structured learning and its application in computer vision.
Dietterich et al. [93] learned the CRF energy functions using gradient tree boosting.
There the functional gradient of the CRF conditional likelihood is calculated, such that
a regression tree (weak learner) is induced as in gradient boosting. An ensemble of trees
is produced by iterating this procedure. In contrast, here we show that it is possible to
learn CRF parameters within the large-margin framework, by generalizing the work of
[8, 9] where CRF parameters are learned using SSVM. In our case, we do not require
approximations such as pseudo-likelihood. Another relevant work is [94], where Munoz
et al. used the functional gradient boosting methodology to discriminatively learn max-
margin Markov networks (M3N), as proposed by Taskar et al. [49]. The random fields’
potential functions are learned following gradient boosting [95].
There are a few structured boosting methods in the literature. As we discuss here, all
of them are based on gradient boosting, and thus are not as general as that which we
propose here. Ratliff et al. [96, 97] proposed a boosting-based approach for imitation
learning based on structured prediction, called maximum margin planning (MMP). Their
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 53
method is named as MMPBoost. To train MMPBoost, a demonstrated policy is provided
as example behavior as the input, and the problem is to learn a function over features
of the environment that produces policies with similar behavior. Although MMPBoost
is structured learning in that the output is a vector, it differs from ours fundamentally.
First, the optimization procedure of MMPBoost is not directly defined on the joint
function f(x,y;w). Second, MMPBoost is based on gradient descent boosting [95], and
StructBoost is built upon fully corrective boosting of Shen and Li [59, 98].
Parker et al. [99] have also successfully applied gradient tree boosting to learning se-
quence alignment. Later, Parker [100] developed a margin-based structured perceptron
update and showed that it can incorporate general notions of misclassification cost as
well as kernels. In these methods, the objective function typically consists of an expo-
nential number of terms that correspond to all possible pairs of (y,y′). Approxima-
tion is required to make the computation of gradient tractable [99]. Wang et al. [101]
learned a local predictor using standard methods, e.g., SVM, but then achieved im-
proved structured classification by exploiting the influence of misclassified components
after structured prediction, and iteratively re-training the local predictor. This approach
is heuristic and it is more like a post-processing procedure—it does not directly optimize
the structured learning objective.
4.1.3 Notation
A bold lower-case letter (u, v) denotes a column vector. An element-wise inequality
between two vectors or matrices such as u ≥ v means ui ≥ vi for all i. Let x be an
input; y be an output and the input-output pair be (x,y) ∈ X×Y, with X ⊂ Rd. Unlike
classification (Y = 1, 2, . . . , k) or regression (Y ⊂ R) problems, we are interested in
the case where elements of Y are structured variables such as vectors, strings, or graphs.
Recall that the proposed StructBoost is a structured boosting method, which combines
a set of weak structured learners (or weak compatibility functions). We denote by C
the domain of all possible weak structured learners. Note that C is typically very large,
or even infinite. Each weak structured learner: ψ(·, ·) ∈ C, is a function that maps an
input-output pair (x,y) to a scalar value which measures the compatibility of the input
and output. We define column vector
Ψ(x,y) = [ψ1(x,y), · · · , ψm(x,y)]> (4.1)
to be the outputs of all weak structured learners. Thus Ψ(x,y) plays the same sole
as the joint mapping vector in SSVM, which relates input x and output y. The form
of a weak structured learner is task-dependent. We show some examples of ψ(·, ·) in
54 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Section 4.3. The discriminant function that we aim to learn is f : X × Y 7→ R, which
measures the compatibility over input-output pairs. It has the form of
f(x,y;w) = w>Ψ(x,y) =∑j
wjψj(x,y), (4.2)
with w ≥ 0. As in other structured learning models, the process for predicting a struc-
tured output (or inference) is to find an output y that maximizes the joint compatibility
function:
y? = argmaxy
f(x,y;w) = argmaxy
w>Ψ(x,y). (4.3)
We denote by 1 a column vector of all 1’s, whose dimension shall be clear from the
context.
We describe the StructBoost approach in Section 4.2, including how to efficiently solve
the resulting optimization problem. We then highlight applications in various domains
in Section 4.3. Experimental results are shown in Section 4.4 and we conclude this
chapter in the last section.
4.2 Structured boosting
We first introduce the general structured boosting framework, and then apply it to a
range of specific problems: classification, ordinal regression, optimizing special criteria
such as the area under the ROC curve and the Pascal image area overlap ratio, and
learning CRF parameters.
To measure the accuracy of prediction we use a loss function, and as is the case with
SSVM, we accept arbitrary loss functions ∆ : Y × Y 7→ R. ∆(y, z) calculates the loss
associated with a prediction z against the true label value y. Note that in general we
assume that ∆(y,y) = 0, ∆(y, z) > 0 for any z 6= y and the loss is upper bounded.
The formulation of StructBoost can be written as (n-slack primal):
minw≥0,ξ≥0
1>w +C
n1>ξ (4.4a)
s.t. ∀i = 1, . . . , n and ∀y ∈ Y :
w>[Ψ(xi,yi)−Ψ(xi,y)
]≥ ∆(yi,y)− ξi. (4.4b)
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 55
Here we have used the `1 norm as the regularization function to control the complexity
of the learned model. To simplify the notation, we introduce
δψi(y) = ψ(xi,yi)− ψ(xi,y); (4.5)
and,
δΨi(y) = Ψ(xi,yi)−Ψ(xi,y); (4.6)
then the constraints in (4.4) can be re-written as:
w>δΨi(y) ≥ ∆(yi,y)− ξi. (4.7)
There are two major obstacles to solve problem (4.4). First, as in conventional boosting,
because the set of weak structured learners ψ(·, ·) can be exponentially large or even
infinite, the dimension of w can be exponentially large or infinite. Thus, in general, we
are not able to directly solve for w. Second, as in SSVM, the number of constraints
(4.4b) can be extremely or infinitely large. For example, in the case of multi-label or
multi-class classification, the label y can be represented as a binary vector (or string)
and clearly the possible number of y such that y is exponential in the length of the
vector, which is 2|Y|. In other words, problem (4.4) can have an extremely or infinitely
large number of variables and constraints. This is significantly more challenging than
solving standard boosting or SSVM in terms of optimization. In standard boosting, one
has a large number of variables while in SSVM, one has a large number of constraints.
For the moment, let us put aside the difficulty of the large number of constraints, and
focus on how to iteratively solve for w using column generation as in boosting methods
[59, 60, 98]. We derive the Lagrange dual of the optimization of (4.4) as:
maxµ≥0
∑i,y
µ(i,y)∆(yi,y) (4.8a)
s.t. ∀ψ ∈ C :∑i,y
µ(i,y)δψi(y) ≤ 1, (4.8b)
∀i = 1, . . . , n : 0 ≤∑y
µ(i,y) ≤C
n. (4.8c)
Here µ are the Lagrange dual variables (Lagrange multipliers). We denote by µ(i,y) the
dual variable associated with the margin constraints (4.4b) for label y and training pair
(xi,yi). Details for deriving the dual problem are described in Appendix B.1
The idea of column generation is to split the original primal problem in (4.4) into two
problems: a master problem and a subproblem. The master problem is the original
56 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
problem with only a subset of variables (or constraints for the dual form) being consid-
ered. The subproblem is to add new variables (or constraints for the dual form) into the
master problem. With the primal-dual pair of (4.4) and (4.8) and following the general
framework of column generation based boosting [59, 60, 98], we obtain our StructBoost
as follows:
Iterate the following two steps until convergence :
1. Solve the following subproblem, which generates the best weak structured learner
by finding the most violated constraint in the dual:
ψ?(·, ·) = argmaxψ(·,·)
∑i,y
µ(i,y)δψi(y). (4.9)
2. Add the selected structured weak learner ψ?(·, ·) into the master problem (either
the primal form or the dual form) and re-solve for the primal solution w and dual
solution µ.
The stopping criterion can be that no violated weak learner can be found. Formally, for
the selected ψ?(·, ·) with (4.9) and a preset precision εcg > 0, if the following relation
holds: ∑i,y
µ(i,y)δψ?i (y) ≤ 1− εcg, (4.10)
we terminate the iteration. Algorithm 3 presents Details of column generation for Struct-
Boost. This approach, however, may not be practical because it is very expensive to
solve the master problem (the reduced problem of (4.4)) at each column generation step
(boosting iteration), which still can have extremely many constraints due to the size of
the set y ∈ Y. The direct formulation for multi-class boosting in [1] can be seen as a
specific instance of this approach, which is in general very slow. We therefore propose
to employ the 1-slack formulation for efficient optimization, which is described in the
next section.
4.2.1 1-slack formulation for fast optimization
Inspired by the cutting-plane method for fast training of linear SVM [87] and SSVM
[52], we rewrite the above problem into an equivalent “1-slack” form so that the efficient
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 57
cutting-plane method can be employed to solve the optimization problem (4.4):
minw≥0,ξ≥0
1>w + Cξ (4.11a)
s.t. ∀c ∈ 0, 1n and ∀y ∈ Y, i = 1, · · · , n :
1
nw>[ n∑i=1
ci · δΨi(y)
]≥ 1
n
n∑i=1
ci∆(yi,y)− ξ. (4.11b)
The following theorem shows the equivalence of problems (4.4) and (4.11).
Theorem 4.1. A solution of problem (4.11) is also a solution of problem (4.4) and vice
versa. The connections are: w?(4.11) = w?
(4.4) and ξ?(4.11) = 1n1>ξ?(4.4).
Proof. This proof adapts the proof in [87]. Given a fixed w, the only variable ξ(4.4) in
(4.4) can be solved by
ξi,(4.4) = maxy
0,∆(yi,y)−w>δΨi(y)
, ∀i.
For (4.11), the optimal ξ(4.11) given a w can be computed as:
ξ(4.11) =1
nmaxc,y
n∑i=1
ci∆(yi,y)−w>[ n∑i=1
ciδΨi(y)]
=1
n
n∑i=1
max
ci∈0,1,yci∆(yi,y)− ciw>δΨi(y)
=1
n
n∑i=1
maxy
0,∆(yi,y)−w>δΨi(y)
=
1
n1>ξ(4.4).
Note that c ∈ 0, 1n in the above equalities. Clearly the objective functions of both
problems coincide for any fixed w and the optimal ξ(4.4) and ξ(4.11).
As demonstrated in [87] and SSVM [52], cutting-plane methods can be used to solve the
1-slack primal problem (4.11) efficiently. This 1-slack formulation has been used to train
linear SVM in linear time. When solving for w, (4.11) is similar to `1-norm regularized
SVM—apart from the extra non-negativeness constraint on w in our case.
In order to utilize column generation for designing boosting methods, we need to derive
the Lagrange dual of the 1-slack primal optimization problem, which can be written as
58 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Algorithm 3: Column generation for StructBoost
Input: training examples: (x1;y1), (x2;y2), · · · ; trade-off parameter: C; terminationthreshold: εcg; the maximum iteration number.
Output: the discriminant function: f(x,y;w) = w>Ψ(x,y).
1 Initialize: for each i, (i = 1, . . . ,m), randomly pick any y(0)i ∈ Y, initialize µ(i,y) = C
n
for y = y(0)i , and µ(i,y) = 0 for all y ∈ Y\y(0)
i .
2 repeat3 Find and add a weak structured learner ψ?(·, ·) by solving the subproblem (4.9) or
(4.13);4 Call Algorithm 4 to obtain w and µ;
5 until either (4.10) is met or the maximum number of iterations is reached ;
follows:
maxλ≥0
∑c,y
λ(c,y)
n∑i=1
ci∆(yi,y) (4.12a)
s.t. ∀ψ ∈ C :1
n
∑c,y
λ(c,y)
[ n∑i=1
ci · δψi(y)
]≤ 1, (4.12b)
0 ≤∑c,y
λ(c,y) ≤ C. (4.12c)
Here c enumerates all possible c ∈ 0, 1n. We denote by λ(c,y) the Lagrange dual
variable (Lagrange multiplier) associated with the inequality constraint in (4.11b) for
c ∈ 0, 1n and label y. Details for deriving the dual problem are described in Appendix
B.1 The subproblem to find the most violated constraint in the dual form for generating
weak structured learners is:
ψ?(·, ·) = argmaxψ(·,·)∈C
∑c,y
λ(c,y)
∑i
ciδψi(y)
= argmaxψ(·,·)∈C
∑i,y
∑c
λ(c,y)ci︸ ︷︷ ︸:=µ(i,y)
δψi(y). (4.13)
We have changed the order of summation in order to achieve a similar form as in the
n-slack case.
4.2.2 Cutting-plane optimization for the 1-slack primal
Despite the extra nonnegative-ness constraint w ≥ 0 in our case, it is straightforward to
apply the cutting-plane method in [87] for solving our problem (4.11). The cutting-plane
algorithm for StructBoost is presented in Algorithm 4. A key step in Algorithm 4 is to
solve the maximization for finding an output y′ that corresponds to the most violated
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 59
Algorithm 4: Cutting planes for solving the 1-slack primal
Input: cutting-plane tolerance: εcp; inputs from Algorithm 3.Output: w and µ.
1 Initialize: working set: W← ∅; ci = 1, y′i ← any element in Y, for i = 1, . . . ,m.2 repeat3 W←W ∪ (c1, . . . , cm,y
′1, . . . ,y
′m);
4 obtain primal and dual solutions w, ξ; λ by solving (4.4) on current working set W ;5 for i = 1, . . . ,m do6 y′i = argmax y ∆(yi,y)−w>δΨi(y);
7 ci =
1 ∆(yi,y
′i)−w>δΨi(y
′i) > 0
0 otherwise;
8 until 1nw>[n∑i=1
ciδΨi(y′i)
]≥ 1
n
n∑i=1
ci∆(yi,y′i)− ξ − εcp;
9 update µ(i,y) =∑c λ(c,y)ci for ∀(c,y) ∈W ;
constraint for every xi (inference step):
y′i = argmaxy
∆(yi,y) +w>δΨi(y). (4.14)
The above maximization problem takes a similar form to that of the output prediction
problem in (4.3). They only differ in the loss term ∆(yi,y). Typically these two prob-
lems can be solved using the same strategy. This inference step usually dominates the
running time for a few applications, e.g., in the application of image segmentation. In
the experiment section, we empirically show that solving (4.11) using cutting planes can
be significantly faster than solving (4.4). Here improved cutting-plane methods such as
[102] can also be adapted to solve our optimization problem at each column generation
boosting iteration.
In terms of implementation of the cutting-plane algorithm, as mentioned in SSVM [52],
a variety of design decisions can have a substantial influence on the practical efficiency of
the algorithm. We have considered some of these design decisions in our implementation.
In our case, we need to call the cutting-plane optimization at each column generation
iteration. Consideration of warm-start initialization between two consecutive column
generations can substantially speed up the training. We re-use the working set in the
cutting-plane algorithm from previous column generation iterations. Finding a new weak
learner in (4.13) is based on the dual solution µ. We need to ensure that the solution of
cutting-plane is able to reach a sufficient precision, such that the generated weak learner
is able to “make progress”. Thus, we can adapt the stopping criterion parameter εcp
in Algorithm 4 according to the cutting-plane precision in the last column generation
iteration.
60 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
4.2.3 Discussion
Let us have a close look at the StructBoost algorithm in Algorithm 3. We can see that
the training loop in Algorithm 3 is almost identical to other fully-corrective boosting
methods (e.g., LPBoost [60] and Shen and Li [59]). Line 4 finds the most violated con-
straint in the dual form and adds a new weak structured learner to the master problem.
The dual solution µ(i,y) defined in (4.13) plays the role of the example weight associated
with each training example in conventional boosting methods such as AdaBoost and
LPBoost [60]. Then Line 5 solves the master problem, which is the reduced problem
of (4.4). Here we can see that, the cutting-plane in Algorithm 4 only serves as a solver
for solving the master problem in Line 5 of Algorithm 3. This makes our StructBoost
framework flexible—we are able to replace the cutting-plane optimization by other opti-
mization methods. For example, the bundle methods in [103] may further speed up the
computation.
For the convergence properties of the cutting-plane algorithm in Algorithm 4, readers
may refer to [87] and [52] for details.
Our column generation algorithm in Algorithm 3 is a constraint generation algorithm
for the dual problem in (4.8). We can adapt the analysis of the standard constraint gen-
eration algorithm for Algorithm 3. In general, for general column generation methods,
the global convergence can be established but it remains unclear about the convergence
rate if no particular assumptions are made.
4.3 Examples of StructBoost
We consider a few applications of the proposed general structured boosting in this sec-
tion, namely binary classification, ordinal regression, multi-class classification, optimiza-
tion of Pascal overlap score, and CRF parameter learning. We show the particular setup
for each application.
4.3.1 Binary classification
As the simplest example, the LPBoost of [60] for binary classification can be recovered
as follows. The label set is Y = +1,−1; and
Ψ(x, y) =1
2yΦ(x). (4.15)
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 61
The label cost can be a simple constant; for example, ∆(y, y′) = 1 for y 6= y′ and 0 for
y = y′. Here we have introduced a column vector Φ(x) :
Φ(x) = [φ1(x), . . . , φm(x)]>, (4.16)
which is the outputs of all weak classifiers on example x. The output of a weak classifier,
e.g., a decision stump or tree, is usually a binary value: φ(·) ∈ +1,−1. In kernel
methods, this feature mapping Φ(·) is only known through the so-called kernel trick.
Here we explicitly learn this feature mapping. Note that if Φ(x) = x, we have the
standard linear SVM.
4.3.2 Ordinal regression and AUC optimization
In ordinal regression, labels of the training data are ranks. Let us assume that the label
y ∈ R indicates an ordinal scale, and pairs (i, j) in the set S has the relationship of
example i being ranked higher than j, i.e., yi yj . The primal can be written as
minw≥0,ξ≥0
1>w +C
n
∑(i,j)∈S
ξij (4.17a)
s.t. ∀(i, j) ∈ S : w>[Φ(xi)− Φ(xj)
]≥ 1− ξij . (4.17b)
Here Φ(·) is defined in (4.16). Note that (4.17) also optimizes the area under the receiver
operating characteristic (ROC) curve (AUC) criterion. As pointed out in [104], (4.17)
is an instance of the multiclass classification problem. We discuss how the multiclass
classification problem fits in our framework shortly.
Here, the number of constraints is quadratic in the number of training examples. Directly
solving (4.17) can only solve problems with up to a few thousand training examples.
We can reformulate (4.17) into an equivalent 1-slack problem, and apply the proposed
StructBoost framework to solve the optimization more efficiently.
4.3.3 Multi-class boosting
The MultiBoost algorithm of Shen and Hao [1] can be implemented by the StructBoost
framework as follows. Let Y = 1, 2, . . . , k and
w = w1 · · · wk. (4.18)
62 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Here stacks two vectors. As in [1], wy is the model parameter associated with the
y-th class. The multi-class discriminant function in [1] writes
f(x, y;w) = w>yΦ(x). (4.19)
Now let us define the orthogonal label coding vector:
Γ(y) = [1(y, 1), · · · ,1(y, k)]> ∈ 0, 1k. (4.20)
Here 1(y, z) is the indicator function defined as:
1(y, z) =
1 if y = z,
0 if y 6= z.(4.21)
Then the following joint mapping function
Ψ(x, y) = Φ(x)⊗ Γ(y) (4.22)
recovers the StructBoost formulation (4.4) for multi-class boosting. The operator ⊗calculates the tensor product. The multi-class learning can be formulated as
minw≥0,ξ≥0
1>w +C
n1>ξ (4.23a)
s.t. ∀i = 1, . . . , n and and ∀y ∈ 1, . . . , k :
w>yiΦ(xi)−w>yΦ(xi) ≥ 1− ξi. (4.23b)
(4.23c)
A new weak classifier φ(·) is generated by solving the argmax problem defined in (4.13),
which can be written as:
φ?(·) = argmaxφ(·),y
∑i,y
µ(i,y)
[φ(xi)⊗ Γ(yi)− φ(xi)⊗ Γ(y))
]. (4.24)
4.3.4 Hierarchical classification with taxonomies
In many applications such as object categorization and document classification [105],
classes of objects are organized in taxonomies or hierarchies. For example, the Ima-
geNet dataset has organized all the classes according to the tree structures of WordNet
[106]. This problem is a classification example that the output space has interdependent
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 63
(a) Taxonomy of the 6-scene dataset
(b) Taxonomy of the 15-scene dataset
Figure 4.1: The hierarchy structures of two selected subsets of the SUN dataset [3] used in ourexperiments for hierarchical image classification.
structures. An example tree structure (taxonomy) of image categories is shown in Figure
4.1.
Now we consider the taxonomy to be a tree, with a partial order ≺, which indicates
if a class is a predecessor of another class. We override the indicator function, which
indicates if z is a predecessor of y in a label tree:
1(y, z) =
1 y ≺ z or y = z,
0 otherwise.(4.25)
The label coding vector has the same format as in the standard multi-class classification
case:
Γ(y) = [1(y, 1), · · · ,1(y, k)]> ∈ 0, 1k. (4.26)
Thus Γ(y)>Γ(y′) counts the number of common predecessors, while in the case of stan-
dard multi-class classification, Γ(y)>Γ(y′) = 0 for y 6= y′.
Figure 4.2 shows an example of the label coding vector for a given label hierarchy. In
this case, for example, for class 3, Γ(3) = [0, 0, 1, 0, 0, 0, 1, 0, 1]>. The joint mapping
64 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Figure 4.2: Classification with taxonomies (tree loss), corresponding to the first example inFigure 4.1.
function is
Ψ(x, y) = Φ(x)⊗ Γ(y). (4.27)
The tree loss function ∆(y, y′) is the height of the first common ancestor of the arguments
y, y′ in the tree. By re-defining Γ(y) and ∆(y, y′), classification with taxonomies can be
immediately implemented using the standard multi-class classification shown in the last
subsection.
Here we also consider an alternative approach. In [107], the authors show that one
can train a multi-class boosting classifier by projecting data to a label-specific space
and then learning a single model parameter w. The main advantage might be that
the optimization of w is simplified. Similar to [107] we define label-augmented data as
x′y = x⊗ Γ(y). The max-margin classification can be written as
minw≥0,ξ≥0
1>w +C
n1>ξ
s.t. ∀i = 1, · · · ,m; and ∀y :
w>[Φ(x′i,yi)− Φ(x′i,y)
]≥ ∆(yi, y)− ξi.
Compared with the first approach, now the model w ∈ Rn, which is independent of the
number of classes.
4.3.5 Optimization of the Pascal image overlap criterion
Object detection/localization has used the image area overlap as the loss function [4,
10, 15], e.g, in the Pascal object detection challenge:
∆(y,y′) = 1− area(y ∩ y′)area(y ∪ y′)
, (4.28)
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 65
with y,y′ being the bounding box coordinates. y∩y′ and y∪y′ are the box intersection
and union. Let xy denote an image patch defined by a bounding box y on the image x.
To apply StructBoost, we define
Ψ(x,y) = Φ(xy). (4.29)
Φ(·) is defined in (4.16). Weak learners such as classifiers or regressors φ(·) are trained
on the image features extracted from image patches. For example, we can extract
histograms of oriented gradients (HOG) from the image patch xy and train a decision
stump with the extracted HOG features by solving the argmax in (4.13).
Note that in this case, solving (4.14), which is to find the most violated constraint in the
training step as well as the inference for prediction (4.3), is in general difficult. In [10], a
branch-and-bound search has been employed to find the global optimum. Following the
simple sampling strategy in [4], we simplify this problem by evaluating a certain number
of sampled image patches to find the most violated constraint. It is also the case for
prediction.
4.3.6 CRF parameter learning
CRF has found successful applications in many vision problems such as pixel labelling,
stereo matching and image segmentation. Previous work often uses tedious cross-
validation for setting the CRF parameters. This approach is only feasible for a small
number of parameters. Recently, SSVM has been introduced to learn the parameters
[8]. We demonstrate here how to employ the proposed StructBoost for CRF parame-
ter learning in the image segmentation task. We demonstrate the effectiveness of our
approach on the Graz-02 image segmentation dataset.
To speed up computation, super-pixels rather than pixels have been widely adopted in
image segmentation. We define x as an image and y as the segmentation labels of all
super-pixels in the image. We consider the energy E of an image x and segmentation
labels y over the nodes N and edges S, which takes the following form:
E(x,y;w) =∑p∈N
w(1)>Φ(1) (U(yp,x))
+∑
(p,q)∈S
w(2)>Φ(2) (V(yp, yq,x)) . (4.30)
Recall that 1(·, ·) is the indicator function defined in (4.21). p and q are the super-pixel
indices; yp, yq are the labels of the super-pixel p, q. U is a set of unary potential functions:
U = [U1, U2, . . . ]>. V is a set of pairwise potential functions: V = [V1, V2, . . . ]
>. When
66 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
we learn the CRF parameters, the learning algorithm sees only U and V. In other words
U and V play the role of the input features. Details on how to construct U and V are
described in the experiment section. w(1) and w(2) are the CRF potential weighting
parameters that we aim to learn. Φ(1)(·) and Φ(2)(·) are two sets of weak learners (e.g.,
decision stumps) for the unary part and pairwise part respectively:
Φ(1)(·) = [φ(1)1 (·), φ(1)
2 (·), . . . ]>,Φ(2)(·) = [φ(2)1 (·), φ(2)
2 (·), . . . ]>. (4.31)
To predict the segmentation labels y? of an unknown image x is to solve the energy
minimization problem:
y? = argminy
E(x,y;w), (4.32)
which can be solved efficiently by using graph cuts [5, 8].
Consider a segmentation problem with two classes (background versus foreground). It
is desirable to keep the submodular property of the energy function in (4.30). Otherwise
graph cuts cannot be directly applied to achieve globally optimal labelling. Let us
examine the pairwise energy term:
θ(p,q)(yp, yq) = w(2)>Φ(2) (V(yp, yq,x)) , (4.33)
and a super-pixel label y ∈ 0, 1. It is well known that, if the following is satisfied for
any pair (p, q) ∈ S, the energy function in (4.30) is submodular:
θ(p,q)(0, 0) + θ(p,q)(1, 1) ≤ θ(p,q)(0, 1) + θ(p,q)(1, 0). (4.34)
We want to keep the above property.
First, for a weak learner φ(2)(·), we enforce it to output 0 when two labels are identical.
This can always be achieved by multiplying a conventional weak learner by (1−1(yp, yq)).
Now we have θ(p,q)(0, 0) = θ(p,q)(1, 1) = 0.
Given that the nonnegative-ness of w is enforced in our model, now a sufficient condition
is that the output of a weak learner φ(2)(·) is always nonnegative, which can always be
achieved. We can always use a weak learner φ(2)(·) which takes a nonnegative output,
e.g., a discrete decision stump or tree with outputs in 0, 1.
By applying weak learners on U and V, our method introduces nonlinearity for the
parameter learning, which is different from most linear CRF learning methods such
as [8]. Until recently, Berteli et al. presented an image segmentation approach that
uses nonlinear kernels for the unary energy term in the CRF model [91]. In our model,
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 67
nonlinearity is introduced by applying weak learners on the potential functions’ outputs.
This is the same as the fact that SVM introduces nonlinearity via the so-called kernel
trick and boosting learns a nonlinear model with nonlinear weak learners. Nowozin et al.
[108] introduced decision tree fields (DTF) to overcome the problem of overly-simplified
modeling of pairwise potentials in most CRF models. In DTF, local interactions between
multiple variables are determined by means of decision trees. In our StructBoost, if we
use decision trees as the weak learners on the pairwise potentials, then StructBoost and
DTF share similar characteristics in that both use decision trees for the same purpose.
However, the starting points of these two methods as well as the training procedures are
entirely different.
To apply StructBoost, the CRF parameter learning problem in a large-margin framework
can then be written as:
minw≥0,ξ≥0
1>w +C
n1>ξ (4.35a)
s.t. ∀i = 1, . . . , n and ∀y ∈ Y :
E(xi,y;w)− E(xi,yi;w) ≥ ∆(yi,y)− ξi. (4.35b)
Here i indexes images. Intuitively, the optimization in (4.35) is to encourage the energy
of the ground truth label E(xi,yi) to be lower than any other incorrect labels E(xi,y)
by at least a margin ∆(yi,y), ∀y. We simply define ∆(yi,y) using the Hamming loss,
which is the sum of the differences between the ground truth label yi and the label y
over all super-pixels in an image:
∆(yi,y) =∑p
(1− 1(ypi , yp)). (4.36)
We show that the problem (4.35) is a special case of the general formulation of Struct-
Boost (4.4) by defining
w = −w(1) w(2), (4.37)
and,
Ψ(x,y) =∑p∈N
Φ(1) (U(yp,x))∑
(p,q)∈S
Φ(2) (V(yp, yq,x)) .
Recall that stacks two vectors. With this definition, we have the relation:
w>Ψ(x,y) = −E(x,y;w). (4.38)
The minus sign here is to inverse the order of subtraction in (4.4b). At each column
generation iteration (Algorithm 3), two new weak learners φ(1)(·) and φ(2)(·) are added
68 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
to the unary weak learner set and the pairwise weak learner set, respectively by solving
the argmax problem defined in (4.13), which can be written as:
φ(1)?(·) = argmaxφ(·)
∑i,y
µ(i,y)
∑p∈N
[φ(1) (U(yp,xi))
− φ(1) (U(ypi ,xi)]; (4.39)
φ(2)?(·) = argmaxφ(·)
∑i,y
µ(i,y)
∑(p,q)∈S
[φ(2) (V(yp, yq,xi))
− φ(2) (V(ypi , yqi ,xi))
]. (4.40)
The maximization problem to find the most violated constraint in (4.14) is to solve the
inference:
y′i = argminy
E(xi,y)−∆(yi,y), (4.41)
which is similar to the label prediction inference in (4.32), and the only difference is that
the labeling loss term: ∆(yi,y) is involved in (4.41). Recall that we use the Hamming
loss ∆(yi,y) as defined in (4.36), the term ∆(yi,y) can be absorbed into the unary term
of the energy function defined in (4.30) (such as in [8]). The inference in (4.41) can be
written as:
y′i = argminy
∑p∈N
[w(1)>Φ(1) (U(yp,xi))− (1− 1(ypi , y
p))
]+∑
(p,q)∈S
w(2)>Φ(2) (V(yp, yq,xi)) . (4.42)
The above minimization (4.42) can also be solved efficiently by using graph cuts.
4.4 Experiments
To evaluate our method, we run various experiments on applications including AUC
maximization (ordinal regression), multi-class image classification, hierarchical image
classification, visual tracking and image segmentation. We mainly compare with the
most relevant method: Structured SVM (SSVM) and some other conventional methods
(e.g., SVM, AdaBoost). If not otherwise specified, the cutting-plane stopping criteria
(εcp) in our method is set to 0.01.
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 69
Table 4.1: AUC maximization. We compare the performance of n-slack and 1-slack formulations.“−” means that the method is not able to converge within a memory and time limit. We cansee that 1-slack can achieve similar AUC results on training and testing data as n-slack while1-slack is significantly faster than n-slack.
dataset method time (sec) AUC training AUC test
winen-slack 13±1 1.000±0.000 0.994±0.0051-slack 3±1 1.000±0.000 0.994±0.006
glassn-slack 20±1 0.967±0.011 0.849±0.0281-slack 6±1 0.955±0.030 0.844±0.039
svmguide2n-slack 332±6 0.988±0.003 0.905±0.0361-slack 106±8 0.988±0.003 0.905±0.036
svmguide4n-slack 564±79 1.000±0.000 0.982±0.0051-slack 106±13 1.000±0.000 0.982±0.005
voweln-slack 4051±116 0.999±0.001 0.968±0.0131-slack 952±139 0.999±0.001 0.967±0.013
dnan-slack − − −1-slack 1598±643 0.998±0.000 0.992±0.003
segmentn-slack − − −1-slack 475±42 1.000±0.000 0.999±0.001
satimagen-slack − − −1-slack 37769±6331 0.999±0.000 0.997±0.002
4.4.1 AUC optimization
In this experiment, we compare the efficiency of the 1-slack (solving (4.11)) and n-slack
(solving (4.4) or its dual) formulations of our method StructBoost. The details of the
AUC optimization are described in Section 4.3.2. We run the experiments on a few
UCI multi-class datasets. To create imbalanced data, we use one class of the multi-class
UCI datasets as positive data and the rest as negative data. The maximum number
of boosting (column generation) iterations is set to 200; the cutting-plane stopping
criterion (εcp) is set to 0.001. Decision stumps are used as weak learners (Φ(·) in (4.17)).
For each data set, we randomly sample 50% for training, 25% for validation and the
rest for testing. The regularization parameter C is chosen from 6 candidates ranging
from 10 to 103. Experiments are repeated 5 times on each dataset and the mean and
standard deviation are reported. Table 4.1 reports the results. We can see that the
1-slack formulation of StructBoost achieves similar AUC performance as the n-slack
formulation, while being much faster than n-slack. The values of the objective function
and optimization times are shown in Figure 4.3 for varying column generation iterations.
Again, it shows that the 1-slack version achieves similar objective values to the n-slack
version with less running time.
Note that RankBoost may also be applied to this problem [109]. RankBoost has been
designed for solving ranking problems.
70 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
20 40 60 80 100 120 140 160 180 200
4
5
6
7
8
9
10
11
SVMGUIDE4
iterations
Obj
ectiv
e fu
nctio
n va
lue
m−slack (4.065±0.221)1−slack (4.067±0.221)
20 40 60 80 100 120 140 160 180 2000
100
200
300
400
500
600SVMGUIDE4
iterations
Opt
imiz
atio
n tim
e (s
econ
ds)
m−slack (564.4±78.5 secs)1−slack (105.7±13.2 secs)
20 40 60 80 100 120 140 160 180 200
8
10
12
14
16
18
20
22
24
VOWEL
iterations
Obj
ectiv
e fu
nctio
n va
lue
m−slack (7.562±0.501)1−slack (7.567±0.500)
20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
4500VOWEL
iterations
Opt
imiz
atio
n tim
e (s
econ
ds)
m−slack (4051.2±116.0 secs)1−slack (951.8±138.7 secs)
Figure 4.3: AUC optimization on two UCI datasets. The objective values and optimization timeare shown in the figure by varying boosting (or column generation) iterations. It shows that1-slack achieves similar objective values as n-slack but needs less running time.
4.4.2 Multi-class classification
Multi-class classification is a special case of structured learning. Details are described in
Section 4.3.3. We carry out experiments on some UCI multi-class datasets and MNIST.
We compare with Structured SVM (SSVM), conventional multi-class boosting methods
(namely AdaBoost.ECC and AdaBoost.MH), and the one-vs-all SVM method. For each
dataset, we randomly select 50% data for training, 25% data for validation and the rest
for testing. The maximum number of boosting iterations is set to 200; the regularization
parameter C is chosen from 6 candidates whose values range from 10 to 1000. The
experiments are repeated 5 times for each dataset.
To demonstrate the flexibility of our method, we use decision stumps and linear percep-
tron functions as weak learners (Φ(·) in (4.23)). The perceptron weak learner can be
written as:
φ(x) = sign(v>x+ b). (4.43)
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 71
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8VOWEL
iterations
Tes
t err
or
1−vs−all SVM(0.544±0.022)SSVM(0.256±0.025)Ada.MH(0.188±0.021)Ada.ECC(0.206±0.015)StBoost−per(0.068±0.018)StBoost−stump(0.175±0.022)
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7SEGMENT
iterations
Tes
t err
or
1−vs−all SVM(0.077±0.008)SSVM(0.053±0.010)Ada.MH(0.037±0.007)Ada.ECC(0.029±0.008)StBoost−per(0.038±0.007)StBoost−stump(0.029±0.007)
20 40 60 80 100 120 140 160 180 2000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8SATIMAGE
iterations
Tes
t err
or
1−vs−all SVM(0.175±0.004)SSVM(0.149±0.001)Ada.MH(0.127±0.009)Ada.ECC(0.128±0.007)StBoost−per(0.114±0.011)StBoost−stump(0.121±0.007)
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7USPS
iterations
Tes
t err
or
1−vs−all SVM(0.054±0.005)SSVM(0.058±0.003)Ada.MH(0.074±0.005)Ada.ECC(0.084±0.007)StBoost−per(0.041±0.006)StBoost−stump(0.069±0.006)
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7PENDIGITS
iterations
Tes
t err
or
1−vs−all SVM(0.081±0.005)SSVM(0.052±0.003)Ada.MH(0.074±0.005)Ada.ECC(0.084±0.007)StBoost−per(0.018±0.003)StBoost−stump(0.039±0.004)
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7MNIST
iterations
Tes
t err
or
1−vs−all SVM(0.092±0.002)SSVM(0.096±0.002)Ada.MH(0.134±0.004)Ada.ECC(0.158±0.003)StBoost−per(0.065±0.006)StBoost−stump(0.125±0.004)
Figure 4.4: Test performance versus the number of boosting iterations of multi-class classification.StBoost-stump and StBoost-per denote our StructBoost using decision stumps and perceptronsas weak learners, respectively. The results of SSVM and SVM are shown as straight lines in theplots. The values shown in the legend are the error rates of the final iteration for each method.Our methods perform better than SSVM in most cases.
72 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Table 4.2: Multi-class classification test errors (%) on several UCI and MNIST datasets. 1-v-a SVM is the one-vs-all SVM. StBoost-stump and StBoost-per denote our StructBoost usingdecision stumps and perceptrons as weak learners, respectively. StructBoost outperforms SSVMin most cases and achieves competitive performance compared with other multi-class classifiers.
glass svmguide2 svmguide4 vowel dna segment satimage usps pendigits mnist
StBoost-stump 35.8 ± 6.2 21.0 ± 3.9 20.1 ± 2.9 17.5 ± 2.2 6.2 ± 0.7 2.9 ± 0.7 12.1 ± 0.7 6.9 ± 0.6 3.9 ± 0.4 12.5 ± 0.4StBoost-per 37.3 ± 6.2 22.7 ± 4.8 53.4 ± 6.1 6.8 ± 1.8 6.6 ± 0.6 3.8 ± 0.7 11.4 ± 1.1 4.1 ± 0.6 1.8 ± 0.3 6.5 ± 0.6
Ada.ECC 32.7 ± 4.9 23.3 ± 4.0 19.1 ± 2.3 20.6 ± 1.5 7.6 ± 1.2 2.9 ± 0.8 12.8 ± 0.7 8.4 ± 0.7 8.4 ± 0.7 15.8 ± 0.3Ada.MH 32.3 ± 5.0 21.9 ± 4.5 19.3 ± 3.0 18.8 ± 2.1 7.1 ± 0.6 3.7 ± 0.7 12.7 ± 0.9 7.4 ± 0.5 7.4 ± 0.5 13.4 ± 0.4
SSVM 38.8 ± 8.7 21.9 ± 5.9 45.7 ± 3.9 25.6 ± 2.5 6.9 ± 0.9 5.3 ± 1.0 14.9 ± 0.1 5.8 ± 0.3 5.2 ± 0.3 9.6 ± 0.21-vs-All SVM 40.8 ± 7.0 17.7 ± 3.5 47.0 ± 3.2 54.4 ± 2.2 6.3 ± 0.5 7.7 ± 0.8 17.5 ± 0.4 5.4 ± 0.5 8.1 ± 0.5 9.2 ± 0.2
Table 4.3: Hierarchical classification. Results of the tree loss and the 1/0 loss (classification errorrate) on subsets of the SUN dataset. StructBoost-tree uses the hierarchy class formulation withthe tree loss, and StructBoost-flat uses the standard multi-class formulation. StructBoost-treethat minimizes the tree loss performs best.
datasets StructBoost-tree StructBoost-flat Ada.ECC-SVM Ada.ECC Ada.MH
6 scenes1/0 loss 0.322 ± 0.018 0.343 ± 0.028 0.350 ± 0.013 0.327 ± 0.002 0.315 ± 0.015tree loss 0.337 ± 0.014 0.380 ± 0.027 0.377 ± 0.018 0.352 ± 0.023 0.346 ± 0.018
15 scenes1/0 loss 0.394 ± 0.005 0.396 ± 0.013 0.414 ± 0.012 0.444 ± 0.012 0.418 ± 0.010tree loss 0.504 ± 0.007 0.536 ± 0.009 0.565 ± 0.019 0.584 ± 0.017 0.551 ± 0.013
We use a smooth sigmoid function to replace the step function so that gradient de-
scent optimization can be applied. We solve the argmax problem in (4.24) by using the
Quasi-Newton LBFGS [2] solver. We find that decision stumps often provide a good
initialization for LBFGS learning of the perceptron. Compared with decision stumps,
using the perceptron weak learner usually needs fewer boosting iterations to converge.
Table 4.2 reports the error rates. Figure 4.4 shows test performance versus the number
of boosting (column generation) iterations. The results demonstrate that our method
outperforms SSVM, and achieves competitive performance compared with other con-
ventional multi-class methods.
StructBoost performs better than SSVM on most datasets. This might be due to the
introduction of non-linearity in StructBoost. Results also show that using the perceptron
weak learner often achieves better performance than using decision stumps on those large
datasets.
4.4.3 Hierarchical multi-class classification
The details of hierarchical multi-class classification are described in Section 4.3.4. We
have constructed two hierarchical image datasets from the SUN dataset [3] which con-
tains 6 classes and 15 classes of scenes. The hierarchical tree structures of these two
datasets are shown in the Figure 4.1. For each scene class, we use the first 200 images
from the original SUN dataset. There are 1200 images in the first dataset and 3000 im-
ages in the second dataset. We have used the HOG features as described in [3]. For each
dataset, 50% examples are randomly selected for training and the rest for testing. We
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 73
Table 4.4: Average bounding box overlap scores on benchmark videos. Struck50 [4] is structuredSVM tracking with a buffer size of 50. Our StructBoost performs the best in most cases. Struckperforms the second best, which confirms the usefulness of structured output learning.
StructBoost AdaBoost Struck50 Frag MIL OAB1 OAB5 VTD
coke 0.79 ± 0.17 0.47 ± 0.19 0.55 ± 0.18 0.07 ± 0.21 0.36 ± 0.23 0.10 ± 0.20 0.04 ± 0.16 0.10 ± 0.23tiger1 0.75 ± 0.17 0.64 ± 0.16 0.68 ± 0.21 0.21 ± 0.30 0.64 ± 0.18 0.44 ± 0.23 0.23 ± 0.24 0.11 ± 0.24tiger2 0.74 ± 0.18 0.46 ± 0.18 0.59 ± 0.19 0.16 ± 0.24 0.63 ± 0.14 0.35 ± 0.23 0.18 ± 0.19 0.19 ± 0.22david 0.86 ± 0.07 0.34 ± 0.23 0.82 ± 0.11 0.18 ± 0.24 0.59 ± 0.13 0.28 ± 0.23 0.21 ± 0.22 0.29 ± 0.27
girl 0.74 ± 0.12 0.41 ± 0.26 0.80 ± 0.10 0.65 ± 0.19 0.56 ± 0.21 0.43 ± 0.18 0.28 ± 0.26 0.63 ± 0.12sylv 0.66 ± 0.16 0.52 ± 0.18 0.69 ± 0.14 0.61 ± 0.23 0.66 ± 0.18 0.47 ± 0.38 0.05 ± 0.12 0.58 ± 0.30bird 0.79 ± 0.11 0.67 ± 0.14 0.60 ± 0.26 0.34 ± 0.32 0.58 ± 0.32 0.57 ± 0.29 0.59 ± 0.30 0.11 ± 0.26walk 0.74 ± 0.19 0.56 ± 0.14 0.59 ± 0.39 0.09 ± 0.25 0.51 ± 0.34 0.54 ± 0.36 0.49 ± 0.34 0.08 ± 0.23
shaking 0.72 ± 0.13 0.49 ± 0.22 0.08 ± 0.19 0.33 ± 0.28 0.61 ± 0.26 0.57 ± 0.28 0.51 ± 0.21 0.69 ± 0.14singer 0.69 ± 0.10 0.74 ± 0.10 0.34 ± 0.37 0.14 ± 0.30 0.20 ± 0.34 0.20 ± 0.33 0.07 ± 0.18 0.50 ± 0.20iceball 0.58 ± 0.17 0.05 ± 0.16 0.51 ± 0.33 0.51 ± 0.31 0.35 ± 0.29 0.08 ± 0.23 0.38 ± 0.30 0.57 ± 0.29
run 5 times for each dataset. The regularization parameter is chosen from 6 candidates
ranging from 1 to 103.
We use linear SVM as weak classifiers in our method. The linear SVM weak classifier
has the same form as (4.43). At each boosting iteration, we solve the argmax problem
by training a linear SVM model. The regularization parameter C in the SVM is set to
a large value (108/#examples). To alleviate the overfitting problem when using linear
SVMs as weak learners, we set 10% of the smallest non-zero dual solutions µ(i,y) to zero.
We compare StructBoost using the hierarchical multi-class formulation (StructBoost-
tree) and the standard multi-class formulation (StructBoost-flat). We run some other
multi-class methods for further comparison: Ada.ECC, Ada.MH with decision stumps.
We also run Ada.ECC using linear SVM as weak classifiers (labelled as Ada.ECC-SVM).
When using SVM as weak learners, the number of boosting iterations is set to 200,
and for decision stump, it is set to 500. Table 4.3 shows the tree loss and the 1/0 loss
(classification error rate). We observe that StructBoost-tree has the lowest tree loss
among all compared methods, and it also outperforms its counterpart, StructBoost-flat,
in terms of both classification error rates and the tree loss. Our StructBoost-tree makes
use of the class hierarchy information and directly optimizes the tree loss. That might
be the reason why StructBoost-tree achieves best performance.
4.4.4 Visual tracking
A visual tracking method, termed Struck [4], was introduced based on SSVM. The core
idea is to train a tracker by optimizing the Pascal image overlap score using SSVM. Here
we apply StructBoost to visual tracking, following the similar setting as in Struck [4].
More details can be found in Section 4.3.5.
74 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Table 4.5: Average center errors on benchmark videos. Struck50 [4] is structured SVM trackingwith a buffer size of 50. We observe similar results as in Table 4.4: Our StructBoost outperformsothers on most sequences, and Struck is the second best.
StructBoost AdaBoost Struck50 Frag MIL OAB1 OAB5 VTD
coke 3.7 ± 4.5 9.3 ± 4.2 8.3 ± 5.6 69.5 ± 32.0 17.8 ± 9.6 34.7 ± 15.5 68.1 ± 30.3 46.8 ± 21.8tiger1 5.4 ± 4.9 7.8 ± 4.4 7.8 ± 9.9 39.6 ± 25.7 8.4 ± 5.9 17.8 ± 16.4 38.9 ± 31.1 68.8 ± 36.4tiger2 5.2 ± 5.6 12.7 ± 6.3 8.7 ± 6.1 38.5 ± 24.9 7.5 ± 3.6 20.5 ± 14.9 38.3 ± 26.9 38.0 ± 29.6david 5.2 ± 2.8 43.0 ± 28.2 7.7 ± 5.7 73.8 ± 36.7 19.6 ± 8.2 51.0 ± 30.9 64.4 ± 33.5 66.1 ± 56.3
girl 14.3 ± 7.8 47.1 ± 29.5 10.1 ± 5.5 23.0 ± 22.5 31.6 ± 28.2 43.3 ± 17.8 67.8 ± 32.5 18.4 ± 11.4sylv 9.1 ± 5.8 14.7 ± 7.8 8.4 ± 5.3 12.2 ± 11.8 9.4 ± 6.5 32.9 ± 36.5 76.4 ± 35.4 21.6 ± 35.7bird 6.7 ± 3.8 12.7 ± 9.5 17.9 ± 13.9 50.0 ± 43.3 49.0 ± 85.3 47.9 ± 87.7 48.5 ± 86.3 143.9 ± 79.3walk 8.4 ± 10.3 13.5 ± 5.4 33.9 ± 49.5 102.8 ± 46.3 35.0 ± 47.5 35.7 ± 49.2 38.0 ± 48.7 100.9 ± 47.1
shaking 9.5 ± 5.4 21.6 ± 12.0 123.9 ± 54.5 47.2 ± 40.6 37.8 ± 75.6 26.9 ± 49.3 29.1 ± 48.7 10.5 ± 6.8singer 5.8 ± 2.2 4.8 ± 2.1 29.5 ± 23.8 172.8 ± 95.2 188.3 ± 120.8 189.9 ± 115.2 158.5 ± 68.6 10.1 ± 7.6iceball 8.0 ± 4.1 107.9 ± 66.4 15.6 ± 22.1 39.8 ± 72.9 61.6 ± 85.6 97.7 ± 53.5 58.7 ± 84.0 13.5 ± 26.0
Our experiment follows the on-line tracking setting. Here we use the first 3 labeled
frames for initialization and training of our StructBoost tracker. Then the tracker is
updated by re-training the model during the course of tracking. Specifically, in the
i-th frame (represented by xi), we first perform a prediction step (solving (4.3)) to
output the detection box (yi), then collect training data for tracker update. For solving
the prediction inference in (4.3), we simply sample about 2000 bounding boxes around
the prediction box of the last frame (represented by yi−1), and search for the most
confident bounding box over all sampled boxes as the prediction. After the prediction,
we collect training data by sampling about 200 bounding boxes around the current
prediction yi. We use the training data in recent 60 frames to re-train the tracker for
every 2 frames. Solving (4.14) for finding the most violated constraint is similar to the
prediction inference.
For StructBoost, decision stumps are used as the weak classifiers; the number of boosting
iterations is set to 300; the regularization parameter C is selected from 100.5 to 102. We
use the down-sampled gray-scale raw pixels and HOG [110] as image features.
For comparison, we also run a simple binary AdaBoost tracker using the same setting as
for our StructBoost tracker. The number of weak classifiers for AdaBoost is set to 500.
When training the AdaBoost tracker, we collect positive boxes that significantly overlap
(overlap score above 0.8) with the prediction box of the current frame, and negative
boxes with small overlap scores (lower or equal to 0.3).
We also compare with a few state-of-the-art tracking methods, including Struck [4] (with
a buffer size of 50), multi-instance tracking (MIL) [111], fragment tracking (Frag) [112],
online AdaBoost tracking (OAB) [113], and visual tracking decomposition (VTD) [114].
Two different settings are used for OAB: one positive example per frame (OAB1) and
five positive examples per frame (OAB5) for training. The test video sequences: ”coke,
tiger1, tiger2, david, girl and sylv” were used in [4]; “shaking, singer” are from [114] and
the rest are from [115].
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 75
Vtd Oab5 Oab1 Mil Frag Struck Adaboost StructBoost
50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
(a) coke
50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
(b) tiger1
50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
(c) tiger2
50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
VtdOab5Oab1MilFragStruckAdaboostStructB
(d) david
20 40 60 80 100 120 1400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(e) walk
50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(f) shaking
200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(g) sylv
50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
VtdOab5Oab1MilFragStruckAdaboostStructB
(h) girl
10 20 30 40 50 60 70 80 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(i) bird
50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(j) singer
50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame index
Bou
ndin
g bo
x ov
erla
p
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(k) iceball
Figure 4.5: Bounding box overlap in frames of several video sequences. Our StructBoost oftenachieves higher scores of box overlap compared with other trackers.
Table 4.4 reports the Pascal overlap scores of compared methods. Our StructBoost
tracker performs best on most sequences. Compared with the simple binary AdaBoost
tracker, StructBoost that optimizes the pascal overlap criterion perform significantly
better. Note that here Struck uses Haar features. When Struck uses a Gaussian kernel
defined on raw pixels, the performance is slightly different [4], and ours still outperforms
Struck in most cases. This might be due to the fact that StructBoost selects relevant
features (300 features selected here), while SSVM of Struck [4] uses all the image patch
information which may contain noises.
The center location errors (in pixels) of compared methods are shown in Table 4.5. We
can see that optimizing the overlap score also helps to minimize the center location
76 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
Vtd Oab5 Oab1 Mil Frag Struck Adaboost StructBoost
50 100 150 200 2500
20
40
60
80
100
120
140
Frame index
Cen
ter
loca
tion
erro
r
(a) coke
50 100 150 200 250 300 3500
20
40
60
80
100
120
140
Frame index
Cen
ter
loca
tion
erro
r
(b) tiger1
50 100 150 200 250 300 3500
20
40
60
80
100
120
140
Frame index
Cen
ter
loca
tion
erro
r
(c) tiger2
50 100 150 200 250 300 350 400 4500
20
40
60
80
100
120
140
160
180
200
Frame index
Cen
ter
loca
tion
erro
r
VtdOab5Oab1MilFragStruckAdaboostStructB
(d) david
20 40 60 80 100 120 1400
20
40
60
80
100
120
140
160
180
Frame index
Cen
ter
loca
tion
erro
r
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(e) walk
50 100 150 200 250 300 3500
50
100
150
200
250
300
Frame index
Cen
ter
loca
tion
erro
r
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(f) shaking
200 400 600 800 1000 12000
20
40
60
80
100
120
140
160
180
Frame index
Cen
ter
loca
tion
erro
r
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(g) sylv
50 100 150 200 250 300 350 400 450 5000
50
100
150
Frame index
Cen
ter
loca
tion
erro
r
VtdOab5Oab1MilFragStruckAdaboostStructB
(h) girl
10 20 30 40 50 60 70 80 900
50
100
150
200
250
300
350
Frame index
Cen
ter
loca
tion
erro
r
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(i) bird
50 100 150 200 250 300 3500
50
100
150
200
250
300
Frame index
Cen
ter
loca
tion
erro
r
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(j) singer
50 100 150 200 250 300 350 400 450 5000
50
100
150
200
250
300
Frame index
Cen
ter
loca
tion
erro
r
Vtd
Oab5
Oab1
Mil
Frag
Struck
Adaboost
StructB
(k) iceball
Figure 4.6: Bounding box center location error in frames of several video sequences. Our Struct-Boost often achieves lower center location errors compared with other trackers.
errors. Our method also achieves the best performance.
Figure 4.5 and 4.6 plots the Pascal overlap scores and central location errors frame by
frame on several video sequences. Some tracking examples are shown in Figure 4.7.
4.4.5 CRF parameter learning for image segmentation
We evaluate our method on CRF parameter learning for image segmentation, following
the work of [8]. The work of [8] applies SSVM to learn weighting parameters for different
potentials (including multiple unary and pairwise potentials). The goal of applying
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 77
Vtd Oab5 Oab1 Mil Frag Struck Adaboost StructBoost
Figure 4.7: Some tracking examples for several video sequences: “coke”, “david”, , “walk”,“bird” and “tiger2” (best viewed on screen). The output bounding boxes of our StructBoostbetter overlap against the ground truth than the compared methods.
Table 4.6: Image segmentation results on the Graz-02 dataset. The results show the the pixelaccuracy, intersection-union score (including the foreground and background) and precision =recall value (as in [5]). Our method StructBoost for nonlinear parameter learning performsbetter than SSVM and other methods.
bike car people
intersection/union (foreground, background) (%)
AdaBoost 69.2 (57.6, 80.7) 72.2 (51.7, 92.7) 68.9 (51.2, 86.5)SVM 65.2 (53.0, 77.4) 68.6 (45.0, 92.3) 62.9 (41.0, 84.8)
SSVM 74.5 (64.4, 84.6) 80.2 (64.9, 95.4) 74.3 (58.8, 89.7)StructBoost 76.5 (66.3, 86.7) 80.8 (66.1, 95.6) 75.7 (61.0, 90.4)
pixel accuracy (foreground, background) (%)
AdaBoost 84.4 (83.8, 85.1) 82.9 (69.8, 96.0) 81.0 (70.0, 92.1)SVM 81.9 (81.8, 82.1) 77.0 (57.2, 96.9) 73.5 (53.8, 93.2)
SSVM 87.9 (87.9, 88.0) 86.9 (75.8, 98.1) 83.5 (71.8, 95.2)StructBoost 87.4 (83.3, 91.5) 87.6 (77.0, 98.1) 84.6 (73.6, 95.6)
precision = recall (%)
M. & S. [116] 61.8 53.8 44.1F. et al. [5] 72.2 72.2 66.3AdaBoost 72.7 67.8 67.0
SVM 68.3 63.4 61.2SSVM 77.3 78.3 74.4
StructBoost 78.9 79.3 75.9
78 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
(a) Testing (b) Truth (c) AdaBoost (d) SSVM (e) StructBoost
Figure 4.8: Some segmentation results on the Graz-02 dataset (car). Compared with AdaBoost,structured output learning methods (StructBoost and SSVM) present sharper segmentationboundaries, and better spatial regularization. Compared with SSVM, our StructBoost withnon-linear parameter learning performs better, demonstrating more accurate foreground objectboundaries and cleaner backgrounds.
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 79
(a) Testing (b) Truth (c) AdaBoost (d) SSVM (e) StructBoost
Figure 4.9: Some segmentation results on the Graz-02 dataset (bicycle). Compared with Ad-aBoost, structured output learning methods (StructBoost and SSVM) present sharper segmen-tation boundaries, and better spatial regularization. Compared with SSVM, our StructBoostwith non-linear parameter learning performs better, demonstrating more accurate foregroundobject boundaries and cleaner backgrounds.
80 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
(a) Testing (b) Truth (c) AdaBoost (d) SSVM (e) StructBoost
Figure 4.10: Some segmentation results on the Graz-02 dataset (person). Compared with Ad-aBoost, structured output learning methods (StructBoost and SSVM) present sharper segmen-tation boundaries, and better spatial regularization. Compared with SSVM, our StructBoostwith non-linear parameter learning performs better, demonstrating more accurate foregroundobject boundaries and cleaner backgrounds.
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 81
(a) Testing (b) Truth (c) AdaBoost (d) SSVM (e) StructBoost
Figure 4.11: Some segmentation results on the Graz-02 dataset (person). Compared with Ad-aBoost, structured output learning methods (StructBoost and SSVM) present sharper segmen-tation boundaries, and better spatial regularization. Compared with SSVM, our StructBoostwith non-linear parameter learning performs better, demonstrating more accurate foregroundobject boundaries and cleaner backgrounds.
82 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
StructBoost here is to learn a non-linear weighting for different potentials. Details are
described in Section 4.3.6.
We extend the super-pixel based segmentation method [5] with CRF parameter learning.
The Graz-02 dataset1 is used here which contains 3 categories (bike, car and person).
Following the setting as other methods [5], the first 300 labeled images in each category
are used in the experiment. Images with the odd indices are used for training and the rest
for testing. We generate super-pixels using the same setting as [5]. For each super-pixel,
we generate 5 types of features: visual word histogram [5], color histograms, GIST [117],
LBP2 and HOG [110]. For constructing the visual word histogram, we follow [5] using
a neighborhood size of 2; the code book size is set to 200. For GIST, LBP and HOG,
we extract features from patches centered at the super-pixel with 4 increasing sizes:
4× 4, 8× 8, 12× 12 and 16× 16. The cell size for LBP and HOG is set to a quarter of
the patch size. For GIST, we generate 512 dimensional features for each patch by using
4 scales and the number of orientations is set to 8. In total, we generate 14 groups of
features (including features extracted on patches of different sizes). Using these super-
pixel features, we construct 14 different unary potentials (U = [U1, . . . , U14 ]>) from
AdaBoost classifiers, which are trained on the foreground and background super-pixels.
The number of boosting iterations for AdaBoost is set to 1000. Specifically, we define
F ′(xp) as the discriminant function of AdaBoost on the features of the p-th super-pixel.
Then the unary potential function can be written as:
U(x, yp) = −ypF ′(xp). (4.44)
We also construct 2 pairwise potentials (V = [V1, V2 ]>): V1 is constructed using color
differences, and V2 using shared boundary length [5] which is able to discourage small
isolated segments. Recall that 1(·, ·) is an indicator function defined in (4.21). ‖xp−xq‖2calculates the `2 norm of the color difference between two super-pixels in the LUV color-
space; `(xp,xq) is the shared boundary length between two super-pixels. Then V1, V2
can be written as:
V1(yp, yq,x) = exp (−‖xp − xq‖2)[1− 1(yp, yq)
], (4.45)
V2(yp, yq,x) =`(xp,xq)[1− 1(yp, yq)
]. (4.46)
We apply StructBoost here to learn non-linear weights for combining these potentials.
We use decision stumps as weak learners (Φ(·) in (4.30)) here. The number of boosting
iterations for StructBoost is set to 50.
1http://www.emt.tugraz.at/~pinz/2http://www.vlfeat.org/
Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 83
For comparison, we use SSVM to learn CRF weighting parameters on exactly the same
potentials as our method. The regularization parameter C in SSVM and our Struct-
Boost is chosen from 6 candidates with the value ranging from 0.1 to 103. We also run
two simple binary super-pixel classifiers (linear SVM and AdaBoost) trained on visual
word histogram features of foreground and background super-pixels. The regularization
parameter C in SVM is chosen from 102 to 107. The number of boosting iterations for
AdaBoost is set to 1000.
We use the intersection-union score, pixel accuracy (including the foreground and back-
ground) and precision = recall value (as in [5]) for evaluation. Results are shown in
Table 4.6. Some segmentation examples are shown in Figure 4.8, 4.9,4.10 and 4.11.
As shown in the results, both StructBoost and SSVM, which learn to combine differ-
ent potential functions, are able to significantly outperform the simple binary models
(AdaBoost and SVM). StructBoost outperforms SSVM since it learns a non-linear com-
bination of potentials. Note that SSVM learns a linear weighting for different potentials.
By employing nonlinear parameter learning, our method gains further performance im-
provement over SSVM.
4.5 Conclusion
We have presented a structured boosting method, which combines a set of weak struc-
tured learners for nonlinear structured output leaning, as an alternative to SSVM [7]
and CRF [88]. Analogous to SSVM, where the discriminant function is learned over
a joint feature space of inputs and outputs, the discriminant function of the proposed
StructBoost is a linear combination of weak structured learners defined over a joint space
of input-output pairs.
To efficiently solve the resulting optimization problems, we have introduced a cutting-
plane method, which was originally proposed for fast training of linear SVM. Our ex-
tensive experiments demonstrate that indeed the proposed algorithm is computationally
tractable.
StructBoost is flexible in the sense that it can be used to optimize a wide variety of
loss functions. We have exemplified the application of StructBoost by applying it to
multi-class classification, hierarchical multi-class classification by optimizing the tree
loss, visual tracking that optimizes the Pascal overlap criterion, and learning CRF pa-
rameters for image segmentation. We show that StructBoost performance at least is
comparable or sometimes exceeds conventional approaches for a wide range of applica-
tions. We also observe that StructBoost outperforms linear SSVM, demonstrating the
84 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables
usefulness of our nonlinear structured learning method. Future work will focus on more
applications of this general StructBoost framework.
Chapter 5
Learning Hash Functions Using
Column Generation
In this chapter, we propose a column generation based method [43] for learning data-
dependent hash functions based on relative pairwise similarity information. Given a set
of triplets that encode the pairwise similarity comparison information, our method learns
hash functions that preserve the relative comparison relations in the data within the
large-margin learning framework. The learning procedure is implemented using column
generation and hence is named CGHash. At each iteration of the column generation
procedure, the best hash function is selected. Unlike many other hashing methods, our
method generalizes to new data points naturally. We show that our method with triplet
based formulation and large-margin learning is able to learn high quality hash functions
for similarity search.
5.1 Introduction
A hashing-based approach constructs a set of hash functions that map high-dimensional
data points to low-dimensional binary codes. These binary codes can be easily loaded
into the memory in order to allow rapid retrieval of data points. Moreover, the pairwise
Hamming distance between these binary codes can be efficiently computed by using
bit operations, which are well supported by modern processors, thus enabling efficient
similarity calculation on large-scale datasets. Hash-based approaches have thus found a
wide range of applications, including object recognition [118], information retrieval [33],
local descriptor compression [119], image matching [120], and many more. Recently a
number of effective hashing methods have been developed which construct a variety of
hash functions, mainly on the assumption that similar data points should have similar
85
86 Chapter 5 Learning Hash Functions Using Column Generation
binary codes, such as random projection based locality sensitive hashing (LSH) [32, 62,
65], boosting learning-based similarity sensitive coding (SSC) [121], and spectral hashing
of Weiss et al. [34] which is inspired by Laplacian eigenmap.
In more detail, spectral hashing [34] optimizes a graph Laplacian based objective function
such that, in the learned low-dimensional binary space, the local neighborhood structure
of the original dataset is preserved. SSC [121] makes use of boosting to adaptively
learn an embedding of the original space, represented by a set of weak learners or hash
functions. This embedding aims to preserve the pairwise similarity relations. These
approaches have demonstrated that, in general, data-dependent hashing is superior to
data-independent hashing with a typical example being LSH [65].
Following this vein, here we learn hash functions using similarity information that is
generally presented in a set of triplet relations. Thees triplet relations used for training
can be generated in an either supervised or unsupervised fashion. The fundamental idea
is to learn hash functions such that the Hamming distance between two similar data
points is smaller than that between two dissimilar data points. This type of relative
similarity comparisons have been successfully applied to learn quadratic distance metrics
[122, 123]. Usually this type of similarity relations do not require explicit class labels
and thus are easier to obtain than either the class labels or the actual distances between
data points. For instance, in content based image retrieval, to collect feedback, users
may be required to report whether one image looks more similar to another image than
it is to a third image. This task is typically much easier than to label each individual
image. Formally, let x denote one data point, we are given a set of triplets:
T = (i, j, k) | d(xi,xj) < d(xi,xk), (5.1)
where d(·, ·) is some distance measure (e.g., Euclidean distance in the original space; or
semantic similarity measure provided by a user). As explained, one may not explicitly
know d(·, ·); instead, one may only be able to provide sparse similarity relations. Using
such a set of constraints, we formulate a learning problem in the large-margin framework.
By using a convex surrogate loss function, a convex optimization problem is obtained,
but has an exponentially large number of variables. Column generation is thus employed
to efficiently solve the formulated optimization problem.
5.1.1 main contribution
The main contribution of this work is to propose a novel hash function learning frame-
work which has the following desirable properties.
Chapter 5 Learning Hash Functions Using Column Generation 87
1. Our column generation based method with triplet based formulation and large-
margin learning is able to learn high quality hash functions. The learned hash
functions are able to outperform many existing methods on similarity preserving.
2. The proposed framework is flexible and can accommodate various types of con-
straints. We show how to learn hash functions based on similarity comparisons.
Furthermore, the framework can accommodate different types of loss functions as
well as regularization terms.
5.1.2 Related work
Without using any training data, data-independent hashing methods usually generate
a set of hash functions using randomization. For instance, a popular method of LSH
family [32, 62, 65] uses random projection and thresholding to generate binary codes,
where the mutually close data points in the Euclidean space are likely to have similar
binary codes. Recently, Kulis and Grauman [63] propose a kernelized version of LSH,
which is capable of capturing the intrinsic relations between data points using kernels
instead of linear inner products.
In terms of learning methodology, data-dependent hashing methods can make use of
unsupervised, supervised or semi-supervised learning techniques to learn a set of hash
functions that generate the compact binary codes. As for unsupervised learning, two
typical approaches are used to obtain such compact binary codes, including thresholding
the real-valued low-dimensional vectors (after dimensionality reduction) and direct op-
timization of a Hamming distance based objective function (e.g., spectral hashing [34],
self-taught hashing [33]). The spectral hashing (SPH) method directly optimizes a graph
Laplacian objective function in the Hamming space. Inspired by SPH, Zhang et al. [33]
develop the self-taught hashing (STH) method. At the first step of STH, Laplacian graph
embedding is used to generate a sequence of binary codes for each example. By viewing
these binary codes as binary classification labels, it trains linear support vector ma-
chine classifiers as hash functions. Liu et al. [38] propose a scalable graph-based hashing
method which uses a small-size anchor graph to approximate the original neighborhood
graph and alleviates the computational limitation of spectral hashing.
As for the supervised learning case, a number of hashing methods take advantage of la-
beled training examples to build data-dependent hash functions. For example, Salakhut-
dinov and Hinton [124] propose the restricted Boltzmann machine (RBM) hashing
method using a multi-layer deep learning technique for binary code generation. Strecha
et al. [119] use Fisher linear discriminant analysis (LDA) to embed the original data
points into a lower-dimensional space, where the embedded data points are binarized
88 Chapter 5 Learning Hash Functions Using Column Generation
using thresholding. Boosting methods have also been employed to develop hashing
methods such as SSC [121] and Forgiving Hash [125], both of which learn a set of weak
learners as hash functions in the boosting framework. It is demonstrated in [118] that
some data-dependent hashing methods like stacked RBM and boosting SSC perform
much better than LSH on large-scale databases of millions of images. Wang et al. [17]
propose a semi-supervised hashing method, which aims to ensure small Hamming dis-
tance of similar data examples and large Hamming distance of dissimilar data points.
More recently, Liu et al. [31] introduce a kernel-based supervised hashing method, where
the hashing functions are nonlinear kernel functions.
The closest work to ours might be the boosting based SSC hashing [121], which also
learns a set of weighted hash functions through boosting learning. Our method differs
SSC in the learning procedure. The resulting optimization problem of our CGHash
is based on the concept of margin maximization. We have derived a meaningful La-
grange dual problem that column generation can be applied to solve the semi-infinite
optimization problem. In contrast, SSC is built on the learning procedure of AdaBoost,
which employs stage-wise coordinate-descent optimization. The weights associated with
selected hash functions (corresponding weak classifiers in AdaBoost) are not fully up-
dated in each iteration. Also the information used for training is different. We have used
distance comparison information and SSC uses pairwise information. In addition, our
work can accommodate various types of constraints, and can flexibly adapt to different
types of loss functions as well as regularization terms. It is unclear, for example, how
SSC can accommodate different types of regularization which may encode useful prior
information. In this sense our CGHash is much more flexible. Next, we present our
main results.
5.1.3 Notation
A bold lower-case letter (u, v) denotes a column vector. An element-wise inequality
between two vectors or matrices such as u ≥ v means ui ≥ vi for all i. 0 is the all-zero
column vector and 1 is the all-one column vector.
5.2 The proposed method
Given a set of training examples X = x1,x2, . . . ,xn ⊂ Rd, we aim to learn a set of m
hash functions:
Φ(x) = [h1(x), h2(x), . . . , hm(x)] ∈ −1,+1m, (5.2)
Chapter 5 Learning Hash Functions Using Column Generation 89
which map these training examples to a low-dimensional binary space. The domain
of hash functions is denoted by C: h(·) ∈ C. The output of one hash function is a
binary value: h(x) ∈ −1,+1. In the low-dimensional binary space, these binary codes
are supposed to preserve the underlying similarity information in the original high-
dimensional space. Next we learn such hash functions within the large-margin learning
framework.
Formally, suppose that we are given a set of triplets T = (i, j, k). These triplets encode
the similarity comparison information in which the distance/dissimilarity between xi and
xj is smaller than that between xi and xk. We define the weighted Hamming distance
for the learned binary codes as: (the constant multiplier is removed):
dhm(xi,xj ;w) =m∑r=1
wr|hr(xi)− hr(xj)|, (5.3)
where wr is a non-negative weighting coefficient associated with the r-th hash function.
We want the constraints:
dhm(xi,xj) < dhm(xi,xk) (5.4)
to be satisfied as well as possible. For notational simplicity, we define
δh(i, j, k) = |h(xi)− h(xk)| − |h(xi)− h(xj)| (5.5)
and
δΦ(i, j, k) = [δh1(i, j, k), δh2(i, j, k), . . . , δhm(i, j, k)]. (5.6)
With the above definitions, the weighted Hamming distance comparison of a triplet can
be written as:
dhm(xi,xk)− dhm(xi,xj) = w>δΦ(i, j, k). (5.7)
In what follows, we describe the details of our hashing algorithm using different types
of convex loss functions and regularization norms. In theory, any convex loss and regu-
larization can be used in our hashing framework.
5.2.1 Learning hash functions with squared hinge loss
As a starting example, we first discuss using squared hinge loss function and `1 norm
regularization for hash function learning. Using the squared hinge loss, we define the
90 Chapter 5 Learning Hash Functions Using Column Generation
following large-margin optimization problem:
minw,ξ
1>w + C∑
(i,j,k)∈T
ξ2(i,j,k) (5.8a)
s.t. ∀(i, j, k) ∈ T :
dhm(xi,xk;w)− dhm(xi,xj ;w) ≥ 1− ξ(i,j,k), (5.8b)
w ≥ 0, ξ ≥ 0. (5.8c)
Here we have used the `1 norm onw as the regularization term to control the complexity
of the learned model; the weighting vector w is defined as:
w = [w1, w2, . . . , wm]>; (5.9)
ξ is the slack variable; C is a parameter controlling the trade-off between the training
error and model complexity. With the definition of weighted Hamming distance in (5.3)
and the notation in (5.6), the optimization problem in (5.8) can be rewritten as:
minw,ξ
1>w + C∑
(i,j,k)∈T
ξ2(i,j,k) (5.10a)
s.t. ∀(i, j, k) ∈ T : w>δΦ(i, j, k) ≥ 1− ξ(i,j,k) (5.10b)
w ≥ 0, ξ ≥ 0. (5.10c)
We aim to solve the above optimization to obtain the weighting vector: w and the set of
hash functions: Φ = [h1, h2, . . . ]. If the hash functions are obtained, the optimization can
be easily solved for w (e.g., using LBFGS-B [2]). In our approach, we apply the column
generation technique to alternatively solve for w and learn hash functions. Basically,
we construct a working set of hash functions and repeat the following two steps until
converge: first we solve for the weighting vector using the current working set of hash
functions, and then generate new hash function and add to the working set.
Column generation is a technique originally used for large scale linear programming
problems. Demiriz et al. [60] apply this technique to design boosting algorithms. In each
iteration, one column—a variable in the primal or a constraint in the dual problem—is
added. Till one cannot find any violating constraints in the dual, the current solution
is the optimal solution. In theory, if we run the column generation with a sufficient
number of iterations, one can obtain a sufficiently accurate solution. Here we only need
to run a small number of column generation iteration (e.g, 60) to learn a compact set of
hash functions.
To applying column generation technique for learning hash functions, we derive the dual
problem of the optimization in (5.10). The optimization in (5.10) can be equally written
Chapter 5 Learning Hash Functions Using Column Generation 91
as:
minw,ρ
1>w + C∑
(i,j,k)∈T
[max(1− ρ(i,j,k), 0)
]2
(5.11a)
s.t. ∀(i, j, k) ∈ T : ρ(i,j,k) = w>δΦ(i, j, k), (5.11b)
w ≥ 0. (5.11c)
The Lagrangian of (5.11) can be written as:
L(w,ρ,µ,α) =1>w + C∑
(i,j,k)∈T
[max(1− ρ(i,j,k), 0)
]2
+∑
(i,j,k)∈T
µ(i,j,k)
[ρ(i,j,k) −w>δΦ(i, j, k)
]−α>w, (5.12a)
where µ, α are Lagrange multipliers and α ≥ 0. For the optimal primal solution, the
following must hold: ∂L∂w = 0 and ∂L
∂ρ = 0; hence the dual problem can be derived as:
maxµ
∑(i,j,k)∈T
µ(i,j,k) −µ2
(i,j,k)
4C(5.13a)
s.t. ∀h(·) ∈ C :∑
(i,j,k)∈T
µ(i,j,k)δh(i, j, k) ≤ 1. (5.13b)
Here µ is one dual variable, which corresponds to one constraint in (5.11b).
The core idea of column generation is to generate a small subset of dual constraints by
finding the most violated dual constraint in (5.13). This process is equivalent to adding
primal variables into the primal optimization problem (5.20). Here finding the most
violated dual constraint is learning one hash function, which can be written as:
h?(·) = argmaxh(·)∈C
∑(i,j,k)∈T
µ(i,j,k)δh(i, j, k)
= argmaxh(·)∈C
∑(i,j,k)∈T
µ(i,j,k)
[|h(xi)− h(xk)| − |h(xi)− h(xj)|
]. (5.14)
In each column generation iteration, we solve the above optimization to generate one
hash function.
Now we give an overview of our approach. Basically, we repeat the following two steps
until converge:
1. Solve the reduced primal problem in (5.11) using the current working set of hash
functions. We obtain the primal solution w and the dual solution µ in this step.
92 Chapter 5 Learning Hash Functions Using Column Generation
2. With the dual solution µ, we solve the subproblem in (5.14) to learn one hash
function, and add to the working set of hash functions.
Our method is summarized in Algorithm 5. We describe more details for running these
two steps as follows.
In the first step, we need to obtain the dual solution µ, which is required for solving
the subproblem in (5.14) of the second step to learn one hash function. In each column
generation iteration, we can easily solve the optimization in (5.11) using the current
working set of hash functions to obtain the primal solution w. For example, using the
efficient LBFGS-B solver [2]. According to the Karush-Kuhn-Tucker (KKT) conditions,
we have the following relation:
∀(i, j, k) ∈ T : µ?(i,j,k) = 2C max
[1−w?>δΦ(i, j, k), 0
]. (5.15)
From the above, we are able to obtain the dual solution µ? for the primal solution w?.
In the second step, we solve the subproblem in (5.14) for learning one hash function. The
form of hash function h(·) can be any functions that have binary output value. When
using a decision stump as the hash function, usually we can exhaustively enumerate
all possibility and find the globally best one. However, for many other types of hash
function, e.g., perceptron and kernel functions, globally solving (5.14) is difficult. In our
experiments, we use the perceptron hash function:
h(x) = sign (v>x+ b). (5.16)
In order to obtain a smoothly differentiable objective function, we reformulate (5.14)
into the following equivalent form:
h?(·) = argmaxh(·)∈C
∑(i,j,k)∈T
µ(i,j,k)
[(h(xi)− h(xk))
2 − (h(xi)− h(xj))2
]. (5.17)
The non-smooth sign function in (5.16) brings the difficulty for optimization. We replace
the sign function by a smooth sigmoid function, and then locally solve the above opti-
mization (5.17) (e.g., using LBFGS [2]) for learning the parameters of a hash function.
We can apply a few initialization heuristics for solving (5.17). For example, similar to
LSH, we can generate a number of random planes and choose the best one, which maxi-
mizes the objective in (5.17), as the initial solution. We can also train a decision stump
by searching a best dimension and threshold to maximize the objective on the quantized
data. Alternatively, we can employ the spectral relaxation method [38] which drops the
Chapter 5 Learning Hash Functions Using Column Generation 93
Algorithm 5: CGHash: Hashing using column generation (with squared hinge loss)
Input: training triplets: T = (i, j, k), training examples: x1,x2, . . ., the number ofbits: m.
Output: Learned hash functions h1, h2, . . . , hm and the associated weights w.1 Initialize: µ← 1
|T| .
2 for r = 1 to m do3 find a new hash function hr(·) by solving the subproblem: (5.14);4 add hr(·) to the working set of hash functions;5 solve the primal problem in (5.11) for w (using LBFGS-B[2]), and calculate the
dual solution µ by (5.15);
sign function and solves a generalized eigenvalue problem to obtain a solution for initial-
ization. In our experiments, we use decision stump training and random projection for
initialization. However, applying the spectral relaxation method for initialization would
further improve the performance.
5.3 Hashing with general smooth convex loss functions
The previous discussion for squared hinge loss is an example of using smooth convex loss
function in our framework. To take a step forward, here we describe how to incorporate
with general smooth convex loss functions. We denote by f(·) as a general convex
loss function which is assumed to be smooth (e.g., exponential, logistic, squared hinge
loss), Our algorithm can be easily extended to using non-smooth loss functions. As an
example, we discuss using the hinge loss in Appendix C.
We encourage the following constraints to be satisfied as well as possible:
∀(i, j, k) ∈ T : dhm(xi,xk)− dhm(xi,xj) = w>δΦ(i, j, k) ≥ 0 (5.18)
These constraints do not have to be all strictly satisfied. Here we define the margin:
ρ(i,j,k) = w>δΦ(i, j, k), (5.19)
and we want to maximize the margin with regularization. Using `1 norm for regulariza-
tion, we define the primal optimization problem as:
minw
1>w + C∑
(i,j,k)∈T
f(ρ(i,j,k)) (5.20a)
s.t. ∀(i, j, k) ∈ T : ρ(i,j,k) = w>δΦ(i, j, k), (5.20b)
w ≥ 0. (5.20c)
94 Chapter 5 Learning Hash Functions Using Column Generation
Here f(·) is a smooth convex loss function. C is a parameter controlling the trade-
off between the training error and model complexity. Without the regularization, one
can always make w arbitrarily large to make the convex loss approach zero when all
constraints are satisfied.
The squared hinge loss which we discussed before is an example of f(·). We can easily
recover the formulation in (5.11) for squared hinge loss by using the following definition:
f(ρ(i,j,k)) =
[max(1− ρ(i,j,k), 0)
]2
. (5.21)
For applying column generation, we derive the dual problem of (5.20). The Lagrangian
of (5.20) can be written as:
L(w,ρ,µ,α) =1>w + C∑
(i,j,k)∈T
f(ρ(i,j,k))
+∑
(i,j,k)∈T
µ(i,j,k)
[ρ(i,j,k) −w>δΦ(i, j, k)
]−α>w, (5.22a)
where µ, α are Lagrange multipliers and α ≥ 0. With the definition of Fenchel conjugate
[126], we have the following dual objective:
infw,ρ
L(w,ρ,µ,α) = infρC
∑(i,j,k)∈T
f(ρ(i,j,k)) +∑
(i,j,k)∈T
µ(i,j,k)ρ(i,j,k) (5.23)
=− supρ−C
∑(i,j,k)∈T
f(ρ(i,j,k))−∑
(i,j,k)∈T
µ(i,j,k)ρ(i,j,k) (5.24)
=− C supρ
∑(i,j,k)∈T
−µ(i,j,k)
Cρ(i,j,k) −
∑(i,j,k)∈T
f(ρ(i,j,k)) (5.25)
=− C∑
(i,j,k)∈T
f∗(−µ(i,j,k)
C). (5.26)
Here f∗(·) is the Fenchel conjugate of f(·). For the optimal primal solution, the condi-
tion: ∂L∂w = 0 must hold; hence we have the following relations:
1−α> −∑
(i,j,k)∈T
µ(i,j,k)δΦ(i, j, k) = 0. (5.27)
Consequently, the corresponding dual problem of (5.20) can be written as:
maxµ−
∑(i,j,k)∈T
f∗(−µ(i,j,k)
C) (5.28a)
s.t. ∀h(·) ∈ C :∑
(i,j,k)∈T
µ(i,j,k)δh(i, j, k) ≤ 1. (5.28b)
Chapter 5 Learning Hash Functions Using Column Generation 95
With the above dual problem for general smooth convex loss functions, we generate a
new hash function by finding the most violating constraints in (5.28b), which is the same
as that for squared hinge loss. Hence, we solve the optimization in (5.14) to generate a
new hash function. Using different loss functions will result in different dual solutions.
The dual solution is required for generating hash functions.
As afore mentioned, in each column generation iteration, we need to obtain the dual
solution before solving (5.14) to generate a hash function. Since we assume that f(·) is
smooth, the Karush-Kuhn-Tucker (KKT) conditions establish the connection between
the primal solution of (5.20) and the dual solution of (5.28):
∀(i, j, k) ∈ T : µ?(i,j,k) = −Cf ′(ρ?(i,j,k)) (5.29)
in which,
ρ?(i,j,k) = w?>δΦ(i, j, k). (5.30)
In other words, the dual variable is determined by the gradient of the loss function in
the primal. According to (5.29), we are able to obtain the dual solution µ? using the
primal solution w?.
5.3.1 Hashing with logistic loss
It has been shown in (5.21) that formulation for the squared hinge loss is an example of
the general formulation in (5.20) with smooth convex loss functions. Here we describe
using the logistic loss as another example of the general formulation. The learning
algorithm is similar to the case of using the squared hinge loss which is described before.
We have the following definition for the logistic loss:
f(ρ(i,j,k)) = log (1 + exp (−ρ(i,j,k))). (5.31)
The general result for smooth convex loss function can be applied here. The primal
optimization problem can be written as:
minw,ρ
1>w + C∑
(i,j,k)∈T
log (1 + exp (−ρ(i,j,k))) (5.32a)
s.t. ∀(i, j, k) ∈ T : ρ(i,j,k) = w>δΦ(i, j, k), (5.32b)
w ≥ 0. (5.32c)
96 Chapter 5 Learning Hash Functions Using Column Generation
The corresponding dual problem can be written as:
maxµ
∑(i,j,k)∈T
(µ(i,j,k) − C) log (C − µ(i,j,k))− µ(i,j,k) log (µ(i,j,k)) (5.33a)
s.t. ∀h(·) ∈ C :∑
(i,j,k)∈T
µ(i,j,k)δh(i, j, k) ≤ 1. (5.33b)
The dual solution can be calculated by:
∀(i, j, k) ∈ T : µ?(i,j,k) =C
exp (w?>δΦ(i, j, k)) + 1. (5.34)
5.4 Hashing with `∞ norm regularization
The proposed method is flexible that it is easy to incorporate with different types of
regularizations. Here we discuss the `∞ norm regularization as an example. For general
convex loss, the optimization can be written as:
minw‖w‖∞ + C
∑(i,j,k)∈T
f(ρ(i,j,k)) (5.35a)
s.t. ∀(i, j, k) ∈ T : ρ(i,j,k) = w>δΦ(i, j, k), (5.35b)
w ≥ 0. (5.35c)
This optimization problem can be equivalently written as:
minw
∑(i,j,k)∈T
f(ρ(i,j,k)) (5.36a)
s.t. ∀(i, j, k) ∈ T : ρ(i,j,k) = w>δΦ(i, j, k), (5.36b)
0 ≤ w ≤ C ′1, (5.36c)
where C ′ is a constant that controls the regularization trade-off. This optimization can
be efficiently solved using quasi-Newton methods such as LBFGS-B [2] by eliminating
the auxiliary variable ρ. The Lagrangian can be written as:
L(w,ρ,µ,α,β) =∑
(i,j,k)∈T
f(ρ(i,j,k))−α>w + β>(w − C ′1)
+∑
(i,j,k)∈T
µ(i,j,k)
[ρ(i,j,k) −w>δΦ(i, j, k)
], (5.37a)
Chapter 5 Learning Hash Functions Using Column Generation 97
where µ, α, β are Lagrange multipliers and α ≥ 0, β ≥ 0. Similar to the case for `1
norm, the dual problem can be written as:
maxµ,β
− C ′1>β −∑
(i,j,k)∈T
f∗(−µ(i,j,k)) (5.38a)
s.t. ∀h(·) ∈ C :∑
(i,j,k)∈T
µ(i,j,k)δh(i, j, k) ≤ βh, (5.38b)
β ≥ 0. (5.38c)
As the same with the case of `1 norm, the dual solution µ can be calculated using (5.29),
and the rule for generating one hash function is to solve the subproblem in (5.14).
Similar to the discussion for `1 norm, different loss functions, including the squared
hinge loss in (5.21) and the logistic loss in (5.31), can be applied here to incorporate
with the `∞ norm regularization. As the flexibility of our framework, we also can use
the non-smooth hinge loss with the `∞ norm regularization. Please refer to Appendix
C for details.
5.5 Extension of regularization
To demonstrate the flexibility of the proposed framework, we show an example that
considers additional pairwise information for hashing learning. Assume that we are
given the pairwise similarity information. It is expected that the distance of similar
data pairs should be minimized. We can easily add a new regularization term in our
objective function to leverage this additional information. Formally, let us denote the
set of pairwise relations by:
D = (i, j) | xi is similar to xj. (5.39)
We want to minimize the divergence:
∑(i,j)∈D
dhm(xi,xj) =
m∑r=1
∑(i,j)∈D
wr|hr(xi)− hr(xj)|. (5.40)
If we use this term to replace the `1 regularization term: 1>w of the optimization
in (5.20), all of our analysis still holds and Algorithm 5 is still applicable with small
modification, because the new term can be simply seen as weighted `1 norm.
98 Chapter 5 Learning Hash Functions Using Column Generation
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ISOLET
recall (using 60 bits)
Pre
cisi
on
STH (AP:0.414±0.006)LSI (AP:0.205±0.003)LCH (AP:0.227±0.002)ITQ (AP:0.433±0.008)AGH (AP:0.319±0.007)SPH (AP:0.256±0.004)LSH (AP:0.261±0.004)BREs (AP:0.443±0.009)SPLH (AP:0.322±0.018)SSC (AP:0.359±0.007)CGHash (AP:0.672±0.006)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8SCENE−15
recall (using 60 bits)
Pre
cisi
on
STH (AP:0.211±0.015)LSI (AP:0.168±0.002)LCH (AP:0.177±0.002)ITQ (AP:0.247±0.004)AGH (AP:0.217±0.006)SPH (AP:0.139±0.008)LSH (AP:0.116±0.003)BREs (AP:0.376±0.009)SPLH (AP:0.166±0.017)SSC (AP:0.250±0.005)CGHash (AP:0.492±0.012)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1MNIST
recall (using 60 bits)
Pre
cisi
on
STH (AP:0.364±0.017)LSI (AP:0.244±0.001)LCH (AP:0.272±0.002)ITQ (AP:0.414±0.003)AGH (AP:0.309±0.006)SPH (AP:0.275±0.002)LSH (AP:0.306±0.009)BREs (AP:0.566±0.009)SPLH (AP:0.406±0.014)SSC (AP:0.402±0.007)CGHash (AP:0.498±0.009)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8LABELME
recall (using 60 bits)
Pre
cisi
on
STH (AP:0.361±0.006)LSI (AP:0.240±0.001)LCH (AP:0.237±0.001)ITQ (AP:0.356±0.001)AGH (AP:0.266±0.004)SPH (AP:0.254±0.001)LSH (AP:0.307±0.005)BREs (AP:0.393±0.003)SPLH (AP:0.356±0.007)SSC (AP:0.383±0.003)CGHash (AP:0.481±0.007)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1USPS
recall (using 60 bits)
Pre
cisi
on
STH (AP:0.429±0.011)LSI (AP:0.294±0.002)LCH (AP:0.297±0.002)ITQ (AP:0.566±0.003)AGH (AP:0.368±0.005)SPH (AP:0.339±0.006)LSH (AP:0.458±0.016)BREs (AP:0.706±0.003)SPLH (AP:0.595±0.031)SSC (AP:0.517±0.011)CGHash (AP:0.672±0.009)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4PASCAL07
recall (using 60 bits)
Pre
cisi
on
STH (AP:0.133±0.003)LSI (AP:0.119±0.001)LCH (AP:0.123±0.001)ITQ (AP:0.101±0.001)AGH (AP:0.124±0.002)SPH (AP:0.121±0.001)LSH (AP:0.115±0.002)BREs (AP:0.147±0.001)SPLH (AP:0.112±0.003)SSC (AP:0.142±0.001)CGHash (AP:0.175±0.002)
Figure 5.1: Results of precision-recall (using 64 bits). It shows that our CGHash performs thebest in most cases.
5.6 Experiments
5.6.1 Dataset description
In order to evaluate the proposed column generation hashing method (referred to as
CGHash), we have conducted a set of experiments on six benchmark datasets. Table 5.1
gives a summary of all datasets used in the experiments. More specifically, the USPS
Chapter 5 Learning Hash Functions Using Column Generation 99
5 10 15 20 25 30 35 40 45 50 55 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9ISOLET
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
STH (0.620±0.010)LSI (0.423±0.005)LCH (0.470±0.004)ITQ (0.652±0.010)AGH (0.637±0.009)SPH (0.495±0.005)LSH (0.450±0.007)BREs (0.630±0.012)SPLH (0.551±0.018)SSC (0.584±0.010)CGHash (0.871±0.005)
5 10 15 20 25 30 35 40 45 50 55 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7SCENE−15
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
STH (0.316±0.025)LSI (0.265±0.004)LCH (0.285±0.003)ITQ (0.361±0.006)AGH (0.356±0.012)SPH (0.217±0.018)LSH (0.165±0.005)BREs (0.505±0.011)SPLH (0.283±0.026)SSC (0.361±0.007)CGHash (0.645±0.011)
5 10 15 20 25 30 35 40 45 50 55 600.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1MNIST
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
STH (0.669±0.022)LSI (0.786±0.005)LCH (0.799±0.005)ITQ (0.866±0.005)AGH (0.823±0.005)SPH (0.811±0.004)LSH (0.746±0.004)BREs (0.857±0.005)SPLH (0.745±0.011)SSC (0.855±0.007)CGHash (0.889±0.005)
5 10 15 20 25 30 35 40 45 50 55 600.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7LABELME
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
STH (0.616±0.004)LSI (0.428±0.002)LCH (0.417±0.003)ITQ (0.601±0.002)AGH (0.588±0.007)SPH (0.446±0.003)LSH (0.522±0.007)BREs (0.602±0.005)SPLH (0.561±0.007)SSC (0.605±0.003)CGHash (0.680±0.006)
5 10 15 20 25 30 35 40 45 50 55 600.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1USPS
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
STH (0.823±0.010)LSI (0.684±0.002)LCH (0.702±0.003)ITQ (0.856±0.002)AGH (0.854±0.003)SPH (0.720±0.007)LSH (0.766±0.010)BREs (0.888±0.006)SPLH (0.826±0.020)SSC (0.827±0.007)CGHash (0.926±0.004)
5 10 15 20 25 30 35 40 45 50 55 600.05
0.1
0.15
0.2
0.25
0.3PASCAL07
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
STH (0.197±0.003)LSI (0.174±0.002)LCH (0.190±0.002)ITQ (0.121±0.001)AGH (0.219±0.003)SPH (0.182±0.003)LSH (0.159±0.002)BREs (0.224±0.004)SPLH (0.178±0.006)SSC (0.200±0.002)CGHash (0.288±0.002)
Figure 5.2: Precision of top-50 retrieved examples using different number of bits. It shows thatour CGHash performs the best in most cases.
dataset consists of 9,298 handwritten images, each of which is resized to 16× 16 pixels.
This dataset is split into two subsets at random (70% for training and 30% for testing).
The MNIST dataset is composed of 70,000 images of handwritten digits. The size of
each image is 28×28. This dataset is randomly partitioned into a training subset (66,000
images) and a testing subset (4,000 images). We select 2,000 images from the training
subset to generate a set of triplets used for learning hash functions. In the above two
handwritten image datasets, the original gray-scale intensity values of each image are
used as features.
100 Chapter 5 Learning Hash Functions Using Column Generation
5 10 15 20 25 30 35 40 45 50 55 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ISOLET
number of bits
3NN
cla
ssifi
catio
n er
ror
STH (0.308±0.013)LSI (0.279±0.009)LCH (0.243±0.008)ITQ (0.208±0.010)AGH (0.222±0.010)SPH (0.248±0.006)LSH (0.353±0.007)BREs (0.227±0.012)SPLH (0.280±0.019)SSC (0.249±0.012)CGHash (0.069±0.008)
5 10 15 20 25 30 35 40 45 50 55 600.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1SCENE−15
number of bits
3NN
cla
ssifi
catio
n er
ror
STH (0.581±0.030)LSI (0.490±0.005)LCH (0.474±0.008)ITQ (0.474±0.008)AGH (0.517±0.015)SPH (0.622±0.025)LSH (0.758±0.004)BREs (0.358±0.011)SPLH (0.621±0.028)SSC (0.502±0.013)CGHash (0.284±0.007)
5 10 15 20 25 30 35 40 45 50 55 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9MNIST
number of bits
3NN
cla
ssifi
catio
n er
ror
STH (0.258±0.021)LSI (0.080±0.006)LCH (0.103±0.006)ITQ (0.061±0.004)AGH (0.119±0.005)SPH (0.076±0.004)LSH (0.137±0.008)BREs (0.084±0.002)SPLH (0.180±0.013)SSC (0.072±0.006)CGHash (0.062±0.005)
5 10 15 20 25 30 35 40 45 50 55 600.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9LABELME
number of bits
3NN
cla
ssifi
catio
n er
ror
STH (0.310±0.004)LSI (0.422±0.006)LCH (0.429±0.005)ITQ (0.317±0.004)AGH (0.341±0.004)SPH (0.430±0.007)LSH (0.375±0.005)BREs (0.314±0.004)SPLH (0.353±0.003)SSC (0.321±0.006)CGHash (0.258±0.005)
5 10 15 20 25 30 35 40 45 50 55 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9USPS
number of bits
3NN
cla
ssifi
catio
n er
ror
STH (0.096±0.006)LSI (0.095±0.003)LCH (0.095±0.004)ITQ (0.060±0.004)AGH (0.081±0.009)SPH (0.093±0.008)LSH (0.103±0.008)BREs (0.064±0.006)SPLH (0.108±0.012)SSC (0.080±0.005)CGHash (0.037±0.004)
5 10 15 20 25 30 35 40 45 50 55 60
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1PASCAL07
number of bits
3NN
cla
ssifi
catio
n er
ror
STH (0.790±0.008)LSI (0.747±0.007)LCH (0.725±0.008)ITQ (0.855±0.006)AGH (0.741±0.009)SPH (0.744±0.009)LSH (0.818±0.006)BREs (0.721±0.008)SPLH (0.785±0.021)SSC (0.769±0.012)CGHash (0.636±0.010)
Figure 5.3: Nearest-neighbor classification error using different number of bits. It shows thatour CGHash performs the best in most cases.
The ISOLET dataset contains 7,797 recordings of 150 subjects speaking the 26 letters
of the alphabet. Each subject spoke each letter twice. This dataset is randomly divided
into a training subset (5,459 spoken letters) and a testing subset (2,338 spoken letters).
Each letter is represented as a 617-dimensional feature vector.
The SCENE-15 dataset consists of 4,485 images of 9 outdoor scenes and 6 indoor scenes.
Each image is divided into 31 sub-windows, each of which is represented as a histogram
of 200 visual code words. A concatenation of the histograms associated with 31 sub-
windows is used to represent an image, resulting in a 6,200-dimensional feature vector.
Chapter 5 Learning Hash Functions Using Column Generation 101
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8SCENE−15
recall (using 60 bits)
Pre
cisi
on
CGHash−k3 (AP:0.197±0.012)CGHash−k10 (AP:0.323±0.028)CGHash−k20 (AP:0.444±0.019)CGHash−k30 (AP:0.492±0.012)
5 10 15 20 25 30 35 40 45 50 55 60
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65SCENE−15
number of bits
Pro
port
ion
of th
e tr
ue n
eare
st n
eigh
bors
CGHash−k3 (0.305±0.018)
CGHash−k10 (0.474±0.031)
CGHash−k20 (0.603±0.017)
CGHash−k30 (0.645±0.011)
5 10 15 20 25 30 35 40 45 50 55 600.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1SCENE−15
number of bits
3NN
cla
ssifi
catio
n er
ror
CGHash−k3 (0.547±0.025)CGHash−k10 (0.379±0.026)CGHash−k20 (0.303±0.009)CGHash−k30 (0.284±0.007)
Figure 5.4: Performance of CGHash using different values of K (K ∈ 3, 10, 20, 30) on theSCENE-15 dataset. Results of precision-recall (using 60 bits), precision of top-50 retrievedexamples and nearest-neighbor classification are shown.
This dataset is randomly divided into a training subset (1500 examples) and a testing
subset (2985 examples).
The LABELME dataset1 [83] is a subset of the original LabelMe dataset, and consists
of 50,000 images that are categorized into 12 classes. Each image has 256× 256 pixels.
We generate 512-dimensional gist features for this dataset.
The PASCAL07 dataset is a subset of the PASCAL VOC 2007 dataset, and contains
9,963 images of 20 object classes. We use 5 types of features provided in [82], thus image
is represented by a 2712-dimensional feature vector. This dataset is randomly separated
into a training subset (70%) and a testing subset (30%).
1http://www.ais.uni-bonn.de/download/datasets.html
102 Chapter 5 Learning Hash Functions Using Column Generation
Dataset MNIST USPS LABELME [83] SCENE-15 [84] ISOLET PASCAL07 [127]
Size 70,000 9,298 50,000 4,485 7,797 9,963Dimension 784 256 512 6,200 617 2712
Classes 10 10 12 15 26 20
Table 5.1: Summary of the 6 datasets used in the experiments.
5.6.2 Experiment setup
Each dataset is randomly split into a training subset and a testing subset. This train-
ing/testing split is repeated 5 times, and the average performance over these 5 trials
is reported here. In the experiments, the proposed hashing method is implemented by
using the squared hinge loss function with the `1 norm regularization. Moreover, the
triplets used for learning hash functions are generated in a similar way as [128]. Specifi-
cally, given a training example, we randomly select K similar examples and K dissimilar
examples by multi-class label consistency, then construct triplets based on these similar
and dissimilar data pairs. We choose K = 30 for the SCENE-15 dataset and K = 10
for the other datasets. The regularization trade-off parameter C is cross-validated. We
found that, in a wide range, the setting of the parameter C does not have a significant
impact on the performance.
5.6.3 Competing methods
we compare our CGHash with some state-of-the-art hashing methods. For simplicity,
they are respectively referred to as LSH (Locality Sensitive Hashing [65]), SSC (Su-
pervised Similarity Sensitive Coding [118] as a modified version of [121]), LSI (Latent
Semantic Indexing [129]), LCH (Laplacian Co-Hashing [130]), SPH (Spectral Hash-
ing [34]), STH (Self-Taught hashing [33]), AGH (Anchor Graph Hashing [38]), BREs
(Supervised Binary Reconstructive Embedding [28]), SPLH (Semi-Supervised Learning
Hashing [17]), and ITQ (Iterative Quantization [36]).
5.6.4 Evaluation criteria
For quantitative performance comparison, we apply the following three evaluation mea-
sures:
1. Precision-recall curve. The precision and recall values are calculated as follows:
precision =#retrieved relevant sampels
#all retrieved examples, (5.41)
recall =#retrieved relevant sampels
#all relevant examples. (5.42)
Chapter 5 Learning Hash Functions Using Column Generation 103
2. Precision of top-K retrieved examples:
precision =#retrieved relevant sampels
K. (5.43)
3. K-nearest-neighbor classification. Each test example is classified using majority
voting of the top-K retrieved examples.
5.6.5 Quantitative comparison results
The performances of all hashing methods on six datasets are shown in Figure 5.1, 5.2
and 5.3. In these figures, we report results in the following three aspects:
1) We show the precision-recall result using the maximum bit length. In the legend of
each figure, we report the averaged precision (the area under the curve) and its standard
deviation.
2) We show the precision of top-50 retrieved examples using different bit lengths. In
the legend of each figure, we report the mean score and its standard deviation using the
maximum bit length.
3) We show the K-nearest-neighbor classification error using different bit lengths. In
the legend of each figure, we report the mean score and its standard deviation using the
maximum bit length.
These results show that the proposed CGHash achieves the best performance in most
cases. In the precision-recall result, CGHash has the best averaged precision, which
is calculated by the area under the precision-recall curve. CGHash also has the best
precision of top-50 retrieved examples in most cases. Moreover CGHash usually has
lower classification error than other methods.
We evaluate our CGHash using different settings of K for generating triplets on the
SCENE-15 dataset. Results are shown in Figure 5.4. It shows that in general the
performance is improved as K increases.
Some retrieval examples of our CGHash on the MNIST and LABELME datasets are
shown in Figure 5.5. It shows that CGHash is able to retrieve accurate nearest neighbors.
5.7 Conclusion
We have proposed a novel hashing method that is implemented using column generation
optimization. Our method aims to preserve the triplet-based relative ranking. Such
104 Chapter 5 Learning Hash Functions Using Column Generation
Figure 5.5: Two retrieval examples for CGHash on the LABELME and MNIST datasets. Queryexamples are shown in the left column, and the retrieved examples are shown in the right part.
a set of triplet constraints are incorporated into the large-margin learning framework.
Hash functions are then learned iteratively using column generation. Experimental
results have shown that the proposed hashing method achieves improved performance
compared with many existing hashing methods for similarity preserving.
Chapter 6
A General Two-Step Approach to
Learning-Based Hashing
In this chapter, we propose a flexible and general method [44] with a two-step learning
scheme. Most existing approaches to hashing apply a single form of hash function, and
an optimization process which is typically deeply coupled to this specific form. This
tight coupling restricts the flexibility of the method to respond to the data, and can
result in complex optimization problems that are difficult to solve. Here we propose
a flexible yet simple framework that is able to accommodate different types of loss
functions and hash functions. This framework allows a number of existing approaches
to hashing to be placed in context, and simplifies the development of new problem-
specific hashing methods. Our framework decomposes the hashing learning problem
into two steps: hash bit learning and hash function learning based on the learned bits.
The first step can typically be formulated as binary quadratic problems, and the second
step can be accomplished by training standard binary classifiers. Both problems have
been extensively studied in the literature. Our extensive experiments demonstrate that
the proposed framework is effective, flexible and outperforms the state-of-the-art.
6.1 Introduction
In general, hash functions are generated with the aim of preserving some notion of
similarity between data points. One of the seminal approaches in this vein is the random
projection based locality-sensitive hashing (LSH) [32, 62], which randomly generates
hash functions to approximate cosine similarity. Compared to this data-independent
method, recent work has focused on data-dependent approaches for generating more
effective hash functions. In this category, a number of methods have been proposed, for
105
106 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
example: spectral hashing (SPH) [34], multi-dimension spectral hashing (MDSH) [35],
iterative quantization (ITQ) [36] and inductive manifold hashing [64]. These methods
do not rely on labeled data and are thus categorized as unsupervised hashing methods.
Another category is the supervised hashing methods. Recent works include supervised
hashing with kernels (KSH) [31], minimal loss hashing (MLH) [29], supervised binary
reconstructive embeddings (BRE) [28], semi-supervised sequential projection learning
hashing (SPLH) [17] and column generation hashing [43], etc.
Loss functions for hashing are typically defined on the basis of the Hamming distance or
Hamming affinity of similar and dissimilar data pairs. Hamming affinity is calculated by
the inner product of two binary codes (a binary code takes a value of −1, 1). Existing
methods thus tend to optimize a single form of hash functions, the parameters of which
are directly optimized against the overall loss function. The common forms of hash
functions are linear perceptron functions (MLH, SPLH, LSH), kernel functions (KSH),
eigenfunctions (SPH, MDSH). The optimization procedure is then coupled with the se-
lected family of hash function. Different types of hash functions offer a trade-off between
testing time and ranking accuracy. For example, compared with kernel functions, the
simple linear perceptron function is usually much more efficient for evaluation but can
have a relatively low accuracy for nearest neighbor search. Moreover, this coupling often
results in a highly non-convex problem which can be very difficult to optimize.
As an example, the loss functions in MDSH, KSH and BRE all take a similar form
that aims to minimize the difference between the Hamming affinity (or distance) and
the ground truth of data pairs. However, the optimization procedures used in these
methods are coupled with the form of hash functions (eigenfunctions, kernel functions)
and thus different optimization techniques are needed.
Self-Taught Hashing (STH) [33] is a method which decomposes the learning procedure
into two steps: binary code generating and hash function learning. We extend this idea
and propose a general two-step approach to hashing of which STH can be seen as a
specific example. Note that STH optimizes the Laplacian affinity loss, which only tries
to pull together those similar data pairs but does not push away those dissimilar data
pairs. As shown in manifold learning, this may lead to inferior performance [71].
Our framework, however, is able to accommodate many different loss functions defined
on the Hamming affinity of data pairs, such as the loss function used in KSH, BRE or
MLH. This more general family of loss functions may consider both similar and dissimilar
data pairs. In order to produce effective binary codes in this first step, we develop a new
technique based on coordinate descent. We show that at each iteration of coordinate
descent, we can formulate the optimization problem of any Hamming affinity loss as a
binary quadratic problem. This formulation unifies different types of objective functions
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 107
into the same optimization problem, which significantly simplifies the optimization effort.
Our main contributions are as follows.
1. We propose a flexible hashing framework that decomposes the learning procedure
into two steps: binary codes inference step and hash function learning step. This
decomposition simplifies the problem and enables the use of different types of loss
functions and simplifies the hash function learning problem into a standard binary
classification problem. An arbitrary classifier, such as linear or kernel support
vector machines (SVM), boosting, neural networks, may thus be adopted to train
the hash functions.
2. For binary code inference, we show that optimization using different types of loss
functions (e.g., loss functions in KSH, BRE, MLH) can be solved as a series of
binary quadratic problems. We show that any type of loss function (e.g., the `2
loss, exponential loss, hinge loss) defined on Hamming affinity of data pairs can
be equivalently converted into a standard quadratic function. Based on this key
observation, we propose a general block coordinate decent method that is able to
incorporate many different types of loss functions in a unified manner.
3. The proposed method is simple and easy to implement. We carry out extensive
experiments on nearest neighbor search for image retrieval. To show the flexibility,
we evaluate our method using different types of loss functions and different formats
of hash functions (linear SVM, kernel SVM, Adaboost with decision stumps, etc).
Experiments show that our method outperforms the state-of-the-art.
6.2 Two-Step Hashing
Given a set of training points X = x1,x2, ...,xn ⊂ Rd, the goal of hashing is to learn
a set of hash functions that are able to preserve some notion of similarity between data
points. A ground truth affinity (or distance) matrix, Y, is provided (or calculated by a
pre-defined rule) for training, which defines the (dis-)similarity relations between data
pairs. In this case yij is the (i, j)-th element of the matrix Y, which is an affinity value
of the data pair (xi,xj). As a simple example, if the data labels are available, yij can
be defined as 1 for similar data pairs to t and −1 for dissimilar data pairs. In the case of
unsupervised learning, yij can be defined as the Euclidean distance or Gaussian affinity
on data points. The output of m hash functions is denoted by Φ(x):
Φ(x) = [h1(x), h2(x), . . . , hm(x)], (6.1)
108 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
which is a vector of m-bit binary codes: Φ(x) ∈ −1, 1m. In general, the optimization
can be written as:
minΦ(·)
n∑i=1
n∑j=1
δijL(Φ(xi),Φ(xj); yij). (6.2)
Here δij ∈ 0, 1 indicates whether the relation between two data points is defined, and
L(·) is a loss function that measures the how well the binary codes match the expected
affinity (or distance) yij . Many different types of loss functions L(·) have been devised,
and will be discussed in detail in the next section.
Most existing methods try to directly optimize objective (6.2) in order to learn the
parameters of hash functions [28, 29, 31, 35]. This inevitably means that the optimization
process is tightly coupled to the form of hash functions used, which makes it non-trivial
to extend a method to use another different format of hash functions. Moreover, this
coupling usually results in highly non-convex problems. Following the idea of STH [33],
we decompose the learning procedure into two steps: the first step for binary code
inference and the second step for hash function learning. The first step is to solve the
optimization:
minZ
n∑i=1
n∑j=1
δijL(zi, zj ; yij), s.t. Z ∈ −1, 1m×n, (6.3)
where Z is the matrix of m-bit binary codes for all data points, and zi is the binary
code vector corresponding to i-th data point.
The second step is to learn hash functions based on the binary codes obtained in the
first step, which is achieved by solving the optimization problem:
minΦ(·)
n∑i=1
G(zi,Φ(xi)). (6.4)
Here G(·, ·) is a loss function which evaluates the correctness of the binary label predic-
tion. We solve the above optimization independently for each of the m bits. To learn
the r-th hash function hr(·), the optimization can be written:
minhr(·)
n∑i=1
F (zi,r, hr(xi)). (6.5)
Here F (·, ·) is an loss function defined on two codes which evaluates the consistence;
zi,r is the binary code corresponding to the i-th data point and the r-th bit. Clearly,
the above optimization is a binary classification problem which is to minimize a kind
of loss given the binary labels. For example, the loss function F (·) can be an zero-one
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 109
loss function returning 0 if two inputs have the same value, and 1 otherwise. As in
classification, one can also use a convex surrogate to replace the zero-one loss. Typical
surrogate loss functions are hinge loss, logistic loss, etc. The resulting classifier is the
hash function that we aim to learn. Therefore, we are able to use any form of classifier.
For example, we can learn perceptron hash functions by training a linear SVM. The
linear perceptron hash function has the form:
h(x) = sign (w>x+ b). (6.6)
We could also train, for example, an RBF-kernel SVM, or Adaboost as hash functions.
Here we describe a kernel hash function that is learned using a linear SVM on kernel-
transferred features (referred to as SVM-KF). The hash function learned by SVM-KF
has a form as follows:
h(x) = sign
Q∑q=1
wqκ(x′q,x) + b
, (6.7)
in which X′ = x′1, . . . ,x′Q are Q data points generated from the training set by random
or uniform sampling.
We evaluate variety of different kinds of hash function in the Experiments Section below.
These tests show that Kernel hash functions often offer better ranking precision but
require much more evaluation time than linear perceptron hash functions. The hash
functions learned by SVM-KF represents a trade-off between kernel SVM and linear
SVM.
The method proposed here is labeled Two-Step Hashing (TSH), the steps are as follows:
• Step-1: Solving the optimization problem in (6.3) using block coordinate decent
(Algorithm 6) to obtain binary codes for each training data point.
• Step-2: Solving the binary classification problem in (6.5) for each bit based on the
binary codes obtained at Step-1.
An illustration of TSH is shown in Figure 6.1.
6.3 Solving binary quadratic problems
Optimizing (6.3) in Step-1 for the entire binary code matrix can be difficult. Instead, we
develop a bit-wise block coordinate descent method so that the problem at each iteration
110 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
Loss function options: any Hamming distance or affinity based loss function, e.g., KSH, BRE.
Hash function options: any classifier, e.g., Linear / Kernel SVM, Boosting, Random Forest, Neural Network.
Binary classification problemsHash functions
Hashing learningOptimisation
Binary quadratic problems
Binary codes
Cover to
Solve
Step-2
Step-1
Solve
Figure 6.1: An illustration of Two-Step Hashing
Algorithm 6: TSH: binary code inference (Step-1)
Input: affinity matrix: Y, bit length: m, number of cyclic iteration: r.Output: the matrix of binary codes: Z.
1 Initialize: the binary code matrix Z.2 repeat3 for t = 1, 2, . . . ,m do4 solve the binary quadratic problem (BQP) in (6.16) to obtain the binary codes
of t-th bit;5 update the codes of the t-th bit in the code matrix: Z;
6 until the maximum cyclic iteration: r is reached ;
can be solved easily. Moreover, we show that at each iteration, any pairwise Hamming
affinity (or distance) based loss can be equivalently formulated as a binary quadratic
problem. Thus we are able to easily work with different loss functions.
Block coordinate decent (BCD) is a technique that iteratively optimizes a subset of
variables at a time. For each iteration, we pick one bit for optimization in a cyclic
fashion. The optimization for the r-th bit can be written as:
minz(r)
n∑i=1
n∑j=1
δijlr(zi,r, zj,r), s.t. z(r) ∈ −1, 1n, (6.8)
where lr is the loss function defined on the r-th bit:
lr(zi,r, zj,r) = L(zi,r, zj,r, zi, zj ; yij). (6.9)
Here z(r) contains the binary codes of the r-th bit. zi,r is the binary code of the i-th
data point and the r-th bit. zi is the binary codes of the i-th data point excluding the
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 111
r-th bit.
Thus far, we have not described the form of the loss function L(·). Our optimization
method is not restricted to optimizing a specified form of the loss function. Based on the
following proposition, we are able to rewrite any Hamming affinity (or distance) based
loss function L(·) into a standard quadratic problem.
Proposition 6.1. For any loss function l(z1, z2) that is defined on a pair of binary input
variables z1, z2 ∈ −1, 1 and
l(1, 1) = l(−1,−1), l(1,−1) = l(1,−1), (6.10)
we can define a quadratic function g(z1, z2) that is equal to l(z1, z2). We have following
equation:
l(z1, z2) =1
2
[z1z2(l(11) − l(−11)) + l(11) + l(−11)
], (6.11)
=1
2z1z2(l(11) − l(−11)) + const. (6.12)
= g(z1, z2). (6.13)
Here l(11), l(−11) are constants, l(11) is the loss output on identical input pair: l(11) =
l(1, 1), and l(−11) is the loss output on distinct input pair: l(−11) = l(−1, 1).
Proof. This proposition can be easily proved by exhaustively checking all possible inputs
of the loss function. Notice that there are only two possible output values of the loss
function. For the input (z1 = 1, z2 = 1):
g(1, 1) =1
2
[1× 1× (l(11) − l(−11)) + l(11) + l(−11)
]= l(1, 1),
For the input (z1 = −1, z2 = 1):
g(−1, 1) =1
2
[− 1× 1× (l(11) − l(−11)) + l(11) + l(−11)
]= l(−1, 1),
The input (z1 = −1, z2 = −1) is the same as (z1 = 1, z2 = 1) and the input (z1 = 1, z2 =
−1) is the same as (z1 = −1, z2 = 1). In conclusion, the function l and g have the same
output for any possible inputs.
112 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
Any hash loss function l(·, ·) which is defined on the Hamming affinity between, or Ham-
ming distance of, data pairs is able to meet the requirement that: l(1, 1) = l(−1,−1), l(1,−1) =
l(1,−1). Applying this proposition, the optimization of (6.8) can be equivalently refor-
mulated as:
minz(r)∈−1,1n
n∑i=1
n∑j=1
δij(l(11)r,i,j − l
(−11)r,i,j )zi,rzj,r, (6.14)
The above optimization is an unconstrained binary quadratic problem. Let ai,j denote
the (i, j)-th element of matrix A, which we define as:
ai,j = δij(l(11)r,i,j − l
(−11)r,i,j ). (6.15)
The above optimization (6.14) can be written in matrix form:
minz(r)
z>(r)Az(r), s.t. z(r) ∈ −1, 1n. (6.16)
We have shown that at each iteration, the original optimization in (6.8) can be equiv-
alently reformulated as a binary quadratic problem (BQP) in (6.16). BQP has been
extensively studied. To solve (6.16), we first apply the spectral relaxation to get an
initial solution. Spectral relaxation drops the binary constraints. The optimization
becomes
minz(r)
z>(r)Az(r), s.t. ‖z(r)‖22 = n. (6.17)
The solution (denoted z0(r)) of the above optimization is simply the eigenvector that
corresponds to the minimum eigenvalue of the matrix A. To achieve a better solution,
here we take a step further. We solve the following relaxed problem of (6.16) as follows
minz(r)
z>(r)Az(r), s.t. z(r) ∈= [−1, 1]n. (6.18)
This relaxation is tighter than the spectral relaxation and provides a solution of better
quality. To solve the above problem, we use the solution z0(r) of spectral relaxation in
(6.17) as initialization and solve it using the efficient LBFGS-B solver [2]. The algorithm
for binary code inference in Step-1 is summarized in Algorithm 6.
The approach proposed above is applicable to many different types of loss functions,
which are defined on Hamming distance or Hamming affinity, such as the `2 loss, ex-
ponential loss, hinge loss. Here we describe a selection of such loss functions, most of
which arise from recently proposed hashing methods. We evaluate these loss functions
in the Experiments Section below. Note that m is the number of bits, and dh(·, ·) is the
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 113
Hamming distance on data pairs. If not specified, yij = 1 if the data pair is similar, and
yij = −1 if the data pair is dissimilar. δ(·) ∈ 0, 1 is an indicator function.
• TSH-KSH
The KSH loss function is based on Hamming affinity using `2 loss function. MDSH
also uses a similar form of loss function (weighted Hamming affinity instead):
LKSH(zi, zj) = (z>i zj −myij)2. (6.19)
• TSH-BRE
The BRE loss function is based on Hamming distance using the `2 loss function:
LBRE(zi, zj) = (dh(zi, zj)/m− δ(yij < 0))2. (6.20)
• TSH-SPLH
It uses an exponential loss outside the loss function proposed in SPLH which is
based on the Hamming affinity of data pairs:
LSPLH(zi, zj) = exp
[−yijz>i zj
m
]. (6.21)
• TSH-EE
Elastic Embedding (EE) is a dimension reduction method proposed in [71]. Here
we use their loss function with some modifications, which is a exponential based
on distance. Here λ is a trade-off parameter:
LEE(zi, zj) = δ(yij > 0)dh(zi, zj)
+ λδ(yij < 0) exp [−dh(zi, zj)/m]. (6.22)
• TSH-ExpH
Here ExpH is an exponential loss function using the Hamming distance:
LExpH(zi, zj) = exp
[yijdh(zi, zj) +mδ(yij < 0)
m
]. (6.23)
114 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
Table 6.1: Results (using hash codes of 32 bits) of TSH using different loss functions, and aselection of other supervised and unsupervised methods on 3 datasets. The upper part relatesthe results on training data and the lower on testing data. The results show that Step-1 of ourmethod is able to generate effective binary codes that outperforms those of competing methodson the training data. On the testing data our method also outperforms others by a large marginin most cases.
Precision-Recall MAP Precision at K (K=300)
Datasets LABELME MNIST CIFAR10 LABELME MNIST CIFAR10 LABELME MNIST CIFAR10
Results on training data
TSH-KSH 0.501 1.000 1.000 0.570 1.000 1.000 0.229 0.667 0.667TSH-BRE 0.527 1.000 1.000 0.600 1.000 1.000 0.230 0.667 0.667
TSH-SPLH 0.504 1.000 1.000 0.524 1.000 1.000 0.230 0.667 0.667TSH-EE 0.485 1.000 1.000 0.524 1.000 1.000 0.224 0.667 0.667
TSH-ExpH 0.475 1.000 1.000 0.541 1.000 1.000 0.225 0.667 0.667
STHs 0.335 0.800 0.629 0.387 0.882 0.774 0.176 0.575 0.433KSH 0.283 0.892 0.585 0.316 0.967 0.652 0.168 0.647 0.481
BREs 0.161 0.445 0.220 0.153 0.504 0.190 0.097 0.376 0.171SPLH 0.166 0.500 0.292 0.153 0.588 0.302 0.092 0.422 0.260MLH 0.120 0.547 0.190 0.142 0.685 0.235 0.100 0.478 0.200
Results on testing data
TSH-KSH 0.175 0.843 0.282 0.296 0.893 0.440 0.293 0.889 0.410TSH-BRE 0.169 0.844 0.283 0.293 0.896 0.439 0.293 0.890 0.409
TSH-SPLH 0.174 0.840 0.284 0.291 0.895 0.444 0.288 0.891 0.416TSH-EE 0.169 0.843 0.280 0.288 0.896 0.438 0.286 0.892 0.410
TSH-ExpH 0.172 0.844 0.282 0.287 0.892 0.441 0.286 0.887 0.410
STHs 0.094 0.385 0.144 0.162 0.639 0.229 0.156 0.634 0.218STHs-RBF 0.151 0.674 0.178 0.274 0.897 0.354 0.271 0.893 0.352
KSH 0.165 0.781 0.249 0.279 0.884 0.407 0.158 0.881 0.398BREs 0.106 0.409 0.151 0.178 0.703 0.226 0.171 0.702 0.210MLH 0.100 0.470 0.150 0.181 0.648 0.264 0.174 0.623 0.215
SPLH 0.093 0.452 0.191 0.168 0.714 0.321 0.158 0.708 0.315ITQ-CCA 0.077 0.619 0.206 0.143 0.792 0.333 0.133 0.784 0.325
MDSH 0.100 0.298 0.150 0.178 0.691 0.288 0.155 0.685 0.228SHPER 0.102 0.296 0.152 0.185 0.624 0.244 0.176 0.623 0.233
ITQ 0.116 0.386 0.161 0.206 0.750 0.264 0.197 0.751 0.252AGH 0.096 0.404 0.144 0.194 0.743 0.252 0.187 0.744 0.244STH 0.077 0.361 0.135 0.135 0.593 0.216 0.125 0.644 0.204BRE 0.091 0.323 0.137 0.160 0.651 0.238 0.147 0.582 0.185LSH 0.069 0.211 0.123 0.116 0.459 0.188 0.103 0.448 0.162
Table 6.2: Training time (in seconds) for TSH using different loss functions, and several othersupervised methods on 3 datasets. The value inside a brackets is the time used in the first stepfor inferring the binary codes. The results show that our method is efficient. Note that thesecond step of learning the hash functions can be easily parallelised.
LABELME MNIST CIFAR10
TSH-KSH 198 (107) 341 (294) 326 (262)TSH-BRE 133 (33) 309 (264) 234 (175)
TSH-EE 124 (29) 302 (249) 287 (225)TSH-ExpH 128 (43) 334 (281) 344 (256)
STHs-RBF 133 99 95KSH 326 355 379
BREs 216 615 231MLH 670 805 658
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 115
Figure 6.2: Some retrieval examples of our method TSH on CIFAR10. The first column showsquery images, and the rest are top 40 retrieved images in the database. False predictions aremarked by red boxes.
116 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
Figure 6.3: Some retrieval examples of our method TSH on CIFAR10. The first column showsquery images, and the rest are top 40 retrieved images in the database. False predictions aremarked by red boxes.
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 117
8 16 24 320.1
0.15
0.2
0.25
0.3
0.35LABLEME
Number of bits
Pre
cisi
on @
300
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
10 15 20 25 300
0.05
0.1
0.15
0.2
0.25
0.3
0.35LABLEME
Number of bits
Pre
cisi
on (
Ham
m. d
ist.
<=
2)
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
8 16 24 320.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3LABLEME
Number of bits
MA
P (
1000
−N
N)
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
100 200 300 400 500 600 700 800 900 10000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4LABLEME
Number of retrieved samples (32 bits)
Pre
cisi
on
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
8 16 24 320.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of bits
Pre
cisi
on @
300
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
10 15 20 25 300.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45CIFAR10
Number of bits
Pre
cisi
on (
Ham
m. d
ist.
<=
2)
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
8 16 24 320.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of bits
MA
P (
1000
−N
N)
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
100 200 300 400 500 600 700 800 900 10000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of retrieved samples (32 bits)
Pre
cisi
on
SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH
Figure 6.4: Results on 2 datasets of supervised methods. Results show that TSH outperformsothers usually by a large margin. The running up methods are STHs-RBF and KSH.
118 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
8 16 24 320.05
0.1
0.15
0.2
0.25
0.3LABLEME
Number of bits
Pre
cisi
on @
300
KLSHSPHAGHITQSPHERMDSHTSH
10 15 20 25 300
0.05
0.1
0.15
0.2
0.25
0.3
0.35LABLEME
Number of bits
Pre
cisi
on (
Ham
m. d
ist.
<=
2)
KLSHSPHAGHITQSPHERMDSHTSH
8 16 24 320.05
0.1
0.15
0.2
0.25
0.3LABLEME
Number of bits
MA
P (
1000
−N
N)
KLSHSPHAGHITQSPHERMDSHTSH
100 200 300 400 500 600 700 800 900 10000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4LABLEME
Number of retrieved samples (32 bits)
Pre
cisi
on
KLSHSPHAGHITQSPHERMDSHTSH
8 16 24 32
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of bits
Pre
cisi
on @
300
KLSHSPHAGHITQSPHERMDSHTSH
10 15 20 25 300.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of bits
Pre
cisi
on (
Ham
m. d
ist.
<=
2)
KLSHSPHAGHITQSPHERMDSHTSH
8 16 24 32
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of bits
MA
P (
1000
−N
N)
KLSHSPHAGHITQSPHERMDSHTSH
100 200 300 400 500 600 700 800 900 1000
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of retrieved samples (32 bits)
Pre
cisi
on
KLSHSPHAGHITQSPHERMDSHTSH
Figure 6.5: Results on 2 datasets for comparing unsupervised methods. Results show that TSHoutperforms others usually by a large margin.
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 119
8 16 24 320.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8SCENE15
Number of bits
Pre
cisi
on @
100
ITQSPHERITQ−CCABREsKSHTSH
8 16 24 320.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9SCENE15
Number of bitsM
AP
(10
00−
NN
)
ITQSPHERITQ−CCABREsKSHTSH
8 16 24 320.4
0.5
0.6
0.7
0.8
0.9
1USPS
Number of bits
Pre
cisi
on @
100
ITQSPHERITQ−CCABREsKSHTSH
8 16 24 320.4
0.5
0.6
0.7
0.8
0.9
1USPS
Number of bits
MA
P (
1000
−N
N)
ITQSPHERITQ−CCABREsKSHTSH
8 16 24 320.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ISOLET
Number of bits
Pre
cisi
on @
100
ITQSPHERITQ−CCABREsKSHTSH
8 16 24 320.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ISOLET
Number of bits
MA
P (
1000
−N
N)
ITQSPHERITQ−CCABREsKSHTSH
Figure 6.6: Results on SCENE15, USPS and ISOLET for comparing with supervised and unsu-pervised methods. Our TSH perform the best.
120 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
8 16 24 320.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3LABLEME
Number of bits
Pre
cisi
on @
300
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
10 15 20 25 300.05
0.1
0.15
0.2
0.25
0.3LABLEME
Number of bits
Pre
cisi
on (
Ham
m. d
ist.
<=
2)
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
8 16 24 32
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32LABLEME
Number of bits
MA
P (
1000
−N
N)
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
100 200 300 400 500 600 700 800 900 10000.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0.34LABLEME
Number of retrieved samples (32 bits)
Pre
cisi
on
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
8 16 24 320.2
0.25
0.3
0.35
0.4
0.45CIFAR10
Number of bits
Pre
cisi
on @
300
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
10 15 20 25 30
0.2
0.25
0.3
0.35
0.4
0.45
0.5CIFAR10
Number of bits
Pre
cisi
on (
Ham
m. d
ist.
<=
2)
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
8 16 24 320.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46CIFAR10
Number of bits
MA
P (
1000
−N
N)
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
100 200 300 400 500 600 700 800 900 10000.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46CIFAR10
Number of retrieved samples (32 bits)
Pre
cisi
on
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
Figure 6.7: Results on 2 datasets of our method using different hash functions. Results showthat using kernel hash function (TSH-RBF and TSH-KF) achieves best performances.
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 121
8 16 24 320
5
10
15
20
25
30
35
40
45LABLEME
Binary code length (the number of bits)
Com
pres
sion
tim
e −
Dat
abas
e (s
ec.)
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
8 16 24 320
20
40
60
80
100
120CIFAR10
Binary code length (the number of bits)
Com
pres
sion
tim
e −
Dat
abas
e (s
ec.)
TSH−StumpTSH−LSVMTSH−KFTSH−RBF
Figure 6.8: Code compression time using different hash functions. Results show that using kerneltransferred feature (TSH-KF) is much faster then SVM with RBF kernel (TSH-RBF). LinearSVM is the fastest one.
0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8FLICKR1M
Recall (32 bits)
Precision
SPLHSTHsITQ−CCAMLHBREsKSHTSH
100 200 300 400 500 600 700 800 900 10000.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1FLICKR1M
Number of retrieved samples (32 bits)
Pre
cisi
on
SPLHSTHsITQ−CCAMLHBREsKSHTSH
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8TINY580K
Recall (32 bits)
Precision
SPLHSTHsITQ−CCAMLHBREsKSHTSH
100 200 300 400 500 600 700 800 900 10000.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1TINY580K
Number of retrieved samples (32 bits)
Pre
cisi
on
SPLHSTHsITQ−CCAMLHBREsKSHTSH
Figure 6.9: Comparison of supervised methods on 2 large scale datasets: Flickr1M and Tiny580k.Our method TSH achieves on par result with KSH. TSH and KSH significantly outperform othermethods.
122 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8FLICKR1M
Recall (32 bits)
Precision
KLSHSPHAGHITQSPHERMDSHTSH
100 200 300 400 500 600 700 800 900 10000.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8FLICKR1M
Number of retrieved samples (32 bits)
Pre
cisi
on
KLSHSPHAGHITQSPHERMDSHTSH
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8TINY580K
Recall (32 bits)
Precision
KLSHSPHAGHITQSPHERMDSHTSH
100 200 300 400 500 600 700 800 900 10000.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75TINY580K
Number of retrieved samples (32 bits)
Pre
cisi
on
KLSHSPHAGHITQSPHERMDSHTSH
Figure 6.10: Comparison of unsupervised methods on 2 large scale datasets: Flickr1M andTiny580k. The first row shows the results of supervised methods and the second row for unsu-pervised methods. Our method TSH achieves on par result with KSH. TSH and KSH significantlyoutperform other methods.
6.4 Experiments
6.4.1 Comparing methods
We compare with a few state-of-the-art hashing methods, including 6 (semi-)supervised
methods: Supervised Hashing with Kernels (KSH) [31], Iterative Quantization with su-
pervised embedding (ITQ-CCA) [36], Minimal Loss Hashing (MLH) [29], Supervised
Binary Reconstructive Embeddings (BREs) [28] and its unsupervised version BRE, Su-
pervised Self-Taught Hashing (STHs) [33] and its unsupervised version STH, Semi-
supervised sequential Projection Learning Hashing(SPLH) [17], and 7 unsupervised
methods: Locality-Sensitive Hashing (LSH) [32], Iterative Quantization (ITQ) [36],
Anchor Graph Hashing (AGH) [38], Spectral Hashing (SPH [34]), Spherical Hashing
(SPHER) [39], Multi-dimension Spectral Hashing (MDSH) [35], Kernelized Locality-
Sensitive Hashing KLSH [63].
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 123
For comparison methods, we follow the original papers for parameter setting. For SPLH,
the regularization trade-off parameter is picked from 0.01 to 1. We use the hierarchical
variant of AGH. For each dataset, the bandwidth parameters of Gaussian affinity in
MDSH and RBF kernel in KLSH, KSH and our method TSH is set as σ = td. Here d is
the average Euclidean distance of top 100 nearing neighbours and t is picked from 0.01
to 50. For STHs and our method TSH, the trade-off parameter in SVM is picked from
10/n to 105/n, n is the number of data points. For our TSH-EE using EE lost function,
we simply set the trade-off parameter λ to 100. If not specified, our method TSH use
SVM with RBF kernel as hash functions. The cyclic iteration number r in Algorithm 6
is simply set to 1.
6.4.2 Dataset description
We use 2 large scale image datasets and another 6 small datasets for evaluation. 2 large
image datasets are 580, 000 tiny image dataset (Tiny-580K), and Flickr 1 Million image
dataset (Flickr-1M). 6 small datasets include CIFAR10, MNIST, LabelMe, SCNENE15,
and 2 UCI datasets: USPS and ISOLET.
CIFAR10 1 is a subset of Tiny-80m [131] image dataset which contains 60K examples.
We generated 320-dimensional GIST features. MNIST is consisted of 70k examples and
the dimension is 784. The LabelMe image dataset is used in [29], which has 22K images
and 512-dimensional GIST features. SCENE15 [84] is a dataset of scene images which
has 4.5K examples. We extract 200 visual words histogram and 31 sub-windows which
results in 6200-dimensional feature vector. The ISOLET dataset contains round 8K
speaking records of 26 letters and has 617 dimensions. USPS is a hand writing digits
dataset which hash 9K examples and 256 dimensions.
The Tiny-580K dataset, described in [36], is subset of Tiny-80M dataset [131] which
consists of 580, 000 images. we use the provided GIST feature with 384 dimensions.
Flickr-1M dataset consists of 1 million thumbnail images of the MIRFlickr-1M We gen-
erate 320-dimension GIST features.
For the LabelMe dataset, the ground truth pairwise affinity matrix is provided. For other
small datasets, we use the multi-class labels to define the ground truth affinity by label
agreement. For large datasets Tiny-580K and Flickr-1M, there is no semantic ground
truth affinity provided. Following the same setting as other hash methods [17, 31], we
generate pseudo-labels for supervised methods according to the `2 distance. In detail, a
data point is labelled as a relevant neighbour to the query if it lies in the top 2 percentile
points of the whole database.
1http://www.cs.toronto.edu/˜kriz/cifar.html
124 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
For all datasets, following a common setting in many supervised hashing methods [28,
29, 31], we randomly select 2000 examples as testing queries, and the rest is served as
database. We train methods using a subset of the database: 5000 examples for large
datasets (Tiny-580K and Flickr-1M) and 2000 examples for the rest.
6.4.3 Evaluation measures
We use 4 types of evaluation measures: precision of top-K retrieved examples (Precision-
at-K), Mean Average Precision (MAP), the area under the Precision-Recall curve, preci-
sion of retrieved examples within the Hamming distance 2. Precision-at-K is the propor-
tion of relevant data points in the returned top-K results. MAP is the averaged precision-
at-K scores over all positions of relevant data points in a ranking. The Precision-Recall
curve is measure the overall performance in all positions.Precision-Recall is computed
by varying the number of nearest neighbours (K) and calculate the precision and recall
value.
6.4.4 Using different loss functions
We evaluate the performance of our method TSH using different loss functions on 3
datasets: LabelMe, MNIST, CIFAR10. 3 types of evaluation measures are used here:
Precision-at-K, Mean Average Precision (MAP) and the area under the Precision-Recall
curve. The loss function is defined in Section 6.3. In particular, our method TSH-KSH
uses the KSH [31] loss function, TSH-BRE uses the BRE [28] function. STHs-RBF is
the STHs method using RBF kernel hash functions. Our method also uses SVM with
RBF kernel as hash functions.
First, we evaluate the effectiveness of the Step-1 in our method. We compare the quality
of the generated binary codes on the training data points. The results are shown in the
upper part of the table in Table. 6.1. The results show that our methods generate
high-quality binary codes and outperform others by a large margin. In CIFAR10 and
MNIST, we are able to generate perfect codes that match the ground truth similarity.
This demonstrates the effectiveness of coordinate descent based hashing codes learning
procedure (Step 1 of our framework).
Compared to STHs-RBF, even though we are using the same formate of hash function,
our overall objective function and the bit-wise binary code inference algorithm may be
more effective. Thus our method achieves better performance than STH.
The second part of the result in Table 6.1 shows the testing performance. Our method
also outperforms others in most cases. Note that MNIST is an ‘easy’ dataset and
Chapter 6 A General Two-Step Approach to Learning-Based Hashing 125
not as challenging as CIFAR10 and LabelMe. Thus many methods manage to achieve
good performance. In the challenging dataset CIFAR10 and LabelMe, our method
significantly outperforms others by a large margin.
Overall, for preserving the semantic similarity, supervised methods usually perform much
better than those unsupervised methods, which is expected. Our method performs the
best, and the running-up methods are STHs-RBF, KSH, and ITQ-CCA.
We show further results of using different numbers of bits in Figure 6.4 for supervised
methods and Figure 6.5 for unsupervised methods on the dataset CIFAR10 and La-
belMe. For dataset SCENE15, USPS and ISOLET, the results of several supervised and
unsupervised methods are shown in Figure 6.6.
In the figures, TSH denotes our method using BRE loss function. Our method still
performs the best in most cases. Some retrieval examples are shown in Figure 6.2 and
6.3.
6.4.5 Training time
In Table 6.2, we compare the training time of different methods. It shows that our
method is fast compared to the state-of-the-art. We also present the binary code learning
time in the table. Notice that in the second step, learning hash functions by binary
classification can be easily paralleled which would make our method even more efficient.
6.4.6 Using different hash functions
We evaluate our method using different hash functions. The hash functions are SVM
with RBF kernel (TSH-RBF), linear SVM with kernel transferred feature (TSH-KF),
linear SVM (TSH-LSVM), Adaboost with decision-stump (TSH-Stump, 2000 iterations).
Results on 3 datasets are shown in Figure 6.7. The testing time for different hash
functions are shown in Figure 6.8.
It shows that the kernel hash functions (TSH-RBF and TSH-KF) achieve best perfor-
mance in similarity search. However, the testing of linear hash functions is much faster
than kernel hash functions. We also find that the testing time of TSH-KF is much faster
then TSH-RBF. The TSH-KF is a trade-off between testing time and search perfor-
mance.
126 Chapter 6 A General Two-Step Approach to Learning-Based Hashing
6.4.7 Results on large datasets
We carry out experiments on 2 large scale datasets: Flickr 1 million image dataset
(Flickr1M) and 580, 000 Tiny image dataset (Tiny580k). Results are shown in Figure 6.9
and Figure 6.10 for the comparison with supervised methods and unsupervised methods
respectively. Our method TSH achieve on par results with KSH. KSH and our TSH
significantly outperform other supervised or unsupervised methods. Notice that there
is no semantic similarity ground truth provided on these two datasets. We generate the
similarity ground truth using the Euclidean distance. Some unsupervised methods are
also able to perform well in this setting (e.g., MDSH, SPHER and ITQ).
6.5 Conclusion
We have shown that it is possible to place a wide variety of learning-based hashing
methods into a unified framework. The key insights is the fact that the code generation
and hash function learning processes can be seen as separate steps, and that the latter
may accurately be formulated as a classification problem. This insight enables the
development of new hashing approaches with efficient and simple learning. Experimental
testing has validated this approach, and shown that this new approach outperforms the
state-of-the-art.
Chapter 7
Fast Supervised Hashing with
Decision Trees for
High-Dimensional Data
In this chapter, we propose a hashing method [45] for efficient and effective learning
on large-scale and high-dimensional data, which is an extension of our general two-step
hashing method described in Chapter 6.
Supervised hashing aims to map the original features to compact binary codes that
are able to preserve label based similarity in the Hamming space. Non-linear hash
functions have demonstrated their advantage over linear ones due to their powerful
generalization capability. In the literature, kernel functions are typically used to achieve
non-linearity in hashing, which achieve encouraging retrieval performance at the price
of slow evaluation and training time. Here we propose to use boosted decision trees
for achieving non-linearity in hashing, which are fast to train and evaluate, hence more
suitable for hashing with high dimensional data. In our approach, we first propose sub-
modular formulations for the hashing binary code inference problem and an efficient
GraphCut based block search method for solving large-scale inference. Then we learn
hash functions by training boosted decision trees to fit the binary codes. Experiments
demonstrate that our proposed method significantly outperforms most state-of-the-art
methods in retrieval precision and training time. Especially for high-dimensional data,
our method is orders of magnitude faster than many methods in terms of training time.
127
128 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
7.1 Introduction
Hashing methods aim to preserve some notion of similarity (or distance) in the Ham-
ming space. These methods can be roughly categorized as supervised and unsupervised.
Unsupervised hashing methods try to preserve the similarity which is calculated in the
original feature space. For example, random projection based Locality-Sensitive Hash-
ing (LSH) [32] generates random linear hash functions to approximate cosine similarity;
Spectral Hashing [35] learns eigenfunctions that preserve Gaussian affinity; Iterative
Quantization (ITQ) [36] approximates the Euclidean distance in the Hamming space;
and Hashing on manifolds [64] takes the intrinsic manifold structure into consideration.
Supervised hashing is designed to preserve the label-based similarity [28, 31, 43, 44].
This might take place, for example, in the case where images from the same category
are defined as being semantically similar to each other. Supervised hashing has received
increasingly attention recently such as Supervised Hashing with Kernels (KSH) [31],
Two-Step Hashing (TSH) [44], Binary Reconstructive embeddings (BRE) [28]. Although
supervised hashing is more flexible and appealing for real-world applications, the learning
is usually much slower than that of unsupervised hashing. Despite the fact that hashing
is only of practical interest in the case where it may be applied to large numbers of
high-dimensional features, most supervised hashing approaches are demonstrated only
on relatively small numbers of low dimensional features. For example, codebook based
features have achieved remarkable success on image classification [132, 133], of which
the number of feature dimension usually comes to tens of thousands. To exploit this
recent advance of feature learning, it is very desirable for supervised hashing to be able
to deal with large-scale data efficiently on sophisticated high-dimensional features. To
bridge this gap, we propose a supervised hashing method which is able to leverage large
training sets and efficiently incorporate with high-dimensional features.
Non-linear hash functions, e.g., the kernel hash function employed in KSH and TSH,
have shown much improved performance over the linear hash function. However, kernel
functions could be extremely expensive for both training and testing on high-dimensional
features. Thus a scalable supervised hashing method with non-linear hash functions is
desirable too.
Compared to kernel method, decision trees only involve simple comparison operations
for evaluation; thus decision trees are much more efficient, especially on high-dimensional
data. Moreover, decision trees are able to work on quantized data without significant
performance loss, hence only consume very small memory for training. Though decision
trees could be a good choice for hash function, it remains unclear that how to learn
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 129
decision trees for supervised hashing. Here we propose an efficient method for learning
decision tree hashing functions.
Our main contributions are as follows.
1. We propose to use (ensembles of) decision trees as hash functions for supervised
hashing, which can easily deal with a very large number of training data with high
dimensionality (tens of thousands), and has the desirable non-linear mapping. To
our knowledge, our method is the first general hashing method that uses decision
trees as hash functions.
2. In order to efficiently learn decision trees for supervised hashing, we apply a two-
step learning strategy which decomposes the learning into the binary code inference
and the simple binary classification training of decision trees. For binary code
inference, we propose sub-modular formulations and an efficient GraphCut [134]
based block search method for solving large-scale inference.
3. Our method significantly outperforms many state-of-the-art methods in terms of
retrieval precision. For high-dimensional data, our method is usually orders of
magnitude faster in terms of training time.
The two-step learning strategy employed in our method is inspired by our general two-
step hashing learning method: TSH [44] which is described in Chapter 6. Other work
in [118, 121, 135] also learns hash functions by training classifiers. The spectral method
in TSH for binary code inference does not scale well on large training data, and it may
also lead to inferior result due to the loose relaxation of spectral methods. Moreover,
TSH only demonstrates satisfactory performance with kernel hash functions on small-
scale training data with low dimensionality, which is clearly not practical for large-scale
learning on high-dimensional features. In contrast with TSH, we explore efficient decision
trees as hash functions and propose an efficient GraphCut based method for binary code
inference. Experiments show that our method significantly outperforms TSH.
7.2 The proposed method
Let X = x1, ...,xn ⊂ Rd denote a set of training points. Label based similarity infor-
mation is described by an affinity matrix: Y, which is the ground truth for supervised
learning. The element in Y: yij indicates the similarity of two data point xi and xj ;
and yij = yji. Specifically, yij = 1 if two data points are similar, yij = −1 if dissimilar
(irrelevant) and yij = 0 if the pairwise relation is undefined. We aim to learn a set of
130 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Algorithm 7: An example for constructing blocks
Input: training data points: x1, ...xn; Affinity matrix: Y.Output: blocks:B1,B2, ....
1 V← x1, ...,xn; t = 0;2 repeat3 t = t+ 1; Bt ← ∅; xi: randomly selected from V;4 initialize U as joint of V and similar examples of xi ;5 for each xj in U do6 if xj is not dissimilar with any examples in Bt then7 add xj to Bt; remove xj from V ;
8 until V = ∅;
Algorithm 8: Step 1: Block GraphCut for binary code inference
Input: Affinity matrix: Y; bit length: r; max inference iteration; blocks:B1,B2, ...;binary codes: z1, ...,zr−1.
Output: Binary codes of one bit: zr.1 repeat2 Randomly permute all blocks;3 for each Bi do4 Solve the inference in (7.15a) on Bi using GraphCut;
5 until max iteration is reached ;
Algorithm 9: FastHash
Input: Training data points: x1, ...xn; Affinity matrix: Y; bit length: m;blocks:B1,B2, ....
Output: Hash functions: Φ = [h1, ..., hm].1 for r = 1, ...,m do2 Step-1: call Algorithm 8 to obtain binary codes of r-th bit;3 Step-2: train trees in (7.22) to obtain hash function hr;4 update the binary codes of r-th bit by the output of hr;
hash functions to preserve the label based similarity in the Hamming space. m hash
functions are denoted as:
Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (7.1)
The output of hash functions are m-bit binary codes: Φ(x) ∈ −1, 1m.
The Hamming distance between two binary codes is the number of bits taking different
values:
dhm(xi,xj) =
m∑r=1
[1− δ(hr(xi), hr(xj))
], (7.2)
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 131
in which δ(·, ·) ∈ 0, 1 is an indication function. If two inputs are equal, δ(·, ·) outputs
1; otherwise, it outputs 0. Generally, the formulation of hashing learning is to encourage
small Hamming distance for similar data pairs and large for dissimilar data pairs. Closely
related to Hamming distance, the Hamming affinity is calculated by the inner product
of two binary codes:
shm(xi,xj) =m∑r=1
hr(xi)hr(xj). (7.3)
As shown in KSH [31], the Hamming affinity is in one-to-one correspondence with the
Hamming distance. Similar to KSH [31], we formulate hashing learning based on Ham-
ming affinity, which is to encourage positive affinity values for similar data pairs and
negative for dissimilar data pairs. The optimization is written as:
minΦ(·)
n∑i=1
n∑j=1
|yij |[myij −
m∑r=1
hr(xi)hr(xj)
]2
. (7.4)
Note that KSH does not include the multiplication of |yij | in the objective. We use |yij |to prevent undefined pairwise relation from harming the hashing task. If the relation is
undefined, |yij | = 0, otherwise, |yij | = 1. Intuitively, this optimization encourages the
Hamming affinity value of a data pair to be close to the ground truth value. In contrast
with KSH which uses kernel functions, here we employ decision trees as hash functions.
We define each hash function as a linear combination of decision trees, that is,
h(x) = sign
[ Q∑q=1
wqTq(x)
]. (7.5)
Here Q is the number of decision trees; T (·) ∈ −1, 1 denotes a tree function with
binary output. The weighting:
w = [w1, w2, ..., wQ] (7.6)
and trees:
T = [T1, T2, ..., TQ] (7.7)
are parameters we need to learn for one hash function. Comparing to kernel method,
decision tree enjoys faster testing on high-dimensional data as well as the non-linear
fitting ability.
Optimizing (7.4) directly for learning decision trees is difficult, and the technique used
in KSH is no longer applicable. Inspired by TSH [44], we introduce auxiliary variables
132 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
zr,i ∈ −1, 1 as the output of the r-th hash function on xi:
zr,i = hr(xi). (7.8)
Clearly, zr,i is the binary code of i-th data point in the r-th bit. With these auxiliary
variables, the problem (7.4) can be decomposed into two sub-problems:
minZ∈−1,1m×n
n∑i=1
n∑j=1
|yij |[myij −
m∑r=1
zr,izr,j
]2
; (7.9)
and
minΦ(·)
m∑r=1
n∑i=1
δ(zr,j , hr(xi)). (7.10)
Here Z is the matrix of m-bit binary codes for all training data points; δ(·, ·) is an in-
dicator function which we described in (7.2). Note that (7.9) is a binary code inference
problem, and (7.10) is a simple binary classification problem. In this way, the com-
plicated decision tree learning for supervised hashing (7.4) now becomes two relatively
simpler tasks—solving (7.9) (Step 1) and (7.10) (Step 2).
7.2.1 Step 1: Binary code inference
For (7.9), we sequentially optimize for one bit at a time, conditioning on previous bits.
When solving for the r-th bit, the cost in (7.9) can be written as:
n∑i=1
n∑j=1
|yij |(ryij −
k∑p=1
zp,izp,j
)2
(7.11a)
=
n∑i=1
n∑j=1
|yij |(ryij −
r−1∑p=1
zp,izp,j − zr,izr,j)2
(7.11b)
=
n∑i=1
n∑j=1
|yij |[(ryij −
r−1∑p=1
zp,izp,j
)2
+ (zr,izr,j)2 − 2
(ryij −
r−1∑p=1
zp,izp,j
)zr,izr,j
](7.11c)
=n∑i=1
n∑j=1
−2|yij |(ryij −
r−1∑p=1
zp,izp,j
)zr,izr,j + const. (7.11d)
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 133
With the above equations, the optimization for the r-th bit can be formulated as a
binary quadratic problem:
minzr∈−1,1n
n∑i=1
n∑j=1
aijzr,izr,j , (7.12a)
where, aij = −|yij |(ryij −r−1∑p=1
z∗p,iz∗p,j). (7.12b)
Here zr is a vector of binary variables of the r-th bit, which we aim to optimize; z∗
denotes a binary code in previous bits; aij is a constant which can be calculated using
the binary codes of previous bits.
We use a stage-wise scheme for solving for each bit. Specifically, when solving for the
r-th bit, the bit length is set to r instead of m, which is shown in (7.12b) of the above
optimization. This way, the optimization of current bit depends on the loss caused by
previous bits, which usually leads to better inference results.
Alternatively, one can apply spectral relaxation method to solve (7.12a), as in TSH.
However solving eigenvalue problems does not scale up to large training sets, and the
spectral relaxation is rather loose (hence leading to inferior results). Here we propose
sub-modular formulations for the hashing binary code inference problem and an efficient
GraphCut based block search method for solving large-scale inference.
Specifically, we first group data points into a number of blocks, then iteratively optimize
for these blocks until converge. For each iteration, we randomly pick one block, then
optimize for (update) the corresponding variables of this block, conditioning on the rest
of the variables. In another word, when optimizing for one block, only those variables
which correspond to the data points of the target block will be updated; and for the
variables which are not involved in the target block, their values remain unchanged.
Obviously, each block update would strictly decrease the objective.
Formally, let B denote a block of data points. For example, we want to optimize for the
corresponding binary variables of the block: B. We denote by zr a binary code in the
r-bit that is not involved in the target block. First we rewrite the objective in (7.12a) to
separate the variables of the target block from other variables. The objective in (7.12a)
134 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
can be rewritten as:
n∑i=1
n∑j=1
aijzr,izr,j =∑i∈B
∑j∈B
aijzr,izr,j +∑i∈B
∑j /∈B
aijzr,izr,j (7.13a)
+∑i/∈B
∑j∈B
aijzr,izr,j +∑i/∈B
∑j /∈B
aij zr,izr,j (7.13b)
=∑i∈B
∑j∈B
aijzr,izr,j + 2∑i∈B
∑j /∈B
aijzr,izr,j +∑i/∈B
∑j /∈B
aij zr,izr,j . (7.13c)
When optimizing for one block, those variables which are not involved in the target block
are treated as constants; hence zr is treated as a constant. By removing the constant
part, the optimization for one block can be written as:
minzr,B∈−1,1|B|
∑i∈B
∑j∈B
aijzr,izr,j + 2∑i∈B
∑j /∈B
aijzr,izr,j . (7.14)
Here zr,B is a vector of variables which are involved in the target block: B and we
aim to optimize. Substituting the constant aij by its definition in (7.12b), the above
optimization is written as:
minzr,B∈−1,1|B|
∑i∈B
uizr,i +∑i∈B
∑j∈B
vijzr,izr,j , (7.15a)
where, vij = −|yij |(ryij −r−1∑p=1
z∗p,iz∗p,j), (7.15b)
ui =− 2∑j /∈B
zr,j |yij |(ryij −r−1∑p=1
z∗p,iz∗p,j). (7.15c)
Here ui, vij are constants.
The key to construct a block is to ensure (7.15a) of such a block is sub-modular, so we
can apply efficient GraphCut. We refer to this as Block GraphCut (Block-GC), shown in
Algorithm 8. Specifically in our hashing problem, by leveraging similarity information,
we can easily construct blocks which meet the sub-modular requirement, as shown in
the following proposition:
Proposition 7.1. ∀i, j ∈ B, if yij ≥ 0, the optimization in (7.15a) is a sub-modular
problem. In other words, for any data point in the block, if it is not dissimilar with any
other data points in the block, then (7.15a) is sub-modular.
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 135
Proof. If yij ≥ 0, the following hods:
ryij ≥r−1∑p=1
z∗p,iz∗p,j . (7.16)
Thus we have:
vij = −|yij |(ryij −r−1∑p=1
z∗p,iz∗p,j) ≤ 0. (7.17)
With the following definition:
θij(zr,i, zr,j) = vijzr,izr,j , (7.18)
the following holds:
θij(−1, 1) =θij(1,−1) = −vij ≥ 0; (7.19)
θij(1, 1) =θij(−1,−1) = vij ≤ 0. (7.20)
Hence we have the following relations:
∀i, j ∈ B, θij(1, 1) + θij(−1,−1) ≤ 0 ≤ θij(1,−1) + θij(−1, 1), (7.21)
which prove the sub-modularity of (7.15a) [136].
Blocks can be constructed in many ways as long as they satisfy the condition in Propo-
sition 7.1. A simple greedy method is shown in Algorithm 7. Note that the blocks can
overlap and the union of them needs to cover all n variables. If one variable is one block,
Block-GC becomes ICM [137, 138] which optimizes for one variable at a time.
7.2.2 Step 2: Learning boosted trees as hash functions
For binary classification in (7.10), usually the zero-one loss is replaced by some convex
surrogate loss. Here we use the exponential loss which is common for boosting methods.
The classification problem for learning the r-th hash function is written as:
minw≥0
n∑i=1
exp
[− zr,i
Q∑q=1
wqTq(xi)
]. (7.22)
We apply Adaboost to solve above problem. In each boosting iteration, a decision tree
as well as its weighting coefficient are learned. Every node of a binary decision tree
136 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
is a decision stump. Training a stump is to find a feature dimension and threshold
that minimizes the weighted classification error. From this point, we are doing feature
selection and hash function learning at the same time. We can easily make use of efficient
decision tree learning techniques available in the literature, which are able to significantly
speed up the training. Here we summarize some techniques that are included in our
implementation:
1. We have used the highly efficient stump implementation proposed in the recent
work of [139], which is around 10 times faster than conventional stump implemen-
tation.
2. Feature quantization can significantly speed up tree training without performance
loss in practice, and also largely reduce the memory consuming. As in [139], we
linearly quantize feature values into 256 bins.
3. We apply the weight-trimming technique described in [139, 140]. In each boosting
iteration, the smallest 10% weightings are trimmed (set to 0).
4. We apply LazyBoost technique, which significantly speeds up the tree learning
process especially on high-dimensional data. For one node splitting in tree training,
only a random subset of feature dimensions are evaluated for splitting.
Finally, we summarize our hashing method (FastHash) in Algorithm 9. In contrast with
TSH, we alternate Step-1 and Step-2 iteratively. For each bit, the binary code is updated
by applying the learned hash function. Hence, the learned hash function is able to make
a feedback for binary code inference of next bit, which may lead to better performance.
7.3 Experiments
We here describe the results of comprehensive experiments carried out on several large
image datasets in order to evaluate the performance of the proposed method in terms of
training time, binary encoding time and retrieval performance. We compare to a number
of recent supervised and unsupervised hashing methods. For decision tree learning in
our FastHash, if not specified, the tree depth is set to 4, and the number of boosting
iterations is 200.
The retrieval performance is measured in 3 ways: the precision of top-K (K = 100)
retrieved examples (denoted as Precision), mean average precision (MAP) and the area
under the Precision-Recall curve (Prec-Recall).
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 137
Figure 7.1: Some retrieval examples of our method FastHash on CIFAR10. The first columnshows query images, and the rest are retrieved images in the database.
138 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Figure 7.2: Some retrieval examples of our method FastHash on ESPGAME. The first columnshows query images, and the rest are retrieved images in the database. False predictions aremarked by red boxes.
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 139
Table 7.1: Comparison of KSH and our FastHash. KSH results with different number of supportvectors. Both of our FastHash and FastHash-Full outperform KSH by a large margin in termsof training time, binary encoding time (Test time) and retrieval precision.
Method #Train #Support Vector Train time Test time Precision
CIFAR10 (features:11200)
KSH 5000 300 1082 22 0.480KSH 5000 1000 3481 57 0.553KSH 5000 3000 52747 145 0.590FastH 5000 N/A 331 21 0.634FastH-Full 50000 N/A 1794 21 0.763
IAPRTC12 (features:11200)
KSH 5000 300 1129 7 0.199KSH 5000 1000 3447 21 0.235KSH 5000 3000 51927 51 0.273FastH 5000 N/A 331 9 0.285FastH-Full 17665 N/A 620 9 0.371
ESPGAME (features:11200)
KSH 5000 300 1120 8 0.124KSH 5000 1000 3358 22 0.139KSH 5000 3000 52115 46 0.163FastH 5000 N/A 309 9 0.188FastH-Full 18689 N/A 663 9 0.261
MIRFLICKR (features:11200)
KSH 5000 300 1036 5 0.387KSH 5000 1000 3337 13 0.407KSH 5000 3000 52031 42 0.434FastH 5000 N/A 278 7 0.555FastH-Full 12500 N/A 509 7 0.595
Results are reported on 5 image datasets which cover a wide variety of images. The
dataset CIFAR10 1 contains 60000 images. The datasets IAPRTC12 and ESPGAME
[141] contain around 20000 images, and MIRFLICKR [142] is a collection of 25000
images. SUN397 [3] is a large image dataset which contains more than 100000 scene
images form 397 categories.
For the multi-class datasets: CIFAR10 and SUN397, the ground truth pairwise similarity
is defined as multi-class label agreement. For datasets: IAPRTC12, ESPGAME and
MIRFLICKR, of which the keyword (tags) annotation are provided in [141], two images
are treated as semantically similar if they are annotated with at lease 2 identical keywords
(or tags).
Following a conventional setting in [28, 31], a large portion of the dataset is allocated as
an image database for training and retrieval and the rest is put aside for testing queries.
Specifically, for CIFAR10, IAPRTC12, ESPGAME and MIRFLICKER, the provided
splits are used; for SUN397, 8000 images are randomly selected as test queries, while
the remaining 100417 images form the training set. If not specified, 64-bit binary codes
are generated using comparing methods for evaluation.
1http://www.cs.toronto.edu/~kriz/cifar.html
140 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Table 7.2: Comparison of TSH and our FastHash for binary code inference in Step 1. Theproposed Block GraphCut (Block-GC) achieves much lower objective value and also takes lessinference time than the spectral method, and thus performs much better.
Step-1 methods #train Block Size Time (s) Objective
SUN397
Spectral (TSH) 100417 N/A 5281 0.7524Block-GC-1 (FastH) 100417 1 298 0.6341Block-GC (FastH) 100417 253 2239 0.5608
CIFAR10
Spectral (TSH) 50000 N/A 1363 0.4912Block-GC-1 (FastH) 50000 1 158 0.5338Block-GC (FastH) 50000 5000 788 0.4158
IAPRTC12
Spectral (TSH) 17665 N/A 426 0.7237Block-GC-1 (FastH) 17665 1 43 0.7316Block-GC (FastH) 17665 316 70 0.7095
ESPGAME
Spectral (TSH) 18689 N/A 480 0.7373Block-GC-1 (FastH) 18689 1 45 0.7527Block-GC (FastH) 18689 336 72 0.7231
MIRFLICKR
Spectral (TSH) 12500 N/A 125 0.5718Block-GC-1 (FastH) 12500 1 28 0.5851Block-GC (FastH) 12500 295 40 0.5449
Given the remarkable success with which they have been applied elsewhere, we extract
codebook-based features following the conventional pipeline from [132, 133]: we employ
K-SVD for codebook (dictionary) learning with a codebook size of 800, soft-thresholding
for patch encoding and spatial pooling of 3 levels, which results 11200-dimensional fea-
tures. We also tested increasing the codebook size to 1600 which results in 22400-
dimensional features.
7.3.1 Comparison with KSH
KSH [31] has been shown to outperform many state-of-the-art comparators. The fact
that our method employs the same loss function as KSH thus motivates further com-
parison against this key method. KSH employs a simple kernel technique by predefining
a set of support vectors then learning linear weightings for each hash function. In the
works of [31, 44], KSH is evaluated only on low dimensional GIST features (512 dimen-
sions) using a small number of support vectors (300). Here, in contrast, we evaluate
KSH on high-dimensional codebook features, and vary the number of support vectors
from 300 to 3000. For our method, the tree depth is set to 4, and the number of boosting
iterations is set to 200. KSH is trained on a sampled set of 5000 examples. The results
of these tests are summarized in Table 7.1, which shows that increasing the number of
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 141
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CIFAR10
Recall (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
IAPRTC12
Recall (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.05
0.1
0.15
0.2
0.25
0.3
0.35ESPGAME
Recall (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65MIRFLICKR
Recall (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
500 1000 1500 2000
0.4
0.5
0.6
0.7
0.8CIFAR10
Number of retrieved samples (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
500 1000 1500 20000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
IAPRTC12
Number of retrieved samples (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
500 1000 1500 20000.05
0.1
0.15
0.2
0.25
0.3
ESPGAME
Number of retrieved samples (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
500 1000 1500 2000
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
MIRFLICKR
Number of retrieved samples (64 bits)
Pre
cis
ion
KSH−3000
KSH−1000
KSH−300
FastHash
FastHash−Full
Figure 7.3: Comparison of KSH and our FastHash on all datasets. The precision and recallcurves are given in the first two rows. The precision curves of the top 2000 retrieved examplesare given on the last 2 rows. The number after “KSH” is the number of support vectors. Bothof our FastHash and FastHash-Full outperform KSH by a large margin.
142 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
support vectors consistently improves the retrieval performance of KSH. However, even
on this small training set, including more support vectors will dramatically increase the
training time and binary encoding time of KSH. We have run our FastHash both on the
same sampled training set and the whole training set (labeled as FastHash-Full) in order
to show that our method can be efficiently trained on the whole dataset. Our FastHash
and FastHash-Full outperform KSH by a large margin both in terms of training speed
and and retrieval precision. The results also show that the decision tree hash func-
tions in FastHash are much more efficient for testing (binary encoding) than the kernel
function in KSH. Our FastHash is orders of magnitude faster than KSH on training,
and thus much better suited to large training sets and high-dimensional data. For the
low-dimensional GIST feature, our FastHash also performs much better than KSH in
retrieval, see Table 7.5 for details. If not specified, the number of support vectors for
KSH is set to 3000.
For the comparison with KSH, we also show the precision-recall curves and the precision
curves of top-K retrieved examples in Figure 7.3. The number after “KSH” is the number
of support vectors. Both of our FastHash and FastHash-Full outperform KSH by a large
margin.
Some retrieval examples of our method are shown in Figure 7.1 for the dataset CIFAR10
and Figure 7.2 for the dataset ESPGAME. The codebook features are used here.
7.3.2 Comparison with TSH
TSH [44] is a general two-step learning method which we propose in Chapter 6. The
proposed FastHash employs a similar two-step approach to that of TSH. We first compare
binary code inference in Step 1: the proposed Block GraphCut (Block-GC) and the
spectral method in TSH. For all experiments in this paper, the number of iterations of
Block-GC is set to 2. The results of testing are summarized in Table 7.2. We construct
blocks using Algorithm 7. The averaged block size is reported in the table. We also
evaluate a special case where the block size is set to 1 for Block-CG (labeled as Block-
CG-1), in which case Block-GC is reduced to the ICM [137, 138] method. It shows that
when the training set gets larger, the spectral method becomes slow. The objective
value shown in the table is divided by the number of defined pairwise relations. The
proposed Block-GC achieves much lower objective values and takes less inference time,
and hence outperforms the spectral method. The inference time for Block-CG increases
only linearly with the training set size. It also shows that the special case: Block-CG-1
is highly efficient and able to achieve comparable low objective value.
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 143
16 32 48 640.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8CIFAR10
Number of bits
mA
P
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 640.08
0.1
0.12
0.14
0.16
0.18
0.2ESPGAME
Number of bits
mA
P
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 64
0.16
0.18
0.2
0.22
0.24
0.26
0.28IAPRTC12
Number of bits
mA
P
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 640.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58MIRFLICKR
Number of bits
mA
P
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 640.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
CIFAR10
Number of bits
Pre
cis
ion (
100−
NN
)
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 640.1
0.15
0.2
0.25
0.3
0.35
ESPGAME
Number of bits
Pre
cis
ion (
100−
NN
)
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 64
0.2
0.25
0.3
0.35
0.4IAPRTC12
Number of bits
Pre
cis
ion (
100−
NN
)
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
16 32 48 640.4
0.45
0.5
0.55
0.6
0.65MIRFLICKR
Number of bits
Pre
cis
ion (
100−
NN
)
TSH−Tree
TSH−LSVM
FastHash−LSVM
FastHash
Figure 7.4: Comparison of various combinations of hash functions and binary inference methods.Note that the proposed FastHash uses decision tree as hash functions. The proposed decisiontree hash function performs much better than the linear SVM hash function. Moreover, ourFastHash performs much better than TSH when using the same hash function in Step 2.
144 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Table 7.3: Comparison of combination of hash functions and binary inference methods. Theproposed decision tree hash function performs much better than linear SVM hash function.Moreover, our FastHash performs much better than TSH when using the same hash function inStep-2.
Step-2 Step-1 Precision MAP Prec-Recall
CIFAR10
Linear-SVM TSH 0.676 0.621 0.436Linear-SVM FastH 0.669 0.621 0.435Tree (FastH) TSH 0.745 0.726 0.567Tree (FastH) FastH 0.763 0.775 0.605
IAPRTC12
Linear-SVM TSH 0.297 0.213 0.155Linear-SVM FastH 0.327 0.238 0.186Tree (FastH) TSH 0.328 0.245 0.185Tree (FastH) FastH 0.371 0.276 0.210
ESPGAME
Linear-SVM TSH 0.194 0.137 0.096Linear-SVM FastH 0.227 0.157 0.109Tree (FastH) TSH 0.220 0.161 0.109Tree (FastH) FastH 0.261 0.189 0.126
MIRFLICKR
Linear-SVM TSH 0.522 0.478 0.331Linear-SVM FastH 0.536 0.498 0.344Tree (FastH) TSH 0.559 0.526 0.391Tree (FastH) FastH 0.595 0.558 0.420
We now provide results comparing different combinations of hash functions (Step 2)
and binary code inference methods (Step 1). We evaluate the linear SVM and the pro-
posed decision tree hash functions with different binary code inference methods (Spectral
method in TSH Block-GC in FastHash). The 11200-dimensional codebook features are
used here. The retrieval performance is summarized in Table 7.3; and we also plot
the retrieval performance in Figure 7.4 by varying the number of bits. As expected,
the proposed decision tree hash function performs much better than linear SVM hash
function. It also shows that our FastHash performs much better than TSH when using
the same type of hash function for Step 2 (decision tree or linear SVM hash function),
which indicates that the proposed Block-GC method for binary code inference and the
stage-wise learning strategy is able to generate high quality binary codes.
We can also train RBF-kernel SVM as hashing functions in Step-2. However, as the case
here, when applied on large training set and high-dimensional data, the training of RBF
SVM almost becomes intractable. A stochastic method with a support vector budget
(BSGD) for efficient training of kernel SVM is recently proposed in [6]. Even using
BSGD, the training and testing cost are still very expensive. We run TSH using BSGD
[6] (TSH-BRBF) and linear SVM (TSH-LSVM) as hashing function, and compare to
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 145
Table 7.4: Comparison of TSH and our FastHash. Results of TSH with the linear SVM andthe budgeted RBF kernel [6] hash functions (TSH-BRBF) for the Step-2 are presented. OurFastHash outperforms TSH by a large margin both on training speed and retrieval performance.
Methods Train time Test time Precision MAP P-Recall
CIFAR10
TSH-BRBF 98961 8994 0.683 0.629 0.448TSH-LSVM 14567 9 0.676 0.621 0.436FastHash 1794 21 0.763 0.775 0.605
IAPRTC12
TSH-BRBF 45739 3129 0.276 0.194 0.144TSH-LSVM 6926 3 0.297 0.213 0.155FastHash 620 9 0.371 0.276 0.210
ESPGAME
TSH-BRBF 51669 1914 0.167 0.114 0.085TSH-LSVM 7062 3 0.194 0.137 0.096FastHash 663 9 0.261 0.189 0.126
MIRFLICKR
TSH-BRBF 21183 1339 0.513 0.455 0.324TSH-LSVM 7755 2 0.522 0.478 0.331FastHash 509 7 0.595 0.558 0.420
our FastHash with boosted trees. The result is shown in Table 7.4. The number of
support vectors is set to 100 as the budget. Even with this small number of support
vectors, TSH-BRBF is already significantly slow on testing and training. Compared to
kernel SVM, for high-dimensional data, our FastHash with decision trees are much more
efficient both on training and testing (binary encoding). Our FastHash also performs
the best on retrieval performance. It is worthy noting here, for each hash function, a
set of support vectors are learned from the data, which is different from KSH [31] using
predefined support vectors for all hash functions.
7.3.3 Experiments on different features
We compare hashing methods on the the low-dimensional (320 or 512) GIST feature
and the high-dimensional (11200) codebook feature. We extract GIST features of 320
dimensions for CIFAR10 which contains low resolution images, and 520 dimensions for
other datasets. Several state-of-the-art supervised methods are included in this com-
parison: KSH [31], Supervised Self-Taught Hashing (STHs) [33], and Semi-supervised
Hashing (SPLH) [17]. The result is presented in Table 7.5. The codebook features con-
sistently show better result than GIST features. Comparing methods are trained on a
sampled training set (5000 examples). Results show that comparing methods can be
efficiently trained on the GIST features. However, when applied on high dimensional
features, even on a small training set (5000), their training time dramatically increase.
146 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CIFAR10
Recall (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.05
0.1
0.15
0.2
0.25
0.3
0.35ESPGAME
Recall (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
IAPRTC12
Recall (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65MIRFLICKR
Recall (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
500 1000 1500 20000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CIFAR10
Number of retrieved samples (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
500 1000 1500 20000.05
0.1
0.15
0.2
0.25
0.3
ESPGAME
Number of retrieved samples (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
500 1000 1500 20000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
IAPRTC12
Number of retrieved samples (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
500 1000 1500 20000.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65MIRFLICKR
Number of retrieved samples (64 bits)
Pre
cis
ion
STHs
SPLH
BREs
KSH
FastHash
FastHash−Full
Figure 7.5: Results on high-dimensional codebook features. The precision and recall curves aregiven in the first two rows. The precision curves of the top 2000 retrieved examples are given onthe last 2 rows. Both our FastHash and FastHash-Full outperform their comparators by a largemargin.
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 147
Table 7.5: Results using two types of features: low-dimensional GIST features and the high-dimensional codebook features. Our FastHash and FastHash-Full outperform the comparatorsby a large margin on both feature types. In terms of training time, our FastHash is also muchfaster than others on the high-dimensional codebook features.
GIST feature (320 / 512 dimensions) Codebook feature (11200 dimensions)
Method #Train Train time Test time Precision MAP Prec-Recall Train time (s) Test time (s) Precision MAP Prec-Recall
CIFAR10
KSH 5000 52173 8 0.453 0.350 0.164 52747 145 0.590 0.464 0.261BREs 5000 481 1 0.262 0.198 0.082 18343 8 0.292 0.216 0.089SPLH 5000 102 1 0.368 0.291 0.138 9858 4 0.496 0.396 0.219STHs 5000 380 1 0.197 0.151 0.051 6878 4 0.246 0.175 0.058FastH 5000 304 21 0.517 0.462 0.243 331 21 0.634 0.575 0.358
FastH-Full 50000 1681 21 0.649 0.653 0.450 1794 21 0.763 0.775 0.605
IAPRTC12
KSH 5000 51864 5 0.182 0.126 0.083 51927 51 0.273 0.169 0.123BREs 5000 6052 1 0.138 0.109 0.074 6779 3 0.163 0.124 0.097SPLH 5000 154 1 0.160 0.124 0.084 10261 2 0.220 0.157 0.119STHs 5000 628 1 0.099 0.092 0.062 10108 2 0.160 0.114 0.076FastH 5000 286 9 0.232 0.168 0.117 331 9 0.285 0.202 0.146
FastH-Full 17665 590 9 0.316 0.240 0.178 620 9 0.371 0.276 0.210
ESPGAME
KSH 5000 52061 5 0.118 0.077 0.054 52115 46 0.163 0.100 0.072BREs 5000 714 1 0.095 0.070 0.050 16628 3 0.111 0.076 0.059SPLH 5000 185 1 0.116 0.083 0.062 11740 2 0.148 0.104 0.074STHs 5000 616 1 0.061 0.047 0.033 11045 2 0.087 0.064 0.042FastH 5000 289 9 0.157 0.106 0.070 309 9 0.188 0.125 0.081
FastH-Full 18689 448 9 0.228 0.169 0.109 663 9 0.261 0.189 0.126
MIRFLICKR
KSH 5000 51983 3 0.379 0.321 0.234 52031 42 0.434 0.350 0.254BREs 5000 1161 1 0.347 0.310 0.224 13671 2 0.399 0.345 0.250SPLH 5000 166 1 0.379 0.337 0.241 9824 2 0.444 0.391 0.277STHs 5000 613 1 0.268 0.261 0.172 10254 2 0.281 0.272 0.174FastH 5000 307 7 0.477 0.429 0.299 338 7 0.555 0.487 0.344
FastH-Full 12500 451 7 0.525 0.507 0.345 509 7 0.595 0.558 0.420
Large matrix multiplication and solving eigenvalue problem on a large matrix may ac-
count for the expensive computation in these comparing methods. It would be very
difficult to train these methods on the whole training set. The training time of KSH
mainly depends on the number of support vectors (3000 is used here). We run our
FastHash on the same sampled training set (5000 examples) and the whole training set
(labeled as FastHash-Full). Results show that FastHash can be efficiently trained on the
whole dataset. FastHash and FastHash-Full outperform others by a large margin both
in GIST and codebook features. The training of FastHash is also orders of magnitudes
faster than others on the high-dimensional codebook feature. The retrieval performance
on codebook feature is also plotted in Figure 7.5.
7.3.4 Comparison with dimension reduction
A possible way to reduce the training cost on high-dimensional data is to apply dimen-
sion reduction. For comparing methods: KSH, SPLH and STHs, here we reduce the
original 11200-dimensional codebook features to 500 dimensions by applying PCA. We
also compare to CCA+ITQ [36] which combines ITQ with the supervised dimensional
reduction. Our FastHash still use the original high-dimensional features. The result is
summarized in Table 7.6. After dimension reduction, most comparing methods can be
trained on the whole training set within 24 hours (except KSH on CIFAR10). However
148 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Table 7.6: Results of methods with dimension reduction. KSH, SPLH and STHs are trainedwith PCA feature reduction. Our FastHash outperforms others by a large margin on retrievalperformance.
Method # Train Train time Test time Precision MAP
CIFAR10
PCA+KSH 50000 − − − −PCA+SPLH 50000 25984 18 0.482 0.388PCA+STHs 50000 7980 18 0.287 0.200CCA+ITQ 50000 1055 7 0.676 0.642FastH 50000 1794 21 0.763 0.775
IAPRTC12
PCA+KSH 17665 55031 11 0.082 0.103PCA+SPLH 17665 1855 7 0.239 0.169PCA+STHs 17665 2463 7 0.174 0.126CCA+ITQ 17665 804 3 0.332 0.198FastH 17665 620 9 0.371 0.276
ESPGAME
PCA+KSH 18689 55714 11 0.141 0.084PCA+SPLH 18689 2409 7 0.153 0.103PCA+STHs 18689 2777 7 0.098 0.069CCA+ITQ 18689 814 3 0.216 0.131FastH 18689 663 9 0.261 0.189
MIRFLICKR
PCA+KSH 12500 54260 8 0.384 0.313PCA+SPLH 12500 1054 5 0.445 0.391PCA+STHs 12500 1768 5 0.347 0.301CCA+ITQ 12500 699 3 0.519 0.408FastH 12500 509 7 0.595 0.558
it still much slower than our FastHash. For the retrieval performance, the retrieval result
of SPLH and STHs get improved with more training data. Our FastHash still signifi-
cantly outperforms all others. The proposed decision tree hash functions in FastHash
actually perform feature selection and hash function learning at the same time, which
shows much better performance than other hashing method with dimensional reduction.
The running-up method is CCA+ITQ. Note that supervised feature reduction can be
also applied in our method.
7.3.5 Comparison with unsupervised methods
We compare to some popular unsupervised hashing methods: LSH [32], ITQ [36], Anchor
Graph Hashing (AGH) [38], Spherical Hashing (SPHER) [39], MDSH [35]. The retrieval
performance is shown in Figure 7.6. Unsupervised methods perform poorly for preserving
label based similarity. Our FastHash outperforms others by a large margin.
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 149
Table 7.7: Performance of our FastHash on more features (22400 dimensions) and more bits(1024 bits). It shows that FastHash can be efficiently trained on high-dimensional features withlarge bit length. The training and binary coding time (Test time) of FastHash is only linearlyincreased with the bit length.
Bits #Train Features Train time Test time Precision MAP
CIFAR10
64 50000 11200 1794 21 0.763 0.775256 50000 22400 5588 71 0.794 0.8141024 50000 22400 22687 282 0.803 0.826
IAPRTC12
64 17665 11200 320 9 0.371 0.276256 17665 22400 1987 33 0.439 0.3141024 17665 22400 7432 134 0.483 0.338
ESPGAME
64 18689 11200 663 9 0.261 0.189256 18689 22400 1912 34 0.329 0.2331024 18689 22400 7689 139 0.373 0.257
MIRFLICKR
64 12500 11200 509 7 0.595 0.558256 12500 22400 1560 28 0.612 0.5671024 12500 22400 6418 105 0.628 0.576
Table 7.8: Results on the large image dataset SUN397 using 11200-dimensional codebook fea-tures. Our FastHash can be efficiently trained to large bit length (1024 bits) on this large trainingset. FastHash outperforms other methods by a large margin on retrieval performance.
Method #Train Bits Train time Test time Precision MAP
SUN397
KSH 10000 64 57045 463 0.034 0.023BREs 10000 64 105240 23 0.019 0.013SPLH 10000 64 27552 14 0.022 0.015STHs 10000 64 22914 14 0.010 0.008
ITQ 100417 1024 1686 127 0.030 0.021SPHER 100417 1024 35954 121 0.039 0.024LSH − 1024 − 99 0.028 0.019CCA+ITQ 100417 512 7484 66 0.113 0.076CCA+ITQ 100417 1024 15580 127 0.120 0.081
FastH 100417 512 29624 302 0.149 0.142FastH 100417 1024 62076 536 0.165 0.163
7.3.6 More features and more bits
To further evaluate the training efficiency of our method, we increase the codebook
size to 1600 for generating higher dimensional features (22400 dimensions) and run up
to 1024 bits. The result is shown in Table 7.7. It shows that our FastHash can be
efficiently trained on high-dimensional features with large bit length. The training and
binary coding time (Test time) of FastHash is only linearly increased with the bit length.
The retrieval result is improved when the bit length is increased.
150 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data
500 1000 1500 20000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8CIFAR10
Number of retrieved samples (64 bits)
Pre
cis
ion
LSH
AGH
MDSH
SPHER
ITQ
FastHash
500 1000 1500 20000.05
0.1
0.15
0.2
0.25
0.3ESPGAME
Number of retrieved samples (64 bits)
Pre
cis
ion
LSH
AGH
MDSH
SPHER
ITQ
FastHash
500 1000 1500 20000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5IAPRTC12
Number of retrieved samples (64 bits)
Pre
cis
ion
LSH
AGH
MDSH
SPHER
ITQ
FastHash
500 1000 1500 20000.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65MIRFLICKR
Number of retrieved samples (64 bits)
Pre
cis
ion
LSH
AGH
MDSH
SPHER
ITQ
FastHash
Figure 7.6: The retrieval precision results of unsupervised methods. Unsupervised methodsperform poorly for preserving label based similarity. Our FastHash outperform others by a largemargin.
7.3.7 Large dataset: SUN397
The challenging SUN397 dataset is a collection of more than 100000 scene images from
397 categories. 11200-dimensional codebook features are extracted on this dataset. We
compare with a number of supervised and unsupervised methods. The depth for decision
trees is set to 6. The result is presented in Table 7.8 Supervised methods: KSH, BREs,
SPLH and STHs are trained to 64 bits on a subset of 10K examples. However, even
on this sampled training set and only run to 64 bits, the training of these methods
are already impractically slow. It would be almost intractable for the whole training
set and long bit length. Short length of bits are not able to achieve good performance
on this challenging dataset. In contrast, our method can be efficiently trained to large
bit length (1024 bits) on the whole training set (more than 100000 training examples).
Our FastHash outperforms other methods by a large margin on retrieval performance.
The runner-up method is the supervised method CCA+ITQ [36]. For unsupervised
methods: ITQ, SPHER, LSH, they are also efficiently. However, they perform poorly
for preserving label based similarity. The retrieval performance of 1024 bits is also
plotted in Figure 7.7.
Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 151
500 1000 1500 20000
0.05
0.1
0.15
0.2
0.25SUN397
Number of retrieved samples (1024 bits)
Pre
cis
ion
LSH
SPHER
ITQ
CCA+ITQ
FastHash
Figure 7.7: The precision curve of top 2000 retrieved examples on large image dataset SUN397using 1024 bits. Here we compare with those methods which can be efficiently trained up to1024 bits on the whole training set. Our FastHash outperforms others by a large margin.
For memory usage, many of the comparing methods require a large amount of memory
for large matrix multiplication. In contrast, the decision tree learning in our method
only involves the simple comparison operation on quantized feature data (256 bins), thus
FastHash only consumes less than 7GB for training, which shows that our method can
be easily applied for large-scale training.
7.4 Conclusion
we have proposed an efficient supervised hashing method for large-scale and high-
dimensional data. Our method follows a two-step hashing learning scheme. We develop
an efficient GraphCut based block search method for solving large-scale inference, and
learn decision tree based hash functions by fitting binary codes. Our method can be effi-
ciently trained on large-scale and high-dimensional datasets, and achieves high retrieval
precision. Our comprehensive experiments show the advantages of our method on re-
trieval precision, training and testing speed, which indicates its practical significance on
many applications like large-scale image retrieval.
Chapter 8
Conclusion
This thesis has made practical and theoretical contributions to structured learning and
binary code learning. We have presented our novel learning methods and explored a num-
ber of applications in computer vision, including image classification, image retrieval,
image segmentation, visual tracking and so on.
In the first part of this thesis, we have explored column generation based techniques
for multi-class classification and the more general structured output prediction tasks.
In Chapter 3, we proposed to learn a separate set of weak learners for different class,
and this class-specified weak learner learning is able to achieve much faster convergence
compared to traditional multi-class boosting methods. For solving the optimization, we
have developed an efficient coordinate decent method with closed-form solutions for each
update iteration. The proposed method has been empirically verified on fast training
and competitive testing accuracies.
Apart from multi-class classification, in Chapter 4, we have developed a boosting based
approach (StructBoost) for general structured output prediction tasks, as an alternative
to conventional structured learning methods like SSVM [7] and CRF [88]. The pro-
posed boosting based method is able to perform efficient nonlinear structured output
prediction by learning and combining a set of weak structured learners. To solve the
resulting optimization problems, we introduced an efficient cutting-plane method and
combined it with the column generation learning framework. In a wide range of appli-
cations on multi-class classification, hierarchical multi-class classification by optimizing
the tree loss, visual tracking that optimizes the Pascal overlap criterion, and learning
CRF parameters for image segmentation, we have shown that StructBoost has compet-
itive performances compared to conventional approaches, and significantly outperforms
linear SSVM, demonstrating the usefulness of our nonlinear structured learning method.
153
154 Chapter 8 Conclusion
In the second part of this thesis, we have introduced three hashing methods which focus
on different aspects of the binary code learning problem. The first proposed method in
Chapter 5 exploits triplet-based relative ranking relations for hash functions learning.
This method is based on large-margin learning framework and incorporate triplet ranking
constraints in the optimization problem. The resulting optimization is solved by column
generation techniques.
The hashing methods presented in Chapter 6 and 7 aim to learn hash functions by
preserving pairwise similarity. Specifically in Chapter 6, we introduced a general two-
step based learning approach, in which the learning task is divided into binary code
inference steps and binary classification steps. We have shown that any hamming dis-
tance (or hamming affinity) based loss functions can be readily applied in this general
learning framework, and thus we palace a wide variety of hashing methods in a uni-
fied framework. As an extension of this general two-step framework, we proposed an
efficient hashing method in Chapter 7, which aims for hashing learning from large-scale
and high-dimensional data. We have developed an efficient GraphCut based block search
method for solving non-submodular problems, which can be easily applied for large-scale
binary inference. Moreover, we proposed to learn decision trees as efficient non-linear
hash functions which are more suitable for high-dimensional data compared to kernel
hash functions. Comprehensive experiments have shown the advantages of our method
on retrieval precision, training and testing speed, indicating its practical significance on
many applications like large-scale image retrieval.
8.1 Future work
Both structured output prediction and binary code learning have been applied in a va-
riety of computer vision applications. As a part of the future work, we aim to explore
more computer vision applications of these proposed learning methods. For structure
learning, our boosting approach can be applied to the learning of interactions between
the elements in complex outputs. Complex outputs are usually involved in many vision
tasks such as pose recognition, action recognition, event detection, context modeling and
scene understanding. The proposed binary code learning approaches can find their po-
tential applications in large-scale nearest neighbor retrieval and learning efficient binary
code representations. Example tasks include nearest-neighbor based image/object/event
retrieval, image classification, object detection, feature matching and so on.
Another direction is to incorporate advanced feature learning techniques into our struc-
tured learning and binary code learning frameworks to develop end-to-end style learning
methods. This end-to-end learning style is able to update both the low-level feature
Chapter 8 Conclusion 155
extracting model and the high-level prediction model according to some task-specific
targets. A popular feature learning method is deep neural networks (DNNs) which
is being actively explored both in the research community and the industry. It has
been shown that DNNs somehow success to learn hierarchical style (from low-level to
high-level) visual concepts. DNNs have shown impressive performance in many vision
applications including large-scale image classification [143], object detection [144, 145],
segmentation [146] and so on. It is interesting to find out how the DNNs can be ap-
plied in our structured learning and binary code learning framework to gain further
performance improvement.
Appendix A
Appendix for MultiBoostcw
A.1 Dual problem of MultiBoostcw
Here we describe how to derive the dual problem of MultiBoostcw. The proposed method
MultiBoostcw with the exponential loss is written as (A.1):
minw≥0
1>w +C
p
∑i
∑y 6=yi
exp (−ρ(i,y)) (A.1a)
s.t. ∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :
w>yiΦyi(xi)−w>yΦy(xi) = ρ(i,y). (A.1b)
The Lagrangian of (A.1) can be written as:
L = 1>w +C
p
∑i
∑y 6=yi
exp (−ρ(i,y))−α>w +
∑i
∑y 6=yi
µ(i,y)[ρ(i,y) −w>yiΦyi(xi) +w>yΦy(xi)] (A.2)
with α ≥ 0, in which µ,α are Lagrangian multipliers. At optimum, the first derivative
of the Lagrangian w.r.t the primal variables must vanish:
∂L
∂ρ(i,y)= 0 =⇒ − C
pexp (−ρ(i,y)) + µ(i,y) = 0 =⇒ ρ(i,y) = − log (
p
Cµ(i,y))
157
Appendix for MultiBoostcw Appendix A Appendix for MultiBoostcw
∂L
∂wc= 0 =⇒ 1−
∑i(yi=c)
∑y 6=yi
µ(i,y)Φyi(xi) +∑i
∑y 6=yi,y=c
µ(i,y)Φy(xi)−αc = 0
=⇒∑
i(yi=c)
∑y 6=yi
µ(i,y)Φyi(xi)−∑i
∑y 6=yi,y=c
µ(i,y)Φy(xi) ≤ 1
The dual problem of (A.1) can be written as (A.3), in which c is the index of class labels.
µ(i,y) is the dual variable associated with one constraint in (A.1b):
maxµ
∑i
∑y 6=yi
µ(i,y)
[1− log
p
C− log µ(i,y)
](A.3a)
s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)
∑y 6=yi
µ(i,y)φyi(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φy(xi) ≤ 1 (A.3b)
∀i = 1, . . . , n : 0 ≤∑y 6=yi
µ(i,y) ≤C
p(A.3c)
A.2 MultiBoostcw with the hinge loss
MultiBoostcw is a flexible framework that it is able to use different loss function. Using
other smooth loss function (e.g. squared hinge loss, logistic loss) in MultiBoostcw is
similar to the case of the exponential loss and can be derived straightforward. Here
we discuss a non-smooth case as a different example (denoted as MultiBoostcw-hinge):
using the hinge loss and slack variable sharing in the constraints (one slack variable ξi is
associated with (K−1) constraints which corresponds to the example xi). MultiBoostcw-
hinge can be formulated as (A.4), in which p is the number of slack variables ξ (p = m):
minw≥0,ξ≥0
1>w +C
p
∑i
ξi (A.4a)
s.t. ∀i = 1, . . . , n and ∀y ∈ 1, 2, . . . ,K\yi :
w>yiΦyi(xi)−w>yΦy(xi) ≥ 1− ξi. (A.4b)
The Lagrangian of (A.4) can be written as:
L = 1>w +C
p
∑i
ξi −α>w − β>ξ +∑i
∑y 6=yi
µ(i,y)[1− ξi −w>yiΦyi(xi) +w>yΦy(xi)] (A.5)
Appendix for MultiBoostcw 159
with µ ≥ 0,α ≥ 0,β ≥ 0, in which µ,α,β are Lagrangian multipliers. At optimum,
the first derivative of the Lagrangian w.r.t the primal variables must vanish:
∂L
∂ξi= 0 =⇒ C
p−∑y 6=yi
µ(i,y) − βi = 0 =⇒ 0 ≤∑y 6=yi
µ(i,y) ≤C
p
∂L
∂wc= 0 =⇒ 1−
∑i(yi=c)
∑y 6=yi
µ(i,y)Φyi(xi) +∑i
∑y 6=yi,y=c
µ(i,y)Φy(xi)−αc = 0
=⇒∑
i(yi=c)
∑y 6=yi
µ(i,y)Φyi(xi)−∑i
∑y 6=yi,y=c
µ(i,y)Φy(xi) ≤ 1
Hence the dual problem of (A.4) can be written as (A.6), in which c indicates the c-th
class. µ(i,y) is the dual variable associated with the constraint (A.4b) for label y 6= yi
and training pair (xi, yi):
maxµ
∑i
∑y 6=yi
µ(i,y) (A.6a)
s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)
∑y 6=yi
µ(i,y)φyi(xi)−∑i
∑y 6=yi,y=c
µ(i,y)φy(xi) ≤ 1 (A.6b)
∀i = 1, . . . , n : 0 ≤∑y 6=yi
µ(i,y) ≤C
p(A.6c)
With the primal-dual pair of (A.4) and (A.6), we can develop a column generation
algorithm for the hinge loss which is similar to Algorithm 1 for the exponential loss.
Notice that the dual constraint in (A.6b) is the same with (A.3b), thus finding new
weak learners is the same as the exponential loss. There are only two differences:
1. Using different solver. The optimization of (A.4) is a linear programming problem
(LP), which can be solved by MOSEK [147] or other off-the-shelf LP solver.
2. Obtain the dual solution µ in a different way. The primal solution w and the dual
solution µ can be obtained at the same time using MOSEK solver.
Appendix B
Appendix for StructBoost
B.1 Dual formulation of n-slack
The formulation of StructBoost is written as (n-slack primal):
minw≥0,ξ≥0
1>w +C
n1>ξ (B.1a)
s.t. ∀i = 1, . . . , n and ∀y ∈ Y :
w>δΨi(y) ≥ ∆(yi,y)− ξi. (B.1b)
The Lagrangian of the n-slack primal problem can be written as:
L = 1>w +C
n1>ξ −
∑i,y
µ(i,y) ·[w>δΨi(y)−∆(yi,y) + ξi
]− ν>w − β>ξ, (B.2)
where µ,ν,β are Lagrange multipliers: µ ≥ 0,ν ≥ 0,β ≥ 0. We denote by µ(i,y)
the Lagrange dual multiplier associated with the margin constraints (B.1b) for label y
and training pair (xi,yi). At optimum, the first derivative of the Lagrangian w.r.t. the
primal variables must vanish,
∂L
∂ξi= 0 =⇒ C
n−∑y
µ(i,y) − βi = 0
=⇒ 0 ≤∑y
µ(i,y) ≤C
n;
161
Appendix for StructBoost Appendix B Appendix for StructBoost
and,
∂L
∂w= 0 =⇒ 1−
∑i,y
µ(i,y)δΨi(y)− ν = 0
=⇒∑i,y
µ(i,y)δΨi(y) ≤ 1.
By putting them back into the Lagrangian (B.2) and we can obtain the dual problem of
the n-slack formulation in (B.1):
maxµ≥0
∑i,y
µ(i,y)∆(yi,y) (B.3a)
s.t. ∀ψ ∈ C :∑i,y
µ(i,y)δψi(y) ≤ 1, (B.3b)
∀i = 1, . . . ,m : 0 ≤∑y
µ(i,y) ≤C
n. (B.3c)
B.2 Dual formulation of 1-slack
The 1-slack formulation of StructBoost is written as:
minw≥0,ξ≥0
1>w + Cξ (B.4a)
s.t. ∀c ∈ 0, 1n and ∀y ∈ Y, i = 1, · · · ,m :
1
nw>[ n∑i=1
ci · δΨi(y)
]≥ 1
n
n∑i=1
ci∆(yi,y)− ξ. (B.4b)
The Lagrangian of the 1-slack primal problem can be written as:
L =1>w + Cξ −∑c,y
λ(c,y) ·
1
nw>[ n∑i=1
ci · δΨi(y)
]−
1
n
n∑i=1
ci∆(yi,y) + ξ
− ν>w − βξ, (B.5)
where λ,ν, β are Lagrange multipliers: λ ≥ 0,ν ≥ 0, β ≥ 0. We denote by λ(c,y) the
Lagrange multiplier associated with the inequality constraints for c ∈ 0, 1n and label
y. At optimum, the first derivative of the Lagrangian w.r.t. the primal variables must
Appendix for StructBoost 163
be zeros,
∂L
∂ξ= 0 =⇒ C −
∑c,y
λ(c,y) − β = 0
=⇒ 0 ≤∑c,y
λ(c,y) ≤ C;
and,
∂L
∂w= 0 =⇒ 1− 1
n
∑c,y
λ(c,y) ·[ n∑i=1
ci · δΨi(y)
]= ν.
=⇒ 1
n
∑c,y
λ(c,y) ·[ n∑i=1
ci · δΨi(y)
]≤ 1. (B.6)
The dual problem of (B.4) can be written as:
maxλ≥0
∑c,y
λ(c,y)
n∑i=1
ci∆(yi,y) (B.7a)
s.t. ∀ψ ∈ C :1
n
∑c,y
λ(c,y)
[ n∑i=1
ci · δψi(y)
]≤ 1, (B.7b)
0 ≤∑c,y
λ(c,y) ≤ C. (B.7c)
Appendix C
Appendix for CGHash
C.1 Learning hashing functions with the hinge loss
Our method can be easily extended to using non-smooth loss functions. Here we discuss
the hinge loss as an example of using non-smooth loss function.
C.1.1 Using `1 norm regularization
We here discuss the case of using the hinge loss and `1 norm regularization for hashing
learning. When using the hinge loss, we define the following large-margin optimization
problem:
minw,ξ
1>w + C∑
(i,j,k)∈T
ξ(i,j,k) (C.1a)
s.t. ∀(i, j, k) ∈ T :
dhm(xi,xk;w)− dhm(xi,xj ;w) ≥ 1− ξ(i,j,k), (C.1b)
w ≥ 0, ξ ≥ 0. (C.1c)
Here we have used the `1 norm onw as the regularization term to control the complexity
of the learned model. With the definition of weighted Hamming distance in (5.3) and
the notation in (5.6), the optimization problem in (C.1) can be rewritten as:
minw,ξ
1>w + C∑
(i,j,k)∈T
ξ(i,j,k) (C.2a)
s.t. ∀(i, j, k) ∈ T : w>δΦ(i, j, k) ≥ 1− ξ(i,j,k), (C.2b)
w ≥ 0, ξ ≥ 0. (C.2c)
165
Appendix for CGHash Appendix C Appendix for CGHash
To apply the column generation technique for learning hash functions, we derive the dual
problem of the above optimization. The corresponding dual problem can be written as:
maxµ
∑(i,j,k)∈T
µ(i,j,k) (C.3a)
s.t. ∀h(·) ∈ C :∑
(i,j,k)∈T
µ(i,j,k)δh(i, j, k) ≤ 1, (C.3b)
∀(i, j, k) ∈ T : 0 ≤ µ(i,j,k) ≤ C. (C.3c)
Here µ is one dual variable, which corresponds to one constraint in (C.2b). Similar to
the case of squared hinge loss, we learn a new hash function by finding the most violated
constraint of the dual problem. We solve the following subproblem to learn one hash
function:
h?(·) = argmaxh(·)∈C
∑(i,j,k)∈T
µ(i,j,k)δh(i, j, k)
= argmaxh(·)∈C
∑(i,j,k)∈T
µ(i,j,k)
[|h(xi − h(xk)| − |h(xi)− h(xj)|
]. (C.4a)
Solving the above optimization is the same as that for squared hinge loss which we have
discussed.
In each column generation iteration, we need to obtain the primal and dual solution.
Different from smooth convex loss functions (e.g., the squared hinge loss), the the primal
optimization in (C.2) is a linear programming problem. Here we can use MOSEK [147]
to solve the primal optimization in (C.2) and obtain the primal solution w and the dual
solution µ.
C.1.2 Using l∞ norm regularization
We also can use the hinge loss with the l∞ norm regularization. The primal optimization
can be written as:
minw,ξ
‖w‖∞ + C∑
(i,j,k)∈T
ξ(i,j,k) (C.5a)
s.t. ∀(i, j, k) ∈ T : w>δΦ(i, j, k) ≥ 1− ξ(i,j,k), (C.5b)
w ≥ 0, ξ ≥ 0. (C.5c)
Appendix for CGHash 167
The above optimization can be equivalently written as:
minw,ξ
∑(i,j,k)∈T
ξ(i,j,k) (C.6a)
s.t. ∀(i, j, k) ∈ T : w>δΦ(i, j, k) ≥ 1− ξ(i,j,k), (C.6b)
0 ≤ ξ, 0 ≤ w ≤ C ′1. (C.6c)
Here C ′ is a constant that controls the regularization trade-off. The dual problem can
be derived as:
maxµ,β
− C ′1>β +∑
(i,j,k)∈T
µ(i,j,k) (C.7a)
s.t. ∀h(·) ∈ C :∑
(i,j,k)∈T
µ(i,j,k)δh(i, j, k) ≤ βh, (C.7b)
0 ≤ β, 0 ≤ µ ≤ 1. (C.7c)
Here µ, β are dual variables. From the dual problem, we can find that the rule for
generating one hash function is the same as that for `1 norm, which is to solve the
subproblem in (C.4a). Similar to `1 norm, we can apply MOSEK [147] to solve the
primal optimization in (C.5) and obtain the primal solution w and the dual solution µ.
Bibliography
[1] C. Shen and Z. Hao. A direct formulation for totally-corrective multi-class boost-
ing. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2011.
[2] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran
subroutines for large-scale bound-constrained optimization. ACM T. Math. Softw.,
1997.
[3] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.
SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.
[4] S. Hare, A. Saffari, and P. Torr. Struck: Structured output tracking with kernels.
In Proceedings of the International Conference on Computer Vision, 2011.
[5] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization
with superpixel neighborhoods. In Proceedings of the International Conference on
Computer Vision, 2009.
[6] Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Breaking the curse of
kernelization: Budgeted stochastic gradient descent for large-scale svm training.
The Journal of Machine Learning Research, 2012.
[7] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Al-
tun. Support vector machine learning for interdependent and structured output
spaces. In Proceedings of the International Conference on Machine Learning, 2004.
[8] Martin Szummer, Pushmeet Kohli, and Derek Hoiem. Learning CRFs using graph
cuts. In Proceedings of European Conference on Computer Vision, 2008.
[9] Sebastian Nowozin, Peter V. Gehler, and Christoph H. Lampert. On parameter
learning in CRF-based approaches to object class image segmentation. In Proceed-
ings of European Conference on Computer Vision, 2010.
169
Appendix for CGHash BIBLIOGRAPHY
[10] Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with
structured output regression. In Proceedings of European Conference on Computer
Vision, 2008.
[11] Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and A. Van Den Hengel.
Part-based visual tracking with online latent structural learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[12] Lu Zhang and Laurens van der Maaten. Structure preserving object tracking. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2013.
[13] K. Tang, Li Fei-Fei, and D. Koller. Learning latent temporal structure for complex
event detection. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2012.
[14] Jiang Wang and Ying Wu. Learning maximum margin temporal warping for action
recognition. In Proceedings of the International Conference on Computer Vision,
2013.
[15] Sebastian Nowozin and Christoph H. Lampert. Structured learning and prediction
in computer vision. Foundations & Trends in Computer Graphics & Vision, 2011.
[16] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: A large data
set for nonparametric object and scene recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2008.
[17] J. Wang, S. Kumar, and S.F. Chang. Semi-supervised hashing for large scale
search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
[18] Xavier Boix, Gemma Roig, Christian Leistner, and Luc Van Gool. Nested sparse
quantization for efficient feature coding. In Proceedings of European Conference
on Computer Vision, 2012.
[19] Thomas Dean, Mark A. Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijaya-
narasimhan, and Jay Yagnik. Fast, accurate detection of 100,000 object classes
on a single machine. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2013.
[20] Christoph Strecha, Alex Bronstein, Michael Bronstein, and Pascal Fua. Ldahash:
Improved matching with smaller descriptors. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 2012.
Appendix for CGHash 171
[21] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation
with parameter-sensitive hashing. In Proceedings of the International Conference
on Computer Vision, 2003.
[22] Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Perez. Aggregating
local descriptors into a compact image representation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2010.
[23] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for
nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2011.
[24] Tomasz Trzcinski, Mario Christoudias, Pascal Fua, and Vincent Lepetit. Boosting
binary keypoint descriptors. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2013.
[25] Wei Dong, Zhe Wang, William Josephson, Moses Charikar, and Kai Li. Modeling
lsh for performance tuning. In Proceedings of the ACM Conference on Information
and Knowledge Management, 2008.
[26] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe
lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the
International Conference on Very Large Data Bases, 2007.
[27] Ruslan Salakhutdinov and Geoffrey E. Hinton. Learning a nonlinear embedding
by preserving class neighbourhood structure. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, 2007.
[28] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings.
In Proceedings of Advances in Neural Information Processing Systems, 2009.
[29] M. Norouzi and D.J. Fleet. Minimal loss hashing for compact binary codes. In
Proceedings of the International Conference on Machine Learning, 2011.
[30] D. Zhang, J. Wang, D. Cai, and J. Lu. Extensions to self-taught hashing: kerneli-
sation and supervision. In Proc. ACM SIGIR Workshop on Feature Generation
and Selection for Information Retrieval, pages 19–26, 2010.
[31] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang. Supervised hashing with
kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2012.
[32] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high
dimensions via hashing. In Proceedings of the International Conference on Very
Large Data Bases, 1999.
Appendix for CGHash BIBLIOGRAPHY
[33] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity
search. In Proceedings of the annual international ACM SIGIR conference on
research and development in information retrieval, 2010.
[34] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of Advances
in Neural Information Processing Systems, 2008.
[35] Yair Weiss, Rob Fergus, and Antonio Torralba. Multidimensional spectral hashing.
In Proceedings of European Conference on Computer Vision, 2012.
[36] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: a
procrustean approach to learning binary codes for large-scale image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2012.
[37] Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An affinity-preserving
quantization method for learning binary compact codes. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[38] W. Liu, J. Wang, S. Kumar, and S. F. Chang. Hashing with graphs. In Proceedings
of the International Conference on Machine Learning, 2011.
[39] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon.
Spherical hashing. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2012.
[40] Yunchao Gong, Sanjiv Kumar, Henry A. Rowley, and Svetlana Lazebnik. Learning
binary codes for high-dimensional data using bilinear projections. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[41] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and David Suter. Fast
training of effective multi-class boosting using coordinate descent optimization. In
Proceedings of the Asian Conference on Computer Vision, 2013.
[42] Chunhua Shen, Guosheng Lin, and Anton van den Hengel. StructBoost: Boosting
methods for predicting structured output variables. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2014.
[43] Xi Li, Guosheng Lin, Chunhua Shen, A. van den Hengel, and Anthony Dick. Learn-
ing hash functions using column generation. In Proceedings of the International
Conference on Machine Learning, 2013.
[44] Guosheng Lin, Chunhua Shen, David Suter, and Anton van den Hengel. A general
two-step approach to learning-based hashing. In Proceedings of the International
Conference on Computer Vision, 2013.
Appendix for CGHash 173
[45] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David.
Suter. Fast supervised hashing with decision trees for high-dimensional data. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2014.
[46] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-
cut/max-flow algorithms for energy minimization in vision. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2004.
[47] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence data. In
Proceedings of the International Conference on Machine Learning, 2001.
[48] Jun Zhu and Eric P Xing. Maximum entropy discrimination markov networks.
The Journal of Machine Learning Research, 2009.
[49] Benjamin Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov net-
works. In Proceedings of Advances in Neural Information Processing Systems,
2003.
[50] Tong Zhang. Statistical behavior and consistency of classification methods based
on convex risk minimization. Annals of Statistics, 2004.
[51] David McAllester. Generalization bounds and consistency for structured labeling
in predicting structured data. Predicting Structured Data, 2007.
[52] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane train-
ing of structural svms. Mach. Learn., 2009.
[53] Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle
methods for regularized risk minimization. The Journal of Machine Learning
Research, 2010.
[54] Nathan D Ratliff, J Andrew Bagnell, and Martin Zinkevich. (approximate) sub-
gradient methods for structured prediction. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, 2007.
[55] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal esti-
mated sub-gradient solver for svm. In Proceedings of the International Conference
on Machine Learning, 2007.
[56] Steve Branson, Oscar Beijbom, and Serge Belongie. Efficient large-scale structured
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2013.
Appendix for CGHash BIBLIOGRAPHY
[57] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. J. Comput. Syst. Sci., 1997.
[58] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting
algorithms as gradient descent. In Proceedings of Advances in Neural Information
Processing Systems, 1999.
[59] Chunhua Shen and Hanxi Li. On the dual formulation of boosting algorithms.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
[60] A. Demiriz, K. P. Bennett, and J. Shawe-Taylor. Linear programming boosting
via column generation. Mach. Learn., 2002.
[61] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass
kernel-based vector mchines. The Journal of Machine Learning Research, 2001.
[62] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In
Proceedings of annual ACM symposium on Theory of computing, 2002.
[63] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2012.
[64] Fumin Shen, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and Zhenmin
Tang. Inductive hashing on manifolds. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2013.
[65] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. In Proc. IEEE Symp. Foundations of Computer
Science, 2006.
[66] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-
sensitive hashing scheme based on p-stable distributions. In Proceedings of annual
symposium on Computational geometry, 2004.
[67] Brian Kulis, Prateek Jain, and Kristen Grauman. Fast similarity search for learned
metrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009.
[68] Fan RK Chung. Spectral graph theory. American Mathematical Society, 1997.
[69] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Francois Paiement, Pas-
cal Vincent, and Marie Ouimet. Learning eigenfunctions links spectral embedding
and kernel pca. Neural Computation, 2004.
[70] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral group-
ing using the nyström method. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2004.
Appendix for CGHash 175
[71] Miguel A. Carreira-Perpinan. The elastic embedding algorithm for dimensionality
reduction. In Proceedings of the International Conference on Machine Learning,
2010.
[72] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction
to Information Retrieval. Cambridge University Press, 2008.
[73] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduc-
tion and data representation. Neural Computation, 2003.
[74] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector ma-
chines. ACM Trans. Intell. Syst. Technol., 2011.
[75] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee S. Lee. boosting the
margin: A new explanation for the effectiveness of voting methods. Annals of
Statistics, 1998.
[76] Paul Viola and Michael J. Jones. Robust real-time face detection. International
Journal of Computer Vision, 2004.
[77] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems
via error-correcting output codes. J. Artif. Int. Res., 1995.
[78] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and error-
correcting codes. In Proc. Annual Conf. Computational Learning Theory, 1999.
[79] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using
confidence-rated predictions. In Mach. Learn., 1999.
[80] Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison
of optimization methods and software for large-scale l1-regularized linear classifi-
cation. The Journal of Machine Learning Research, 2010.
[81] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. LIBLINEAR: A library for large linear classification. The Journal of Machine
Learning Research, 2008.
[82] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning
for image classification. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2010.
[83] R. Uetz and S. Behnke. Large-scale object recognition with CUDA-accelerated hi-
erarchical neural networks. In IEEE Int. Conf. Intelligent Computing & Intelligent
Systems, 2009.
Appendix for CGHash BIBLIOGRAPHY
[84] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2006.
[85] J. Weston and C. Watkins. Multi-class support vector machines. In Proc. Euro.
Symp. Artificial Neural Networks, 1999.
[86] I. Steinwart. Sparseness of support vector machines. The Journal of Machine
Learning Research, 2003.
[87] Thorsten Joachims. Training linear SVMs in linear time. In Proc. ACM SIGKDD
Int. Conf. Knowledge discovery & data mining, 2006.
[88] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In Proceedings of the
International Conference on Machine Learning, 2001.
[89] Charles Sutton and Andrew McCallum. An introduction to conditional random
fields. Foundations and Trends in Machine Learning, 2012.
[90] N. Plath, M. Toussaint, and S. Nakajima. Multi-class image segmentation using
conditional random fields and global classification. In Proceedings of the Interna-
tional Conference on Machine Learning, 2009.
[91] Luca Bertelli, Tianli Yu, Diem Vu, and Burak Gokturk. Kernelized structural
SVM learning for supervised object segmentation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2011.
[92] Chaitanya Desai, Deva Ramanan, and Charless C. Fowlkes. Discriminative models
for multi-class object layout. International Journal of Computer Vision, 2011.
[93] Thomas G Dietterich, Adam Ashenfelter, and Yaroslav Bulatov. Training condi-
tional random fields via gradient tree boosting. In Proceedings of the International
Conference on Machine Learning, 2004.
[94] Daniel Munoz, James A Bagnell, Nicolas Vandapel, and Martial Hebert. Contex-
tual classification with functional max-margin markov networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[95] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting
algorithms as gradient descent. In Proceedings of Advances in Neural Information
Processing Systems, 1999.
[96] Nathan Ratliff, David Bradley, J. Andrew Bagnell, and Joel Chestnutt. Boosting
structured prediction for imitation learning. In Proceedings of Advances in Neural
Information Processing Systems, 2007.
Appendix for CGHash 177
[97] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Func-
tional gradient techniques for imitation learning. Autonomous Robots, 2009.
[98] C. Shen, H. Li, and A. van den Hengel. Fully corrective boosting with arbitrary
loss and regularization. Neural Networks, 2013.
[99] Charles Parker, Alan Fern, and Prasad Tadepalli. Gradient boosting for sequence
alignment. In Proc. National Conf. Artificial Intelligence, 2006.
[100] C. Parker. Structured gradient boosting, 2007. URL http://hdl.handle.net/
1957/6490. PhD thesis, Oregon State University.
[101] Q. Wang, D. Lin, and D. Schuurmans. Simple training of dependency parsers via
structured boosting. In Proc. Int. Joint Conf. Artificial Intell., 2007.
[102] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector
machines. In Proceedings of the International Conference on Machine Learning,
2008.
[103] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods
for regularized risk minimization. The Journal of Machine Learning Research, 11:
311–365, 2010.
[104] T. Joachims. A support vector method for multivariate performance measures. In
Proceedings of the International Conference on Machine Learning, 2005.
[105] Lijuan Cai and Thomas Hofmann. Hierarchical document categorization with
support vector machines. In Proc. ACM Int. Conf. Information & knowledge
management, 2004.
[106] C. Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books, 1998.
[107] S. Paisitkriangkrai, C. Shen, Q. Shi, and A. van den Hengel. RandomBoost: Sim-
plified multi-class boosting through randomization. IEEE Trans. Neural Networks
and Learning Systems, 2013.
[108] Sebastian Nowozin, Carsten Rother, Shai Bagon, Toby Sharp, Bangpeng Yao,
and Pushmeet Kohli. Decision tree fields. In Proceedings of the International
Conference on Computer Vision, 2011.
[109] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boost-
ing algorithm for combining preferences. The Journal of Machine Learning Re-
search, 2003.
Appendix for CGHash BIBLIOGRAPHY
[110] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan.
Object detection with discriminatively trained part-based models. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 2010.
[111] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual tracking with online
multiple instance learning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2009.
[112] Amit Adam, Ehud Rivlin, and Ilan Shimshoni. Robust fragments-based tracking
using the integral histogram. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2006.
[113] Helmut Grabner, Michael Grabner, and Horst Bischof. Real-time tracking via
on-line boosting. In Proc. British Mach. Vis. Conf., 2006.
[114] Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.
[115] Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang. Superpixel tracking.
In Proceedings of the International Conference on Computer Vision, 2011.
[116] M. Marszatek and C. Schmid. Accurate object localization with shape masks. In
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007.
[117] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic
representation of the spatial envelope. International Journal of Computer Vision,
2001.
[118] Antonio Torralba, Robert Fergus, and Yair Weiss. Small codes and large image
databases for recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2008.
[119] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. Ldahash: Improved
matching with smaller descriptors. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2011.
[120] S. Korman and S. Avidan. Coherency sensitive hashing. In Proceedings of the
International Conference on Computer Vision, 2011.
[121] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation
with parameter-sensitive hashing. In Proceedings of the International Conference
on Computer Vision, 2003.
[122] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons.
In Proc. Adv. Neural Information Processing Systems, 2004.
Appendix for CGHash 179
[123] C. Shen, J. Kim, L. Wang, and A. van den Hengel. Positive semidefinite metric
learning using boosting-like algorithms. J. Machine Learning Research, 2012.
[124] R. Salakhutdinov and G. Hinton. Semantic hashing. Int. J. Approximate Reason-
ing, 2009.
[125] S. Baluja and M. Covell. Learning to hash: forgiving hash functions and applica-
tions. Data Mining & Knowledge Discovery, 2008.
[126] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[127] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
PASCAL visual object classes challenge 2007 results, 2007.
[128] K.Q. Weinberger, J. Blitzer, and L.K. Saul. Distance metric learning for large
margin nearest neighbor classification. In Proceedings of Advances in Neural In-
formation Processing Systems, 2006.
[129] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman.
Indexing by latent semantic analysis. J. American Society for Information Science,
1990.
[130] D. Zhang, J. Wang, D. Cai, and J. Lu. Laplacian co-hashing of terms and docu-
ments. In Proc. Eur. Conf. Information Retrieval, 2010.
[131] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: A large data
set for nonparametric object and scene recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2008.
[132] Adam Coates and Andrew Ng. The importance of encoding versus training with
sparse coding and vector quantization. In Proceedings of the International Con-
ference on Machine Learning, 2011.
[133] Ryan Kiros and Csaba Szepesvari. Deep representations and codes for image auto-
annotation. In Proceedings of Advances in Neural Information Processing Systems,
2012.
[134] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy mini-
mization via graph cuts. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2001.
[135] Mohammad Rastegari, Ali Farhadi, and David Forsyth. Attribute discovery via
predictable discriminative binary codes. In Proceedings of European Conference
on Computer Vision, 2012.
Appendix for CGHash BIBLIOGRAPHY
[136] Carsten Rother, Vladimir Kolmogorov, Victor Lempitsky, and Martin Szummer.
Optimizing binary MRFs via extended roof duality. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2007.
[137] Julian Besag. On the statistical analysis of dirty pictures. J. of the Royal Stat.
Society., 1986.
[138] Mark Schmidt. http://www.di.ens.fr/~mschmidt/Software/UGM.html, 2012.
[139] Ron Appel, Thomas Fuchs, Piotr Dollar, and Pietro Perona. Quickly boosting
decision trees-pruning underachieving features early. In Proceedings of the Inter-
national Conference on Machine Learning, 2013.
[140] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regres-
sion: a statistical view of boosting (with discussion and a rejoinder by the authors).
The annals of statistics, 2000.
[141] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid.
Tagprop: Discriminative metric learning in nearest neighbor models for image
auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2009.
[142] Mark J Huiskes and Michael S Lew. The MIR-Flickr retrieval evaluation. In Proc.
ACM Int. Conf. Multimedia Info. Retrieval, 2008.
[143] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Proceedings of Advances in Neural
Information Processing Systems, 2012.
[144] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hi-
erarchies for accurate object detection and semantic segmentation. arXiv preprint
arXiv:1311.2524, 2013.
[145] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks
for object detection. In Proceedings of Advances in Neural Information Processing
Systems, 2013.
[146] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning
hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2013.
[147] The MOSEK optimization software, 2010. URL http://www.mosek.com/.