structured output prediction and binary code learning in

Structured Output Prediction and Binary Code

Learning in Computer Vision

by

Guosheng Lin

A thesis submitted in fulfillment for the

degree of Doctor of Philosophy

in the

Faculty of Engineering, Computer and Mathematical Sciences

School of Computer Science

December 2014

http://www.adelaide.edu.au

http://sites.google.com/site/guoshenglin/

http://cs.adelaide.edu.au/

http://cs.adelaide.edu.au

Declaration

I certify that this work contains no material which has been accepted for the award of

any other degree or diploma in any university or other tertiary institution and, to the

best of my knowledge and belief, contains no material previously published or written

by another person, except where due reference has been made in the text. In addition,

I certify that no part of this work will, in the future, be used in a submission for any

other degree or diploma in any university or other tertiary institution without the prior

approval of the University of Adelaide and where applicable, any partner institution

responsible for the joint-award of this degree.

I give consent to this copy of my thesis when deposited in the University Library, being

made available for loan and photocopying, subject to the provisions of the Copyright

Act 1968.

I also give permission for the digital version of my thesis to be made available on the web,

via the Universitys digital research repository, the Library catalogue and also through

web search engines, unless permission has been granted by the University to restrict

access for a period of time.

Signed:

Date:

iii

Acknowledgements

Above all, I would like to express my profound gratitude to my principal supervisor Prof.

Chunhua Shen for his continued guidance and strong support throughout my time as

his student. He taught me not only fundamental and advanced knowledge of machine

learning, but also scientific thinking and writing skills. He gave me insightful guidance

and creative suggestions on my research problems. He made himself available to me at

any time for discussing research problems. This thesis would not have been possible

without his guidance.

I would like to thank my co-supervisor Prof. David Suter. He encouraged me working

on my research interests, helped me to develop scientific thinking, inspired me thinking

outside the box, and improved my presentation skills. I would also like to thank my

co-supervisor Dr.Tat-Jun Chin. His questions raised in meetings helped me deliver clear

and convincing presentations of my ideas.

I would like to thank Prof. Anton van den Hengel for his great support and invaluable

collaboration. He gave me important suggestions on presentation, and offer great help on

paper writing and revision. I am also grateful to Dr. Qinfeng (Javen) Shi. His insightful

viewpoints and suggestions helped me improve my research methodologies and paper

writing.

Many thanks go to ACVT researchers (current or previous) for their kindness and bene-

ficial discussions: Dr. Xi Li, Dr. Sakrapee (Paul) Paisitkriangkrai, Dr. Peng Wang, Dr.

Lingqiao Liu and Dr. Jin Yu. I would also like to thank all of my lab mates, especially

Quoc-Huy Tran, Zhenhua Wang, Rui Yao, Fumin Shen and Zhen Zhang.

Finally, my special appreciation is dedicated to my parents, brother, sister for their

everlasting support, and to my girlfriend with whom I share my success and frustration.

v

Publications

This thesis is based on the content of the following peer-reviewed conference and journal

papers:

1. Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel and David

Suter; “Fast Supervised Hashing with Decision Trees for High-Dimensional Data”;

In proceeding of IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2014.

2. Guosheng Lin, Chunhua Shen, Jianxin Wu;

“Optimizing Ranking Measures for Compact Binary Code Learning”; In proceeding

of European Conference on Computer Vision (ECCV), 2014.

3. Chunhua Shen, Guosheng Lin, Anton van den Hengel;

“StructBoost: Boosting Methods For Predicting Structured Output Variables”; IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.

4. Guosheng Lin, Chunhua Shen, David Suter and Anton van den Hengel;

“A General Two-Step Approach to Learning-Based Hashing”; In proceeding of

International Conference on Computer Vision (ICCV), 2013.

5. Xi Li*, Guosheng Lin*, Chunhua Shen, Anton van den Hengel and Anthony Dick

(* indicates equal contribution);

“Learning Hash Functions Using Column Generation”; In proceeding of Interna-

tional Conference on Machine Learning (ICML), 2013.

6. Guosheng Lin, Chunhua Shen and Anton van den Hengel;

“Approximate Constraint Generation for Efficient Structured Boosting”; In pro-

ceeding of International Conference on Image Processing (ICIP), 2013.

7. Guosheng Lin, Chunhua Shen, Anton van den Hengel and David Suter;

“Fast Training of Effective Multi-class Boosting Using Coordinate Descent Opti-

mization”; In proceeding of Asian Conference on Computer Vision (ACCV), 2012.

vii

THE UNIVERSITY OF ADELAIDE

Abstract

Faculty of Engineering, Computer and Mathematical Sciences

School of Computer Science

Doctor of Philosophy

by Guosheng Lin

Machine learning techniques play essential roles in many computer vision applications.

This thesis is dedicated to two types of machine learning techniques which are important

to computer vision: structured learning and binary code learning. Structured learning is

for predicting complex structured output of which the components are inter-dependent.

Structured outputs are common in real-world applications. The image segmentation

mask is an example of structured output. Binary code learning is to learn hash functions

that map data points into binary codes. The binary code representation is popular for

large-scale similarity search, indexing and storage. This thesis has made practical and

theoretical contributions to these two types of learning techniques.

The first part of this thesis focuses on boosting based structured output prediction.

Boosting is a type of methods for learning a single accurate predictor by linearly com-

bining a set of less accurate weak learners. As a special case of structured learning,

we first propose an efficient boosting method for multi-class classification, which can be

applied to image classification. Different from many existing multi-class boosting meth-

ods, we train class specified weak learners by separately learning weak learners for each

class. We also develop a fast coordinate descent method for solving the optimization

problem, in which we have closed-form solution for each coordinate update.

For general structured output prediction, we propose a new boosting based method,

which we refer to as StructBoost. StructBoost supports nonlinear structured learning

by combining a set of weak structured learners. Our StructBoost generalizes standard

boosting approaches such as AdaBoost, or LPBoost to structured learning. The result-

ing optimization problem is challenging in the sense that it may involve exponentially

many variables and constraints. We develop cutting plane and column generation based

algorithms to efficiently solve the optimization. We show the versatility and usefulness

of StructBoost on a range of problems such as optimizing the tree loss for hierarchi-

cal multi-class classification, optimizing the Pascal overlap criterion for robust visual

tracking and learning conditional random field parameters for image segmentation.

ix

http://www.adelaide.edu.au

http://cs.adelaide.edu.au/

http://cs.adelaide.edu.au

http://sites.google.com/site/guoshenglin/

x

The last part of this thesis focuses on hashing methods for binary code learning. We

develop three novel hashing methods which focus on different aspects of binary code

learning. We first present a column generation based hash function learning method

for preserving triplet based relative pairwise similarity. Given a set of triplets that

encode the pairwise similarity comparison information, our method learns hash functions

within the large-margin learning framework. At each iteration of the column generation

procedure, the best hash function is selected. We show that our method with triplet

based formulation and large-margin learning is able to learn high quality hash functions.

The second hashing learning method in this thesis is a flexible and general method with

a two-step learning scheme. Most existing approaches to hashing apply a single form

of hash function, and an optimization process which is typically deeply coupled to this

specific form. This tight coupling restricts the flexibility of the method to respond to the

data, and can result in complex optimization problems that are difficult to solve. In this

chapter we propose a flexible yet simple framework that is able to accommodate different

types of loss functions and hash functions. This framework allows a number of existing

approaches to hashing to be placed in context, and simplifies the development of new

problem-specific hashing methods. Our framework decomposes the hashing learning

problem into two steps: hash bit learning and hash function learning based on the

learned bits. The first step can typically be formulated as binary quadratic problems,

and the second step can be accomplished by training standard binary classifiers. These

two steps can be easily solved by leveraging sophisticated algorithms in the literature.

The third hashing learning method aims for efficient and effective hash function learning

on large-scale and high-dimensional data, which is an extension of our general two-step

hashing method. Non-linear hash functions have demonstrated their advantage over

linear ones due to their powerful generalization capability. In the literature, kernel func-

tions are typically used to achieve non-linearity in hashing, which achieve encouraging

retrieval performance at the price of slow evaluation and training time. We propose to

use boosted decision trees for achieving non-linearity in hashing, which are fast to train

and evaluate, hence more suitable for hashing with high dimensional data. In our ap-

proach, we first propose sub-modular formulations for the hashing binary code inference

problem and an efficient GraphCut based block search method for solving large-scale

inference. Then we learn hash functions by training boosted decision trees to fit the bi-

nary codes. We show that our method significantly outperforms most existing methods

both in retrieval precision and training time, especially for high-dimensional data.

Dedicated to my family.

xi

Contents

Declaration iii

Acknowledgements v

Publications vii

Abstract ix

Contents xiii

List of Figures xvii

List of Tables xxi

1 Introduction 1

1.1 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Binary Code Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background Literature 9

2.1 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Structured SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Column generation boosting . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Column generation for multi-class boosting . . . . . . . . . . . . . 15

2.2.2.1 MultiBoost with hinge loss . . . . . . . . . . . . . . . . . 16

2.2.2.2 MultiBoost with exponential loss . . . . . . . . . . . . . . 17

2.3 Binary Code Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Spectral hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 Self-taught hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.4 Supervised hashing with kernel . . . . . . . . . . . . . . . . . . . . 24

xiii

xiv CONTENTS

3 Fast Training of Effective Multi-class Boosting Using Coordinate De-scent 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Column generation for MultiBoostcw . . . . . . . . . . . . . . . . . 33

3.2.2 Fast coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 UCI datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Handwritten digit recognition . . . . . . . . . . . . . . . . . . . . . 42

3.3.3 Three Image datasets: PASCAL07, LabelMe, CIFAR10 . . . . . . 43

3.3.4 Scene recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.5 Traffic sign recognition . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.6 FCD evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 StructBoost: Boosting Methods for Predicting Structured OutputVariables 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Structured boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.1 1-slack formulation for fast optimization . . . . . . . . . . . . . . . 56

4.2.2 Cutting-plane optimization for the 1-slack primal . . . . . . . . . . 58

4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Examples of StructBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.2 Ordinal regression and AUC optimization . . . . . . . . . . . . . . 61

4.3.3 Multi-class boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.4 Hierarchical classification with taxonomies . . . . . . . . . . . . . . 62

4.3.5 Optimization of the Pascal image overlap criterion . . . . . . . . . 64

4.3.6 CRF parameter learning . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.1 AUC optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.2 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4.3 Hierarchical multi-class classification . . . . . . . . . . . . . . . . . 72

4.4.4 Visual tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.5 CRF parameter learning for image segmentation . . . . . . . . . . 76

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5 Learning Hash Functions Using Column Generation 85

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.1 main contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

CONTENTS xv

5.2 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.1 Learning hash functions with squared hinge loss . . . . . . . . . . 89

5.3 Hashing with general smooth convex loss functions . . . . . . . . . . . . . 93

5.3.1 Hashing with logistic loss . . . . . . . . . . . . . . . . . . . . . . . 95

5.4 Hashing with `∞ norm regularization . . . . . . . . . . . . . . . . . . . . . 96

5.5 Extension of regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6.2 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6.3 Competing methods . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6.4 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6.5 Quantitative comparison results . . . . . . . . . . . . . . . . . . . . 103

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 A General Two-Step Approach to Learning-Based Hashing 105

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2 Two-Step Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Solving binary quadratic problems . . . . . . . . . . . . . . . . . . . . . . 109

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.4.1 Comparing methods . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.4.2 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.4.3 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4.4 Using different loss functions . . . . . . . . . . . . . . . . . . . . . 124

6.4.5 Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.4.6 Using different hash functions . . . . . . . . . . . . . . . . . . . . . 125

6.4.7 Results on large datasets . . . . . . . . . . . . . . . . . . . . . . . 126

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Fast Supervised Hashing with Decision Trees for High-DimensionalData 127

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.2 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2.1 Step 1: Binary code inference . . . . . . . . . . . . . . . . . . . . . 132

7.2.2 Step 2: Learning boosted trees as hash functions . . . . . . . . . . 135

7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.3.1 Comparison with KSH . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.3.2 Comparison with TSH . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.3.3 Experiments on different features . . . . . . . . . . . . . . . . . . . 145

7.3.4 Comparison with dimension reduction . . . . . . . . . . . . . . . . 147

7.3.5 Comparison with unsupervised methods . . . . . . . . . . . . . . . 148

7.3.6 More features and more bits . . . . . . . . . . . . . . . . . . . . . . 149

7.3.7 Large dataset: SUN397 . . . . . . . . . . . . . . . . . . . . . . . . 150

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8 Conclusion 153

8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

A Appendix for MultiBoostcw 157

xvi CONTENTS

A.1 Dual problem of MultiBoostcw . . . . . . . . . . . . . . . . . . . . . . . . 157

A.2 MultiBoostcw with the hinge loss . . . . . . . . . . . . . . . . . . . . . . . 158

B Appendix for StructBoost 161

B.1 Dual formulation of n-slack . . . . . . . . . . . . . . . . . . . . . . . . . . 161

B.2 Dual formulation of 1-slack . . . . . . . . . . . . . . . . . . . . . . . . . . 162

C Appendix for CGHash 165

C.1 Learning hashing functions with the hinge loss . . . . . . . . . . . . . . . 165

C.1.1 Using `1 norm regularization . . . . . . . . . . . . . . . . . . . . . 165

C.1.2 Using l∞ norm regularization . . . . . . . . . . . . . . . . . . . . . 166

Bibliography 169

List of Figures

1.1 An example of the image segmentation task. The first row is the inputimages and the second row is the segmentation label masks. The labelmask is the structured output that we aim to predict, which identifiestarget objects (cars here) from the background. . . . . . . . . . . . . . . 2

1.2 An illustration of hashing based similarity preserving . . . . . . . . . . . . 4

1.3 An illustration of image retrieval. The first column shows query images,and the rest are retrieved images in the database. These retrieved imagesare expected to be semantically similar to the corresponding query images. 4

3.1 Results on 2 UCI datasets: VOWEL and ISOLET. CW and CW-1 are ourmethods. CW-1 uses stage-wise setting. The number after the methodname is the mean value with standard deviation of the last iteration. Ourmethods converge much faster and achieve competitive test accuracy. Thetotal training time and the solver time of our methods both are less thanthat of MultiBoost [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Experiments on 2 handwritten digit recognition datasets: USPS, PENDIG-ITS. CW and CW-1 are our methods. CW-1 uses stage-wise setting. Ourmethods converge much faster, achieve best test error and use less trainingtime. Ada.MO has similar convergence rate as ours, but requires muchmore training time. With a maximum training time of 1000 seconds,Ada.MO failed to finish 500 iterations on all datasets. . . . . . . . . . . . 42

3.3 Experiments on handwritten digit recognition datasets: MNIST. CW andCW-1 are our methods. CW-1 uses stage-wise setting. Our methodsconverge much faster, achieve best test error and use less training time.Ada.MO has similar convergence rate as ours, but requires much moretraining time. With a maximum training time of 1000 seconds, Ada.MOfailed to finish 500 iterations on this dataset. . . . . . . . . . . . . . . . . 43

3.4 Results on a traffic sign dataset: GTSRB. CW and CW-1 (stage-wisesetting) are our methods. Our methods converge much faster, achievebest test error and use less training time. . . . . . . . . . . . . . . . . . . 44

3.5 Experiments on 3 image datasets: PASCAL07, LabelMe and CIFAR10.CW and CW-1 are our methods. CW-1 uses stage-wise setting. Ourmethods converge much faster, achieve best test error and use less trainingtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 Experiments on 2 scene recognition datasets: SCENE15 and a subset ofSUN. CW and CW-1 are our methods. CW-1 uses stage-wise setting.Our methods converge much faster and achieve best test error and useless training time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xvii

xviii LIST OF FIGURES

3.7 Solver comparison between FCD with different parameter setting andLBFGS-B [2]. One column for one dataset. The number after “FCD”is the setting for the maximum iteration (τmax) of FCD. The stage-wisesetting of FCD is the fastest one. See the text for details. . . . . . . . . . 48

4.1 The hierarchy structures of two selected subsets of the SUN dataset [3]used in our experiments for hierarchical image classification. . . . . . . . . 63

4.2 Classification with taxonomies (tree loss), corresponding to the first ex-ample in Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 AUC optimization on two UCI datasets. The objective values and op-timization time are shown in the figure by varying boosting (or columngeneration) iterations. It shows that 1-slack achieves similar objectivevalues as n-slack but needs less running time. . . . . . . . . . . . . . . . . 70

4.4 Test performance versus the number of boosting iterations of multi-classclassification. StBoost-stump and StBoost-per denote our StructBoostusing decision stumps and perceptrons as weak learners, respectively. Theresults of SSVM and SVM are shown as straight lines in the plots. Thevalues shown in the legend are the error rates of the final iteration foreach method. Our methods perform better than SSVM in most cases. . . 71

4.5 Bounding box overlap in frames of several video sequences. Our Struct-Boost often achieves higher scores of box overlap compared with othertrackers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.6 Bounding box center location error in frames of several video sequences.Our StructBoost often achieves lower center location errors comparedwith other trackers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Some tracking examples for several video sequences: “coke”, “david”, ,“walk”, “bird” and “tiger2” (best viewed on screen). The output bound-ing boxes of our StructBoost better overlap against the ground truth thanthe compared methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.8 Some segmentation results on the Graz-02 dataset (car). Compared withAdaBoost, structured output learning methods (StructBoost and SSVM)present sharper segmentation boundaries, and better spatial regulariza-tion. Compared with SSVM, our StructBoost with non-linear parameterlearning performs better, demonstrating more accurate foreground objectboundaries and cleaner backgrounds. . . . . . . . . . . . . . . . . . . . . . 78

4.9 Some segmentation results on the Graz-02 dataset (bicycle). Comparedwith AdaBoost, structured output learning methods (StructBoost andSSVM) present sharper segmentation boundaries, and better spatial reg-ularization. Compared with SSVM, our StructBoost with non-linearparameter learning performs better, demonstrating more accurate fore-ground object boundaries and cleaner backgrounds. . . . . . . . . . . . . . 79

4.10 Some segmentation results on the Graz-02 dataset (person). Comparedwith AdaBoost, structured output learning methods (StructBoost andSSVM) present sharper segmentation boundaries, and better spatial reg-ularization. Compared with SSVM, our StructBoost with non-linearparameter learning performs better, demonstrating more accurate fore-ground object boundaries and cleaner backgrounds. . . . . . . . . . . . . . 80

LIST OF FIGURES xix

4.11 Some segmentation results on the Graz-02 dataset (person). Comparedwith AdaBoost, structured output learning methods (StructBoost andSSVM) present sharper segmentation boundaries, and better spatial reg-ularization. Compared with SSVM, our StructBoost with non-linearparameter learning performs better, demonstrating more accurate fore-ground object boundaries and cleaner backgrounds. . . . . . . . . . . . . . 81

5.1 Results of precision-recall (using 64 bits). It shows that our CGHashperforms the best in most cases. . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Precision of top-50 retrieved examples using different number of bits. Itshows that our CGHash performs the best in most cases. . . . . . . . . . 99

5.3 Nearest-neighbor classification error using different number of bits. Itshows that our CGHash performs the best in most cases. . . . . . . . . . 100

5.4 Performance of CGHash using different values of K (K ∈ 3, 10, 20, 30)on the SCENE-15 dataset. Results of precision-recall (using 60 bits),precision of top-50 retrieved examples and nearest-neighbor classificationare shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5 Two retrieval examples for CGHash on the LABELME and MNIST datasets.Query examples are shown in the left column, and the retrieved examplesare shown in the right part. . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1 An illustration of Two-Step Hashing . . . . . . . . . . . . . . . . . . . . . 110

6.2 Some retrieval examples of our method TSH on CIFAR10. The firstcolumn shows query images, and the rest are top 40 retrieved images inthe database. False predictions are marked by red boxes. . . . . . . . . . 115

6.3 Some retrieval examples of our method TSH on CIFAR10. The firstcolumn shows query images, and the rest are top 40 retrieved images inthe database. False predictions are marked by red boxes. . . . . . . . . . 116

6.4 Results on 2 datasets of supervised methods. Results show that TSHoutperforms others usually by a large margin. The running up methodsare STHs-RBF and KSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5 Results on 2 datasets for comparing unsupervised methods. Results showthat TSH outperforms others usually by a large margin. . . . . . . . . . . 118

6.6 Results on SCENE15, USPS and ISOLET for comparing with supervisedand unsupervised methods. Our TSH perform the best. . . . . . . . . . . 119

6.7 Results on 2 datasets of our method using different hash functions. Re-sults show that using kernel hash function (TSH-RBF and TSH-KF)achieves best performances. . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.8 Code compression time using different hash functions. Results show thatusing kernel transferred feature (TSH-KF) is much faster then SVM withRBF kernel (TSH-RBF). Linear SVM is the fastest one. . . . . . . . . . 121

6.9 Comparison of supervised methods on 2 large scale datasets: Flickr1Mand Tiny580k. Our method TSH achieves on par result with KSH. TSHand KSH significantly outperform other methods. . . . . . . . . . . . . . 121

6.10 Comparison of unsupervised methods on 2 large scale datasets: Flickr1Mand Tiny580k. The first row shows the results of supervised methods andthe second row for unsupervised methods. Our method TSH achieveson par result with KSH. TSH and KSH significantly outperform othermethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xx LIST OF FIGURES

7.1 Some retrieval examples of our method FastHash on CIFAR10. The firstcolumn shows query images, and the rest are retrieved images in thedatabase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2 Some retrieval examples of our method FastHash on ESPGAME. Thefirst column shows query images, and the rest are retrieved images in thedatabase. False predictions are marked by red boxes. . . . . . . . . . . . 138

7.3 Comparison of KSH and our FastHash on all datasets. The precision andrecall curves are given in the first two rows. The precision curves of thetop 2000 retrieved examples are given on the last 2 rows. The numberafter “KSH” is the number of support vectors. Both of our FastHash andFastHash-Full outperform KSH by a large margin. . . . . . . . . . . . . . 141

7.4 Comparison of various combinations of hash functions and binary infer-ence methods. Note that the proposed FastHash uses decision tree ashash functions. The proposed decision tree hash function performs muchbetter than the linear SVM hash function. Moreover, our FastHash per-forms much better than TSH when using the same hash function in Step2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.5 Results on high-dimensional codebook features. The precision and recallcurves are given in the first two rows. The precision curves of the top2000 retrieved examples are given on the last 2 rows. Both our FastHashand FastHash-Full outperform their comparators by a large margin. . . . 146

7.6 The retrieval precision results of unsupervised methods. Unsupervisedmethods perform poorly for preserving label based similarity. Our FastHashoutperform others by a large margin. . . . . . . . . . . . . . . . . . . . . . 150

7.7 The precision curve of top 2000 retrieved examples on large image datasetSUN397 using 1024 bits. Here we compare with those methods which canbe efficiently trained up to 1024 bits on the whole training set. OurFastHash outperforms others by a large margin. . . . . . . . . . . . . . . 151

List of Tables

4.1 AUC maximization. We compare the performance of n-slack and 1-slackformulations. “−” means that the method is not able to converge within amemory and time limit. We can see that 1-slack can achieve similar AUCresults on training and testing data as n-slack while 1-slack is significantlyfaster than n-slack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Multi-class classification test errors (%) on several UCI and MNIST datasets.1-v-a SVM is the one-vs-all SVM. StBoost-stump and StBoost-per denoteour StructBoost using decision stumps and perceptrons as weak learners,respectively. StructBoost outperforms SSVM in most cases and achievescompetitive performance compared with other multi-class classifiers. . . . 72

4.3 Hierarchical classification. Results of the tree loss and the 1/0 loss (classi-fication error rate) on subsets of the SUN dataset. StructBoost-tree usesthe hierarchy class formulation with the tree loss, and StructBoost-flatuses the standard multi-class formulation. StructBoost-tree that mini-mizes the tree loss performs best. . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Average bounding box overlap scores on benchmark videos. Struck50 [4]is structured SVM tracking with a buffer size of 50. Our StructBoostperforms the best in most cases. Struck performs the second best, whichconfirms the usefulness of structured output learning. . . . . . . . . . . . 73

4.5 Average center errors on benchmark videos. Struck50 [4] is structuredSVM tracking with a buffer size of 50. We observe similar results as inTable 4.4: Our StructBoost outperforms others on most sequences, andStruck is the second best. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.6 Image segmentation results on the Graz-02 dataset. The results showthe the pixel accuracy, intersection-union score (including the foregroundand background) and precision = recall value (as in [5]). Our methodStructBoost for nonlinear parameter learning performs better than SSVMand other methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Summary of the 6 datasets used in the experiments. . . . . . . . . . . . . 102

6.1 Results (using hash codes of 32 bits) of TSH using different loss func-tions, and a selection of other supervised and unsupervised methods on3 datasets. The upper part relates the results on training data and thelower on testing data. The results show that Step-1 of our method isable to generate effective binary codes that outperforms those of compet-ing methods on the training data. On the testing data our method alsooutperforms others by a large margin in most cases. . . . . . . . . . . . . 114

xxi

xxii LIST OF TABLES

6.2 Training time (in seconds) for TSH using different loss functions, and sev-eral other supervised methods on 3 datasets. The value inside a bracketsis the time used in the first step for inferring the binary codes. The resultsshow that our method is efficient. Note that the second step of learningthe hash functions can be easily parallelised. . . . . . . . . . . . . . . . . 114

7.1 Comparison of KSH and our FastHash. KSH results with different numberof support vectors. Both of our FastHash and FastHash-Full outperformKSH by a large margin in terms of training time, binary encoding time(Test time) and retrieval precision. . . . . . . . . . . . . . . . . . . . . . . 139

7.2 Comparison of TSH and our FastHash for binary code inference in Step 1.The proposed Block GraphCut (Block-GC) achieves much lower objectivevalue and also takes less inference time than the spectral method, and thusperforms much better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.3 Comparison of combination of hash functions and binary inference meth-ods. The proposed decision tree hash function performs much better thanlinear SVM hash function. Moreover, our FastHash performs much betterthan TSH when using the same hash function in Step-2. . . . . . . . . . 144

7.4 Comparison of TSH and our FastHash. Results of TSH with the linearSVM and the budgeted RBF kernel [6] hash functions (TSH-BRBF) forthe Step-2 are presented. Our FastHash outperforms TSH by a largemargin both on training speed and retrieval performance. . . . . . . . . . 145

7.5 Results using two types of features: low-dimensional GIST features andthe high-dimensional codebook features. Our FastHash and FastHash-Full outperform the comparators by a large margin on both feature types.In terms of training time, our FastHash is also much faster than otherson the high-dimensional codebook features. . . . . . . . . . . . . . . . . . 147

7.6 Results of methods with dimension reduction. KSH, SPLH and STHs aretrained with PCA feature reduction. Our FastHash outperforms othersby a large margin on retrieval performance. . . . . . . . . . . . . . . . . . 148

7.7 Performance of our FastHash on more features (22400 dimensions) andmore bits (1024 bits). It shows that FastHash can be efficiently trained onhigh-dimensional features with large bit length. The training and binarycoding time (Test time) of FastHash is only linearly increased with thebit length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.8 Results on the large image dataset SUN397 using 11200-dimensional code-book features. Our FastHash can be efficiently trained to large bit length(1024 bits) on this large training set. FastHash outperforms other meth-ods by a large margin on retrieval performance. . . . . . . . . . . . . . . 149

Chapter 1

Introduction

The general goal of computer vision research is to make computers understand the visual

world. Typical computer vision tasks include image classification, object detection,

image segmentation, visual tracking and image retrieval. Newly emerged tasks include

action recognition, event detection, object retrieval. Machine learning techniques are

playing essential roles on these computer vision tasks. In this thesis we focus on two

types of machine learning techniques: structured learning and binary code learning,

which are important to many computer vision applications.

1.1 Structured Learning

Conventional supervised learning problems, such as classification and regression, aim

to learn a function that predicts a best output value of y ∈ R for an input vector

x ∈ Rd. In many applications, however, the outputs are often complex and cannot

be well represented by a single scalar, but the most appropriate outputs are objects

(vectors, sequences, trees, etc.). The components of the output are inter-dependent.

Such problems are referred to as structured output prediction.

Let y denote a structured output. The domain of output is denoted by Y. The struc-

tured output y can be any objects and it varies in different application. There are

complex interactions/dependencies among the components of the output. These inter-

actions/dependencies within y are usually modeled by directed/undirected graphical

models.

For example, in the application of image segmentation, the input x is an image and the

output y is a label matrix ( or called a label mask). Each element in y is the label

value of the corresponding pixel in the input image. A simple image segmentation task

1

2 Chapter 1 Introduction

Figure 1.1: An example of the image segmentation task. The first row is the input images andthe second row is the segmentation label masks. The label mask is the structured output thatwe aim to predict, which identifies target objects (cars here) from the background.

is shown in Figure 1.1, in which the task is to predict a segmentation label mask. A

label mask identifies target objects from the background.

In structured learning, we aim to find a prediction function g which has a structured

output: y = g(x). This prediction function is assumed to take the following form:

y? = g(x) = argmaxy

f(x,y). (1.1)

Here f is a function that measures the consistency of the input and output. The predic-

tion is achieved by maximizing f(x,y) over all possible y ∈ Y. A well-known structured

learning method is Structured SVM (SSVM) [7].

Structured learning is widely applied in computer vision applications, such as image

segmentation [8, 9], object detection [10], visual tracking [11, 12], event detection [13],

action recognition [14]. The survey of [15] provides a comprehensive review of structured

learning and its application in computer vision.

In these structured learning applications, most of them learn a linear models for struc-

tured output prediction. Non-linear models usually have better generalization perfor-

mance than the linear ones; hence a practical non-learning structured learning method

is very disable. However the non-linear structured learning receive much less attentions

compared to linear learning. It is a great challenge for learning non-linear models for

structured prediction. For example, using kernels in SSVM usually leads to very ex-

pensive computation, thus it greatly limits the applications of kernel SSVM. In this

thesis, we propose a boosting based structure learning method for efficient non-linear

structured prediction. This boosting based method combines a set of weak structured

learner to form a strong structured learner.

Chapter 1 Introduction 3

1.2 Binary Code Learning

In binary code learning, we aim to learn a set of mapping functions that turn the original

high-dimensional data into binary codes. We refer to these mapping functions as hash

functions, and the corresponding learning methods as hashing methods. Suppose the

learning task is to find m hash functions for mapping input examples into m-bit binary

codes. Let x ∈ Rd denote a data point, and Φ(x) denote the output of m hash functions.

Φ(x) can be written as:

Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (1.2)

Each hash function outputs a binary value:

h(x) ∈ −1, 1. (1.3)

The output of these hash functions are m-bit binary codes: Φ(x) ∈ −1, 1m. Generally

these hash functions are able to preserve some kind of data similarity in the Hamming

space.

The term “hashing” here refers to compact binary encoding [16]. Hashing methods have

been particularly successful for similarity search (or called nearest neighbor search).

Hashing based similarity search has been applied in a variety of computer vision appli-

cations, such as object recognition [16], image retrieval [17], image classification [18],

large-scale object detection [19], image matching [20], fast pose estimation [21], compact

local descriptor learning [22–24]. Content based image retrieval is an important appli-

cation of hashing methods, which is to search for similar images from a large collection

of images based on the visual content. An example for the image retrieval application

is shown in Figure 1.3. Similarity search techniques are essential to image retrieval

applications.

One main advantage of hashing methods for similarity search is their high computational

efficiency. The similarity between two images can be measured by calculating Hamming

distance based on their binary codes. The Hamming distance of two binary codes is the

number of bits with different values. The Hamming distance only requires simple bit

operation which makes it extremely efficient for computation.

After obtaining binary codes, there are two common approaches for similarity search [17]:

one is using hash look-up tables [25, 26], the other one is hamming distance ranking. In

the first approach, look-up tables are constructed using binary codes as keys. Given a

query, data points within a fixed hamming distance are returned as the retrieved result.

In the second approach, given a query, all data points in the database are ranked based


Similar Dissimilar

Hashing

11001100

Hashing

11001101

Hashing

00110011

Small Hamming Distance Large Hamming Distance

Figure 1.2: An illustration of hashing based similarity preserving

Figure 1.3: An illustration of image retrieval. The first column shows query images, and the restare retrieved images in the database. These retrieved images are expected to be semanticallysimilar to the corresponding query images.


on their hamming distance to the query, then a desired number of top ranked data

points are returned as the retrieved result. Both approaches are computational efficient.

Additionally, compact binary codes are extremely efficient for large-scale data storage.

In general, hash functions are generated with the aim of preserving some notion of

similarity between data points. It is expected that the hamming distance of two similar

data points are smaller than that of two dissimilar data points. An illustrative example

is shown in Figure 1.2.

In computer vision applications, the notion of “similarity” can be roughly categorized as

visual similarity and semantic similarity. Visually similar images means that they have

similar visual appearance. For example, images have small euclidean distances in the

pixel space are visually similar. In contrast, semantically similar images usually means

that they are relevant in terms of high-level semantic content. For example, images

containing objects comes from the same category are semantically similar. Clearly,

semantically similar images may be not necessarily visually similar, and vice versa. The

visual differences are caused by many reasons. For example, the visual appearance of

one object would change dramatically due to the variations of viewpoint, illumination,

scaling, translation, rotation. Moreover, two objects may come from some relevant sub-

categories but visually look different. In real-world applications, the semantic similarity

would be much more desirable.

Generally, for preserving the semantic similarity, similarity labels are provided for hash

function learning. Hash methods which aim to preserve label based similarity are referred

to as supervised methods [17, 27–31]. Respectively, hash methods for preserving the

similarity on the original feature space are referred to as unsupervised methods [32–40].

Supervised methods are able to preserve user provided or domain-specific similarities,

and not restricted to the Euclidean based similarities which are defined on the feature

space. Hence they are more flexible and favorable for real-world applications. We focus

on developing supervised hashing methods in this thesis.

The performances of existing supervised hashing methods are still far from satisfactory.

These methods usually have difficulties in accurate similarity preserving and large-scale

learning.

Nowadays, a large number of images or videos can be easily obtained in many applica-

tions. This huge amount of data brings great challenges for hashing learning. Most of

the existing supervised hashing methods become impractically slow when they perform

training on large-scale data. They are only able to learn hash functions from a relatively

small scale of training data (up to tens of thousands) and low-dimensional features (up

to hundreds).


Many supervised methods aim to learn linear hash functions, while non-linear hash func-

tion learning receive much less attentions. Non-linear hash functions are able to improve

similarity preserving accuracy over linear ones. Kernel hash functions are typically ap-

plied to achieve non-linear learning. However, evaluating and learning kernel functions

are computational expensive, especially on high-dimensional data. Overall, it remains a

great challenge to learn non-linear hash functions for accurate similarity preserving and

large-scale learning.

In this thesis, we aim to propose practical hashing methods for semantic similarity

search, which have good search precision and can be efficiently trained and evaluated on

large-scale high-dimensional datasets.

1.3 Contributions

The contributions of this thesis are on the fields of structured learning, binary code learn-

ing and their applications in computer vision. In Chapter 2, we present the literature

background of structured learning and binary code learning. Multi-class classification

is a special case of structured output prediction. We first propose an efficient boosting

method for multi-class classification in Chapter 3. Then in Chapter 4, we propose a

boosting based structured learning method for general structured output prediction. In

the following Chapter 5, 6 and 7, we focus on binary code learning and propose three

novel hashing methods. Finally in Chapter 8, we conclude the thesis and discuss future

work. We describe the contributions in more details as following:

Chapter 3

We present a novel column generation based boosting method for multi-class classifi-

cation [41]. Our multi-class boosting is formulated in a single optimization problem.

Different from most existing multi-class boosting methods, which use the same set of

weak learners for all the classes, we learn a separate set of weak learners for each class.

We show that using these class specified weak learners leads to fast convergence, with-

out introducing additional computational overhead in the training procedure. To further

make the training more efficient and scalable, we also propose a fast coordinate descent

method for solving the optimization problem at each boosting iteration. The proposed

coordinate descent method is conceptually simple and easy to implement in that it has

a closed-form solution for each coordinate update. Experimental results on a variety

of datasets show that, compared to a range of existing multi-class boosting methods,


the proposed method has much faster convergence rate and better generalization per-

formance in most cases. We also empirically show that the proposed fast coordinate

descent algorithm is able to efficiently solve the optimization.

Chapter 4

Recently, structured learning has found many applications in computer vision. Inspired

by structured support vector machines (SSVM), here we propose a new boosting algo-

rithm for structured output prediction, which we refer to as StructBoost [42]. Struct-

Boost supports nonlinear structured learning by combining a set of weak structured

learners.

As SSVM generalizes SVM, our StructBoost generalizes standard boosting approaches

such as AdaBoost, or LPBoost to structured learning. The resulting optimization prob-

lem of StructBoost is more challenging than SSVM in the sense that it may involve

exponentially many variables and constraints. In contrast, for SSVM one usually has

an exponential number of constraints and a cutting-plane method is used. In order to

efficiently solve StructBoost, we formulate an equivalent 1-slack formulation and solve

it using a combination of cutting planes and column generation. We show the versa-

tility and usefulness of StructBoost on a range of problems such as optimizing the tree

loss for hierarchical multi-class classification, optimizing the Pascal overlap criterion

for robust visual tracking and learning conditional random field parameters for image

segmentation.

Chapter 5

In this chapter, we propose a column generation based method [43] for learning data-

dependent hash functions based on relative pairwise similarity information. Given a set

of triplets that encode the pairwise similarity comparison information, our method learns

hash functions that preserve the relative comparison relations in the data within the

large-margin learning framework. The learning procedure is implemented using column

generation and hence is named CGHash. At each iteration of the column generation

procedure, the best hash function is selected. Unlike many other hashing methods, our

method generalizes to new data points naturally. We show that our method with triplet

based formulation and large-margin learning is able to learn high quality hash functions

for similarity search.


Chapter 6

In this chapter, we propose a flexible and general method [44] with a two-step learning

scheme. Most existing approaches to hashing apply a single form of hash function, and

an optimization process which is typically deeply coupled to this specific form. This

tight coupling restricts the flexibility of the method to respond to the data, and can

result in complex optimization problems that are difficult to solve. Here we propose

a flexible yet simple framework that is able to accommodate different types of loss

functions and hash functions. This framework allows a number of existing approaches

to hashing to be placed in context, and simplifies the development of new problem-

specific hashing methods. Our framework decomposes the hashing learning problem

into two steps: hash bit learning and hash function learning based on the learned bits.

The first step can typically be formulated as binary quadratic problems, and the second

step can be accomplished by training standard binary classifiers. Both problems have

been extensively studied in the literature. Our extensive experiments demonstrate that

the proposed framework is effective, flexible and outperforms the state-of-the-art.

Chapter 7

In this chapter, we propose a hashing method [45] for efficient and effective learning

on large-scale and high-dimensional data, which is an extension of our general two-

step hashing method. Supervised hashing aims to map the original features to compact

binary codes that are able to preserve label based similarity in the Hamming space. Non-

linear hash functions have demonstrated their advantage over linear ones due to their

powerful generalization capability. In the literature, kernel functions are typically used

to achieve non-linearity in hashing, which achieve encouraging retrieval performance at

the price of slow evaluation and training time. Here we propose to use boosted decision

trees for achieving non-linearity in hashing, which are fast to train and evaluate, hence

more suitable for hashing with high dimensional data. In our approach, we first propose

sub-modular formulations for the hashing binary code inference problem and an efficient

GraphCut based block search method for solving large-scale inference. Then we learn

hash functions by training boosted decision trees to fit the binary codes. Experiments

demonstrate that our proposed method significantly outperforms most state-of-the-art

methods in retrieval precision and training time. Especially for high-dimensional data,

our method is orders of magnitude faster than many methods in terms of training time.

Chapter 2

Background Literature

This chapter provides the background of structured learning and binary code learning,

which the later chapters are based on. We explain some basic notations and review some

popular existing methods which related to the focus of this thesis.

2.1 Structured Learning

Structured learning aims to learn a prediction function g which has a structured output:

y = g(x). This prediction function is assumed to take the following form:

y? = g(x) = argmaxy

f(x,y). (2.1)

Here f is a function that measures the consistency of the input and output. The pre-

diction is achieved by solving an inference problem which is finding an output y? that

maximizes f(x,y). Algorithms for solving the prediction in (2.1) depend on applica-

tions. For example, in the application of image segmentation [8], the GraphCut [46]

algorithm is applied to solve the inference problem for prediction. The survey of [15]

provides a comprehensive review of structured learning and its application in computer

vision.

Existing structured learning methods can be categorized into two categories: proba-

bilistic approaches and max-margin approaches. Probabilistic approaches estimate the

distribution of underlying data, hence require an expensive normalization step. Popu-

lar existing methods in this category include Conditional Random Fields (CRFs) [47],

Maximum Entropy Discrimination Markov Networks [48], and so on.

9

10 Chapter 2 Background Literature

In contrast with probabilistic approaches, max-margin approaches directly learn a dis-

criminative function, and only require to solve maximum a posteriori (MAP) inference

problems. These MAP inference problems are usually similar to the structured predic-

tion problem in (2.1), thus similar inference algorithms can be applied. Popular existing

methods in this category include Structured SVM [7], Max-Margin Markov Networks

[49] and so on.

2.1.1 Structured SVM

Structured SVM (SSVM) [7] is a well-known max-margin structured learning method.

Our boosting based structured learning method which is described in Chapter 4 can be

seen as an extension of SSVM for efficient nonlinear learning.

In structured learning, a prediction function g can be learned by minimizing the regu-

larized structural empirical risk functional, which can be written as:

mingJ(g) := νΩ(g) +Remp(g), (2.2)

where, Remp(g) :=1

m

m∑i=1

∆(yi, g(xi)). (2.3)

Here g is a prediction function which is aimed to learn; Remp is the empirical risk which

is defined on the input-output pairs: (x1,y1), (x2,y2), . . . , (xm,ym) ∈ X × Y; ∆(·) is

a structured loss which measure how well does the prediction match the ground truth;

νΩ(g) is the regularization term for controlling the model complexity, in which ν(ν > 0)

is a predefined trade-off parameter.

Directly solving the optimization in (2.3) is difficult, because the term ∆(yi, g(xi)) is

piece-wise constant. SSVM replaces the ∆-loss in (2.3) by its convex upper bound. As

shown in [50, 51], minimizing a convex upper bound to (2.3) is sufficient for learning a

good prediction function. In SSVM, the prediction function is defined as:

g(x) = argmaxy

f(x,y;w) = argmaxy

w>Ψ(x,y), (2.4)

where Ψ(x,y) is a joint feature mapping of the input-output pair. The optimization

problem of SSVM is written as:

minw

1

2‖w‖2 +

C

m

m∑i=1

l(xi,yi,w), (2.5)

Chapter 2 Background Literature 11

where,

l(xi,yi,w) = argmaxy∈Y

∆(yi,y)− f(xi,yi;w) + f(xi,y;w). (2.6)

Here `2 norm is used for regularization and C is a regularization trade-off parameter.

Clearly we have:

g(xi) = argmaxy

f(xi,y;w) ≥ f(xi,yi;w); (2.7)

hence we have the following relations:

∆(yi, g(xi)) ≤∆(yi, g(xi))− f(xi,yi;w) + f(xi,y;w) (2.8a)

≤ argmaxy∈Y

∆(yi,y)− f(xi,yi;w) + f(xi,y;w) (2.8b)

=l(xi,yi,w). (2.8c)

The above relations show that l(xi,yi,w) is an upper bound of ∆(g(xi),yi). The loss

formulation of SSVM in (2.5) is equivalent to the conventional slack formulation of

SSVM, which is written as:

minw,ξ

1

2‖w‖2 +

C

m

m∑i=1

ξi (2.9a)

s.t. ∀i : ξi ≥ 0, (2.9b)

∀i = 1, . . . ,m and ∀y ∈ Y :

w>Ψ(xi,yi)−w>Ψ(xi,y) ≥ ∆(yi,y)− ξi. (2.9c)

The definition of the prediction loss ∆(yi,y) depends on applications. In multi-class

classification, as a special case of structured learning, ∆ is defined as the zero-one loss.

In image segmentation, usually ∆ is defined as Hamming loss between two label mask

[8].

The optimization problem of SSVM is a convex problem. However, the loss function in

SSVM is not everywhere differentiable which is the result of the max operation. SSVM

can be solved by sub-gradient descent or cutting-plane [7, 52, 53], on-line sub-gradient

descent [54–56] and so on.

The cutting-plane algorithm constructs an working set of constrains and iteratively adds

constraints into the working set. Specifically, in each iteration, it finds most violated

constraints in (2.9c) and adds them to the working set, then solves the optimization in

(2.9c) with the constraints in the working set. Finding the most violated constraints is


to solve the following MAP inference problem for each input xi:

y? = argmaxy

w>Ψ(xi,y) + ∆(yi,y). (2.10)

This MAP inference problem is similar to the prediction inference problem in (2.1);

hence it can be solved in a similar way. This inference step is also required in many

other methods for solving SSVM.

2.2 Boosting

In this thesis, we explore boosting methods for multi-class classification and general

structured learning. We explain the background of boosting learning in this section.

Basically, boosting methods construct a strong learner by combining a set of weak learn-

ers. In the classification problem, typical boosting method aims to learn a set of weak

classifiers and their corresponding weightings. Each weak classifier only has moderate

classification accuracy. Weak classifiers are then combined based on the learned weight-

ings to form a single strong classifier. A well known example of boosting methods is the

AdaBoost [57].

2.2.1 Column generation boosting

There are a variety of boosting methods in the literature. These boosting methods usu-

ally can be explained by some general boosting learning frameworks. For example, the

AnyBoost [58] framework shows that many boosting methods perform gradient descent

in function space. In this thesis we focus on the column generation framework [59, 60] for

boosting learning. In this framework, boosting learning is formulated as an optimization

problem which usually involves infinitely many variables, and this optimization can be

solved by the column generation technique. Column generation based boosting methods

are able to update the weightings of all weak learners when adding new weak learners

in each iteration. This kind of boosting learning, which update the weights of all weak

learners in each iteration, is referred to as totally-corrective learning.

Here we describe LPboost [60] as a simple example of column generation based boosting

methods. LPboost is a boosting method for binary classification, which solves a linear

programming optimization problem.

First we describe some basic notations. The inequality between two vectors means

element-wise inequality. 1 is a vector with elements all being one. Its dimension should

be clear from the context. Respectively, 0 is a vector with elements all being zero.


A training example is denoted by x and its class label is denoted by y. In binary

classification, we have y ∈ −1,+1. A weak classifier φ is a function that maps an

example x to −1,+1:

φ(x) ∈ −1,+1. (2.11)

The domain of all possible weak learners is denoted by C: φ(·) ∈ C. The output of all

weak classifiers are defined by a column vector:

Φ(x) = [φ1(x), φ2(x), · · · , φm(x)]>. (2.12)

The weak learner weighting vector is denoted by w, and w ≥ 0. Weak classifiers are

linearly combined to form the final strong classifier:

f(x) = sign[w>Φ(x)

]= sign

[ m∑j

wjφj(x)

]. (2.13)

LPBoost is a max-margin learning method with the hinge loss. The optimization prob-

lem for LPBoost is written as:

minw,ξ

1>w +C

n

∑i

ξi (2.14a)

s.t. w ≥ 0, ξ ≥ 0, (2.14b)

∀i = 1, . . . , n : yiw>Φ(xi) ≥ 1− ξi. (2.14c)

Here `1 norm is used for regularization; C is a regularization trade-off parameter; n is

the number of training examples. By solving the above optimization, we can learn a set

of weak classifiers Φ(·) and their corresponding weightings w.

In the viewpoint of column generation boosting, all possible weak learners are considered

in the optimization in (2.14). The number of weak learners in (2.14) is the size of the

weak learner domain: |C|, which can be infinitely large; thus the dimension of the

weighting vector w is infinitely large. The initialized weighting for all weak learners is

0. A weak classifier with 0 weighting means that this weak classifier is not included in

the final strong classifier. By solving the optimization in (2.14), We are able to obtain

a sparse solution of the weightings w. In this way, we obtain a small number of weak

classifiers which have nonzero weightings. Finally these weak classifiers with nonzero

weights construct the strong classifier.

Column generation is a technique for solving optimization problem which has a large

number of variables and cannot be directly solved. Starting from an empty working


set of weak classifiers, column generation iteratively generates new weak classifiers and

add to the working set. New weak classifiers are generated by finding most violated

constraints in the dual problem of (2.14). The dual problem of (2.14) can be derived

as:

maxµ

n∑i

µi (2.15a)

s.t. ∀i = 1, . . . , n : 0 ≤ µi ≤C

n, (2.15b)

∀φ(·) ∈ C :n∑i

µiyiφ(xi) ≤ 1. (2.15c)

Here µi is a dual variable associated with one constraint in (2.14c). In each column

generation iteration, we perform the following two steps:

1. Generate a new weak classifier by finding the most violated constraint in the dual

problem in (2.15), add to the weak classifiers working set Wφ.

2. Solve the primal problem in (2.14) or the dual problem (2.15) using weak classifiers

from the working set Wφ. Obtain the primal and dual solution: w,µ.

The learning algorithm repeats these two steps until it reaches a predefined number of

iterations.

In the first step, generating a new weak classifier is to solve the following optimization:

φ?(·) = argmaxφ(·)∈C

n∑i

µiyiφ(xi), (2.16)

which is finding a most violated constraint in the dual problem (2.15). The above

optimization is a weighted binary classification problem, and the dual solution µ here

is seen as examples weightings. Typically, we can train a decision stump/tree classifier

as the the weak classifier solution.

We refer to the primal problem in (2.14) using weak classifiers from the working set

Wφ as the reduced primal problem. Respectively, the dual problem in (2.15) using weak

classifiers from the working set Wφ as the reduced dual problem.

In the second step, the reduced primal problem of (2.14) or the reduced dual problem of

(2.15) is a linear programming problem (LP); hence we can use MOSEK or other off-the-

shelf LP solver to obtain the primal solution w and dual solution µ. The dual solution

is required for generating new weak classifiers. The weightings w of all weak classifiers

in the working set are updated for each iteration, thus LPBoost is a totally-corrective

boosting method.


2.2.2 Column generation for multi-class boosting

Shen and Hao [1] propose a column generation based boosting method for multi-class

classification, which is an extension of LPBoost (described in section 2.2.1). We refer

to this method as MultiBoost. In Chapter 4, we extend MultiBoost for boosting based

structured learning. MultiBoost follows the max-margin learning formulation of multi-

class SVM which is proposed by Crammer and Singer in [61].

We denote the number of classes by K. The class label y takes a value from 1 to K:

y ∈ 1, 2, . . . ,K. MultiBoost learns a weighting vector for each class; hence it learns K

models. We denoted by wy the weighting vector of the class y. Then the classification

score of the class y is:

fy(x) = w>yΦ(x) =m∑j

wy,jφj(x). (2.17)

Given a test data point x, the prediction function is:

y? = argmaxy

fy(x) = argmaxy

w>yΦ(x), (2.18)

which means finding a class label with the largest confidence.

MultiBoost is a flexible framework that a variety of loss functions can be applied. Here

we describe the hinge loss and the exponential loss as examples. For a training example,

we expect that the classification score of the ground truth class is higher than that of

any other classes. The multi-class margin associated with the training example (xi, yi)

is defined as

γ(i,y) = w>yiΦ(xi)−w>yΦ(xi). (2.19)

Intuitively, γ(i,y) is the difference between the classification scores of the ground truth

class and the other class. The training of MultiBoost is to encourage this margin to be

large.


2.2.2.1 MultiBoost with hinge loss

The training of MultiBoost with the hinge loss is to solve the following optimization

problem:

minw,ξ

1>w +C

n

n∑i

ξi (2.20a)

s.t. w ≥ 0, ξ ≥ 0, (2.20b)

∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :

w>yiΦ(xi)−w>yΦ(xi) ≥ 1− ξi. (2.20c)

The corresponding dual problem is written as:

maxµ

∑i

∑y 6=yi

µ(i,y) (2.21a)

s.t. ∀φ and ∀c = 1, . . . ,K :∑i(yi=c)

∑y 6=yi

µ(i,y)φ(xi)−∑i

∑y 6=yi,y=c

µ(i,y)φ(xi) ≤ 1 (2.21b)

∀i = 1, . . . , n : 0 ≤∑y 6=yi

µ(i,y) ≤C

n. (2.21c)

The column generation algorithm for MultiBoost is similar to that of LPBoost, which

is described in in section 2.2.1. The sub-problem for learning a new weak classifier is

finding the most violated constraint in the dual problem, which is written as:

[φ?(·), c?] = argmaxφ(·)∈C, c∈1,...,K

∑i(yi=c)

∑y 6=yi


∑y 6=yi,y=c

µ(i,y)φ(xi). (2.22)

Recall that the reduced primal problem is the primal problem in (2.20) using weak

classifiers from the working set. The reduced primal problem can be solved by LP

solvers to obtain the primal and dual solution, which is similar to LPBoost.


2.2.2.2 MultiBoost with exponential loss

The training of MultiBoost with the exponential loss is to solve the following optimiza-

tion problem:

minw

1>w +C

p

n∑i

∑y 6=yi

exp (−γ(i,y)) (2.23a)

s.t. w ≥ 0, ξ ≥ 0, (2.23b)

∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :

w>yiΦ(xi)−w>yΦ(xi) = γ(i,y). (2.23c)

Here p is the number of constraints: p = n(K − 1). In the above optimization, one

constraint corresponds to one margin variable, which is different from the hinge loss

formulation in (2.20). The corresponding dual problem is written as:

maxµ

∑i

∑y 6=yi

µ(i,y)

[1− log

p

C− log µ(i,y)

](2.24a)

s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)

∑y 6=yi


∑y 6=yi,y=c

µ(i,y)φ(xi) ≤ 1, (2.24b)

∀i = 1, . . . , n : 0 ≤∑y 6=yi

µ(i,y) ≤C

p. (2.24c)

Similar to the case of the hinge loss, the sub-problem for learning a new weak classifier is

finding the most violated constraint in the dual problem, which is written in (2.22). The

primal problem of (2.23) has a smooth objective, which is different from the hinge loss

formulation in (2.20). Hence we are able to applied LBFGS-B [2] to solve the reduced

primal problem. The dual solution µ? can be calculated using the primal solution w?.

According to the KKT conditions, the dual solution is written as:

µ?(i,y) =C

pexp

[w?>y Φ(xi)−w?>

yi Φ(xi)]. (2.25)

2.3 Binary Code Learning

In binary Code Learning, we aim to learn a set of mapping functions that turn the

original high-dimensional data into binary codes. We refer to these mapping functions as

hash functions, and the corresponding learning methods as hashing methods. Suppose

the learning task is to find m hash functions for mapping input examples into m-bit


binary codes. Let x ∈ Rd denote a data point, and Φ(x) denote the output of m hash

functions. Φ(x) is written as:

Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (2.26)

Each hash function outputs a binary value:

h(x) ∈ −1, 1. (2.27)

The output of these hash functions are m-bit binary codes: Φ(x) ∈ −1, 1m. These

hash functions are able to preserve some kind of data similarity in the Hamming space.

Various types of loss functions and hash functions have been used in existing methods.

Generally, the formulation of hashing learning is to encourage small Hamming distance

for similar data pairs and large for dissimilar data pairs. The Hamming distance between

two binary codes is the number of bits taking different values:

dhm(xi,xj) =

m∑r=1

[1− δ(hr(xi), hr(xj))

], (2.28)

in which δ(·, ·) ∈ 0, 1 is an indication function. If two inputs are equal, δ(·, ·) outputs

1; otherwise, it outputs 0. Closely related to Hamming distance, the Hamming affinity

is calculated by the inner product of two binary codes:

shm(xi,xj) =

m∑r=1

hr(xi)hr(xj). (2.29)

In terms of Hamming affinity, hashing learning encourages positive Hamming affinity

values for similar data pairs and negative values for dissimilar data pairs. Loss functions

are typically defined on the basis of the Hamming distance or Hamming affinity of similar

and dissimilar data pairs.

Existing hashing methods can be roughly categorized into unsupervised and supervised

methods. Unsupervised methods [32–40] aim to preserving the similarity which is calcu-

lated in the original feature space, while supervised methods [17, 27–31] aim to preserve

label based similarity. Supervised methods require similarity labels (usually pairwise or

triplet based) for supervised learning.

Various types of loss functions have been used in existing methods. For example, the

Laplacian affinity loss is used in SPH [34], STH [33], AGH [38]; the quantization loss

is used in ITQ [36] and K-means Hashing [37]; the Hamming distance or Hamming

affinity based similarity regression loss is used in MDSH [35], KSH [31], BRE [28], semi-

supervised hashing [17]; a hinge-like loss is used in MLH [29].


Different types of hash functions have been used in existing methods. For example, linear

perceptron functions are use in random projection LSH [32, 62], semi-supervised hashing

[17] and ITQ [36]; kernel functions are used in KSH [31] and KLSH [63]; eigenfunctions

are used in SPH [34] and MDSH [35].

A variety of techniques have been proposed for hashing learning. For example, random

projection is used in LSH [32, 62] and KLSH [63]; spectral graph analysis for exploring

data manifold is used in SPH [34], MDSH [35], STH[33], AGH [38], and inductive hashing

[64]; vector quantization is used in ITQ [36], K-means Hashing [37], bilinear hashing [40],

kernel methods are used in KSH [31] and KLSH [63].

In the next, we review a number of popular existing unsupervised and supervised hashing

methods, and discuss their strength and weakness.

2.3.1 Locality-sensitive hashing

Locality-sensitive hashing (LSH) [32] is a pioneer work of hashing methods. LSH family

generates random hash functions for mapping data points into binary codes. A popular

class of LSH uses random projection to generate hash functions [62, 65]. It is able

to preserve cosine similarity which is calculated in the input feature space. The hash

function generated by random projection is written as:

h(x) = sign (w>x+ b). (2.30)

Here the vector w and the scalar b are randomly generated. When applying hash table

look-up for nearest neighbor retrieval, using multiple hash tables is able to improve the

retrieval precision of LSH. LSH has been extended for preserving a variety of similar-

ity measures, such as p-norm distance [66], Mahalanobis distance [67], and the kernel

similarity [63].

Randomly generated hash functions are data-independent. LSH has a drawback that

it usually requires a large length of binary codes (long bits) to achieve good retrieval

precision. Rather than randomly generating hash functions in LSH, data dependent

methods learn meaningful hash functions from the data, thus they are able to generate

more efficient binary codes. Data dependent methods can be divided into unsupervised

and supervised methods. Next we explain some popular data dependent methods.


2.3.2 Spectral hashing

Spectral hashing (SPH) [34] is a data-dependent unsupervised method for learning uncor-

related binary codes which are able to preserve Gaussian similarity. Let x1,x2, . . . ,xndenote a set of training examples, and z1, z2, . . . ,zn denote the corresponding binary

codes. We have z ∈ −1,+1m, where m is the number of bits. The pairwise similarity

information is provided by a similarity matrix Y. The element of Y: yij measures the

similarity between the example xi and xj . Here yij is calculated by Gaussian kernel on

the input features, which is written as:

yij = exp (−‖xi − xj‖2

σ2). (2.31)

The Hamming distances between similar neighbors should be small. SPH aims to learn

uncorrelated binary codes which minimize the Hamming distances. It solves the follow-

ing optimization problem to generate the binary codes of the training examples:

minZ

n∑i,j=1

yij‖zi − zj‖2 (2.32a)

s.t. ∀i = 1, . . . , n : zi ∈ −1,+1m, (2.32b)

n∑i

zi = 0, (2.32c)

1

m

n∑i

ziz>j = I. (2.32d)

Here Z is a matrix of binary codes for all training examples; I is an identity matrix. The

constraint in (2.32c) requires balanced binary codes for each bit, which means that the

bit values are evenly distributed for each bit. The constraint in (2.32d) requires the bits

are uncorrelated. Uncorrelated codes help to reduce redundancy and balanced codes

help to evenly distribute examples into hash buckets. The above optimization is difficult

to solve due to the binary constraint in (2.32b). After spectral relaxation, which drops

the binary constraints in (2.32b), the above optimization can be solved by finding graph

Laplacian eigenvectors [68]. The final binary codes can be generated by thresholding

the Laplacian eigenvectors.

Solving the above optimization is only for obtaining the binary codes of training ex-

amples. It is more important that we need to know how to generate binary codes for

new examples. In other words, we need to learn hash functions which can be efficiently

evaluated on new examples. In manifold learning, the problem of generating represen-

tations for new examples is called out-of-example extension. Nystrom methods [69, 70]

are often applied for solving out-of-example extension, but they are computationally


expensive for large datasets. To enable efficient out-of-example extension, SPH assumes

that data follow an uniform distribution. When data follow an uniform distribution and

pairwise similarity is defined by Gaussian kernel, it shows that the solution of hash func-

tions is a type of eigenfunctions which have an outer product form and can be efficiently

evaluated on new examples. A new example is evaluated by the learned eigenfunctions

and then thresholded to output binary codes.

There are a number of disadvantages of SPH. For real-world data, usually the assump-

tion of uniform distribution is not true. SPH optimizes the Laplacian affinity loss, which

only tries to pull together those similar data pairs but does not push away those dissimi-

lar data pairs. As shown in manifold learning, this may lead to inferior performance [71].

Moreover, the constraints of uncorrelated binary codes may not help to improve the sim-

ilarity search accuracy. Actually in some recent supervised methods, interdependently

learned binary codes show good performance for similarity preserving [17, 29, 31, 44, 45].

2.3.3 Self-taught hashing

Self-taught hashing (STH) [33] applies a two-step learning scheme for hash function

learning. In the first step, STH infers the binary codes of all training examples; then

in the second step, it trains binary classifiers as hash functions to fit the binary codes.

STH optimizes the Laplacian affinity loss for binary codes inference which is similar to

spectral hashing (SPH) [34]. SPH is described in the previous section.

Let x1,x2, . . . ,xn denote a set of training examples, and z1, z2, . . . ,zn denote the

corresponding binary codes. We have z ∈ −1,+1m, where m is the number of bits. A

similarity matrix Y is used for providing pairwise similarity information. The element

of Y: yij measures the similarity between the example xi and xj . STH uses local

similarity informations [72]. The similarity relations of one example are only defined on

a few number of nearest neighbors; hence the similarity matrix Y is highly sparse. The

pairwise similarity value is calculated using cosine similarity on the input features [72].

Let S denote a set of defined pairwise relations. The pairwise similarity values is written

as:

∀(i, j) ∈ S : yij =x>i xj‖xi‖‖xj‖

, (2.33)

and ∀(i, j) /∈ S : yij = 0. (2.34)


We introduce a diagonal n× n matrix: D whose diagonal entries are given by:

D(i, i) =n∑j=1

yij . (2.35)

In the first step for binary code inference, STH minimizes the following objective:

n∑i,j=1

yij‖zi − zj‖2 (2.36)

We define a matrix L as:

L = D−Y. (2.37)

We refer to the matrix L as graph Laplacian [68]. With some simple mathematical

transformation, the objective in (2.36) can be replaced by:

trace (Z>LZ). (2.38)

Overall, the optimization for binary code inference is written as:

minZ

trace (Z>LZ) (2.39a)

s.t. Z ∈ −1,+1m×n, (2.39b)

Z>D1 = 0, (2.39c)

Z>DZ = I. (2.39d)

Here Z is a matrix of binary codes for all training examples, one column corresponds

to one example; I is an identity matrix. The above optimization looks similar to that

of SPH in (2.32) except that the diagonal matrix D is involved in the constraints. The

above problem can bey relaxed by dropping the binary constraints in (2.39b). The

relaxed problem becomes the Laplacian Eigenmap (LapEig) problem [73] in manifold

learning. We solve the following generalized eigenvalue problem:

Lv = λDv (2.40)

to obtain m eigenvectors: [v1, . . . ,vm] which corresponds to the m smallest eigenvalues

These m eigenvectors is the solution of the relaxed problem. The final binary code

solution Z is obtained by thresholding these eigenvectors. The threshold values are

given by the median values of eigenvectors, which results in balanced binary codes.

In the second step, STH trains linear SVM binary classifiers as hash functions to fit the


binary codes obtained from the first step. The optimization problem of SVM classifica-

tion is written as:

minw,ξ

1

2‖w‖2 + C

n∑i

ξi (2.41a)

s.t. ∀i = 1, . . . , n : ziw>xi ≥ 1− ξi, (2.41b)

ξ ≥ 0. (2.41c)

We need to train m classifiers in the second step. The learned hash functions have the

following form:

h(x) = sign (w>x+ b). (2.42)

In contrast to SPH, STH trains simple classifiers as hash functions to solve the out-of-

example problem. No restrictive assumption of the data distribution is made. STH has

empirically shown that this simple way for hash function learning is able to significantly

outperform SPH.

SPH requires to use Gaussian kernel similarity for obtaining efficient eigenfunctions. In

contrast to SPH, STH is not limited to the type of similarity definition. The work in

[30] extends STH to supervised hashing which uses label based similarity definition. The

training of supervised STH accepts any user-defined similarity matrix.

The two-step learning scheme in STH provides an interesting paradigm for hashing

learning. Compared to SPH, this learning approach has a better and simpler solution

for the out-of-example extension problem. It also avoids the complex optimization of

highly non-convex problems in some recent methods [28, 29, 31] for directly learning

hash functions.

STH optimizes the Laplacian affinity loss, which only tries to pull together those similar

data pairs but does not push away those dissimilar data pairs. As shown in manifold

learning, this may lead to inferior performance [71]. It is not clear that how to in-

fer binary codes when using other meaningful loss functions, such as KSH [31], BRE

[28], MLH [29] loss functions. In Chapter 6, we propose a general two-step learning

framework, which extends STH to a much more general case. Our approach can easily

incorporate with any Hamming distance or Hamming affinity based loss functions. For

example, using the KSH or BRE loss function we are able to outperform STH which

uses the Laplacian affinity loss function.


2.3.4 Supervised hashing with kernel

Preserving semantic similarity is favorable in many computer vision applications. Images

from the same category or from the same object may have large distance (e.g., euclidean

distance) in the feature space, because the variations of view points, illumination, scales

and so on. The similarity measured in the input feature space (e.g., euclidean distance,

cosine similarity, Gaussian affinity) may not able to reveal the semantic similarity. Su-

pervised hashing is designed to preserve label based similarity.

Supervised hashing with kernel (KSH) [31] is a supervised method for learning kernel

hash functions. KSH has shown better performance over many other supervised meth-

ods, e.g., MLH [29] and BRE [28]. KSH minimizes the Hamming affinity loss function

for hash function learning. The Hamming affinity is closely related to the Hamming

distance. Hamming distance based loss functions usually require complex optimization,

e.g., MLH and BRE. As shown in KSH, using the Hamming affinity loss is able to simply

the optimization.

We define m as the number of hash functions which we need to learn. The output of

these m hash functions is denoted by Φ(x):

Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (2.43)

Each hash function h(x) has binary output: h(x) ∈ −1,+1. The Hamming affinity is

calculated by the inner product of two binary codes:

shm(xi,xj) =

m∑r=1

hr(xi)hr(xj). (2.44)

Intuitively, the optimization should encourage positive affinity value of similar data pairs

and negative affinity value of dissimilar data pairs. A similarity matrix Y is provided for

supervised learning, which encodes pairwise similarity information. The element of Y:

yij measures the similarity of the data point xi and xj . Specifically, if yij = 1, these two

data points are similar; if yij = −1, these two data points are dissimilar; if yij = 0, the

relation of these two data points are undefined. KSH solves the following optimization

problem:

minΦ(·)

n∑i=1

n∑j=1

[myij −

m∑r=1

hr(xi)hr(xj)

]2

. (2.45)


KSH uses kernel based hash functions. One hash function is written as:

h(x) = sign

Q∑q=1

wqκ(x′q,x) + b

. (2.46)

Here X′ = x′1, . . . ,x′Q is a set of Q support vectors; κ(·, ·) is a kernel function; wq is

the weighting coefficient and b is the bias term. The RBF kernel function is used in the

implementation of KSH, which is written as:

κ(x′q,x) = exp (−‖x− x′q‖2

2σ2), (2.47)

in which σ is a predefined kernel parameter. Similar to the Kernelized Locality-Sensitive

Hashing (KLSH) [63] method, these support vectors are uniformly sampled from training

examples.

Before the optimization step, KSH evaluates the kernel response of all training examples

using the predefined support vectors. In this way, the input features are transferred to

kernel features. Clearly, the dimension of the kernel features is the number of support

vectors. To simplify the optimization, the kernel features are centered by subtracting

their means in all dimensions, and the bias term in (2.46) is removed. The centered

kernel feature of the q-th support vector is defined as:

ψq(x) = κ(x′q,x) (2.48)

= κ(x′q,x)− 1

n

n∑i=1

κ(x′q,xi) (2.49)

The kernel features of all support vectors are denoted by Ψ(x), which is written as:

Ψ(x) = [ψ1(x), ψ2(x), . . . , ψQ(x)]. (2.50)

The hash function formulation becomes:

h(x) = sign

Q∑q=1

wqκ(x′q,x)

= sign (w>Ψ(x)). (2.51)

With the above hash function formulation, the optimization of (2.45) can be rewritten

as:

minW

n∑i=1

n∑j=1

[myij −

m∑r=1

sign (w>rΨ(xi)) sign (w>rΨ(xj))

]2

. (2.52)


We need to solve the above optimization to obtain the kernel weighting coefficients w

of all hash functions. KSH solves the above optimization in a greedy way. It optimizes

for one bit each time, based on the solution of previous bits. When solving for the r-th

bit, we define a residual value aijr as:

aijr = myij −r−1∑p=1

sign (w>pΨ(xi)) sign (w>pΨ(xj)). (2.53)

Thus the optimization for the r-th bit is written as

minwr

n∑i=1

n∑j=1

[aijr − sign (w>rΨ(xi)) sign (w>rΨ(xj)

]2

. (2.54)

To further simply the optimization, we have the following relations:[aijr − sign (w>rΨ(xi)) sign (w>rΨ(xj)

]2

(2.55)

=− 2aijr sign (w>rΨ(xi)) sign (w>rΨ(xj)+

[sign (w>rΨ(xi)) sign (w>rΨ(xj)

]2

+ a2ijr

(2.56)

=− 2aijr sign (w>rΨ(xi)) sign (w>rΨ(xj) + const (2.57)

Hence, the optimization in (2.54) is equal to:

maxwr

n∑i=1

n∑j=1

aijr sign (w>rΨ(xi)) sign (w>rΨ(xj). (2.58)

For solving the above optimization, KSH first applies spectral relaxation and solve the

relaxed problem to obtain an initial solution w0. Applying spectral relaxation is mainly

dropping the sign function. The relaxed problem is a standard generalized eigenvalue

problem, whose solution is the eigenvector corresponding to the largest eigenvalue. Then

it generates a smoothed problem by replacing the sign function to a sigmoid function.

Finally, based on the initial solution w0, it solves the smoothed problem using an accel-

erated gradient decent algorithm.

In Chapter 6, we introduce our general two-step method (TSH), which can easily incor-

porate with any Hamming affinity or Hamming distance based loss functions, including

the KSH loss function describe here, the BRE similarity reconstruction loss function, and

the MLH hinge-like loss function. KSH with the Hamming affinity based loss function

indeed simplifies the optimization. However, with two-step decomposition, our TSH

hashing learning is much simpler than KSH which directly solve for hash functions.


Moreover, the BRE reconstruction loss and MLH hinge-like loss function can be also

efficiently optimized in our TSH framework.

KSH requires a set of predefined support vectors and does not force sparsity on the

weighting coefficients, which is not practical for training on large-scale dataset. A small

number of support vectors is usually not sufficient to have a good prediction accuracy.

The default setting of the number of support vectors in KSH is only 300, which is far

from enough. However, a large number of support vectors will significantly slow down

the training and testing of KSH.

In our two-step method: TSH, we train binary classifiers for learning hash functions,

hence any sophisticated kernel classifiers can be applied in TSH. Note that kernel based

binary classifier is well-studies. For example, LIBSVM [74] is a widely used implemen-

tation of kernel SVM classifier, and the recent on-line learning method in [6] is able to

efficiently train a budgeted kernel SVM with any user-defined sparsity. These sophisti-

cated kernel methods all provide sparse solutions and can be seamlessly applied in our

TSH. Compared to KSH with its naive kernel solution, our TSH is much more flexible,

useful and effective.

Though kernel hash functions are able to boost up performance, however they are gen-

erally inefficient for training and testing, even using sophisticated training methods.

When training and testing on large-scale and high-dimensional data, kernel methods

become impractically slow. In Chapter 7, we extend our TSH framework and propose

a fast supervised method for hashing learning on large-scale and high-dimensional data,

which is referred to as FastHash. Our FastHash efficiently learns decision trees as hash

functions. Compared to kernel hash functions, decision tree based hash functions can

easily deal with a very large number of training data with high dimensionality (tens of

thousands), and has the desirable non-linear mapping.

Chapter 3

Fast Training of Effective

Multi-class Boosting Using

Coordinate Descent

In this chapter, we present a novel column generation based boosting method for multi-

class classification [41]. Our multi-class boosting is formulated as a single optimization

problem. Different from most existing multi-class boosting methods, which use the same

set of weak learners for all the classes, we learn a separate set of weak learners for each

class. In other words, the weak learners are class-specific. We show that using these

class specified weak learners leads to fast convergence, without introducing additional

computational overhead in the training procedure. To further make the training more

efficient and scalable, we also propose a fast coordinate descent method for solving

the optimization problem at each boosting iteration. The proposed coordinate descent

method is conceptually simple and easy to implement in that it has a closed-form solution

for each coordinate update. Experimental results on a variety of datasets show that,

compared to a range of existing multi-class boosting methods, the proposed method

has much faster convergence rate and better generalization performance in most cases.

We also empirically show that the proposed fast coordinate descent algorithm is able to

efficiently solve the optimization.

3.1 Introduction

Boosting methods combine a set of weak classifiers (weak learners) to form a strong clas-

sifier. Boosting has been extensively studied [75] and applied to a wide range of applica-

tions due to its robustness and efficiency (e.g., real-time object detection [76]). Despite

29

30 Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent

the fact that most classification tasks are inherently multi-class problems, the majority

of boosting algorithms are designed for binary classification. A popular approach to

multi-class boosting is to split the multi-class problem into a bunch of binary classifi-

cation problems. A simple example is the one-vs-all approach. The well-known error

correcting output coding (ECOC) methods [77] belong to this category. AdaBoost.ECC

[78], AdaBoost.MH and AdaBoost.MO [79] can all be viewed as examples of the ECOC

approach. The second approach is to directly formulate multi-class as a single learning

task, which is based on pairwise model comparisons between different classes. Shen and

Hao [1] use direct formulation for multi-class boosting (referred to as MultiBoost) is

such an example. From the perspective of optimization, MultiBoost can be seen as an

extension of the binary column generation boosting framework [59, 60] to the multi-class

case. Our work here builds upon MultiBoost.

For most existing multi-class boosting, including MultiBoost, different classes share the

same set of weak learners. However, a weak learner is usually aim to reduce the error for a

particular class. Sharing a weak learner across different classes usually leads to a sparse

solution of the model parameters and hence slow convergence. We will discuss more

details on this point. To solve this problem, in this work we propose a novel formulation

(referred to as MultiBoostcw) for multi-class boosting by using separate sets of weak

learners. Namely, each class uses its own set of weak learners. Compared to MultiBoost,

MultiBoostcw converges much faster, generally has better generalization performance

and does not introduce additional time cost for training. Note that AdaBoost.MO

proposed in [79] also uses different sets of weak classifiers for each class. AdaBoost.MO

is based on ECOC and the code matrix in AdaBoost.MO is specified before learning.

Therefore, the underlying dependence between the fixed code matrix and generated

binary classifiers is not explicitly taken into consideration. In contrast, our MultiBoostcw

is based on the direct formulation of multi-class boosting, which leads to fundamentally

different optimization strategies. More importantly, as shown in our experiments, our

MultiBoostcw is much more scalable than AdaBoost.MO although both enjoy faster

convergence than most other multi-class boosting.

MultiBoost requires sophisticated optimization tools like Mosek or LBFGS-B [2] to solve

the resulting optimization problem at each boosting iteration, which is not very scalable.

Here we propose a coordinate descent algorithm, which is termed as Fast Coordinate

Descent (FCD), for fast optimization of the resulting problem at each boosting iteration.

Specifically, FCD chooses one variable at a time and efficiently solve the single-variable

sub-problem. Coordinate decent (CD) technique has been applied to solve many large-

scale optimization problems. Yuan et al. [80] present comprehensive empirical compar-

isons of `1 regularized classification algorithms. They conclude that CD methods are

very competitive for solving large-scale problems. In the formulation of MultiBoost (also

Chapter 3 Fast Training of Effective Multi-class Boosting Using Coordinate Descent 31

in our MultiBoostcw), the number of variables is the product of the number of classes

and the number of weak learners, which can be very large (especially when the number

of classes is large). Therefore CD methods may be a better choice for fast optimization

of multi-class boosting. Our method FCD is specially tailored for the optimization of

MultiBoostcw. We are able to obtain a closed-form solution for each variable update,

thus the optimization can be extremely fast. Moreover, the proposed FCD is easy to

implement and no optimization toolbox is required.

3.1.1 Main Contributions

1. We propose a novel multi-class boosting method (MultiBoostcw) that uses class

specified weak learners. Unlike MultiBoost sharing a single set of weak learners

across different classes, our method uses a separate set of weak learners for each

class. We generate K (the number of classes) weak learners in each boosting

iteration—one weak learner for each class. With this mechanism, we are able to

achieve much faster convergence.

2. Similar to MultiBoost [1], we employ column generation to implement the boosting

training. We derive the Lagrange dual problem of the new multi-class boosting

formulation which enables us to design fully corrective multi-class algorithms using

the primal-dual optimization technique.

3. We propose a coordinate descent based method (termed as FCD) for fast training

of MultiBoostcw. We obtain an analytical solution for each variable update. We

use the Karush-Kuhn-Tucker (KKT) conditions to derive effective stop criteria

and construct working sets of violated variables for faster optimization. We show

that FCD can be applied not only to fully corrective optimization, which updates

all variables, but also to fast stage-wise optimization, which updates newly added

variables only. The stage-wise optimization of our multi-class boosting is similar

to the optimization in standard AdaBoost.

3.1.2 Notation

A bold lower-case letter (u, v) denotes a column vector. An element-wise inequality

between two vectors or matrices such as u ≥ v means ui ≥ vi for all i. Let us assume

that we have K classes. A weak learner is a function that maps an example x to

−1,+1. We denote each weak learner by φ: φy,j(·) ∈ C, in which, y = 1 . . .K and

j = 1 . . .m. C is the space of all possible weak learners; m is the number of learned weak


learners. We define column vectors:

Φy(x) = [φy,1(x), φy,2(x), · · · , φy,m(x)]> (3.1)

as the outputs of weak learners associated with the y-th class on example x. We denote

the weak learner coefficients wy for class y. Then the strong classifier for class y is:

fy(x) = w>yΦy(x). (3.2)

We need to learn K strong classifiers, one for each class. Given a test data x, the

classification rule is:

y? = argmaxy

fy(x). (3.3)

1 is a vector with elements all being one. Its dimension should be clear from the context.

Respectively, 0 is a vector with elements all being zero.

3.2 The Proposed Method

We show how to formulate the multi-class boosting problem in the large margin learning

framework. Analogue to MultiBoost, we can define the multi-class margin associated

with the training example (xi, yi) as

ρ(i,y) = w>yiΦyi(xi)−w>yΦy(xi), (3.4)

for y 6= yi. Intuitively, ρ(i,y) is the difference of the classification scores between a

“wrong” model and the right model. We want to make this margin as large as possible.

MultiBoostcw with the exponential loss can be formulated as:

minw≥0

1>w +C

p

∑i

∑y 6=yi

exp (−ρ(i,y)) (3.5a)

s.t. ∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :

w>yiΦyi(xi)−w>yΦy(xi) = ρ(i,y). (3.5b)

As in conventional boosting, w is constrained to be non-negative, and the `1 norm

regularization on w is applied here. The number of constraints is denoted by p. We

have p = n× (K − 1), in which n is the number of training examples. The parameter C

controls the complexity of the learned model. The model parameter is:

w = [w1,w2, . . . ,wK ]> ∈ RK·m×1. (3.6)


Algorithm 1: CG: Column generation for MultiBoostcw

Input: training examples: (x1; y1), (x2; y2), · · · ; regularization parameter: C; themaximum boosting iteration number.

Output: K discriminant function f(x, y;w) = w>yΦy(x), y = 1 · · ·K.

1 Initialize: weak learner working sets Hc = ∅(c = 1 · · ·K); initialize∀(i, y 6= yi) : µ(i,y) = 1.

2 repeat3 solve (3.10) to find K weak learners: φ?c(·), c = 1 · · ·K; and add them to the weak

learner working set Hc;4 solve the primal problem (3.5b) on the current weak learner working sets:

φc ∈ Hc, c = 1, . . . ,K; obtain w (we use coordinate descent of Algorithm 2);5 update dual variables µ in (3.11) using the primal solution w and the KKT

conditions (3.11);

6 until the maximum boosting iteration is reached ;

Minimizing (3.5b) encourages the confidence score of the correct label yi to be larger than

the confidence of any other labels. We define Y as a set of K labels: Y = 1, 2, . . . ,K.The discriminant function f : X× Y 7→ R we need to learn is:

f(x, y;w) = w>yΦy(x) =∑j

wy,jφy,j(x). (3.7)

The class label prediction y? for an unknown example x is to maximize f(x, y;w) over

y, which means finding a class label with the largest confidence:

y? = argmaxy

f(x, y;w) = argmaxy

w>yΦ(x). (3.8)

MultiBoostcw is an extension of MultiBoost [1] for multi-class classification. In Multi-

Boost, different classes share the same set of weak learners Φ. In contrast, in our

MultiBoostcw, each class associates a separate set of weak learners. We show that

MultiBoostcw learns a more compact model than MultiBoost. MultiBoostcw is a flexible

framework that it can easily work with different kinds of loss functions. MultiBoostcw

with the hinge loss is described in Appendix A.2.

3.2.1 Column generation for MultiBoostcw

To apply column generation for boosting learning, we need to derive the dual problem

of (3.5b). The detail for deriving the dual problem is described in Appendix A.1. the

dual problem of (3.5b) is written as (3.9), in which c is the index of class labels. µ(i,y)


Algorithm 2: FCD: Fast coordinate decent for MultiBoostcw

Input: training examples: (x1; y1), · · · , (xm; ym); coordinate descent tolerance:ε;weak learner set: Hc, c = 1, . . . ,K; initial value of w; maximum updateiteration: τmax.

Output: w.1 Initialize: initialize variable working set S by variables in w that correspond to newly

added weak learners; initialize µ in (3.30); working set iteration index τ = 0.2 repeat3 τ = τ + 1; reset the inner loop index: q = 0 ;4 while q < |S| (|S| is the size of S) do5 q = q + 1;6 pick one variable index j from S: if τ = 1 sequentially pick one, else randomly

pick one;7 Compute V− and V+ in (3.35) using µ;8 update variable wj in (3.22) using V− and V+;9 update µ in (3.34) using the updated wj ;

10 compute the violated values θ in (3.29) for all variables;11 re-construct the variable working set S in (3.31) using θ;

12 until the stop condition in (3.32) is satisfied or maximum iteration reached:τ >= τmax;

is the dual variable which is associated with one constraint in (3.5b):

maxµ

∑i

∑y 6=yi

µ(i,y)

[1− log

p

C− log µ(i,y)

](3.9a)

s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)

∑y 6=yi

µ(i,y)φyi(xi)−∑i

∑y 6=yi,y=c

µ(i,y)φy(xi) ≤ 1, (3.9b)

∀i = 1, . . . , n : 0 ≤∑y 6=yi

µ(i,y) ≤C

p. (3.9c)

Following the idea of column generation [59], we divide the original problem (3.5b) into

a master problem and a sub-problem, and solve them alternatively. The master problem

is a reduced problem of (3.5b) which only considers the generated weak learners. The

sub-problem is to generate K weak learners (corresponding K classes) by finding the

most violated constraint of each class in the dual form (3.9), and add them to the master

problem. The sub-problem for finding most violated constraints is be written as:

∀c =1, . . . ,K :

φ?c(·) = argmaxφc(·)∈C

∑i(yi=c)

∑y 6=yi


∑y 6=yi,y=c

µ(i,y)φy(xi). (3.10)

The column generation procedure for MultiBoostcw is described in Algorithm 1. Essen-

tially, we repeat the following two steps until convergence:


1. We solve the master problem (3.5b) with φc ∈ Hc, c = 1, . . . ,K, to obtain the

primal solution w. Hc is the working set of generated weak learners associated

with the c-th class. The dual solution µ? can be calculated using the primal

solution w?. According to the KKT conditions, the dual solution is written as:

µ?(i,y) =C

pexp

[w?>y Φy(xi)−w?>

yi Φyi(xi)]. (3.11)

2. With the dual solution µ?(i,y), we solve the sub-problem (3.10) to generate K weak

learners: φ?c , c = 1, 2, . . . ,K, and add to the weak learner working set Hc.

In MultiBoostcw, K weak learners are generated for K classes respectively in each itera-

tion, while in MultiBoost, only one weak learner is generated at each column generation

and shared by all classes. For MultiBoost, as shown in [1], the sub-problem for finding

the most violated constraint in the dual form is:

[φ?(·), c?] = argmaxφ(·)∈C, c∈1,...,K

∑i(yi=c)

∑y 6=yi


∑y 6=yi,y=c

µ(i,y)φ(xi). (3.12)

In MultiBoost, the above problem (3.12) is solved to to generated one weak learner

in each column generation iteration. Note that solving (3.12) is to search over all

K classes to find the best weak learner φ?, thus the computational cost is the same

as MultiBoostcw. This is the reason why MultiBoostcw does not introduce additional

training cost compared to MultiBoost.

In general, the solution [w1, · · · ,wK ] of MultiBoost is highly sparse. This can be ob-

served in our empirical study. The weak learner generated by solving (3.12) actually is

targeted for one class, thus using this weak learner across all classes in MultiBoost leads

to a very sparse solution. The sparsity of [w1, · · · ,wK ] indicates that one weak learner

is usually only useful for the prediction of a very few number of classes (typically only

one), but useless for most other classes. In this sense, forcing different classes to use the

same set of weak learners may not be necessary and usually it leads to slow convergence.

In contrast, using separate weak learner set for each class, MultiBoostcw tends to have

a dense solution of w. With K weak learners generated at each iteration, MultiBoostcw

converges much faster.

3.2.2 Fast coordinate descent

To further speed up the training, we propose a fast coordinate descent algorithm (FCD)

for solving the primal optimization problem (3.5) of MultiBoostcw at each column gener-

ation iteration. This efficient algorithm also can be applied in MultiBoost. The details of


FCD is presented in Algorithm 2. The high-level idea is simple. FCD works iteratively,

and at each iteration (referred to as working set iteration), we compute the violated

value of the KKT conditions for each variable in w, and construct a working set of

violated variables (denoted as S), then pick variables from S for update (one variable at

a time). We also use the violated values for defining stop criteria.

Our FCD is a mix of sequential and stochastic coordinate descent. For the first working

set iteration, variables are sequentially picked for update (cyclic CD); in later working

set iterations, variables are randomly picked (stochastic CD). In the sequel, we present

the details of FCD. First, we describe how to update one variable of w by solving a

single-variable optimization problem. For notation simplicity, we define:

δΦi(y) = Φyi(xi)⊗ ρ(yi)− Φy(xi)⊗ ρ(y), (3.13)

and

δφi(y) = φyi(xi)⊗ ρ(yi)− φy(xi)⊗ ρ(y), (3.14)

ρ(y) is the orthogonal label coding vector:

ρ(y) = [δ(y, 1), δ(y, 2), · · · , δ(y,K)]> ∈ 0, 1K . (3.15)

Here δ(y, k) is the indicator function that returns 1 if y = k, otherwise 0. ⊗ denotes the

tensor product. MultiBoostcw in (3.5b) can be equivalently written as:

minw≥0

1>w +C

p

∑i

∑y 6=yi

exp[−w>δΦi(y)

]. (3.16)

We assume that binary weak learners are used here: φ(x) ∈ +1,−1. δφi,j(y) de-

notes the j-th dimension of δΦi(y), and δΦi,j(y) denotes the rest dimensions of δΦi(y)

excluding the j-th. Obviously, the output of δφi,j(y) only takes three possible values:

δφi,j(y) ∈ −1, 0,+1. For the j-th dimension, we define:

Djv = (i, y) | δφji (y) = v, i ∈ 1, . . . ,m, y ∈ Y/yi , v ∈ −1, 0,+1; (3.17)

thus Djv is a set of constraint indices (i, y) that the output of δφi,j(y) is v. wj denotes

the j-th variable of w; wj denotes the rest variables of w excluding the j-th. Let g(w)


be the objective function of the optimization (3.16). g(w) can be de-composited as:

g(w) = 1>w +C

p

∑i

∑y 6=yi

exp[−w>δΦi(y)

]= 1>wj + wj +

C

p

∑i,y 6=yi

exp[− w>j δΦi,j(y)− w>j δφi,j(y)

]= 1>wj + wj +

C

p

exp (w>j )

∑(i,y)∈Dj

−1

exp[− w>j δΦi,j(y)

]+

exp (−w>j )∑

(i,y)∈Dj+1


]+

∑(i,y)∈Dj

0


]

= 1>wj + wj +C

p

[exp (w>j )V− + exp (−w>j )V+ + V0

]. (3.18)

Here we have defined:

V− =∑

(i,y)∈Dj−1


], V0 =

∑(i,y)∈Dj

0


](3.19a)

V+ =∑

(i,y)∈Dj+1


](3.19b)

In the variable update step, one variable wj is picked at a time for updating and other

variables wj are fixed; thus we need to minimize g in (3.18) w.r.t wj , which is a single-

variable minimization. It can be written as:

minwj≥0

wj +C

p

[V− exp (w>j ) + V+ exp (−w>j )

]. (3.20)

The derivative of the objective function in (3.20) with wj > 0 is:

∂g

∂wj= 0 =⇒ 1 +

C

p

[V− exp (w>j )− V+ exp (−w>j )

]= 0. (3.21)

By solving (3.21) with the bounded constraint wj ≥ 0, we obtain the analytical solution

of the optimization in (3.20) (since V− > 0):

w?j = max

0, log

(√V+V− +

p2

4C2− p

2C

)− log V−

. (3.22)

When C is large, (3.22) can be approximately simplified as:

w?j = max

0,

1

2log

V+

V−

. (3.23)

With the analytical solution in (3.22), the update of each dimension of w can be per-

formed extremely efficiently. We can find that, the main requirement for obtaining the


closed-form solution is that the use of discrete weak learners.

We use the KKT conditions to construct a set of violated variables and derive meaningful

stop criteria. For the optimization of MultiBoostcw (3.16), KKT conditions are necessary

conditions and also sufficient for optimality. The Lagrangian of (3.16) is:

L = 1>w +C

p

∑i

∑y 6=yi

exp[−w>δΦi(y)

]−α>w. (3.24)

According to the KKT conditions, w? is the optimal for (3.20) if and only if w? satisfies

w? ≥ 0,α? ≥ 0, (3.25)

∀j : α?jw?j = 0 (3.26)

and

∀j : ∇jL(w?) = 0. (3.27)

For wj > 0, we have:

∂L

∂wj= 0 =⇒ 1− C

p

∑i

∑y 6=yi

exp[−wδΦi(y)

]δφi,j(y)− αj = 0.

Considering the complementary slackness: α?jw?j = 0, if w?j > 0, we have α?j = 0; if

w?j = 0, we have α?j ≥ 0. The optimality conditions can be written as:

∀j :

1− C

p

∑i

∑y 6=yi exp

[−w?δΦi(y)

]δφi,j(y) = 0, if w?j > 0;

1− Cp

∑i

∑y 6=yi exp

[−w?δΦi(y)

]δφi,j(y) ≥ 0, if w?j = 0.

(3.28)

For notation simplicity, we define a column vector µ as in (3.30). With the optimality

conditions (3.28), we define θj in (3.29) as the violated value of the j-th variable of the

solution w?:

θj =

|1− C

p

∑i

∑y 6=yi µ(i,y) δφi,j(y)| if w?j > 0

max0, Cp

∑i

∑y 6=yi µ(i,y) δφi,j(y)− 1 if w?j = 0

, (3.29)

in which,

µ(i,y) = exp[−w>δΦi(y)

]. (3.30)

At each working set iteration of FCD, we compute the violated values θ, and construct

a working set S of violated variables; then we randomly (except the first iteration) pick

one variable from S for update. We repeat picking for |S| times; |S| is the number of


elements in S. S is defined as

S = j |θj > ε (3.31)

where ε is a tolerance parameter. Analogous to [81] and [80], with the definition of the

variable violated values θ in (3.29), we can define the stop criteria as:

maxjθj ≤ ε, (3.32)

where ε can be the same tolerance parameter as in the definition of working set S

(3.31). The stop condition (3.32) shows if the largest violated value is smaller than

some threshold, FCD terminates. We can see that using KKT conditions is actually

using the gradient information. An inexact solution for w is acceptable for each column

generation iteration, thus we place a maximum iteration number (τmax in Algorithm 2)

for FCD to prevent unnecessary computation.

We need to compute µ before obtaining θ, but computing µ in (3.30) is expensive.

Fortunately, we are able to incrementally update µ after the update of one variable wj

to avoid re-computing of (3.30). µ in (3.30) can be equally written as:

µ(i,y) = exp[− w>j δΦi,j(y)− wjδφi,j(y)

]. (3.33)

So the update of µ is then:

µ(i,y) = µold(i,y) exp

[δφi,j(y)(wold

j − wj)]. (3.34)

With the definition of µ in (3.33), the value V− and V+ for one variable update can

be efficiently computed by using µ to avoid the expensive computation in (3.19a) and

(3.19b); V− and V+ can be equally defined as:

V− =∑

(i,y)∈Dj−1

µ(i,y) exp (−wj), V+ =∑

(i,y)∈Dj+1

µ(i,y) exp (wj). (3.35)

Some discussion on FCD (Algorithm 2) is as follows:

1) Stage-wise optimization is a special case of FCD. Compared to totally corrective

optimization which considers all variables of w for update, stage-wise only considers

those newly added variables for update. We initialize the working set using the newly

added variables. For the first working set iteration, we sequentially update the new added

variables. If setting the maximum working set iteration to 1 (τmax = 1 in Algorithm 2),

FCD becomes a stage-wise algorithm. Thus FCD is a generalized algorithm with totally

corrective update and stage-wise update as special cases. In the stage-wise setting,


usually a large C (regularization parameter) is implicitly enforced, thus we can use the

analytical solution in (3.23) for variable update.

2) Randomly picking one variable for update without any guidance leads to slow local

convergence. When the solution gets close to the optimality, usually only very few

variables need update, and most picks do not “hit”. In column generation (CG), the

initial value of w is initialized by the solution of last CG iteration. This initialization

is already fairly close to optimality. Therefore the slow local convergence for stochastic

coordinate decent (CD) is more serious in column generation based boosting. Here we

have used the KKT conditions to iteratively construct a working set of violated variables,

and only the variables in the working set need update. This strategy leads to faster CD

convergence.

3.3 Experiments

We evaluate our method MultiBoostcw on some UCI datasets and a variety of multi-

class image classification applications, including digit recognition, scene recognition,

and traffic sign recognition. We compare MultiBoostcw against MultiBoost [1] with

the exponential loss, and some popular existing multi-class boosting algorithms: Ad-

aBoost.ECC [78], AdaBoost.MH [79] and AdaBoost.MO [79]. We use FCD as the solver

for MultiBoostcw, and LBFGS-B [2] for MultiBoost. We also perform further experi-

ments to evaluate FCD in detail. For all experiments, the best regularization parameter

C for MultiBoostcw and MultiBoost is selected from 102 to 105; the tolerance parameter

in FCD is set to 0.1 (ε = 0.1). We use MultiBoostcw-1 (CW-1) to denote MultiBoostcw

using the stage-wise setting of FCD. The suffix “-1” here means the iteration parameter

(τmax) is set to 1 in Algorithm 2. In MultiBoostcw-1, we fix C to be a large value:

C = 108.

All experiments are run 5 times. We compare the testing error, the total training

time and solver time on all datasets. The results show that our MultiBoostcw and

MultiBoostcw-1 converge much faster than other methods, use less training time than

MultiBoost, and achieve the best testing error on most datasets.

AdaBoost.MO [79] (Ada.MO) has a similar convergence rate as our method, but it is

much slower than our method and becomes intractable for large scale datasets. We run

Ada.MO on some UCI datasets and MNIST. Results are shown in Figure 3.1, Figure

3.2 and 3.3. We set a maximum training time (1000 seconds) for Ada.MO; the training

time of all other methods is below this maximum time on those datasets. If maximum

time reached, we report the results of those finished iterations.


100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

iterations [VOWEL]

Te

st

err

or

ADA.MO(0.117±0.014)

ADA.MH(0.130±0.016)

ADA.ECC(0.103±0.024)

MultiB(0.080±0.025)

CW(ours)(0.101±0.016)

CW−1(ours)(0.085±0.018)

100 200 300 400 5000

0.2

0.4

0.6

0.8

1

iterations [ISOLET]

Te

st

err

or

ADA.MH(0.067±0.005)

ADA.ECC(0.106±0.008)

MultiB(0.069±0.005)

CW(ours)(0.056±0.007)

CW−1(ours)(0.050±0.004)

100 200 300 400 5000

100

200

300

400

500

iterations [VOWEL]

Tra

inin

g t

ime

(se

co

nd

s)

ADA.MO (419.7±39.5)

MultiB (135.5±10.4)

CW(ours) (107.5±7.2)

CW−1(ours) (76.8±4.8)

100 200 300 400 5000

500

1000

1500

2000

2500

3000

3500

4000

iterations [ISOLET]

Tra

inin

g t

ime

(se

co

nd

s)

MultiB (3621.3±549.8)

CW(ours) (2430.4±371.1)

CW−1(ours) (2553.8±213.8)

100 200 300 400 5000

10

20

30

40

50

60

70

iterations [VOWEL]

So

lve

r tim

e (

se

co

nd

s)

MultiB (69.6±5.4)

CW(ours) (43.5±4.0)

CW−1(ours) (24.7±1.7)

100 200 300 400 5000

500

1000

1500

2000

iterations [ISOLET]

So

lve

r tim

e (

se

co

nd

s)

MultiB (1604.7±243.8)

CW(ours) (746.7±119.5)

CW−1(ours) (699.7±58.8)

Figure 3.1: Results on 2 UCI datasets: VOWEL and ISOLET. CW and CW-1 are our methods.CW-1 uses stage-wise setting. The number after the method name is the mean value with stan-dard deviation of the last iteration. Our methods converge much faster and achieve competitivetest accuracy. The total training time and the solver time of our methods both are less thanthat of MultiBoost [1].

3.3.1 UCI datasets

We use 2 UCI multi-class datasets: VOWEL and ISOLET. For each dataset, we ran-

domly select 75% data for training and the rest for testing. Results are shown in Figure

3.1.


100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

iterations [USPS]

Te

st

err

or

ADA.MO(0.057±0.006)

ADA.MH(0.056±0.003)

ADA.ECC(0.061±0.007)

MultiB(0.053±0.004)

CW(ours)(0.046±0.001)

CW−1(ours)(0.042±0.002)

100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

iterations [PENDIGITS]

Te

st

err

or

ADA.MO(0.042±0.004)

ADA.MH(0.035±0.003)

ADA.ECC(0.035±0.003)

MultiB(0.022±0.001)

CW(ours)(0.023±0.003)

CW−1(ours)(0.019±0.003)

100 200 300 400 5000

200

400

600

800

1000

iterations [USPS]

Tra

inin

g t

ime

(se

co

nd

s)

ADA.MO (990.8±8.8)

MultiB (847.1±126.5)

CW(ours) (518.4±86.5)

CW−1(ours) (394.3±7.5)

100 200 300 400 5000

200

400

600

800

1000


Tra

inin

g t

ime

(se

co

nd

s)

ADA.MO (976.7±25.0)

MultiB (936.5±153.1)

CW(ours) (592.9±96.3)

CW−1(ours) (404.9±2.9)

100 200 300 400 5000

100

200

300

400

500

iterations [USPS]

So

lve

r tim

e (

se

co

nd

s)

MultiB (408.1±69.9)

CW(ours) (188.4±36.0)

CW−1(ours) (110.8±2.5)

100 200 300 400 5000

100

200

300

400

500


So

lve

r tim

e (

se

co

nd

s)

MultiB (477.6±80.2)

CW(ours) (238.8±34.3)

CW−1(ours) (121.6±1.5)

Figure 3.2: Experiments on 2 handwritten digit recognition datasets: USPS, PENDIGITS. CWand CW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge much faster,achieve best test error and use less training time. Ada.MO has similar convergence rate as ours,but requires much more training time. With a maximum training time of 1000 seconds, Ada.MOfailed to finish 500 iterations on all datasets.

3.3.2 Handwritten digit recognition

We use 3 handwritten datasets: MNIST, USPS and PENDIGITS. For MNIST, we ran-

domly sample 1000 examples from each class, and use the original test set of 10,000

examples. For USPS and PENDIGITS, we randomly select 75% for training, the rest

for testing. Results are shown in Figure 3.2 and 3.3.


100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

iterations [MNIST]T

est

err

or

ADA.MO(0.110±0.001)

ADA.MH(0.109±0.002)

ADA.ECC(0.121±0.003)

MultiB(0.104±0.004)

CW(ours)(0.097±0.004)

CW−1(ours)(0.092±0.001)

100 200 300 400 5000

200

400

600

800

1000

iterations [MNIST]

Tra

inin

g t

ime

(se

co

nd

s)

ADA.MO (981.3±14.6)

MultiB (956.5±56.8)

CW(ours) (730.1±99.3)

CW−1(ours) (577.6±5.9)

100 200 300 400 5000

100

200

300

400

500

iterations [MNIST]

So

lve

r tim

e (

se

co

nd

s)

MultiB (468.4±18.3)

CW(ours) (255.7±27.4)

CW−1(ours) (163.6±3.2)

Figure 3.3: Experiments on handwritten digit recognition datasets: MNIST. CW and CW-1 areour methods. CW-1 uses stage-wise setting. Our methods converge much faster, achieve besttest error and use less training time. Ada.MO has similar convergence rate as ours, but requiresmuch more training time. With a maximum training time of 1000 seconds, Ada.MO failed tofinish 500 iterations on this dataset.

3.3.3 Three Image datasets: PASCAL07, LabelMe, CIFAR10

For PASCAL07, we use 5 types of features provided in [82]. For labelMe, we use the

subset: LabelMe-12-50k1 [83] and generate GIST features. For these two datasets, we

use those images which only have one class label. We use 70% data for training, the

rest for testing. For CIFAR102, we construct 2 datasets, one uses GIST features and

the other uses the pixel values. We use the provided test set and 5 training sets for 5

times run. Results are shown in Figure 3.5.

3.3.4 Scene recognition

We use 2 scene image datasets: Scene15 [84] and SUN [3]. For Scene15, we randomly

select 100 images per class for training, and the rest for testing. We generate histograms

1http://www.ais.uni-bonn.de/download/datasets.html2http://www.cs.toronto.edu/˜kriz/cifar.html


50 100 150 2000

0.2

0.4

0.6

0.8

1

iterations [GTSRB]

Te

st

err

or

ADA.MH(0.103±0.005)

ADA.ECC(0.236±0.011)

MultiB(0.116±0.008)

CW(ours)(0.081±0.004)

CW−1(ours)(0.082±0.003)

50 100 150 2000

50

100

150

200

250

300

350

400

iterations [GTSRB]

So

lve

r tim

e (

se

co

nd

s)

MultiB (380.1±6.0)

CW(ours) (260.0±9.0)

CW−1(ours) (205.7±9.4)

Figure 3.4: Results on a traffic sign dataset: GTSRB. CW and CW-1 (stage-wise setting) areour methods. Our methods converge much faster, achieve best test error and use less trainingtime.

of code words as features. The code book size is 200. An image is divided into 31 sub-

windows in a spatial hierarchy manner. We generate histograms in each sub-windows,

thus the histogram feature dimension is 6200. For SUN dataset, we construct a subset

of the original dataset containing 25 categories. For each category, we use the top 200

images, and randomly select 80% data for training, the rest for testing. We use the

HOG features described in [3]. Results are shown in Figure 3.6.

3.3.5 Traffic sign recognition

We use the GTSRB3 traffic sign dataset. There are 43 classes and more than 50000

images. We use the provided 3 types of HOG features; thus there are 6052 features in

total. We randomly select 100 examples per class for training and use the original test

set. Results are shown in Figure 3.4.

3.3.6 FCD evaluation

We perform further experiments to evaluate FCD with different parameter settings,

and compare to the LBFGS-B [2] solver. We use 3 datasets in this section: VOWEL,

USPS and SCENE15. We run FCD with different settings of the maximum working set

iteration (τmax in Algorithm 2) to evaluate how the setting of τmax (maximum working

set iteration) affects the performance of FCD. We also run LBFGS-B [2] solver for solving

the same optimization (3.5b) as FCD. We set C = 104 for all cases. Results are shown

in Figure 3.7. For LBFGS-B, we use the default converge setting to get a moderate

solution. The number after “FCD” in the figure is the setting of τmax in Algorithm 2

3http://benchmark.ini.rub.de/


for FCD. Results show that the stage-wise case (τmax = 1) of FCD is the fastest one, as

expected. When we set τmax ≥ 2, the objective value of the optimization (3.5b) of our

method converges much faster than LBFGS-B. Thus setting of τmax = 2 is sufficient to

achieve a very accurate solution, and at the same time has faster convergence and less

running time than LBFGS-B.

3.4 Conclusion

we have presented a novel multi-class boosting method based on the column generation

technique. Different from most existing multi-class boosting, we train a separate set of

weak learners for each class, which results in much faster convergence. We also develop

an efficient coordinate decent method for solving the optimization. A wide range of

experiments demonstrate that the proposed multi-class boosting achieves competitive

testing accuracy, converges much faster, and has fast training speed due to the proposed

efficient coordinate descent algorithm.


100 200 300 400 5000.4

0.5

0.6

0.7

0.8

0.9

iterations [PASCAL07]

Te

st

err

or

ADA.MH(0.545±0.008)

ADA.ECC(0.575±0.008)

MultiB(0.541±0.007)

CW(ours)(0.480±0.003)

CW−1(ours)(0.475±0.003)

100 200 300 400 5000.2

0.3

0.4

0.5

0.6

0.7

iterations [LABELME−SUB]

Te

st

err

or

ADA.MH(0.254±0.002)

ADA.ECC(0.273±0.004)

MultiB(0.250±0.004)

CW(ours)(0.223±0.002)

CW−1(ours)(0.225±0.002)

100 200 300 400 5000.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

iterations [CIFAR10−GIST]

Te

st

err

or

ADA.MH(0.495±0.006)

ADA.ECC(0.518±0.005)

MultiB(0.495±0.004)

CW(ours)(0.470±0.002)

CW−1(ours)(0.476±0.004)

100 200 300 400 500

0.65

0.7

0.75

0.8

0.85

iterations [CIFAR10−RAW]

Te

st

err

or

ADA.MH(0.647±0.004)

ADA.ECC(0.664±0.003)

MultiB(0.644±0.006)

CW(ours)(0.621±0.004)

CW−1(ours)(0.648±0.004)

100 200 300 400 5000

100

200

300

400

500

600

700

800

iterations [PASCAL07]

So

lve

r tim

e (

se

co

nd

s)

MultiB (728.6±71.9)

CW(ours) (440.7±33.5)

CW−1(ours) (285.4±2.4)

100 200 300 400 5000

200

400

600

800

1000

1200

iterations [LABELME−SUB]

So

lve

r tim

e (

se

co

nd

s)

MultiB (1015.4±76.1)

CW(ours) (993.3±40.4)

CW−1(ours) (394.4±4.5)

100 200 300 400 5000

50

100

150

200

250

300

350

400

iterations [CIFAR10−RAW]

So

lve

r tim

e (

se

co

nd

s)

MultiB (400.0±39.4)

CW(ours) (242.0±35.4)

CW−1(ours) (161.6±2.0)

100 200 300 400 5000

50

100

150

200

250

300

350

400

iterations [CIFAR10−GIST]

So

lve

r tim

e (

se

co

nd

s)

MultiB (350.9±37.5)

CW(ours) (218.2±20.0)

CW−1(ours) (161.9±4.5)

Figure 3.5: Experiments on 3 image datasets: PASCAL07, LabelMe and CIFAR10. CW andCW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge much faster,achieve best test error and use less training time.


100 200 300 400 5000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

iterations [SCENE15]

Te

st

err

or

ADA.MH(0.245±0.009)

ADA.ECC(0.269±0.005)

MultiB(0.278±0.004)

CW(ours)(0.229±0.004)

CW−1(ours)(0.225±0.005)

100 200 300 400 5000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

iterations [SUN−25]T

est

err

or

ADA.MH(0.564±0.013)

ADA.ECC(0.606±0.013)

MultiB(0.601±0.008)

CW(ours)(0.525±0.009)

CW−1(ours)(0.528±0.007)

100 200 300 400 5000

100

200

300

400

500

600


Tra

inin

g t

ime

(se

co

nd

s)

MultiB (592.1±52.6)

CW(ours) (417.1±25.6)

CW−1(ours) (278.5±6.2)

100 200 300 400 5000

500

1000

1500

2000

2500

iterations [SUN−25]

Tra

inin

g t

ime

(se

co

nd

s)

MultiB (2189.6±85.4)

CW(ours) (1850.2±47.0)

CW−1(ours) (1589.1±51.9)

100 200 300 400 5000

50

100

150

200

250


So

lve

r tim

e (

se

co

nd

s)

MultiB (242.0±25.0)

CW(ours) (103.7±10.1)

CW−1(ours) (49.5±0.9)

100 200 300 400 5000

200

400

600

800

1000

iterations [SUN−25]

So

lve

r tim

e (

se

co

nd

s)

MultiB (944.3±46.3)

CW(ours) (541.8±19.6)

CW−1(ours) (340.1±20.2)

Figure 3.6: Experiments on 2 scene recognition datasets: SCENE15 and a subset of SUN. CWand CW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge much fasterand achieve best test error and use less training time.


50 100 150 200 250 300

400

600

800

1000

1200

1400

1600

boosting iterations [VOWEL]

Ob

jective

fu

nctio

n v

alu

e

LBFGS−B (357.297±6.773)

FCD−10 (337.756±3.227)

FCD−5 (337.863±3.274)

FCD−2 (338.353±3.423)

FCD−1 (352.022±3.868)

50 100 150 200 250 3000

5

10

15

20

25

30

boosting iterations [VOWEL]

So

lve

r tim

e (

se

co

nd

s)

LBFGS−B (23.7±2.5)

FCD−10 (25.4±2.6)

FCD−5 (22.8±1.9)

FCD−2 (15.6±1.0)

FCD−1 (8.9±1.3)

50 100 150 200 250 300

400

600

800

1000

1200

boosting iterations [USPS]

Ob

jective

fu

nctio

n v

alu

e

LBFGS−B (284.069±3.444)

FCD−10 (263.744±1.465)

FCD−5 (263.860±1.391)

FCD−2 (264.701±1.434)

FCD−1 (292.395±1.778)

50 100 150 200 250 3000

50

100

150

200


So

lve

r tim

e (

se

co

nd

s)

LBFGS−B (103.5±5.5)

FCD−10 (174.1±12.5)

FCD−5 (143.6±9.3)

FCD−2 (82.1±4.2)

FCD−1 (45.3±3.3)

50 100 150 200 250 300200

400

600

800

1000


Ob

jective

fu

nctio

n v

alu

e

LBFGS−B (240.503±1.976)

FCD−10 (239.617±1.898)

FCD−5 (239.599±1.869)

FCD−2 (239.627±1.859)

FCD−1 (275.267±2.544)

50 100 150 200 250 3000

10

20

30

40

50

60


So

lve

r tim

e (

se

co

nd

s)

LBFGS−B (55.5±4.3)

FCD−10 (47.3±5.5)

FCD−5 (42.5±2.3)

FCD−2 (32.8±4.0)

FCD−1 (22.9±1.4)

Figure 3.7: Solver comparison between FCD with different parameter setting and LBFGS-B [2].One column for one dataset. The number after “FCD” is the setting for the maximum iteration(τmax) of FCD. The stage-wise setting of FCD is the fastest one. See the text for details.

Chapter 4

StructBoost: Boosting Methods

for Predicting Structured Output

Variables

Boosting is a method for learning a single accurate predictor by linearly combining a set

of less accurate weak learners. Recently, structured learning has found many applications

in computer vision. Inspired by structured support vector machines (SSVM), here we

propose a new boosting algorithm for structured output prediction, which we refer to

as StructBoost [42]. StructBoost supports nonlinear structured learning by combining

a set of weak structured learners.

As SSVM generalizes SVM, our StructBoost generalizes standard boosting approaches

such as AdaBoost, or LPBoost to structured learning. The resulting optimization prob-

lem of StructBoost is more challenging than SSVM in the sense that it may involve

exponentially many variables and constraints. In contrast, for SSVM one usually has

an exponential number of constraints and a cutting-plane method is used. In order to

efficiently solve StructBoost, we formulate an equivalent 1-slack formulation and solve

it using a combination of cutting planes and column generation. We show the versa-

tility and usefulness of StructBoost on a range of problems such as optimizing the tree

loss for hierarchical multi-class classification, optimizing the Pascal overlap criterion

for robust visual tracking and learning conditional random field parameters for image

segmentation.

49

50 Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables

4.1 Introduction

Structured learning has attracted considerable attention in machine learning and com-

puter vision in recent years (see, for example [4, 7, 10, 15]). Conventional supervised

learning problems, such as classification and regression, aim to learn a function that

predicts a best output value of y ∈ R for an input vector x ∈ Rd. In many appli-

cations, however, the outputs are often complex and cannot be well represented by a

single scalar, but the most appropriate outputs are objects (vectors, sequences, trees,

etc.). The components of the output are inter-dependent. Such problems are referred

to as structured output prediction.

Structured support vector machines (SSVM) [7] generalize the multi-class SVM of [85]

and [61] to the much broader problem of predicting interdependent and structured out-

puts. SSVM uses discriminant functions that take advantage of the dependencies and

structure of outputs. In SSVM, the general form of the learned discriminant function

is f(x,y;w) : X × Y 7→ R over input-output pairs and the prediction is achieved by

maximizing f(x,y;w) over all possible y ∈ Y. Note that to introduce non-linearity, the

discriminant function can be defined by an implicit feature mapping function that is

only accessible as a particular inner product in a reproducing kernel Hilbert space. This

is the so-called kernel trick.

On the other hand, boosting algorithms linearly combine a set of moderately accurate

weak learners to form a nonlinear strong predictor, whose prediction performance is

usually highly accurate. Recently, Shen and Hao [1] proposed a direct formulation for

multi-class boosting using the loss functions of multi-class SVMs [61, 85]. Inspired by the

general boosting framework of Shen and Li [59], they implemented multi-class boosting

using column generation. Here we go further by generalizing multi-class boosting of

Shen and Hao to broad structured output prediction problems. StructBoost thus enables

nonlinear structured learning by combining a set of weak structured learners.

The effectiveness of SSVM has been limited by the fact that only the linear kernel is

typically used. This limitation arises largely as a result of the computational expense

of training and applying kernelized SSVMs. Nonlinear kernels often deliver improved

prediction accuracy over that of linear kernels, but at the cost of significantly higher

memory requirements and computation time. This is particularly the case when the

training size is large, because the number of support vectors is linearly proportional to

the size of training data [86]. Boosting, however, learns models which are much faster

to evaluate. Boosting can also select relevant features during the course of learning by

using particular weak learners such as decision stumps or decision trees, while almost all

nonlinear kernels are defined on the entire feature space. It thus remains difficult (if not

Chapter 4 StructBoost: Boosting Methods for Predicting Structured Output Variables 51

impossible) to see how kernel methods can select/learn explicit features. For boosting,

the learning procedure also selects or induces relevant features. The final model learned

by boosting methods is thus often significantly simpler and computationally cheaper. In

this sense, the proposed StructBoost possesses the advantages of both nonlinear SSVM

and boosting methods.

4.1.1 Main contributions

The main contributions of this work are three-fold.

1. We propose StructBoost, a new fully-corrective boosting method that combines a

set of weak structured learners for predicting a broad range of structured outputs.

We also discuss special cases of this general structured learning framework, includ-

ing multi-class classification, ordinal regression, optimization of complex measures

such as the Pascal image overlap criterion and conditional random field (CRF)

parameters learning for image segmentation.

2. To implement StructBoost, we adapt the efficient cutting-plane method—originally

designed for efficient linear SVM training [87]—for our purpose. We equivalently

reformulate the n-slack optimization to 1-slack optimization.

3. We apply the proposed StructBoost to a range of computer vision problems and

show that StructBoost can indeed achieve state-of-the-art performance in some

of the key problems in the field. In particular, we demonstrate a state-of-the-art

object tracker trained by StructBoost. We also demonstrate an application for

CRF and super-pixel based image segmentation, using StructBoost together with

graph cuts for CRF parameter learning.

Since StructBoost builds upon the fully corrective boosting of Shen and Li [59], it inherits

the desirable properties of column generation based boosting, such as a fast convergence

rate and clear explanations from the primal-dual convex optimization perspective.

4.1.2 Related work

The current state-of-the-art structured learning methods are CRF [88] and SSVM [7],

which capture the interdependency among output variables. Note that CRFs formulate

global training for structured prediction as a convex optimization problem. SSVM also

follows this path but employs a different loss function (hinge loss) and optimization

methods. Our StructBoost is directly inspired by SSVM. StructBoost can be seen as


an extension of boosting methods to structured prediction. It therefore builds upon

the column generation approach to boosting from [59] and the direct formulation for

multi-class boosting [1]. Indeed, we show that the multi-class boosting of [1] is a special

case of the general framework presented here.

CRF and SSVM have been applied to various problems in machine learning and com-

puter vision mainly because the learned models can easily integrate prior knowledge

about the problem structure. For example, the linear chain CRF has been widely used

in natural language processing [88, 89]. SSVM takes the context into account using the

joint feature maps over the input-output pairs, where features can be represented equiv-

alently as in CRF [87]. CRF is particularly of interest in computer vision for its success

in semantic image segmentation [90]. A critical issue in semantic image segmentation is

to integrate local and global features for the prediction of local pixel/segment labels. Se-

mantic segmentation is achieved by exploiting the class information with a CRF model.

SSVM can also be used for similar purposes as demonstrated in [91]. Blaschko and

Lampert [10] trained SSVM models to predict the bounding box of objects in a given

image, by optimizing the Pascal bounding box overlap score. The work in [4] introduced

structured learning to real-time object detection and tracking, which also optimizes the

Pascal box overlap score. SSVM has also been used to learn statistics that capture the

spatial arrangements of various object classes in images [92]. The trained model can

then simultaneously predict a structured labeling of the entire image. Based on the idea

of large-margin learning in SSVM, Szummer et al. [8] learned optimal parameters of a

CRF, avoiding tedious cross validation. The survey of [15] provided a comprehensive

review of structured learning and its application in computer vision.

Dietterich et al. [93] learned the CRF energy functions using gradient tree boosting.

There the functional gradient of the CRF conditional likelihood is calculated, such that

a regression tree (weak learner) is induced as in gradient boosting. An ensemble of trees

is produced by iterating this procedure. In contrast, here we show that it is possible to

learn CRF parameters within the large-margin framework, by generalizing the work of

[8, 9] where CRF parameters are learned using SSVM. In our case, we do not require

approximations such as pseudo-likelihood. Another relevant work is [94], where Munoz

et al. used the functional gradient boosting methodology to discriminatively learn max-

margin Markov networks (M3N), as proposed by Taskar et al. [49]. The random fields’

potential functions are learned following gradient boosting [95].

There are a few structured boosting methods in the literature. As we discuss here, all

of them are based on gradient boosting, and thus are not as general as that which we

propose here. Ratliff et al. [96, 97] proposed a boosting-based approach for imitation

learning based on structured prediction, called maximum margin planning (MMP). Their


method is named as MMPBoost. To train MMPBoost, a demonstrated policy is provided

as example behavior as the input, and the problem is to learn a function over features

of the environment that produces policies with similar behavior. Although MMPBoost

is structured learning in that the output is a vector, it differs from ours fundamentally.

First, the optimization procedure of MMPBoost is not directly defined on the joint

function f(x,y;w). Second, MMPBoost is based on gradient descent boosting [95], and

StructBoost is built upon fully corrective boosting of Shen and Li [59, 98].

Parker et al. [99] have also successfully applied gradient tree boosting to learning se-

quence alignment. Later, Parker [100] developed a margin-based structured perceptron

update and showed that it can incorporate general notions of misclassification cost as

well as kernels. In these methods, the objective function typically consists of an expo-

nential number of terms that correspond to all possible pairs of (y,y′). Approxima-

tion is required to make the computation of gradient tractable [99]. Wang et al. [101]

learned a local predictor using standard methods, e.g., SVM, but then achieved im-

proved structured classification by exploiting the influence of misclassified components

after structured prediction, and iteratively re-training the local predictor. This approach

is heuristic and it is more like a post-processing procedure—it does not directly optimize

the structured learning objective.

4.1.3 Notation


between two vectors or matrices such as u ≥ v means ui ≥ vi for all i. Let x be an

input; y be an output and the input-output pair be (x,y) ∈ X×Y, with X ⊂ Rd. Unlike

classification (Y = 1, 2, . . . , k) or regression (Y ⊂ R) problems, we are interested in

the case where elements of Y are structured variables such as vectors, strings, or graphs.

Recall that the proposed StructBoost is a structured boosting method, which combines

a set of weak structured learners (or weak compatibility functions). We denote by C

the domain of all possible weak structured learners. Note that C is typically very large,

or even infinite. Each weak structured learner: ψ(·, ·) ∈ C, is a function that maps an

input-output pair (x,y) to a scalar value which measures the compatibility of the input

and output. We define column vector

Ψ(x,y) = [ψ1(x,y), · · · , ψm(x,y)]> (4.1)

to be the outputs of all weak structured learners. Thus Ψ(x,y) plays the same sole

as the joint mapping vector in SSVM, which relates input x and output y. The form

of a weak structured learner is task-dependent. We show some examples of ψ(·, ·) in


Section 4.3. The discriminant function that we aim to learn is f : X × Y 7→ R, which

measures the compatibility over input-output pairs. It has the form of

f(x,y;w) = w>Ψ(x,y) =∑j

wjψj(x,y), (4.2)

with w ≥ 0. As in other structured learning models, the process for predicting a struc-

tured output (or inference) is to find an output y that maximizes the joint compatibility

function:

y? = argmaxy

f(x,y;w) = argmaxy

w>Ψ(x,y). (4.3)

We denote by 1 a column vector of all 1’s, whose dimension shall be clear from the

context.

We describe the StructBoost approach in Section 4.2, including how to efficiently solve

the resulting optimization problem. We then highlight applications in various domains

in Section 4.3. Experimental results are shown in Section 4.4 and we conclude this

chapter in the last section.

4.2 Structured boosting

We first introduce the general structured boosting framework, and then apply it to a

range of specific problems: classification, ordinal regression, optimizing special criteria

such as the area under the ROC curve and the Pascal image area overlap ratio, and

learning CRF parameters.

To measure the accuracy of prediction we use a loss function, and as is the case with

SSVM, we accept arbitrary loss functions ∆ : Y × Y 7→ R. ∆(y, z) calculates the loss

associated with a prediction z against the true label value y. Note that in general we

assume that ∆(y,y) = 0, ∆(y, z) > 0 for any z 6= y and the loss is upper bounded.

The formulation of StructBoost can be written as (n-slack primal):

minw≥0,ξ≥0

1>w +C

n1>ξ (4.4a)

s.t. ∀i = 1, . . . , n and ∀y ∈ Y :

w>[Ψ(xi,yi)−Ψ(xi,y)

]≥ ∆(yi,y)− ξi. (4.4b)


Here we have used the `1 norm as the regularization function to control the complexity

of the learned model. To simplify the notation, we introduce

δψi(y) = ψ(xi,yi)− ψ(xi,y); (4.5)

and,

δΨi(y) = Ψ(xi,yi)−Ψ(xi,y); (4.6)

then the constraints in (4.4) can be re-written as:

w>δΨi(y) ≥ ∆(yi,y)− ξi. (4.7)

There are two major obstacles to solve problem (4.4). First, as in conventional boosting,

because the set of weak structured learners ψ(·, ·) can be exponentially large or even

infinite, the dimension of w can be exponentially large or infinite. Thus, in general, we

are not able to directly solve for w. Second, as in SSVM, the number of constraints

(4.4b) can be extremely or infinitely large. For example, in the case of multi-label or

multi-class classification, the label y can be represented as a binary vector (or string)

and clearly the possible number of y such that y is exponential in the length of the

vector, which is 2|Y|. In other words, problem (4.4) can have an extremely or infinitely

large number of variables and constraints. This is significantly more challenging than

solving standard boosting or SSVM in terms of optimization. In standard boosting, one

has a large number of variables while in SSVM, one has a large number of constraints.

For the moment, let us put aside the difficulty of the large number of constraints, and

focus on how to iteratively solve for w using column generation as in boosting methods

[59, 60, 98]. We derive the Lagrange dual of the optimization of (4.4) as:

maxµ≥0

∑i,y

µ(i,y)∆(yi,y) (4.8a)

s.t. ∀ψ ∈ C :∑i,y

µ(i,y)δψi(y) ≤ 1, (4.8b)

∀i = 1, . . . , n : 0 ≤∑y

µ(i,y) ≤C

n. (4.8c)

Here µ are the Lagrange dual variables (Lagrange multipliers). We denote by µ(i,y) the

dual variable associated with the margin constraints (4.4b) for label y and training pair

(xi,yi). Details for deriving the dual problem are described in Appendix B.1

The idea of column generation is to split the original primal problem in (4.4) into two

problems: a master problem and a subproblem. The master problem is the original


problem with only a subset of variables (or constraints for the dual form) being consid-

ered. The subproblem is to add new variables (or constraints for the dual form) into the

master problem. With the primal-dual pair of (4.4) and (4.8) and following the general

framework of column generation based boosting [59, 60, 98], we obtain our StructBoost

as follows:

Iterate the following two steps until convergence :

1. Solve the following subproblem, which generates the best weak structured learner

by finding the most violated constraint in the dual:

ψ?(·, ·) = argmaxψ(·,·)

∑i,y

µ(i,y)δψi(y). (4.9)

2. Add the selected structured weak learner ψ?(·, ·) into the master problem (either

the primal form or the dual form) and re-solve for the primal solution w and dual

solution µ.

The stopping criterion can be that no violated weak learner can be found. Formally, for

the selected ψ?(·, ·) with (4.9) and a preset precision εcg > 0, if the following relation

holds: ∑i,y

µ(i,y)δψ?i (y) ≤ 1− εcg, (4.10)

we terminate the iteration. Algorithm 3 presents Details of column generation for Struct-

Boost. This approach, however, may not be practical because it is very expensive to

solve the master problem (the reduced problem of (4.4)) at each column generation step

(boosting iteration), which still can have extremely many constraints due to the size of

the set y ∈ Y. The direct formulation for multi-class boosting in [1] can be seen as a

specific instance of this approach, which is in general very slow. We therefore propose

to employ the 1-slack formulation for efficient optimization, which is described in the

next section.

4.2.1 1-slack formulation for fast optimization

Inspired by the cutting-plane method for fast training of linear SVM [87] and SSVM

[52], we rewrite the above problem into an equivalent “1-slack” form so that the efficient


cutting-plane method can be employed to solve the optimization problem (4.4):

minw≥0,ξ≥0

1>w + Cξ (4.11a)

s.t. ∀c ∈ 0, 1n and ∀y ∈ Y, i = 1, · · · , n :

1

nw>[ n∑i=1

ci · δΨi(y)

]≥ 1

n

n∑i=1

ci∆(yi,y)− ξ. (4.11b)

The following theorem shows the equivalence of problems (4.4) and (4.11).

Theorem 4.1. A solution of problem (4.11) is also a solution of problem (4.4) and vice

versa. The connections are: w?(4.11) = w?

(4.4) and ξ?(4.11) = 1n1>ξ?(4.4).

Proof. This proof adapts the proof in [87]. Given a fixed w, the only variable ξ(4.4) in

(4.4) can be solved by

ξi,(4.4) = maxy

0,∆(yi,y)−w>δΨi(y)

, ∀i.

For (4.11), the optimal ξ(4.11) given a w can be computed as:

ξ(4.11) =1

nmaxc,y

n∑i=1

ci∆(yi,y)−w>[ n∑i=1

ciδΨi(y)]

=1

n

n∑i=1

max

ci∈0,1,yci∆(yi,y)− ciw>δΨi(y)

=1

n

n∑i=1

maxy

0,∆(yi,y)−w>δΨi(y)

=

1

n1>ξ(4.4).

Note that c ∈ 0, 1n in the above equalities. Clearly the objective functions of both

problems coincide for any fixed w and the optimal ξ(4.4) and ξ(4.11).

As demonstrated in [87] and SSVM [52], cutting-plane methods can be used to solve the

1-slack primal problem (4.11) efficiently. This 1-slack formulation has been used to train

linear SVM in linear time. When solving for w, (4.11) is similar to `1-norm regularized

SVM—apart from the extra non-negativeness constraint on w in our case.

In order to utilize column generation for designing boosting methods, we need to derive

the Lagrange dual of the 1-slack primal optimization problem, which can be written as


Algorithm 3: Column generation for StructBoost

Input: training examples: (x1;y1), (x2;y2), · · · ; trade-off parameter: C; terminationthreshold: εcg; the maximum iteration number.

Output: the discriminant function: f(x,y;w) = w>Ψ(x,y).

1 Initialize: for each i, (i = 1, . . . ,m), randomly pick any y(0)i ∈ Y, initialize µ(i,y) = C

n

for y = y(0)i , and µ(i,y) = 0 for all y ∈ Y\y(0)

i .

2 repeat3 Find and add a weak structured learner ψ?(·, ·) by solving the subproblem (4.9) or

(4.13);4 Call Algorithm 4 to obtain w and µ;

5 until either (4.10) is met or the maximum number of iterations is reached ;

follows:

maxλ≥0

∑c,y

λ(c,y)

n∑i=1

ci∆(yi,y) (4.12a)

s.t. ∀ψ ∈ C :1

n

∑c,y

λ(c,y)

[ n∑i=1

ci · δψi(y)

]≤ 1, (4.12b)

0 ≤∑c,y

λ(c,y) ≤ C. (4.12c)

Here c enumerates all possible c ∈ 0, 1n. We denote by λ(c,y) the Lagrange dual

variable (Lagrange multiplier) associated with the inequality constraint in (4.11b) for

c ∈ 0, 1n and label y. Details for deriving the dual problem are described in Appendix

B.1 The subproblem to find the most violated constraint in the dual form for generating

weak structured learners is:

ψ?(·, ·) = argmaxψ(·,·)∈C

∑c,y

λ(c,y)

∑i

ciδψi(y)

= argmaxψ(·,·)∈C

∑i,y

∑c

λ(c,y)ci︸︷︷︸:=µ(i,y)

δψi(y). (4.13)

We have changed the order of summation in order to achieve a similar form as in the

n-slack case.

4.2.2 Cutting-plane optimization for the 1-slack primal

Despite the extra nonnegative-ness constraint w ≥ 0 in our case, it is straightforward to

apply the cutting-plane method in [87] for solving our problem (4.11). The cutting-plane

algorithm for StructBoost is presented in Algorithm 4. A key step in Algorithm 4 is to

solve the maximization for finding an output y′ that corresponds to the most violated


Algorithm 4: Cutting planes for solving the 1-slack primal

Input: cutting-plane tolerance: εcp; inputs from Algorithm 3.Output: w and µ.

1 Initialize: working set: W← ∅; ci = 1, y′i ← any element in Y, for i = 1, . . . ,m.2 repeat3 W←W ∪ (c1, . . . , cm,y

′1, . . . ,y

′m);

4 obtain primal and dual solutions w, ξ; λ by solving (4.4) on current working set W ;5 for i = 1, . . . ,m do6 y′i = argmax y ∆(yi,y)−w>δΨi(y);

7 ci =

1 ∆(yi,y

′i)−w>δΨi(y

′i) > 0

0 otherwise;

8 until 1nw>[n∑i=1

ciδΨi(y′i)

]≥ 1

n

n∑i=1

ci∆(yi,y′i)− ξ − εcp;

9 update µ(i,y) =∑c λ(c,y)ci for ∀(c,y) ∈W ;

constraint for every xi (inference step):

y′i = argmaxy

∆(yi,y) +w>δΨi(y). (4.14)

The above maximization problem takes a similar form to that of the output prediction

problem in (4.3). They only differ in the loss term ∆(yi,y). Typically these two prob-

lems can be solved using the same strategy. This inference step usually dominates the

running time for a few applications, e.g., in the application of image segmentation. In

the experiment section, we empirically show that solving (4.11) using cutting planes can

be significantly faster than solving (4.4). Here improved cutting-plane methods such as

[102] can also be adapted to solve our optimization problem at each column generation

boosting iteration.

In terms of implementation of the cutting-plane algorithm, as mentioned in SSVM [52],

a variety of design decisions can have a substantial influence on the practical efficiency of

the algorithm. We have considered some of these design decisions in our implementation.

In our case, we need to call the cutting-plane optimization at each column generation

iteration. Consideration of warm-start initialization between two consecutive column

generations can substantially speed up the training. We re-use the working set in the

cutting-plane algorithm from previous column generation iterations. Finding a new weak

learner in (4.13) is based on the dual solution µ. We need to ensure that the solution of

cutting-plane is able to reach a sufficient precision, such that the generated weak learner

is able to “make progress”. Thus, we can adapt the stopping criterion parameter εcp

in Algorithm 4 according to the cutting-plane precision in the last column generation

iteration.


4.2.3 Discussion

Let us have a close look at the StructBoost algorithm in Algorithm 3. We can see that

the training loop in Algorithm 3 is almost identical to other fully-corrective boosting

methods (e.g., LPBoost [60] and Shen and Li [59]). Line 4 finds the most violated con-

straint in the dual form and adds a new weak structured learner to the master problem.

The dual solution µ(i,y) defined in (4.13) plays the role of the example weight associated

with each training example in conventional boosting methods such as AdaBoost and

LPBoost [60]. Then Line 5 solves the master problem, which is the reduced problem

of (4.4). Here we can see that, the cutting-plane in Algorithm 4 only serves as a solver

for solving the master problem in Line 5 of Algorithm 3. This makes our StructBoost

framework flexible—we are able to replace the cutting-plane optimization by other opti-

mization methods. For example, the bundle methods in [103] may further speed up the

computation.

For the convergence properties of the cutting-plane algorithm in Algorithm 4, readers

may refer to [87] and [52] for details.

Our column generation algorithm in Algorithm 3 is a constraint generation algorithm

for the dual problem in (4.8). We can adapt the analysis of the standard constraint gen-

eration algorithm for Algorithm 3. In general, for general column generation methods,

the global convergence can be established but it remains unclear about the convergence

rate if no particular assumptions are made.

4.3 Examples of StructBoost

We consider a few applications of the proposed general structured boosting in this sec-

tion, namely binary classification, ordinal regression, multi-class classification, optimiza-

tion of Pascal overlap score, and CRF parameter learning. We show the particular setup

for each application.

4.3.1 Binary classification

As the simplest example, the LPBoost of [60] for binary classification can be recovered

as follows. The label set is Y = +1,−1; and

Ψ(x, y) =1

2yΦ(x). (4.15)


The label cost can be a simple constant; for example, ∆(y, y′) = 1 for y 6= y′ and 0 for

y = y′. Here we have introduced a column vector Φ(x) :

Φ(x) = [φ1(x), . . . , φm(x)]>, (4.16)

which is the outputs of all weak classifiers on example x. The output of a weak classifier,

e.g., a decision stump or tree, is usually a binary value: φ(·) ∈ +1,−1. In kernel

methods, this feature mapping Φ(·) is only known through the so-called kernel trick.

Here we explicitly learn this feature mapping. Note that if Φ(x) = x, we have the

standard linear SVM.

4.3.2 Ordinal regression and AUC optimization

In ordinal regression, labels of the training data are ranks. Let us assume that the label

y ∈ R indicates an ordinal scale, and pairs (i, j) in the set S has the relationship of

example i being ranked higher than j, i.e., yi yj . The primal can be written as

minw≥0,ξ≥0

1>w +C

n

∑(i,j)∈S

ξij (4.17a)

s.t. ∀(i, j) ∈ S : w>[Φ(xi)− Φ(xj)

]≥ 1− ξij . (4.17b)

Here Φ(·) is defined in (4.16). Note that (4.17) also optimizes the area under the receiver

operating characteristic (ROC) curve (AUC) criterion. As pointed out in [104], (4.17)

is an instance of the multiclass classification problem. We discuss how the multiclass

classification problem fits in our framework shortly.

Here, the number of constraints is quadratic in the number of training examples. Directly

solving (4.17) can only solve problems with up to a few thousand training examples.

We can reformulate (4.17) into an equivalent 1-slack problem, and apply the proposed

StructBoost framework to solve the optimization more efficiently.

4.3.3 Multi-class boosting

The MultiBoost algorithm of Shen and Hao [1] can be implemented by the StructBoost

framework as follows. Let Y = 1, 2, . . . , k and

w = w1 · · · wk. (4.18)


Here stacks two vectors. As in [1], wy is the model parameter associated with the

y-th class. The multi-class discriminant function in [1] writes

f(x, y;w) = w>yΦ(x). (4.19)

Now let us define the orthogonal label coding vector:

Γ(y) = [1(y, 1), · · · ,1(y, k)]> ∈ 0, 1k. (4.20)

Here 1(y, z) is the indicator function defined as:

1(y, z) =

1 if y = z,

0 if y 6= z.(4.21)

Then the following joint mapping function

Ψ(x, y) = Φ(x)⊗ Γ(y) (4.22)

recovers the StructBoost formulation (4.4) for multi-class boosting. The operator ⊗calculates the tensor product. The multi-class learning can be formulated as

minw≥0,ξ≥0

1>w +C

n1>ξ (4.23a)

s.t. ∀i = 1, . . . , n and and ∀y ∈ 1, . . . , k :

w>yiΦ(xi)−w>yΦ(xi) ≥ 1− ξi. (4.23b)

(4.23c)

A new weak classifier φ(·) is generated by solving the argmax problem defined in (4.13),

which can be written as:

φ?(·) = argmaxφ(·),y

∑i,y

µ(i,y)

[φ(xi)⊗ Γ(yi)− φ(xi)⊗ Γ(y))

]. (4.24)

4.3.4 Hierarchical classification with taxonomies

In many applications such as object categorization and document classification [105],

classes of objects are organized in taxonomies or hierarchies. For example, the Ima-

geNet dataset has organized all the classes according to the tree structures of WordNet

[106]. This problem is a classification example that the output space has interdependent


(a) Taxonomy of the 6-scene dataset

(b) Taxonomy of the 15-scene dataset

Figure 4.1: The hierarchy structures of two selected subsets of the SUN dataset [3] used in ourexperiments for hierarchical image classification.

structures. An example tree structure (taxonomy) of image categories is shown in Figure

4.1.

Now we consider the taxonomy to be a tree, with a partial order ≺, which indicates

if a class is a predecessor of another class. We override the indicator function, which

indicates if z is a predecessor of y in a label tree:

1(y, z) =

1 y ≺ z or y = z,

0 otherwise.(4.25)

The label coding vector has the same format as in the standard multi-class classification

case:

Γ(y) = [1(y, 1), · · · ,1(y, k)]> ∈ 0, 1k. (4.26)

Thus Γ(y)>Γ(y′) counts the number of common predecessors, while in the case of stan-

dard multi-class classification, Γ(y)>Γ(y′) = 0 for y 6= y′.

Figure 4.2 shows an example of the label coding vector for a given label hierarchy. In

this case, for example, for class 3, Γ(3) = [0, 0, 1, 0, 0, 0, 1, 0, 1]>. The joint mapping


Figure 4.2: Classification with taxonomies (tree loss), corresponding to the first example inFigure 4.1.

function is

Ψ(x, y) = Φ(x)⊗ Γ(y). (4.27)

The tree loss function ∆(y, y′) is the height of the first common ancestor of the arguments

y, y′ in the tree. By re-defining Γ(y) and ∆(y, y′), classification with taxonomies can be

immediately implemented using the standard multi-class classification shown in the last

subsection.

Here we also consider an alternative approach. In [107], the authors show that one

can train a multi-class boosting classifier by projecting data to a label-specific space

and then learning a single model parameter w. The main advantage might be that

the optimization of w is simplified. Similar to [107] we define label-augmented data as

x′y = x⊗ Γ(y). The max-margin classification can be written as

minw≥0,ξ≥0

1>w +C

n1>ξ

s.t. ∀i = 1, · · · ,m; and ∀y :

w>[Φ(x′i,yi)− Φ(x′i,y)

]≥ ∆(yi, y)− ξi.

Compared with the first approach, now the model w ∈ Rn, which is independent of the

number of classes.

4.3.5 Optimization of the Pascal image overlap criterion

Object detection/localization has used the image area overlap as the loss function [4,

10, 15], e.g, in the Pascal object detection challenge:

∆(y,y′) = 1− area(y ∩ y′)area(y ∪ y′)

, (4.28)


with y,y′ being the bounding box coordinates. y∩y′ and y∪y′ are the box intersection

and union. Let xy denote an image patch defined by a bounding box y on the image x.

To apply StructBoost, we define

Ψ(x,y) = Φ(xy). (4.29)

Φ(·) is defined in (4.16). Weak learners such as classifiers or regressors φ(·) are trained

on the image features extracted from image patches. For example, we can extract

histograms of oriented gradients (HOG) from the image patch xy and train a decision

stump with the extracted HOG features by solving the argmax in (4.13).

Note that in this case, solving (4.14), which is to find the most violated constraint in the

training step as well as the inference for prediction (4.3), is in general difficult. In [10], a

branch-and-bound search has been employed to find the global optimum. Following the

simple sampling strategy in [4], we simplify this problem by evaluating a certain number

of sampled image patches to find the most violated constraint. It is also the case for

prediction.

4.3.6 CRF parameter learning

CRF has found successful applications in many vision problems such as pixel labelling,

stereo matching and image segmentation. Previous work often uses tedious cross-

validation for setting the CRF parameters. This approach is only feasible for a small

number of parameters. Recently, SSVM has been introduced to learn the parameters

[8]. We demonstrate here how to employ the proposed StructBoost for CRF parame-

ter learning in the image segmentation task. We demonstrate the effectiveness of our

approach on the Graz-02 image segmentation dataset.

To speed up computation, super-pixels rather than pixels have been widely adopted in

image segmentation. We define x as an image and y as the segmentation labels of all

super-pixels in the image. We consider the energy E of an image x and segmentation

labels y over the nodes N and edges S, which takes the following form:

E(x,y;w) =∑p∈N

w(1)>Φ(1) (U(yp,x))

+∑

(p,q)∈S

w(2)>Φ(2) (V(yp, yq,x)) . (4.30)

Recall that 1(·, ·) is the indicator function defined in (4.21). p and q are the super-pixel

indices; yp, yq are the labels of the super-pixel p, q. U is a set of unary potential functions:

U = [U1, U2, . . . ]>. V is a set of pairwise potential functions: V = [V1, V2, . . . ]

>. When


we learn the CRF parameters, the learning algorithm sees only U and V. In other words

U and V play the role of the input features. Details on how to construct U and V are

described in the experiment section. w(1) and w(2) are the CRF potential weighting

parameters that we aim to learn. Φ(1)(·) and Φ(2)(·) are two sets of weak learners (e.g.,

decision stumps) for the unary part and pairwise part respectively:

Φ(1)(·) = [φ(1)1 (·), φ(1)

2 (·), . . . ]>,Φ(2)(·) = [φ(2)1 (·), φ(2)

2 (·), . . . ]>. (4.31)

To predict the segmentation labels y? of an unknown image x is to solve the energy

minimization problem:

y? = argminy

E(x,y;w), (4.32)

which can be solved efficiently by using graph cuts [5, 8].

Consider a segmentation problem with two classes (background versus foreground). It

is desirable to keep the submodular property of the energy function in (4.30). Otherwise

graph cuts cannot be directly applied to achieve globally optimal labelling. Let us

examine the pairwise energy term:

θ(p,q)(yp, yq) = w(2)>Φ(2) (V(yp, yq,x)) , (4.33)

and a super-pixel label y ∈ 0, 1. It is well known that, if the following is satisfied for

any pair (p, q) ∈ S, the energy function in (4.30) is submodular:

θ(p,q)(0, 0) + θ(p,q)(1, 1) ≤ θ(p,q)(0, 1) + θ(p,q)(1, 0). (4.34)

We want to keep the above property.

First, for a weak learner φ(2)(·), we enforce it to output 0 when two labels are identical.

This can always be achieved by multiplying a conventional weak learner by (1−1(yp, yq)).

Now we have θ(p,q)(0, 0) = θ(p,q)(1, 1) = 0.

Given that the nonnegative-ness of w is enforced in our model, now a sufficient condition

is that the output of a weak learner φ(2)(·) is always nonnegative, which can always be

achieved. We can always use a weak learner φ(2)(·) which takes a nonnegative output,

e.g., a discrete decision stump or tree with outputs in 0, 1.

By applying weak learners on U and V, our method introduces nonlinearity for the

parameter learning, which is different from most linear CRF learning methods such

as [8]. Until recently, Berteli et al. presented an image segmentation approach that

uses nonlinear kernels for the unary energy term in the CRF model [91]. In our model,


nonlinearity is introduced by applying weak learners on the potential functions’ outputs.

This is the same as the fact that SVM introduces nonlinearity via the so-called kernel

trick and boosting learns a nonlinear model with nonlinear weak learners. Nowozin et al.

[108] introduced decision tree fields (DTF) to overcome the problem of overly-simplified

modeling of pairwise potentials in most CRF models. In DTF, local interactions between

multiple variables are determined by means of decision trees. In our StructBoost, if we

use decision trees as the weak learners on the pairwise potentials, then StructBoost and

DTF share similar characteristics in that both use decision trees for the same purpose.

However, the starting points of these two methods as well as the training procedures are

entirely different.

To apply StructBoost, the CRF parameter learning problem in a large-margin framework

can then be written as:

minw≥0,ξ≥0

1>w +C

n1>ξ (4.35a)

s.t. ∀i = 1, . . . , n and ∀y ∈ Y :

E(xi,y;w)− E(xi,yi;w) ≥ ∆(yi,y)− ξi. (4.35b)

Here i indexes images. Intuitively, the optimization in (4.35) is to encourage the energy

of the ground truth label E(xi,yi) to be lower than any other incorrect labels E(xi,y)

by at least a margin ∆(yi,y), ∀y. We simply define ∆(yi,y) using the Hamming loss,

which is the sum of the differences between the ground truth label yi and the label y

over all super-pixels in an image:

∆(yi,y) =∑p

(1− 1(ypi , yp)). (4.36)

We show that the problem (4.35) is a special case of the general formulation of Struct-

Boost (4.4) by defining

w = −w(1) w(2), (4.37)

and,

Ψ(x,y) =∑p∈N

Φ(1) (U(yp,x))∑

(p,q)∈S

Φ(2) (V(yp, yq,x)) .

Recall that stacks two vectors. With this definition, we have the relation:

w>Ψ(x,y) = −E(x,y;w). (4.38)

The minus sign here is to inverse the order of subtraction in (4.4b). At each column

generation iteration (Algorithm 3), two new weak learners φ(1)(·) and φ(2)(·) are added


to the unary weak learner set and the pairwise weak learner set, respectively by solving

the argmax problem defined in (4.13), which can be written as:

φ(1)?(·) = argmaxφ(·)

∑i,y

µ(i,y)

∑p∈N

[φ(1) (U(yp,xi))

− φ(1) (U(ypi ,xi)]; (4.39)

φ(2)?(·) = argmaxφ(·)

∑i,y

µ(i,y)

∑(p,q)∈S

[φ(2) (V(yp, yq,xi))

− φ(2) (V(ypi , yqi ,xi))

]. (4.40)

The maximization problem to find the most violated constraint in (4.14) is to solve the

inference:

y′i = argminy

E(xi,y)−∆(yi,y), (4.41)

which is similar to the label prediction inference in (4.32), and the only difference is that

the labeling loss term: ∆(yi,y) is involved in (4.41). Recall that we use the Hamming

loss ∆(yi,y) as defined in (4.36), the term ∆(yi,y) can be absorbed into the unary term

of the energy function defined in (4.30) (such as in [8]). The inference in (4.41) can be

written as:

y′i = argminy

∑p∈N

[w(1)>Φ(1) (U(yp,xi))− (1− 1(ypi , y

p))

]+∑

(p,q)∈S

w(2)>Φ(2) (V(yp, yq,xi)) . (4.42)

The above minimization (4.42) can also be solved efficiently by using graph cuts.

4.4 Experiments

To evaluate our method, we run various experiments on applications including AUC

maximization (ordinal regression), multi-class image classification, hierarchical image

classification, visual tracking and image segmentation. We mainly compare with the

most relevant method: Structured SVM (SSVM) and some other conventional methods

(e.g., SVM, AdaBoost). If not otherwise specified, the cutting-plane stopping criteria

(εcp) in our method is set to 0.01.


Table 4.1: AUC maximization. We compare the performance of n-slack and 1-slack formulations.“−” means that the method is not able to converge within a memory and time limit. We cansee that 1-slack can achieve similar AUC results on training and testing data as n-slack while1-slack is significantly faster than n-slack.

dataset method time (sec) AUC training AUC test

winen-slack 13±1 1.000±0.000 0.994±0.0051-slack 3±1 1.000±0.000 0.994±0.006

glassn-slack 20±1 0.967±0.011 0.849±0.0281-slack 6±1 0.955±0.030 0.844±0.039

svmguide2n-slack 332±6 0.988±0.003 0.905±0.0361-slack 106±8 0.988±0.003 0.905±0.036

svmguide4n-slack 564±79 1.000±0.000 0.982±0.0051-slack 106±13 1.000±0.000 0.982±0.005

voweln-slack 4051±116 0.999±0.001 0.968±0.0131-slack 952±139 0.999±0.001 0.967±0.013

dnan-slack − − −1-slack 1598±643 0.998±0.000 0.992±0.003

segmentn-slack − − −1-slack 475±42 1.000±0.000 0.999±0.001

satimagen-slack − − −1-slack 37769±6331 0.999±0.000 0.997±0.002

4.4.1 AUC optimization

In this experiment, we compare the efficiency of the 1-slack (solving (4.11)) and n-slack

(solving (4.4) or its dual) formulations of our method StructBoost. The details of the

AUC optimization are described in Section 4.3.2. We run the experiments on a few

UCI multi-class datasets. To create imbalanced data, we use one class of the multi-class

UCI datasets as positive data and the rest as negative data. The maximum number

of boosting (column generation) iterations is set to 200; the cutting-plane stopping

criterion (εcp) is set to 0.001. Decision stumps are used as weak learners (Φ(·) in (4.17)).

For each data set, we randomly sample 50% for training, 25% for validation and the

rest for testing. The regularization parameter C is chosen from 6 candidates ranging

from 10 to 103. Experiments are repeated 5 times on each dataset and the mean and

standard deviation are reported. Table 4.1 reports the results. We can see that the

1-slack formulation of StructBoost achieves similar AUC performance as the n-slack

formulation, while being much faster than n-slack. The values of the objective function

and optimization times are shown in Figure 4.3 for varying column generation iterations.

Again, it shows that the 1-slack version achieves similar objective values to the n-slack

version with less running time.

Note that RankBoost may also be applied to this problem [109]. RankBoost has been

designed for solving ranking problems.


20 40 60 80 100 120 140 160 180 200

4

5

6

7

8

9

10

11

SVMGUIDE4

iterations

Obj

ectiv

e fu

nctio

n va

lue

m−slack (4.065±0.221)1−slack (4.067±0.221)

20 40 60 80 100 120 140 160 180 2000

100

200

300

400

500

600SVMGUIDE4

iterations

Opt

imiz

atio

n tim

e (s

econ

ds)

m−slack (564.4±78.5 secs)1−slack (105.7±13.2 secs)

20 40 60 80 100 120 140 160 180 200

8

10

12

14

16

18

20

22

24

VOWEL

iterations

Obj

ectiv

e fu

nctio

n va

lue

m−slack (7.562±0.501)1−slack (7.567±0.500)

20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500VOWEL

iterations

Opt

imiz

atio

n tim

e (s

econ

ds)

m−slack (4051.2±116.0 secs)1−slack (951.8±138.7 secs)

Figure 4.3: AUC optimization on two UCI datasets. The objective values and optimization timeare shown in the figure by varying boosting (or column generation) iterations. It shows that1-slack achieves similar objective values as n-slack but needs less running time.

4.4.2 Multi-class classification

Multi-class classification is a special case of structured learning. Details are described in

Section 4.3.3. We carry out experiments on some UCI multi-class datasets and MNIST.

We compare with Structured SVM (SSVM), conventional multi-class boosting methods

(namely AdaBoost.ECC and AdaBoost.MH), and the one-vs-all SVM method. For each

dataset, we randomly select 50% data for training, 25% data for validation and the rest

for testing. The maximum number of boosting iterations is set to 200; the regularization

parameter C is chosen from 6 candidates whose values range from 10 to 1000. The

experiments are repeated 5 times for each dataset.

To demonstrate the flexibility of our method, we use decision stumps and linear percep-

tron functions as weak learners (Φ(·) in (4.23)). The perceptron weak learner can be

written as:

φ(x) = sign(v>x+ b). (4.43)


20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8VOWEL

iterations

Tes

t err

or

1−vs−all SVM(0.544±0.022)SSVM(0.256±0.025)Ada.MH(0.188±0.021)Ada.ECC(0.206±0.015)StBoost−per(0.068±0.018)StBoost−stump(0.175±0.022)

20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7SEGMENT

iterations

Tes

t err

or


20 40 60 80 100 120 140 160 180 2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8SATIMAGE

iterations

Tes

t err

or


20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7USPS

iterations

Tes

t err

or


20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7PENDIGITS

iterations

Tes

t err

or


20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7MNIST

iterations

Tes

t err

or


Figure 4.4: Test performance versus the number of boosting iterations of multi-class classification.StBoost-stump and StBoost-per denote our StructBoost using decision stumps and perceptronsas weak learners, respectively. The results of SSVM and SVM are shown as straight lines in theplots. The values shown in the legend are the error rates of the final iteration for each method.Our methods perform better than SSVM in most cases.


Table 4.2: Multi-class classification test errors (%) on several UCI and MNIST datasets. 1-v-a SVM is the one-vs-all SVM. StBoost-stump and StBoost-per denote our StructBoost usingdecision stumps and perceptrons as weak learners, respectively. StructBoost outperforms SSVMin most cases and achieves competitive performance compared with other multi-class classifiers.

glass svmguide2 svmguide4 vowel dna segment satimage usps pendigits mnist

StBoost-stump 35.8 ± 6.2 21.0 ± 3.9 20.1 ± 2.9 17.5 ± 2.2 6.2 ± 0.7 2.9 ± 0.7 12.1 ± 0.7 6.9 ± 0.6 3.9 ± 0.4 12.5 ± 0.4StBoost-per 37.3 ± 6.2 22.7 ± 4.8 53.4 ± 6.1 6.8 ± 1.8 6.6 ± 0.6 3.8 ± 0.7 11.4 ± 1.1 4.1 ± 0.6 1.8 ± 0.3 6.5 ± 0.6

Ada.ECC 32.7 ± 4.9 23.3 ± 4.0 19.1 ± 2.3 20.6 ± 1.5 7.6 ± 1.2 2.9 ± 0.8 12.8 ± 0.7 8.4 ± 0.7 8.4 ± 0.7 15.8 ± 0.3Ada.MH 32.3 ± 5.0 21.9 ± 4.5 19.3 ± 3.0 18.8 ± 2.1 7.1 ± 0.6 3.7 ± 0.7 12.7 ± 0.9 7.4 ± 0.5 7.4 ± 0.5 13.4 ± 0.4

SSVM 38.8 ± 8.7 21.9 ± 5.9 45.7 ± 3.9 25.6 ± 2.5 6.9 ± 0.9 5.3 ± 1.0 14.9 ± 0.1 5.8 ± 0.3 5.2 ± 0.3 9.6 ± 0.21-vs-All SVM 40.8 ± 7.0 17.7 ± 3.5 47.0 ± 3.2 54.4 ± 2.2 6.3 ± 0.5 7.7 ± 0.8 17.5 ± 0.4 5.4 ± 0.5 8.1 ± 0.5 9.2 ± 0.2

Table 4.3: Hierarchical classification. Results of the tree loss and the 1/0 loss (classification errorrate) on subsets of the SUN dataset. StructBoost-tree uses the hierarchy class formulation withthe tree loss, and StructBoost-flat uses the standard multi-class formulation. StructBoost-treethat minimizes the tree loss performs best.

datasets StructBoost-tree StructBoost-flat Ada.ECC-SVM Ada.ECC Ada.MH

6 scenes1/0 loss 0.322 ± 0.018 0.343 ± 0.028 0.350 ± 0.013 0.327 ± 0.002 0.315 ± 0.015tree loss 0.337 ± 0.014 0.380 ± 0.027 0.377 ± 0.018 0.352 ± 0.023 0.346 ± 0.018

15 scenes1/0 loss 0.394 ± 0.005 0.396 ± 0.013 0.414 ± 0.012 0.444 ± 0.012 0.418 ± 0.010tree loss 0.504 ± 0.007 0.536 ± 0.009 0.565 ± 0.019 0.584 ± 0.017 0.551 ± 0.013

We use a smooth sigmoid function to replace the step function so that gradient de-

scent optimization can be applied. We solve the argmax problem in (4.24) by using the

Quasi-Newton LBFGS [2] solver. We find that decision stumps often provide a good

initialization for LBFGS learning of the perceptron. Compared with decision stumps,

using the perceptron weak learner usually needs fewer boosting iterations to converge.

Table 4.2 reports the error rates. Figure 4.4 shows test performance versus the number

of boosting (column generation) iterations. The results demonstrate that our method

outperforms SSVM, and achieves competitive performance compared with other con-

ventional multi-class methods.

StructBoost performs better than SSVM on most datasets. This might be due to the

introduction of non-linearity in StructBoost. Results also show that using the perceptron

weak learner often achieves better performance than using decision stumps on those large

datasets.

4.4.3 Hierarchical multi-class classification

The details of hierarchical multi-class classification are described in Section 4.3.4. We

have constructed two hierarchical image datasets from the SUN dataset [3] which con-

tains 6 classes and 15 classes of scenes. The hierarchical tree structures of these two

datasets are shown in the Figure 4.1. For each scene class, we use the first 200 images

from the original SUN dataset. There are 1200 images in the first dataset and 3000 im-

ages in the second dataset. We have used the HOG features as described in [3]. For each

dataset, 50% examples are randomly selected for training and the rest for testing. We


Table 4.4: Average bounding box overlap scores on benchmark videos. Struck50 [4] is structuredSVM tracking with a buffer size of 50. Our StructBoost performs the best in most cases. Struckperforms the second best, which confirms the usefulness of structured output learning.

StructBoost AdaBoost Struck50 Frag MIL OAB1 OAB5 VTD

coke 0.79 ± 0.17 0.47 ± 0.19 0.55 ± 0.18 0.07 ± 0.21 0.36 ± 0.23 0.10 ± 0.20 0.04 ± 0.16 0.10 ± 0.23tiger1 0.75 ± 0.17 0.64 ± 0.16 0.68 ± 0.21 0.21 ± 0.30 0.64 ± 0.18 0.44 ± 0.23 0.23 ± 0.24 0.11 ± 0.24tiger2 0.74 ± 0.18 0.46 ± 0.18 0.59 ± 0.19 0.16 ± 0.24 0.63 ± 0.14 0.35 ± 0.23 0.18 ± 0.19 0.19 ± 0.22david 0.86 ± 0.07 0.34 ± 0.23 0.82 ± 0.11 0.18 ± 0.24 0.59 ± 0.13 0.28 ± 0.23 0.21 ± 0.22 0.29 ± 0.27

girl 0.74 ± 0.12 0.41 ± 0.26 0.80 ± 0.10 0.65 ± 0.19 0.56 ± 0.21 0.43 ± 0.18 0.28 ± 0.26 0.63 ± 0.12sylv 0.66 ± 0.16 0.52 ± 0.18 0.69 ± 0.14 0.61 ± 0.23 0.66 ± 0.18 0.47 ± 0.38 0.05 ± 0.12 0.58 ± 0.30bird 0.79 ± 0.11 0.67 ± 0.14 0.60 ± 0.26 0.34 ± 0.32 0.58 ± 0.32 0.57 ± 0.29 0.59 ± 0.30 0.11 ± 0.26walk 0.74 ± 0.19 0.56 ± 0.14 0.59 ± 0.39 0.09 ± 0.25 0.51 ± 0.34 0.54 ± 0.36 0.49 ± 0.34 0.08 ± 0.23

shaking 0.72 ± 0.13 0.49 ± 0.22 0.08 ± 0.19 0.33 ± 0.28 0.61 ± 0.26 0.57 ± 0.28 0.51 ± 0.21 0.69 ± 0.14singer 0.69 ± 0.10 0.74 ± 0.10 0.34 ± 0.37 0.14 ± 0.30 0.20 ± 0.34 0.20 ± 0.33 0.07 ± 0.18 0.50 ± 0.20iceball 0.58 ± 0.17 0.05 ± 0.16 0.51 ± 0.33 0.51 ± 0.31 0.35 ± 0.29 0.08 ± 0.23 0.38 ± 0.30 0.57 ± 0.29

run 5 times for each dataset. The regularization parameter is chosen from 6 candidates

ranging from 1 to 103.

We use linear SVM as weak classifiers in our method. The linear SVM weak classifier

has the same form as (4.43). At each boosting iteration, we solve the argmax problem

by training a linear SVM model. The regularization parameter C in the SVM is set to

a large value (108/#examples). To alleviate the overfitting problem when using linear

SVMs as weak learners, we set 10% of the smallest non-zero dual solutions µ(i,y) to zero.

We compare StructBoost using the hierarchical multi-class formulation (StructBoost-

tree) and the standard multi-class formulation (StructBoost-flat). We run some other

multi-class methods for further comparison: Ada.ECC, Ada.MH with decision stumps.

We also run Ada.ECC using linear SVM as weak classifiers (labelled as Ada.ECC-SVM).

When using SVM as weak learners, the number of boosting iterations is set to 200,

and for decision stump, it is set to 500. Table 4.3 shows the tree loss and the 1/0 loss

(classification error rate). We observe that StructBoost-tree has the lowest tree loss

among all compared methods, and it also outperforms its counterpart, StructBoost-flat,

in terms of both classification error rates and the tree loss. Our StructBoost-tree makes

use of the class hierarchy information and directly optimizes the tree loss. That might

be the reason why StructBoost-tree achieves best performance.

4.4.4 Visual tracking

A visual tracking method, termed Struck [4], was introduced based on SSVM. The core

idea is to train a tracker by optimizing the Pascal image overlap score using SSVM. Here

we apply StructBoost to visual tracking, following the similar setting as in Struck [4].

More details can be found in Section 4.3.5.


Table 4.5: Average center errors on benchmark videos. Struck50 [4] is structured SVM trackingwith a buffer size of 50. We observe similar results as in Table 4.4: Our StructBoost outperformsothers on most sequences, and Struck is the second best.

StructBoost AdaBoost Struck50 Frag MIL OAB1 OAB5 VTD

coke 3.7 ± 4.5 9.3 ± 4.2 8.3 ± 5.6 69.5 ± 32.0 17.8 ± 9.6 34.7 ± 15.5 68.1 ± 30.3 46.8 ± 21.8tiger1 5.4 ± 4.9 7.8 ± 4.4 7.8 ± 9.9 39.6 ± 25.7 8.4 ± 5.9 17.8 ± 16.4 38.9 ± 31.1 68.8 ± 36.4tiger2 5.2 ± 5.6 12.7 ± 6.3 8.7 ± 6.1 38.5 ± 24.9 7.5 ± 3.6 20.5 ± 14.9 38.3 ± 26.9 38.0 ± 29.6david 5.2 ± 2.8 43.0 ± 28.2 7.7 ± 5.7 73.8 ± 36.7 19.6 ± 8.2 51.0 ± 30.9 64.4 ± 33.5 66.1 ± 56.3

girl 14.3 ± 7.8 47.1 ± 29.5 10.1 ± 5.5 23.0 ± 22.5 31.6 ± 28.2 43.3 ± 17.8 67.8 ± 32.5 18.4 ± 11.4sylv 9.1 ± 5.8 14.7 ± 7.8 8.4 ± 5.3 12.2 ± 11.8 9.4 ± 6.5 32.9 ± 36.5 76.4 ± 35.4 21.6 ± 35.7bird 6.7 ± 3.8 12.7 ± 9.5 17.9 ± 13.9 50.0 ± 43.3 49.0 ± 85.3 47.9 ± 87.7 48.5 ± 86.3 143.9 ± 79.3walk 8.4 ± 10.3 13.5 ± 5.4 33.9 ± 49.5 102.8 ± 46.3 35.0 ± 47.5 35.7 ± 49.2 38.0 ± 48.7 100.9 ± 47.1

shaking 9.5 ± 5.4 21.6 ± 12.0 123.9 ± 54.5 47.2 ± 40.6 37.8 ± 75.6 26.9 ± 49.3 29.1 ± 48.7 10.5 ± 6.8singer 5.8 ± 2.2 4.8 ± 2.1 29.5 ± 23.8 172.8 ± 95.2 188.3 ± 120.8 189.9 ± 115.2 158.5 ± 68.6 10.1 ± 7.6iceball 8.0 ± 4.1 107.9 ± 66.4 15.6 ± 22.1 39.8 ± 72.9 61.6 ± 85.6 97.7 ± 53.5 58.7 ± 84.0 13.5 ± 26.0

Our experiment follows the on-line tracking setting. Here we use the first 3 labeled

frames for initialization and training of our StructBoost tracker. Then the tracker is

updated by re-training the model during the course of tracking. Specifically, in the

i-th frame (represented by xi), we first perform a prediction step (solving (4.3)) to

output the detection box (yi), then collect training data for tracker update. For solving

the prediction inference in (4.3), we simply sample about 2000 bounding boxes around

the prediction box of the last frame (represented by yi−1), and search for the most

confident bounding box over all sampled boxes as the prediction. After the prediction,

we collect training data by sampling about 200 bounding boxes around the current

prediction yi. We use the training data in recent 60 frames to re-train the tracker for

every 2 frames. Solving (4.14) for finding the most violated constraint is similar to the

prediction inference.

For StructBoost, decision stumps are used as the weak classifiers; the number of boosting

iterations is set to 300; the regularization parameter C is selected from 100.5 to 102. We

use the down-sampled gray-scale raw pixels and HOG [110] as image features.

For comparison, we also run a simple binary AdaBoost tracker using the same setting as

for our StructBoost tracker. The number of weak classifiers for AdaBoost is set to 500.

When training the AdaBoost tracker, we collect positive boxes that significantly overlap

(overlap score above 0.8) with the prediction box of the current frame, and negative

boxes with small overlap scores (lower or equal to 0.3).

We also compare with a few state-of-the-art tracking methods, including Struck [4] (with

a buffer size of 50), multi-instance tracking (MIL) [111], fragment tracking (Frag) [112],

online AdaBoost tracking (OAB) [113], and visual tracking decomposition (VTD) [114].

Two different settings are used for OAB: one positive example per frame (OAB1) and

five positive examples per frame (OAB5) for training. The test video sequences: ”coke,

tiger1, tiger2, david, girl and sylv” were used in [4]; “shaking, singer” are from [114] and

the rest are from [115].


Vtd Oab5 Oab1 Mil Frag Struck Adaboost StructBoost

50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

(a) coke

50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

(b) tiger1

50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

(c) tiger2

50 100 150 200 250 300 350 400 4500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

VtdOab5Oab1MilFragStruckAdaboostStructB

(d) david

20 40 60 80 100 120 1400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(e) walk

50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(f) shaking

200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(g) sylv

50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p


(h) girl

10 20 30 40 50 60 70 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(i) bird

50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(j) singer

50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame index

Bou

ndin

g bo

x ov

erla

p

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(k) iceball

Figure 4.5: Bounding box overlap in frames of several video sequences. Our StructBoost oftenachieves higher scores of box overlap compared with other trackers.

Table 4.4 reports the Pascal overlap scores of compared methods. Our StructBoost

tracker performs best on most sequences. Compared with the simple binary AdaBoost

tracker, StructBoost that optimizes the pascal overlap criterion perform significantly

better. Note that here Struck uses Haar features. When Struck uses a Gaussian kernel

defined on raw pixels, the performance is slightly different [4], and ours still outperforms

Struck in most cases. This might be due to the fact that StructBoost selects relevant

features (300 features selected here), while SSVM of Struck [4] uses all the image patch

information which may contain noises.

The center location errors (in pixels) of compared methods are shown in Table 4.5. We

can see that optimizing the overlap score also helps to minimize the center location



50 100 150 200 2500

20

40

60

80

100

120

140

Frame index

Cen

ter

loca

tion

erro

r

(a) coke

50 100 150 200 250 300 3500

20

40

60

80

100

120

140

Frame index

Cen

ter

loca

tion

erro

r

(b) tiger1

50 100 150 200 250 300 3500

20

40

60

80

100

120

140

Frame index

Cen

ter

loca

tion

erro

r

(c) tiger2

50 100 150 200 250 300 350 400 4500

20

40

60

80

100

120

140

160

180

200

Frame index

Cen

ter

loca

tion

erro

r


(d) david

20 40 60 80 100 120 1400

20

40

60

80

100

120

140

160

180

Frame index

Cen

ter

loca

tion

erro

r

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(e) walk

50 100 150 200 250 300 3500

50

100

150

200

250

300

Frame index

Cen

ter

loca

tion

erro

r

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(f) shaking

200 400 600 800 1000 12000

20

40

60

80

100

120

140

160

180

Frame index

Cen

ter

loca

tion

erro

r

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(g) sylv

50 100 150 200 250 300 350 400 450 5000

50

100

150

Frame index

Cen

ter

loca

tion

erro

r


(h) girl

10 20 30 40 50 60 70 80 900

50

100

150

200

250

300

350

Frame index

Cen

ter

loca

tion

erro

r

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(i) bird

50 100 150 200 250 300 3500

50

100

150

200

250

300

Frame index

Cen

ter

loca

tion

erro

r

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(j) singer

50 100 150 200 250 300 350 400 450 5000

50

100

150

200

250

300

Frame index

Cen

ter

loca

tion

erro

r

Vtd

Oab5

Oab1

Mil

Frag

Struck

Adaboost

StructB

(k) iceball

Figure 4.6: Bounding box center location error in frames of several video sequences. Our Struct-Boost often achieves lower center location errors compared with other trackers.

errors. Our method also achieves the best performance.

Figure 4.5 and 4.6 plots the Pascal overlap scores and central location errors frame by

frame on several video sequences. Some tracking examples are shown in Figure 4.7.

4.4.5 CRF parameter learning for image segmentation

We evaluate our method on CRF parameter learning for image segmentation, following

the work of [8]. The work of [8] applies SSVM to learn weighting parameters for different

potentials (including multiple unary and pairwise potentials). The goal of applying



Figure 4.7: Some tracking examples for several video sequences: “coke”, “david”, , “walk”,“bird” and “tiger2” (best viewed on screen). The output bounding boxes of our StructBoostbetter overlap against the ground truth than the compared methods.

Table 4.6: Image segmentation results on the Graz-02 dataset. The results show the the pixelaccuracy, intersection-union score (including the foreground and background) and precision =recall value (as in [5]). Our method StructBoost for nonlinear parameter learning performsbetter than SSVM and other methods.

bike car people

intersection/union (foreground, background) (%)

AdaBoost 69.2 (57.6, 80.7) 72.2 (51.7, 92.7) 68.9 (51.2, 86.5)SVM 65.2 (53.0, 77.4) 68.6 (45.0, 92.3) 62.9 (41.0, 84.8)

SSVM 74.5 (64.4, 84.6) 80.2 (64.9, 95.4) 74.3 (58.8, 89.7)StructBoost 76.5 (66.3, 86.7) 80.8 (66.1, 95.6) 75.7 (61.0, 90.4)

pixel accuracy (foreground, background) (%)

AdaBoost 84.4 (83.8, 85.1) 82.9 (69.8, 96.0) 81.0 (70.0, 92.1)SVM 81.9 (81.8, 82.1) 77.0 (57.2, 96.9) 73.5 (53.8, 93.2)

SSVM 87.9 (87.9, 88.0) 86.9 (75.8, 98.1) 83.5 (71.8, 95.2)StructBoost 87.4 (83.3, 91.5) 87.6 (77.0, 98.1) 84.6 (73.6, 95.6)

precision = recall (%)

M. & S. [116] 61.8 53.8 44.1F. et al. [5] 72.2 72.2 66.3AdaBoost 72.7 67.8 67.0

SVM 68.3 63.4 61.2SSVM 77.3 78.3 74.4

StructBoost 78.9 79.3 75.9


(a) Testing (b) Truth (c) AdaBoost (d) SSVM (e) StructBoost

Figure 4.8: Some segmentation results on the Graz-02 dataset (car). Compared with AdaBoost,structured output learning methods (StructBoost and SSVM) present sharper segmentationboundaries, and better spatial regularization. Compared with SSVM, our StructBoost withnon-linear parameter learning performs better, demonstrating more accurate foreground objectboundaries and cleaner backgrounds.



Figure 4.9: Some segmentation results on the Graz-02 dataset (bicycle). Compared with Ad-aBoost, structured output learning methods (StructBoost and SSVM) present sharper segmen-tation boundaries, and better spatial regularization. Compared with SSVM, our StructBoostwith non-linear parameter learning performs better, demonstrating more accurate foregroundobject boundaries and cleaner backgrounds.



Figure 4.10: Some segmentation results on the Graz-02 dataset (person). Compared with Ad-aBoost, structured output learning methods (StructBoost and SSVM) present sharper segmen-tation boundaries, and better spatial regularization. Compared with SSVM, our StructBoostwith non-linear parameter learning performs better, demonstrating more accurate foregroundobject boundaries and cleaner backgrounds.



Figure 4.11: Some segmentation results on the Graz-02 dataset (person). Compared with Ad-aBoost, structured output learning methods (StructBoost and SSVM) present sharper segmen-tation boundaries, and better spatial regularization. Compared with SSVM, our StructBoostwith non-linear parameter learning performs better, demonstrating more accurate foregroundobject boundaries and cleaner backgrounds.


StructBoost here is to learn a non-linear weighting for different potentials. Details are

described in Section 4.3.6.

We extend the super-pixel based segmentation method [5] with CRF parameter learning.

The Graz-02 dataset1 is used here which contains 3 categories (bike, car and person).

Following the setting as other methods [5], the first 300 labeled images in each category

are used in the experiment. Images with the odd indices are used for training and the rest

for testing. We generate super-pixels using the same setting as [5]. For each super-pixel,

we generate 5 types of features: visual word histogram [5], color histograms, GIST [117],

LBP2 and HOG [110]. For constructing the visual word histogram, we follow [5] using

a neighborhood size of 2; the code book size is set to 200. For GIST, LBP and HOG,

we extract features from patches centered at the super-pixel with 4 increasing sizes:

4× 4, 8× 8, 12× 12 and 16× 16. The cell size for LBP and HOG is set to a quarter of

the patch size. For GIST, we generate 512 dimensional features for each patch by using

4 scales and the number of orientations is set to 8. In total, we generate 14 groups of

features (including features extracted on patches of different sizes). Using these super-

pixel features, we construct 14 different unary potentials (U = [U1, . . . , U14 ]>) from

AdaBoost classifiers, which are trained on the foreground and background super-pixels.

The number of boosting iterations for AdaBoost is set to 1000. Specifically, we define

F ′(xp) as the discriminant function of AdaBoost on the features of the p-th super-pixel.

Then the unary potential function can be written as:

U(x, yp) = −ypF ′(xp). (4.44)

We also construct 2 pairwise potentials (V = [V1, V2 ]>): V1 is constructed using color

differences, and V2 using shared boundary length [5] which is able to discourage small

isolated segments. Recall that 1(·, ·) is an indicator function defined in (4.21). ‖xp−xq‖2calculates the `2 norm of the color difference between two super-pixels in the LUV color-

space; `(xp,xq) is the shared boundary length between two super-pixels. Then V1, V2

can be written as:

V1(yp, yq,x) = exp (−‖xp − xq‖2)[1− 1(yp, yq)

], (4.45)

V2(yp, yq,x) =`(xp,xq)[1− 1(yp, yq)

]. (4.46)

We apply StructBoost here to learn non-linear weights for combining these potentials.

We use decision stumps as weak learners (Φ(·) in (4.30)) here. The number of boosting

iterations for StructBoost is set to 50.

1http://www.emt.tugraz.at/~pinz/2http://www.vlfeat.org/

http://www.emt.tugraz.at/~pinz/

http://www.vlfeat.org/


For comparison, we use SSVM to learn CRF weighting parameters on exactly the same

potentials as our method. The regularization parameter C in SSVM and our Struct-

Boost is chosen from 6 candidates with the value ranging from 0.1 to 103. We also run

two simple binary super-pixel classifiers (linear SVM and AdaBoost) trained on visual

word histogram features of foreground and background super-pixels. The regularization

parameter C in SVM is chosen from 102 to 107. The number of boosting iterations for

AdaBoost is set to 1000.

We use the intersection-union score, pixel accuracy (including the foreground and back-

ground) and precision = recall value (as in [5]) for evaluation. Results are shown in

Table 4.6. Some segmentation examples are shown in Figure 4.8, 4.9,4.10 and 4.11.

As shown in the results, both StructBoost and SSVM, which learn to combine differ-

ent potential functions, are able to significantly outperform the simple binary models

(AdaBoost and SVM). StructBoost outperforms SSVM since it learns a non-linear com-

bination of potentials. Note that SSVM learns a linear weighting for different potentials.

By employing nonlinear parameter learning, our method gains further performance im-

provement over SSVM.

4.5 Conclusion

We have presented a structured boosting method, which combines a set of weak struc-

tured learners for nonlinear structured output leaning, as an alternative to SSVM [7]

and CRF [88]. Analogous to SSVM, where the discriminant function is learned over

a joint feature space of inputs and outputs, the discriminant function of the proposed

StructBoost is a linear combination of weak structured learners defined over a joint space

of input-output pairs.

To efficiently solve the resulting optimization problems, we have introduced a cutting-

plane method, which was originally proposed for fast training of linear SVM. Our ex-

tensive experiments demonstrate that indeed the proposed algorithm is computationally

tractable.

StructBoost is flexible in the sense that it can be used to optimize a wide variety of

loss functions. We have exemplified the application of StructBoost by applying it to

multi-class classification, hierarchical multi-class classification by optimizing the tree

loss, visual tracking that optimizes the Pascal overlap criterion, and learning CRF pa-

rameters for image segmentation. We show that StructBoost performance at least is

comparable or sometimes exceeds conventional approaches for a wide range of applica-

tions. We also observe that StructBoost outperforms linear SSVM, demonstrating the


usefulness of our nonlinear structured learning method. Future work will focus on more

applications of this general StructBoost framework.

Chapter 5

Learning Hash Functions Using

Column Generation

In this chapter, we propose a column generation based method [43] for learning data-

dependent hash functions based on relative pairwise similarity information. Given a set

of triplets that encode the pairwise similarity comparison information, our method learns

hash functions that preserve the relative comparison relations in the data within the

large-margin learning framework. The learning procedure is implemented using column

generation and hence is named CGHash. At each iteration of the column generation

procedure, the best hash function is selected. Unlike many other hashing methods, our

method generalizes to new data points naturally. We show that our method with triplet

based formulation and large-margin learning is able to learn high quality hash functions

for similarity search.

5.1 Introduction

A hashing-based approach constructs a set of hash functions that map high-dimensional

data points to low-dimensional binary codes. These binary codes can be easily loaded

into the memory in order to allow rapid retrieval of data points. Moreover, the pairwise

Hamming distance between these binary codes can be efficiently computed by using

bit operations, which are well supported by modern processors, thus enabling efficient

similarity calculation on large-scale datasets. Hash-based approaches have thus found a

wide range of applications, including object recognition [118], information retrieval [33],

local descriptor compression [119], image matching [120], and many more. Recently a

number of effective hashing methods have been developed which construct a variety of

hash functions, mainly on the assumption that similar data points should have similar

85

86 Chapter 5 Learning Hash Functions Using Column Generation

binary codes, such as random projection based locality sensitive hashing (LSH) [32, 62,

65], boosting learning-based similarity sensitive coding (SSC) [121], and spectral hashing

of Weiss et al. [34] which is inspired by Laplacian eigenmap.

In more detail, spectral hashing [34] optimizes a graph Laplacian based objective function

such that, in the learned low-dimensional binary space, the local neighborhood structure

of the original dataset is preserved. SSC [121] makes use of boosting to adaptively

learn an embedding of the original space, represented by a set of weak learners or hash

functions. This embedding aims to preserve the pairwise similarity relations. These

approaches have demonstrated that, in general, data-dependent hashing is superior to

data-independent hashing with a typical example being LSH [65].

Following this vein, here we learn hash functions using similarity information that is

generally presented in a set of triplet relations. Thees triplet relations used for training

can be generated in an either supervised or unsupervised fashion. The fundamental idea

is to learn hash functions such that the Hamming distance between two similar data

points is smaller than that between two dissimilar data points. This type of relative

similarity comparisons have been successfully applied to learn quadratic distance metrics

[122, 123]. Usually this type of similarity relations do not require explicit class labels

and thus are easier to obtain than either the class labels or the actual distances between

data points. For instance, in content based image retrieval, to collect feedback, users

may be required to report whether one image looks more similar to another image than

it is to a third image. This task is typically much easier than to label each individual

image. Formally, let x denote one data point, we are given a set of triplets:

T = (i, j, k) | d(xi,xj) < d(xi,xk), (5.1)

where d(·, ·) is some distance measure (e.g., Euclidean distance in the original space; or

semantic similarity measure provided by a user). As explained, one may not explicitly

know d(·, ·); instead, one may only be able to provide sparse similarity relations. Using

such a set of constraints, we formulate a learning problem in the large-margin framework.

By using a convex surrogate loss function, a convex optimization problem is obtained,

but has an exponentially large number of variables. Column generation is thus employed

to efficiently solve the formulated optimization problem.

5.1.1 main contribution

The main contribution of this work is to propose a novel hash function learning frame-

work which has the following desirable properties.

Chapter 5 Learning Hash Functions Using Column Generation 87

1. Our column generation based method with triplet based formulation and large-

margin learning is able to learn high quality hash functions. The learned hash

functions are able to outperform many existing methods on similarity preserving.

2. The proposed framework is flexible and can accommodate various types of con-

straints. We show how to learn hash functions based on similarity comparisons.

Furthermore, the framework can accommodate different types of loss functions as

well as regularization terms.

5.1.2 Related work

Without using any training data, data-independent hashing methods usually generate

a set of hash functions using randomization. For instance, a popular method of LSH

family [32, 62, 65] uses random projection and thresholding to generate binary codes,

where the mutually close data points in the Euclidean space are likely to have similar

binary codes. Recently, Kulis and Grauman [63] propose a kernelized version of LSH,

which is capable of capturing the intrinsic relations between data points using kernels

instead of linear inner products.

In terms of learning methodology, data-dependent hashing methods can make use of

unsupervised, supervised or semi-supervised learning techniques to learn a set of hash

functions that generate the compact binary codes. As for unsupervised learning, two

typical approaches are used to obtain such compact binary codes, including thresholding

the real-valued low-dimensional vectors (after dimensionality reduction) and direct op-

timization of a Hamming distance based objective function (e.g., spectral hashing [34],

self-taught hashing [33]). The spectral hashing (SPH) method directly optimizes a graph

Laplacian objective function in the Hamming space. Inspired by SPH, Zhang et al. [33]

develop the self-taught hashing (STH) method. At the first step of STH, Laplacian graph

embedding is used to generate a sequence of binary codes for each example. By viewing

these binary codes as binary classification labels, it trains linear support vector ma-

chine classifiers as hash functions. Liu et al. [38] propose a scalable graph-based hashing

method which uses a small-size anchor graph to approximate the original neighborhood

graph and alleviates the computational limitation of spectral hashing.

As for the supervised learning case, a number of hashing methods take advantage of la-

beled training examples to build data-dependent hash functions. For example, Salakhut-

dinov and Hinton [124] propose the restricted Boltzmann machine (RBM) hashing

method using a multi-layer deep learning technique for binary code generation. Strecha

et al. [119] use Fisher linear discriminant analysis (LDA) to embed the original data

points into a lower-dimensional space, where the embedded data points are binarized


using thresholding. Boosting methods have also been employed to develop hashing

methods such as SSC [121] and Forgiving Hash [125], both of which learn a set of weak

learners as hash functions in the boosting framework. It is demonstrated in [118] that

some data-dependent hashing methods like stacked RBM and boosting SSC perform

much better than LSH on large-scale databases of millions of images. Wang et al. [17]

propose a semi-supervised hashing method, which aims to ensure small Hamming dis-

tance of similar data examples and large Hamming distance of dissimilar data points.

More recently, Liu et al. [31] introduce a kernel-based supervised hashing method, where

the hashing functions are nonlinear kernel functions.

The closest work to ours might be the boosting based SSC hashing [121], which also

learns a set of weighted hash functions through boosting learning. Our method differs

SSC in the learning procedure. The resulting optimization problem of our CGHash

is based on the concept of margin maximization. We have derived a meaningful La-

grange dual problem that column generation can be applied to solve the semi-infinite

optimization problem. In contrast, SSC is built on the learning procedure of AdaBoost,

which employs stage-wise coordinate-descent optimization. The weights associated with

selected hash functions (corresponding weak classifiers in AdaBoost) are not fully up-

dated in each iteration. Also the information used for training is different. We have used

distance comparison information and SSC uses pairwise information. In addition, our

work can accommodate various types of constraints, and can flexibly adapt to different

types of loss functions as well as regularization terms. It is unclear, for example, how

SSC can accommodate different types of regularization which may encode useful prior

information. In this sense our CGHash is much more flexible. Next, we present our

main results.

5.1.3 Notation


between two vectors or matrices such as u ≥ v means ui ≥ vi for all i. 0 is the all-zero

column vector and 1 is the all-one column vector.

5.2 The proposed method

Given a set of training examples X = x1,x2, . . . ,xn ⊂ Rd, we aim to learn a set of m

hash functions:

Φ(x) = [h1(x), h2(x), . . . , hm(x)] ∈ −1,+1m, (5.2)


which map these training examples to a low-dimensional binary space. The domain

of hash functions is denoted by C: h(·) ∈ C. The output of one hash function is a

binary value: h(x) ∈ −1,+1. In the low-dimensional binary space, these binary codes

are supposed to preserve the underlying similarity information in the original high-

dimensional space. Next we learn such hash functions within the large-margin learning

framework.

Formally, suppose that we are given a set of triplets T = (i, j, k). These triplets encode

the similarity comparison information in which the distance/dissimilarity between xi and

xj is smaller than that between xi and xk. We define the weighted Hamming distance

for the learned binary codes as: (the constant multiplier is removed):

dhm(xi,xj ;w) =m∑r=1

wr|hr(xi)− hr(xj)|, (5.3)

where wr is a non-negative weighting coefficient associated with the r-th hash function.

We want the constraints:

dhm(xi,xj) < dhm(xi,xk) (5.4)

to be satisfied as well as possible. For notational simplicity, we define

δh(i, j, k) = |h(xi)− h(xk)| − |h(xi)− h(xj)| (5.5)

and

δΦ(i, j, k) = [δh1(i, j, k), δh2(i, j, k), . . . , δhm(i, j, k)]. (5.6)

With the above definitions, the weighted Hamming distance comparison of a triplet can

be written as:

dhm(xi,xk)− dhm(xi,xj) = w>δΦ(i, j, k). (5.7)

In what follows, we describe the details of our hashing algorithm using different types

of convex loss functions and regularization norms. In theory, any convex loss and regu-

larization can be used in our hashing framework.

5.2.1 Learning hash functions with squared hinge loss

As a starting example, we first discuss using squared hinge loss function and `1 norm

regularization for hash function learning. Using the squared hinge loss, we define the


following large-margin optimization problem:

minw,ξ

1>w + C∑

(i,j,k)∈T

ξ2(i,j,k) (5.8a)

s.t. ∀(i, j, k) ∈ T :

dhm(xi,xk;w)− dhm(xi,xj ;w) ≥ 1− ξ(i,j,k), (5.8b)

w ≥ 0, ξ ≥ 0. (5.8c)

Here we have used the `1 norm onw as the regularization term to control the complexity

of the learned model; the weighting vector w is defined as:

w = [w1, w2, . . . , wm]>; (5.9)

ξ is the slack variable; C is a parameter controlling the trade-off between the training

error and model complexity. With the definition of weighted Hamming distance in (5.3)

and the notation in (5.6), the optimization problem in (5.8) can be rewritten as:

minw,ξ

1>w + C∑

(i,j,k)∈T

ξ2(i,j,k) (5.10a)

s.t. ∀(i, j, k) ∈ T : w>δΦ(i, j, k) ≥ 1− ξ(i,j,k) (5.10b)

w ≥ 0, ξ ≥ 0. (5.10c)

We aim to solve the above optimization to obtain the weighting vector: w and the set of

hash functions: Φ = [h1, h2, . . . ]. If the hash functions are obtained, the optimization can

be easily solved for w (e.g., using LBFGS-B [2]). In our approach, we apply the column

generation technique to alternatively solve for w and learn hash functions. Basically,

we construct a working set of hash functions and repeat the following two steps until

converge: first we solve for the weighting vector using the current working set of hash

functions, and then generate new hash function and add to the working set.

Column generation is a technique originally used for large scale linear programming

problems. Demiriz et al. [60] apply this technique to design boosting algorithms. In each

iteration, one column—a variable in the primal or a constraint in the dual problem—is

added. Till one cannot find any violating constraints in the dual, the current solution

is the optimal solution. In theory, if we run the column generation with a sufficient

number of iterations, one can obtain a sufficiently accurate solution. Here we only need

to run a small number of column generation iteration (e.g, 60) to learn a compact set of

hash functions.

To applying column generation technique for learning hash functions, we derive the dual

problem of the optimization in (5.10). The optimization in (5.10) can be equally written


as:

minw,ρ

1>w + C∑

(i,j,k)∈T

[max(1− ρ(i,j,k), 0)

]2

(5.11a)

s.t. ∀(i, j, k) ∈ T : ρ(i,j,k) = w>δΦ(i, j, k), (5.11b)

w ≥ 0. (5.11c)

The Lagrangian of (5.11) can be written as:

L(w,ρ,µ,α) =1>w + C∑

(i,j,k)∈T

[max(1− ρ(i,j,k), 0)

]2

+∑

(i,j,k)∈T

µ(i,j,k)

[ρ(i,j,k) −w>δΦ(i, j, k)

]−α>w, (5.12a)

where µ, α are Lagrange multipliers and α ≥ 0. For the optimal primal solution, the

following must hold: ∂L∂w = 0 and ∂L

∂ρ = 0; hence the dual problem can be derived as:

maxµ

∑(i,j,k)∈T

µ(i,j,k) −µ2

(i,j,k)

4C(5.13a)

s.t. ∀h(·) ∈ C :∑

(i,j,k)∈T

µ(i,j,k)δh(i, j, k) ≤ 1. (5.13b)

Here µ is one dual variable, which corresponds to one constraint in (5.11b).

The core idea of column generation is to generate a small subset of dual constraints by

finding the most violated dual constraint in (5.13). This process is equivalent to adding

primal variables into the primal optimization problem (5.20). Here finding the most

violated dual constraint is learning one hash function, which can be written as:

h?(·) = argmaxh(·)∈C

∑(i,j,k)∈T

µ(i,j,k)δh(i, j, k)

= argmaxh(·)∈C

∑(i,j,k)∈T

µ(i,j,k)

[|h(xi)− h(xk)| − |h(xi)− h(xj)|

]. (5.14)

In each column generation iteration, we solve the above optimization to generate one

hash function.

Now we give an overview of our approach. Basically, we repeat the following two steps

until converge:

1. Solve the reduced primal problem in (5.11) using the current working set of hash

functions. We obtain the primal solution w and the dual solution µ in this step.


2. With the dual solution µ, we solve the subproblem in (5.14) to learn one hash

function, and add to the working set of hash functions.

Our method is summarized in Algorithm 5. We describe more details for running these

two steps as follows.

In the first step, we need to obtain the dual solution µ, which is required for solving

the subproblem in (5.14) of the second step to learn one hash function. In each column

generation iteration, we can easily solve the optimization in (5.11) using the current

working set of hash functions to obtain the primal solution w. For example, using the

efficient LBFGS-B solver [2]. According to the Karush-Kuhn-Tucker (KKT) conditions,

we have the following relation:

∀(i, j, k) ∈ T : µ?(i,j,k) = 2C max

[1−w?>δΦ(i, j, k), 0

]. (5.15)

From the above, we are able to obtain the dual solution µ? for the primal solution w?.

In the second step, we solve the subproblem in (5.14) for learning one hash function. The

form of hash function h(·) can be any functions that have binary output value. When

using a decision stump as the hash function, usually we can exhaustively enumerate

all possibility and find the globally best one. However, for many other types of hash

function, e.g., perceptron and kernel functions, globally solving (5.14) is difficult. In our

experiments, we use the perceptron hash function:

h(x) = sign (v>x+ b). (5.16)

In order to obtain a smoothly differentiable objective function, we reformulate (5.14)

into the following equivalent form:


∑(i,j,k)∈T

µ(i,j,k)

[(h(xi)− h(xk))

2 − (h(xi)− h(xj))2

]. (5.17)

The non-smooth sign function in (5.16) brings the difficulty for optimization. We replace

the sign function by a smooth sigmoid function, and then locally solve the above opti-

mization (5.17) (e.g., using LBFGS [2]) for learning the parameters of a hash function.

We can apply a few initialization heuristics for solving (5.17). For example, similar to

LSH, we can generate a number of random planes and choose the best one, which maxi-

mizes the objective in (5.17), as the initial solution. We can also train a decision stump

by searching a best dimension and threshold to maximize the objective on the quantized

data. Alternatively, we can employ the spectral relaxation method [38] which drops the


Algorithm 5: CGHash: Hashing using column generation (with squared hinge loss)

Input: training triplets: T = (i, j, k), training examples: x1,x2, . . ., the number ofbits: m.

Output: Learned hash functions h1, h2, . . . , hm and the associated weights w.1 Initialize: µ← 1

|T| .

2 for r = 1 to m do3 find a new hash function hr(·) by solving the subproblem: (5.14);4 add hr(·) to the working set of hash functions;5 solve the primal problem in (5.11) for w (using LBFGS-B[2]), and calculate the

dual solution µ by (5.15);

sign function and solves a generalized eigenvalue problem to obtain a solution for initial-

ization. In our experiments, we use decision stump training and random projection for

initialization. However, applying the spectral relaxation method for initialization would

further improve the performance.

5.3 Hashing with general smooth convex loss functions

The previous discussion for squared hinge loss is an example of using smooth convex loss

function in our framework. To take a step forward, here we describe how to incorporate

with general smooth convex loss functions. We denote by f(·) as a general convex

loss function which is assumed to be smooth (e.g., exponential, logistic, squared hinge

loss), Our algorithm can be easily extended to using non-smooth loss functions. As an

example, we discuss using the hinge loss in Appendix C.

We encourage the following constraints to be satisfied as well as possible:

∀(i, j, k) ∈ T : dhm(xi,xk)− dhm(xi,xj) = w>δΦ(i, j, k) ≥ 0 (5.18)

These constraints do not have to be all strictly satisfied. Here we define the margin:

ρ(i,j,k) = w>δΦ(i, j, k), (5.19)

and we want to maximize the margin with regularization. Using `1 norm for regulariza-

tion, we define the primal optimization problem as:

minw

1>w + C∑

(i,j,k)∈T

f(ρ(i,j,k)) (5.20a)


w ≥ 0. (5.20c)


Here f(·) is a smooth convex loss function. C is a parameter controlling the trade-

off between the training error and model complexity. Without the regularization, one

can always make w arbitrarily large to make the convex loss approach zero when all

constraints are satisfied.

The squared hinge loss which we discussed before is an example of f(·). We can easily

recover the formulation in (5.11) for squared hinge loss by using the following definition:

f(ρ(i,j,k)) =

[max(1− ρ(i,j,k), 0)

]2

. (5.21)

For applying column generation, we derive the dual problem of (5.20). The Lagrangian

of (5.20) can be written as:

L(w,ρ,µ,α) =1>w + C∑

(i,j,k)∈T

f(ρ(i,j,k))

+∑

(i,j,k)∈T

µ(i,j,k)


]−α>w, (5.22a)

where µ, α are Lagrange multipliers and α ≥ 0. With the definition of Fenchel conjugate

[126], we have the following dual objective:

infw,ρ

L(w,ρ,µ,α) = infρC

∑(i,j,k)∈T

f(ρ(i,j,k)) +∑

(i,j,k)∈T

µ(i,j,k)ρ(i,j,k) (5.23)

=− supρ−C

∑(i,j,k)∈T

f(ρ(i,j,k))−∑

(i,j,k)∈T

µ(i,j,k)ρ(i,j,k) (5.24)

=− C supρ

∑(i,j,k)∈T

−µ(i,j,k)

Cρ(i,j,k) −

∑(i,j,k)∈T

f(ρ(i,j,k)) (5.25)

=− C∑

(i,j,k)∈T

f∗(−µ(i,j,k)

C). (5.26)

Here f∗(·) is the Fenchel conjugate of f(·). For the optimal primal solution, the condi-

tion: ∂L∂w = 0 must hold; hence we have the following relations:

1−α> −∑

(i,j,k)∈T

µ(i,j,k)δΦ(i, j, k) = 0. (5.27)

Consequently, the corresponding dual problem of (5.20) can be written as:

maxµ−

∑(i,j,k)∈T

f∗(−µ(i,j,k)

C) (5.28a)

s.t. ∀h(·) ∈ C :∑

(i,j,k)∈T

µ(i,j,k)δh(i, j, k) ≤ 1. (5.28b)


With the above dual problem for general smooth convex loss functions, we generate a

new hash function by finding the most violating constraints in (5.28b), which is the same

as that for squared hinge loss. Hence, we solve the optimization in (5.14) to generate a

new hash function. Using different loss functions will result in different dual solutions.

The dual solution is required for generating hash functions.

As afore mentioned, in each column generation iteration, we need to obtain the dual

solution before solving (5.14) to generate a hash function. Since we assume that f(·) is

smooth, the Karush-Kuhn-Tucker (KKT) conditions establish the connection between

the primal solution of (5.20) and the dual solution of (5.28):

∀(i, j, k) ∈ T : µ?(i,j,k) = −Cf ′(ρ?(i,j,k)) (5.29)

in which,

ρ?(i,j,k) = w?>δΦ(i, j, k). (5.30)

In other words, the dual variable is determined by the gradient of the loss function in

the primal. According to (5.29), we are able to obtain the dual solution µ? using the

primal solution w?.

5.3.1 Hashing with logistic loss

It has been shown in (5.21) that formulation for the squared hinge loss is an example of

the general formulation in (5.20) with smooth convex loss functions. Here we describe

using the logistic loss as another example of the general formulation. The learning

algorithm is similar to the case of using the squared hinge loss which is described before.

We have the following definition for the logistic loss:

f(ρ(i,j,k)) = log (1 + exp (−ρ(i,j,k))). (5.31)

The general result for smooth convex loss function can be applied here. The primal

optimization problem can be written as:

minw,ρ

1>w + C∑

(i,j,k)∈T

log (1 + exp (−ρ(i,j,k))) (5.32a)


w ≥ 0. (5.32c)


The corresponding dual problem can be written as:

maxµ

∑(i,j,k)∈T

(µ(i,j,k) − C) log (C − µ(i,j,k))− µ(i,j,k) log (µ(i,j,k)) (5.33a)

s.t. ∀h(·) ∈ C :∑

(i,j,k)∈T

µ(i,j,k)δh(i, j, k) ≤ 1. (5.33b)

The dual solution can be calculated by:

∀(i, j, k) ∈ T : µ?(i,j,k) =C

exp (w?>δΦ(i, j, k)) + 1. (5.34)

5.4 Hashing with `∞ norm regularization

The proposed method is flexible that it is easy to incorporate with different types of

regularizations. Here we discuss the `∞ norm regularization as an example. For general

convex loss, the optimization can be written as:

minw‖w‖∞ + C

∑(i,j,k)∈T

f(ρ(i,j,k)) (5.35a)


w ≥ 0. (5.35c)

This optimization problem can be equivalently written as:

minw

∑(i,j,k)∈T

f(ρ(i,j,k)) (5.36a)


0 ≤ w ≤ C ′1, (5.36c)

where C ′ is a constant that controls the regularization trade-off. This optimization can

be efficiently solved using quasi-Newton methods such as LBFGS-B [2] by eliminating

the auxiliary variable ρ. The Lagrangian can be written as:

L(w,ρ,µ,α,β) =∑

(i,j,k)∈T

f(ρ(i,j,k))−α>w + β>(w − C ′1)

+∑

(i,j,k)∈T

µ(i,j,k)


], (5.37a)


where µ, α, β are Lagrange multipliers and α ≥ 0, β ≥ 0. Similar to the case for `1

norm, the dual problem can be written as:

maxµ,β

− C ′1>β −∑

(i,j,k)∈T

f∗(−µ(i,j,k)) (5.38a)

s.t. ∀h(·) ∈ C :∑

(i,j,k)∈T

µ(i,j,k)δh(i, j, k) ≤ βh, (5.38b)

β ≥ 0. (5.38c)

As the same with the case of `1 norm, the dual solution µ can be calculated using (5.29),

and the rule for generating one hash function is to solve the subproblem in (5.14).

Similar to the discussion for `1 norm, different loss functions, including the squared

hinge loss in (5.21) and the logistic loss in (5.31), can be applied here to incorporate

with the `∞ norm regularization. As the flexibility of our framework, we also can use

the non-smooth hinge loss with the `∞ norm regularization. Please refer to Appendix

C for details.

5.5 Extension of regularization

To demonstrate the flexibility of the proposed framework, we show an example that

considers additional pairwise information for hashing learning. Assume that we are

given the pairwise similarity information. It is expected that the distance of similar

data pairs should be minimized. We can easily add a new regularization term in our

objective function to leverage this additional information. Formally, let us denote the

set of pairwise relations by:

D = (i, j) | xi is similar to xj. (5.39)

We want to minimize the divergence:

∑(i,j)∈D

dhm(xi,xj) =

m∑r=1

∑(i,j)∈D

wr|hr(xi)− hr(xj)|. (5.40)

If we use this term to replace the `1 regularization term: 1>w of the optimization

in (5.20), all of our analysis still holds and Algorithm 5 is still applicable with small

modification, because the new term can be simply seen as weighted `1 norm.


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ISOLET

recall (using 60 bits)

Pre

cisi

on

STH (AP:0.414±0.006)LSI (AP:0.205±0.003)LCH (AP:0.227±0.002)ITQ (AP:0.433±0.008)AGH (AP:0.319±0.007)SPH (AP:0.256±0.004)LSH (AP:0.261±0.004)BREs (AP:0.443±0.009)SPLH (AP:0.322±0.018)SSC (AP:0.359±0.007)CGHash (AP:0.672±0.006)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8SCENE−15


Pre

cisi

on


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1MNIST


Pre

cisi

on


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8LABELME


Pre

cisi

on


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1USPS


Pre

cisi

on


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4PASCAL07


Pre

cisi

on


Figure 5.1: Results of precision-recall (using 64 bits). It shows that our CGHash performs thebest in most cases.

5.6 Experiments

5.6.1 Dataset description

In order to evaluate the proposed column generation hashing method (referred to as

CGHash), we have conducted a set of experiments on six benchmark datasets. Table 5.1

gives a summary of all datasets used in the experiments. More specifically, the USPS


5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9ISOLET

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors

STH (0.620±0.010)LSI (0.423±0.005)LCH (0.470±0.004)ITQ (0.652±0.010)AGH (0.637±0.009)SPH (0.495±0.005)LSH (0.450±0.007)BREs (0.630±0.012)SPLH (0.551±0.018)SSC (0.584±0.010)CGHash (0.871±0.005)

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7SCENE−15

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors


5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1MNIST

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors


5 10 15 20 25 30 35 40 45 50 55 600.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7LABELME

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors


5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1USPS

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors


5 10 15 20 25 30 35 40 45 50 55 600.05

0.1

0.15

0.2

0.25

0.3PASCAL07

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors


Figure 5.2: Precision of top-50 retrieved examples using different number of bits. It shows thatour CGHash performs the best in most cases.

dataset consists of 9,298 handwritten images, each of which is resized to 16× 16 pixels.

This dataset is split into two subsets at random (70% for training and 30% for testing).

The MNIST dataset is composed of 70,000 images of handwritten digits. The size of

each image is 28×28. This dataset is randomly partitioned into a training subset (66,000

images) and a testing subset (4,000 images). We select 2,000 images from the training

subset to generate a set of triplets used for learning hash functions. In the above two

handwritten image datasets, the original gray-scale intensity values of each image are

used as features.


5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ISOLET

number of bits

3NN

cla

ssifi

catio

n er

ror


5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1SCENE−15

number of bits

3NN

cla

ssifi

catio

n er

ror


5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9MNIST

number of bits

3NN

cla

ssifi

catio

n er

ror


5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9LABELME

number of bits

3NN

cla

ssifi

catio

n er

ror


5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9USPS

number of bits

3NN

cla

ssifi

catio

n er

ror


5 10 15 20 25 30 35 40 45 50 55 60

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1PASCAL07

number of bits

3NN

cla

ssifi

catio

n er

ror


Figure 5.3: Nearest-neighbor classification error using different number of bits. It shows thatour CGHash performs the best in most cases.

The ISOLET dataset contains 7,797 recordings of 150 subjects speaking the 26 letters

of the alphabet. Each subject spoke each letter twice. This dataset is randomly divided

into a training subset (5,459 spoken letters) and a testing subset (2,338 spoken letters).

Each letter is represented as a 617-dimensional feature vector.

The SCENE-15 dataset consists of 4,485 images of 9 outdoor scenes and 6 indoor scenes.

Each image is divided into 31 sub-windows, each of which is represented as a histogram

of 200 visual code words. A concatenation of the histograms associated with 31 sub-

windows is used to represent an image, resulting in a 6,200-dimensional feature vector.


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8SCENE−15


Pre

cisi

on

CGHash−k3 (AP:0.197±0.012)CGHash−k10 (AP:0.323±0.028)CGHash−k20 (AP:0.444±0.019)CGHash−k30 (AP:0.492±0.012)

5 10 15 20 25 30 35 40 45 50 55 60

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65SCENE−15

number of bits

Pro

port

ion

of th

e tr

ue n

eare

st n

eigh

bors

CGHash−k3 (0.305±0.018)

CGHash−k10 (0.474±0.031)

CGHash−k20 (0.603±0.017)

CGHash−k30 (0.645±0.011)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1SCENE−15

number of bits

3NN

cla

ssifi

catio

n er

ror

CGHash−k3 (0.547±0.025)CGHash−k10 (0.379±0.026)CGHash−k20 (0.303±0.009)CGHash−k30 (0.284±0.007)

Figure 5.4: Performance of CGHash using different values of K (K ∈ 3, 10, 20, 30) on theSCENE-15 dataset. Results of precision-recall (using 60 bits), precision of top-50 retrievedexamples and nearest-neighbor classification are shown.

This dataset is randomly divided into a training subset (1500 examples) and a testing

subset (2985 examples).

The LABELME dataset1 [83] is a subset of the original LabelMe dataset, and consists

of 50,000 images that are categorized into 12 classes. Each image has 256× 256 pixels.

We generate 512-dimensional gist features for this dataset.

The PASCAL07 dataset is a subset of the PASCAL VOC 2007 dataset, and contains

9,963 images of 20 object classes. We use 5 types of features provided in [82], thus image

is represented by a 2712-dimensional feature vector. This dataset is randomly separated

into a training subset (70%) and a testing subset (30%).

1http://www.ais.uni-bonn.de/download/datasets.html


Dataset MNIST USPS LABELME [83] SCENE-15 [84] ISOLET PASCAL07 [127]

Size 70,000 9,298 50,000 4,485 7,797 9,963Dimension 784 256 512 6,200 617 2712

Classes 10 10 12 15 26 20

Table 5.1: Summary of the 6 datasets used in the experiments.

5.6.2 Experiment setup

Each dataset is randomly split into a training subset and a testing subset. This train-

ing/testing split is repeated 5 times, and the average performance over these 5 trials

is reported here. In the experiments, the proposed hashing method is implemented by

using the squared hinge loss function with the `1 norm regularization. Moreover, the

triplets used for learning hash functions are generated in a similar way as [128]. Specifi-

cally, given a training example, we randomly select K similar examples and K dissimilar

examples by multi-class label consistency, then construct triplets based on these similar

and dissimilar data pairs. We choose K = 30 for the SCENE-15 dataset and K = 10

for the other datasets. The regularization trade-off parameter C is cross-validated. We

found that, in a wide range, the setting of the parameter C does not have a significant

impact on the performance.

5.6.3 Competing methods

we compare our CGHash with some state-of-the-art hashing methods. For simplicity,

they are respectively referred to as LSH (Locality Sensitive Hashing [65]), SSC (Su-

pervised Similarity Sensitive Coding [118] as a modified version of [121]), LSI (Latent

Semantic Indexing [129]), LCH (Laplacian Co-Hashing [130]), SPH (Spectral Hash-

ing [34]), STH (Self-Taught hashing [33]), AGH (Anchor Graph Hashing [38]), BREs

(Supervised Binary Reconstructive Embedding [28]), SPLH (Semi-Supervised Learning

Hashing [17]), and ITQ (Iterative Quantization [36]).

5.6.4 Evaluation criteria

For quantitative performance comparison, we apply the following three evaluation mea-

sures:

1. Precision-recall curve. The precision and recall values are calculated as follows:

precision =#retrieved relevant sampels

#all retrieved examples, (5.41)

recall =#retrieved relevant sampels

#all relevant examples. (5.42)


2. Precision of top-K retrieved examples:

precision =#retrieved relevant sampels

K. (5.43)

3. K-nearest-neighbor classification. Each test example is classified using majority

voting of the top-K retrieved examples.

5.6.5 Quantitative comparison results

The performances of all hashing methods on six datasets are shown in Figure 5.1, 5.2

and 5.3. In these figures, we report results in the following three aspects:

1) We show the precision-recall result using the maximum bit length. In the legend of

each figure, we report the averaged precision (the area under the curve) and its standard

deviation.

2) We show the precision of top-50 retrieved examples using different bit lengths. In

the legend of each figure, we report the mean score and its standard deviation using the

maximum bit length.

3) We show the K-nearest-neighbor classification error using different bit lengths. In

the legend of each figure, we report the mean score and its standard deviation using the

maximum bit length.

These results show that the proposed CGHash achieves the best performance in most

cases. In the precision-recall result, CGHash has the best averaged precision, which

is calculated by the area under the precision-recall curve. CGHash also has the best

precision of top-50 retrieved examples in most cases. Moreover CGHash usually has

lower classification error than other methods.

We evaluate our CGHash using different settings of K for generating triplets on the

SCENE-15 dataset. Results are shown in Figure 5.4. It shows that in general the

performance is improved as K increases.

Some retrieval examples of our CGHash on the MNIST and LABELME datasets are

shown in Figure 5.5. It shows that CGHash is able to retrieve accurate nearest neighbors.

5.7 Conclusion

We have proposed a novel hashing method that is implemented using column generation

optimization. Our method aims to preserve the triplet-based relative ranking. Such


Figure 5.5: Two retrieval examples for CGHash on the LABELME and MNIST datasets. Queryexamples are shown in the left column, and the retrieved examples are shown in the right part.

a set of triplet constraints are incorporated into the large-margin learning framework.

Hash functions are then learned iteratively using column generation. Experimental

results have shown that the proposed hashing method achieves improved performance

compared with many existing hashing methods for similarity preserving.

Chapter 6

A General Two-Step Approach to

Learning-Based Hashing

In this chapter, we propose a flexible and general method [44] with a two-step learning

scheme. Most existing approaches to hashing apply a single form of hash function, and

an optimization process which is typically deeply coupled to this specific form. This

tight coupling restricts the flexibility of the method to respond to the data, and can

result in complex optimization problems that are difficult to solve. Here we propose

a flexible yet simple framework that is able to accommodate different types of loss

functions and hash functions. This framework allows a number of existing approaches

to hashing to be placed in context, and simplifies the development of new problem-

specific hashing methods. Our framework decomposes the hashing learning problem

into two steps: hash bit learning and hash function learning based on the learned bits.

The first step can typically be formulated as binary quadratic problems, and the second

step can be accomplished by training standard binary classifiers. Both problems have

been extensively studied in the literature. Our extensive experiments demonstrate that

the proposed framework is effective, flexible and outperforms the state-of-the-art.

6.1 Introduction

In general, hash functions are generated with the aim of preserving some notion of

similarity between data points. One of the seminal approaches in this vein is the random

projection based locality-sensitive hashing (LSH) [32, 62], which randomly generates

hash functions to approximate cosine similarity. Compared to this data-independent

method, recent work has focused on data-dependent approaches for generating more

effective hash functions. In this category, a number of methods have been proposed, for

105

106 Chapter 6 A General Two-Step Approach to Learning-Based Hashing

example: spectral hashing (SPH) [34], multi-dimension spectral hashing (MDSH) [35],

iterative quantization (ITQ) [36] and inductive manifold hashing [64]. These methods

do not rely on labeled data and are thus categorized as unsupervised hashing methods.

Another category is the supervised hashing methods. Recent works include supervised

hashing with kernels (KSH) [31], minimal loss hashing (MLH) [29], supervised binary

reconstructive embeddings (BRE) [28], semi-supervised sequential projection learning

hashing (SPLH) [17] and column generation hashing [43], etc.

Loss functions for hashing are typically defined on the basis of the Hamming distance or

Hamming affinity of similar and dissimilar data pairs. Hamming affinity is calculated by

the inner product of two binary codes (a binary code takes a value of −1, 1). Existing

methods thus tend to optimize a single form of hash functions, the parameters of which

are directly optimized against the overall loss function. The common forms of hash

functions are linear perceptron functions (MLH, SPLH, LSH), kernel functions (KSH),

eigenfunctions (SPH, MDSH). The optimization procedure is then coupled with the se-

lected family of hash function. Different types of hash functions offer a trade-off between

testing time and ranking accuracy. For example, compared with kernel functions, the

simple linear perceptron function is usually much more efficient for evaluation but can

have a relatively low accuracy for nearest neighbor search. Moreover, this coupling often

results in a highly non-convex problem which can be very difficult to optimize.

As an example, the loss functions in MDSH, KSH and BRE all take a similar form

that aims to minimize the difference between the Hamming affinity (or distance) and

the ground truth of data pairs. However, the optimization procedures used in these

methods are coupled with the form of hash functions (eigenfunctions, kernel functions)

and thus different optimization techniques are needed.

Self-Taught Hashing (STH) [33] is a method which decomposes the learning procedure

into two steps: binary code generating and hash function learning. We extend this idea

and propose a general two-step approach to hashing of which STH can be seen as a

specific example. Note that STH optimizes the Laplacian affinity loss, which only tries

to pull together those similar data pairs but does not push away those dissimilar data

pairs. As shown in manifold learning, this may lead to inferior performance [71].

Our framework, however, is able to accommodate many different loss functions defined

on the Hamming affinity of data pairs, such as the loss function used in KSH, BRE or

MLH. This more general family of loss functions may consider both similar and dissimilar

data pairs. In order to produce effective binary codes in this first step, we develop a new

technique based on coordinate descent. We show that at each iteration of coordinate

descent, we can formulate the optimization problem of any Hamming affinity loss as a

binary quadratic problem. This formulation unifies different types of objective functions

Chapter 6 A General Two-Step Approach to Learning-Based Hashing 107

into the same optimization problem, which significantly simplifies the optimization effort.

Our main contributions are as follows.

1. We propose a flexible hashing framework that decomposes the learning procedure

into two steps: binary codes inference step and hash function learning step. This

decomposition simplifies the problem and enables the use of different types of loss

functions and simplifies the hash function learning problem into a standard binary

classification problem. An arbitrary classifier, such as linear or kernel support

vector machines (SVM), boosting, neural networks, may thus be adopted to train

the hash functions.

2. For binary code inference, we show that optimization using different types of loss

functions (e.g., loss functions in KSH, BRE, MLH) can be solved as a series of

binary quadratic problems. We show that any type of loss function (e.g., the `2

loss, exponential loss, hinge loss) defined on Hamming affinity of data pairs can

be equivalently converted into a standard quadratic function. Based on this key

observation, we propose a general block coordinate decent method that is able to

incorporate many different types of loss functions in a unified manner.

3. The proposed method is simple and easy to implement. We carry out extensive

experiments on nearest neighbor search for image retrieval. To show the flexibility,

we evaluate our method using different types of loss functions and different formats

of hash functions (linear SVM, kernel SVM, Adaboost with decision stumps, etc).

Experiments show that our method outperforms the state-of-the-art.

6.2 Two-Step Hashing

Given a set of training points X = x1,x2, ...,xn ⊂ Rd, the goal of hashing is to learn

a set of hash functions that are able to preserve some notion of similarity between data

points. A ground truth affinity (or distance) matrix, Y, is provided (or calculated by a

pre-defined rule) for training, which defines the (dis-)similarity relations between data

pairs. In this case yij is the (i, j)-th element of the matrix Y, which is an affinity value

of the data pair (xi,xj). As a simple example, if the data labels are available, yij can

be defined as 1 for similar data pairs to t and −1 for dissimilar data pairs. In the case of

unsupervised learning, yij can be defined as the Euclidean distance or Gaussian affinity

on data points. The output of m hash functions is denoted by Φ(x):

Φ(x) = [h1(x), h2(x), . . . , hm(x)], (6.1)


which is a vector of m-bit binary codes: Φ(x) ∈ −1, 1m. In general, the optimization

can be written as:

minΦ(·)

n∑i=1

n∑j=1

δijL(Φ(xi),Φ(xj); yij). (6.2)

Here δij ∈ 0, 1 indicates whether the relation between two data points is defined, and

L(·) is a loss function that measures the how well the binary codes match the expected

affinity (or distance) yij . Many different types of loss functions L(·) have been devised,

and will be discussed in detail in the next section.

Most existing methods try to directly optimize objective (6.2) in order to learn the

parameters of hash functions [28, 29, 31, 35]. This inevitably means that the optimization

process is tightly coupled to the form of hash functions used, which makes it non-trivial

to extend a method to use another different format of hash functions. Moreover, this

coupling usually results in highly non-convex problems. Following the idea of STH [33],

we decompose the learning procedure into two steps: the first step for binary code

inference and the second step for hash function learning. The first step is to solve the

optimization:

minZ

n∑i=1

n∑j=1

δijL(zi, zj ; yij), s.t. Z ∈ −1, 1m×n, (6.3)

where Z is the matrix of m-bit binary codes for all data points, and zi is the binary

code vector corresponding to i-th data point.

The second step is to learn hash functions based on the binary codes obtained in the

first step, which is achieved by solving the optimization problem:

minΦ(·)

n∑i=1

G(zi,Φ(xi)). (6.4)

Here G(·, ·) is a loss function which evaluates the correctness of the binary label predic-

tion. We solve the above optimization independently for each of the m bits. To learn

the r-th hash function hr(·), the optimization can be written:

minhr(·)

n∑i=1

F (zi,r, hr(xi)). (6.5)

Here F (·, ·) is an loss function defined on two codes which evaluates the consistence;

zi,r is the binary code corresponding to the i-th data point and the r-th bit. Clearly,

the above optimization is a binary classification problem which is to minimize a kind

of loss given the binary labels. For example, the loss function F (·) can be an zero-one


loss function returning 0 if two inputs have the same value, and 1 otherwise. As in

classification, one can also use a convex surrogate to replace the zero-one loss. Typical

surrogate loss functions are hinge loss, logistic loss, etc. The resulting classifier is the

hash function that we aim to learn. Therefore, we are able to use any form of classifier.

For example, we can learn perceptron hash functions by training a linear SVM. The

linear perceptron hash function has the form:

h(x) = sign (w>x+ b). (6.6)

We could also train, for example, an RBF-kernel SVM, or Adaboost as hash functions.

Here we describe a kernel hash function that is learned using a linear SVM on kernel-

transferred features (referred to as SVM-KF). The hash function learned by SVM-KF

has a form as follows:

h(x) = sign

Q∑q=1

wqκ(x′q,x) + b

, (6.7)

in which X′ = x′1, . . . ,x′Q are Q data points generated from the training set by random

or uniform sampling.

We evaluate variety of different kinds of hash function in the Experiments Section below.

These tests show that Kernel hash functions often offer better ranking precision but

require much more evaluation time than linear perceptron hash functions. The hash

functions learned by SVM-KF represents a trade-off between kernel SVM and linear

SVM.

The method proposed here is labeled Two-Step Hashing (TSH), the steps are as follows:

• Step-1: Solving the optimization problem in (6.3) using block coordinate decent

(Algorithm 6) to obtain binary codes for each training data point.

• Step-2: Solving the binary classification problem in (6.5) for each bit based on the

binary codes obtained at Step-1.

An illustration of TSH is shown in Figure 6.1.

6.3 Solving binary quadratic problems

Optimizing (6.3) in Step-1 for the entire binary code matrix can be difficult. Instead, we

develop a bit-wise block coordinate descent method so that the problem at each iteration


Loss function options: any Hamming distance or affinity based loss function, e.g., KSH, BRE.

Hash function options: any classifier, e.g., Linear / Kernel SVM, Boosting, Random Forest, Neural Network.

Binary classification problemsHash functions

Hashing learningOptimisation

Binary quadratic problems

Binary codes

Cover to

Solve

Step-2

Step-1

Solve

Figure 6.1: An illustration of Two-Step Hashing

Algorithm 6: TSH: binary code inference (Step-1)

Input: affinity matrix: Y, bit length: m, number of cyclic iteration: r.Output: the matrix of binary codes: Z.

1 Initialize: the binary code matrix Z.2 repeat3 for t = 1, 2, . . . ,m do4 solve the binary quadratic problem (BQP) in (6.16) to obtain the binary codes

of t-th bit;5 update the codes of the t-th bit in the code matrix: Z;

6 until the maximum cyclic iteration: r is reached ;

can be solved easily. Moreover, we show that at each iteration, any pairwise Hamming

affinity (or distance) based loss can be equivalently formulated as a binary quadratic

problem. Thus we are able to easily work with different loss functions.

Block coordinate decent (BCD) is a technique that iteratively optimizes a subset of

variables at a time. For each iteration, we pick one bit for optimization in a cyclic

fashion. The optimization for the r-th bit can be written as:

minz(r)

n∑i=1

n∑j=1

δijlr(zi,r, zj,r), s.t. z(r) ∈ −1, 1n, (6.8)

where lr is the loss function defined on the r-th bit:

lr(zi,r, zj,r) = L(zi,r, zj,r, zi, zj ; yij). (6.9)

Here z(r) contains the binary codes of the r-th bit. zi,r is the binary code of the i-th

data point and the r-th bit. zi is the binary codes of the i-th data point excluding the


r-th bit.

Thus far, we have not described the form of the loss function L(·). Our optimization

method is not restricted to optimizing a specified form of the loss function. Based on the

following proposition, we are able to rewrite any Hamming affinity (or distance) based

loss function L(·) into a standard quadratic problem.

Proposition 6.1. For any loss function l(z1, z2) that is defined on a pair of binary input

variables z1, z2 ∈ −1, 1 and

l(1, 1) = l(−1,−1), l(1,−1) = l(1,−1), (6.10)

we can define a quadratic function g(z1, z2) that is equal to l(z1, z2). We have following

equation:

l(z1, z2) =1

2

[z1z2(l(11) − l(−11)) + l(11) + l(−11)

], (6.11)

=1

2z1z2(l(11) − l(−11)) + const. (6.12)

= g(z1, z2). (6.13)

Here l(11), l(−11) are constants, l(11) is the loss output on identical input pair: l(11) =

l(1, 1), and l(−11) is the loss output on distinct input pair: l(−11) = l(−1, 1).

Proof. This proposition can be easily proved by exhaustively checking all possible inputs

of the loss function. Notice that there are only two possible output values of the loss

function. For the input (z1 = 1, z2 = 1):

g(1, 1) =1

2

[1× 1× (l(11) − l(−11)) + l(11) + l(−11)

]= l(1, 1),

For the input (z1 = −1, z2 = 1):

g(−1, 1) =1

2

[− 1× 1× (l(11) − l(−11)) + l(11) + l(−11)

]= l(−1, 1),

The input (z1 = −1, z2 = −1) is the same as (z1 = 1, z2 = 1) and the input (z1 = 1, z2 =

−1) is the same as (z1 = −1, z2 = 1). In conclusion, the function l and g have the same

output for any possible inputs.


Any hash loss function l(·, ·) which is defined on the Hamming affinity between, or Ham-

ming distance of, data pairs is able to meet the requirement that: l(1, 1) = l(−1,−1), l(1,−1) =

l(1,−1). Applying this proposition, the optimization of (6.8) can be equivalently refor-

mulated as:

minz(r)∈−1,1n

n∑i=1

n∑j=1

δij(l(11)r,i,j − l

(−11)r,i,j )zi,rzj,r, (6.14)

The above optimization is an unconstrained binary quadratic problem. Let ai,j denote

the (i, j)-th element of matrix A, which we define as:

ai,j = δij(l(11)r,i,j − l

(−11)r,i,j ). (6.15)

The above optimization (6.14) can be written in matrix form:

minz(r)

z>(r)Az(r), s.t. z(r) ∈ −1, 1n. (6.16)

We have shown that at each iteration, the original optimization in (6.8) can be equiv-

alently reformulated as a binary quadratic problem (BQP) in (6.16). BQP has been

extensively studied. To solve (6.16), we first apply the spectral relaxation to get an

initial solution. Spectral relaxation drops the binary constraints. The optimization

becomes

minz(r)

z>(r)Az(r), s.t. ‖z(r)‖22 = n. (6.17)

The solution (denoted z0(r)) of the above optimization is simply the eigenvector that

corresponds to the minimum eigenvalue of the matrix A. To achieve a better solution,

here we take a step further. We solve the following relaxed problem of (6.16) as follows

minz(r)

z>(r)Az(r), s.t. z(r) ∈= [−1, 1]n. (6.18)

This relaxation is tighter than the spectral relaxation and provides a solution of better

quality. To solve the above problem, we use the solution z0(r) of spectral relaxation in

(6.17) as initialization and solve it using the efficient LBFGS-B solver [2]. The algorithm

for binary code inference in Step-1 is summarized in Algorithm 6.

The approach proposed above is applicable to many different types of loss functions,

which are defined on Hamming distance or Hamming affinity, such as the `2 loss, ex-

ponential loss, hinge loss. Here we describe a selection of such loss functions, most of

which arise from recently proposed hashing methods. We evaluate these loss functions

in the Experiments Section below. Note that m is the number of bits, and dh(·, ·) is the


Hamming distance on data pairs. If not specified, yij = 1 if the data pair is similar, and

yij = −1 if the data pair is dissimilar. δ(·) ∈ 0, 1 is an indicator function.

• TSH-KSH

The KSH loss function is based on Hamming affinity using `2 loss function. MDSH

also uses a similar form of loss function (weighted Hamming affinity instead):

LKSH(zi, zj) = (z>i zj −myij)2. (6.19)

• TSH-BRE

The BRE loss function is based on Hamming distance using the `2 loss function:

LBRE(zi, zj) = (dh(zi, zj)/m− δ(yij < 0))2. (6.20)

• TSH-SPLH

It uses an exponential loss outside the loss function proposed in SPLH which is

based on the Hamming affinity of data pairs:

LSPLH(zi, zj) = exp

[−yijz>i zj

m

]. (6.21)

• TSH-EE

Elastic Embedding (EE) is a dimension reduction method proposed in [71]. Here

we use their loss function with some modifications, which is a exponential based

on distance. Here λ is a trade-off parameter:

LEE(zi, zj) = δ(yij > 0)dh(zi, zj)

+ λδ(yij < 0) exp [−dh(zi, zj)/m]. (6.22)

• TSH-ExpH

Here ExpH is an exponential loss function using the Hamming distance:

LExpH(zi, zj) = exp

[yijdh(zi, zj) +mδ(yij < 0)

m

]. (6.23)


Table 6.1: Results (using hash codes of 32 bits) of TSH using different loss functions, and aselection of other supervised and unsupervised methods on 3 datasets. The upper part relatesthe results on training data and the lower on testing data. The results show that Step-1 of ourmethod is able to generate effective binary codes that outperforms those of competing methodson the training data. On the testing data our method also outperforms others by a large marginin most cases.

Precision-Recall MAP Precision at K (K=300)

Datasets LABELME MNIST CIFAR10 LABELME MNIST CIFAR10 LABELME MNIST CIFAR10

Results on training data

TSH-KSH 0.501 1.000 1.000 0.570 1.000 1.000 0.229 0.667 0.667TSH-BRE 0.527 1.000 1.000 0.600 1.000 1.000 0.230 0.667 0.667

TSH-SPLH 0.504 1.000 1.000 0.524 1.000 1.000 0.230 0.667 0.667TSH-EE 0.485 1.000 1.000 0.524 1.000 1.000 0.224 0.667 0.667

TSH-ExpH 0.475 1.000 1.000 0.541 1.000 1.000 0.225 0.667 0.667

STHs 0.335 0.800 0.629 0.387 0.882 0.774 0.176 0.575 0.433KSH 0.283 0.892 0.585 0.316 0.967 0.652 0.168 0.647 0.481

BREs 0.161 0.445 0.220 0.153 0.504 0.190 0.097 0.376 0.171SPLH 0.166 0.500 0.292 0.153 0.588 0.302 0.092 0.422 0.260MLH 0.120 0.547 0.190 0.142 0.685 0.235 0.100 0.478 0.200

Results on testing data

TSH-KSH 0.175 0.843 0.282 0.296 0.893 0.440 0.293 0.889 0.410TSH-BRE 0.169 0.844 0.283 0.293 0.896 0.439 0.293 0.890 0.409

TSH-SPLH 0.174 0.840 0.284 0.291 0.895 0.444 0.288 0.891 0.416TSH-EE 0.169 0.843 0.280 0.288 0.896 0.438 0.286 0.892 0.410

TSH-ExpH 0.172 0.844 0.282 0.287 0.892 0.441 0.286 0.887 0.410

STHs 0.094 0.385 0.144 0.162 0.639 0.229 0.156 0.634 0.218STHs-RBF 0.151 0.674 0.178 0.274 0.897 0.354 0.271 0.893 0.352

KSH 0.165 0.781 0.249 0.279 0.884 0.407 0.158 0.881 0.398BREs 0.106 0.409 0.151 0.178 0.703 0.226 0.171 0.702 0.210MLH 0.100 0.470 0.150 0.181 0.648 0.264 0.174 0.623 0.215

SPLH 0.093 0.452 0.191 0.168 0.714 0.321 0.158 0.708 0.315ITQ-CCA 0.077 0.619 0.206 0.143 0.792 0.333 0.133 0.784 0.325

MDSH 0.100 0.298 0.150 0.178 0.691 0.288 0.155 0.685 0.228SHPER 0.102 0.296 0.152 0.185 0.624 0.244 0.176 0.623 0.233

ITQ 0.116 0.386 0.161 0.206 0.750 0.264 0.197 0.751 0.252AGH 0.096 0.404 0.144 0.194 0.743 0.252 0.187 0.744 0.244STH 0.077 0.361 0.135 0.135 0.593 0.216 0.125 0.644 0.204BRE 0.091 0.323 0.137 0.160 0.651 0.238 0.147 0.582 0.185LSH 0.069 0.211 0.123 0.116 0.459 0.188 0.103 0.448 0.162

Table 6.2: Training time (in seconds) for TSH using different loss functions, and several othersupervised methods on 3 datasets. The value inside a brackets is the time used in the first stepfor inferring the binary codes. The results show that our method is efficient. Note that thesecond step of learning the hash functions can be easily parallelised.

LABELME MNIST CIFAR10

TSH-KSH 198 (107) 341 (294) 326 (262)TSH-BRE 133 (33) 309 (264) 234 (175)

TSH-EE 124 (29) 302 (249) 287 (225)TSH-ExpH 128 (43) 334 (281) 344 (256)

STHs-RBF 133 99 95KSH 326 355 379

BREs 216 615 231MLH 670 805 658


Figure 6.2: Some retrieval examples of our method TSH on CIFAR10. The first column showsquery images, and the rest are top 40 retrieved images in the database. False predictions aremarked by red boxes.


Figure 6.3: Some retrieval examples of our method TSH on CIFAR10. The first column showsquery images, and the rest are top 40 retrieved images in the database. False predictions aremarked by red boxes.


8 16 24 320.1

0.15

0.2

0.25

0.3

0.35LABLEME

Number of bits

Pre

cisi

on @

300

SPLHSTHs−RBFITQ−CCAMLHBREsKSHTSH

10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35LABLEME

Number of bits

Pre

cisi

on (

Ham

m. d

ist.

<=

2)


8 16 24 320.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3LABLEME

Number of bits

MA

P (

1000

−N

N)


100 200 300 400 500 600 700 800 900 10000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4LABLEME

Number of retrieved samples (32 bits)

Pre

cisi

on


8 16 24 320.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10

Number of bits

Pre

cisi

on @

300


10 15 20 25 300.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45CIFAR10

Number of bits

Pre

cisi

on (

Ham

m. d

ist.

<=

2)


8 16 24 320.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10

Number of bits

MA

P (

1000

−N

N)


100 200 300 400 500 600 700 800 900 10000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10


Pre

cisi

on


Figure 6.4: Results on 2 datasets of supervised methods. Results show that TSH outperformsothers usually by a large margin. The running up methods are STHs-RBF and KSH.


8 16 24 320.05

0.1

0.15

0.2

0.25

0.3LABLEME

Number of bits

Pre

cisi

on @

300

KLSHSPHAGHITQSPHERMDSHTSH

10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35LABLEME

Number of bits

Pre

cisi

on (

Ham

m. d

ist.

<=

2)


8 16 24 320.05

0.1

0.15

0.2

0.25

0.3LABLEME

Number of bits

MA

P (

1000

−N

N)


100 200 300 400 500 600 700 800 900 10000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4LABLEME


Pre

cisi

on


8 16 24 32

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10

Number of bits

Pre

cisi

on @

300


10 15 20 25 300.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10

Number of bits

Pre

cisi

on (

Ham

m. d

ist.

<=

2)


8 16 24 32

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10

Number of bits

MA

P (

1000

−N

N)


100 200 300 400 500 600 700 800 900 1000

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10


Pre

cisi

on


Figure 6.5: Results on 2 datasets for comparing unsupervised methods. Results show that TSHoutperforms others usually by a large margin.


8 16 24 320.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8SCENE15

Number of bits

Pre

cisi

on @

100

ITQSPHERITQ−CCABREsKSHTSH

8 16 24 320.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9SCENE15

Number of bitsM

AP

(10

00−

NN

)


8 16 24 320.4

0.5

0.6

0.7

0.8

0.9

1USPS

Number of bits

Pre

cisi

on @

100


8 16 24 320.4

0.5

0.6

0.7

0.8

0.9

1USPS

Number of bits

MA

P (

1000

−N

N)


8 16 24 320.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ISOLET

Number of bits

Pre

cisi

on @

100


8 16 24 320.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ISOLET

Number of bits

MA

P (

1000

−N

N)


Figure 6.6: Results on SCENE15, USPS and ISOLET for comparing with supervised and unsu-pervised methods. Our TSH perform the best.


8 16 24 320.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3LABLEME

Number of bits

Pre

cisi

on @

300

TSH−StumpTSH−LSVMTSH−KFTSH−RBF

10 15 20 25 300.05

0.1

0.15

0.2

0.25

0.3LABLEME

Number of bits

Pre

cisi

on (

Ham

m. d

ist.

<=

2)


8 16 24 32

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32LABLEME

Number of bits

MA

P (

1000

−N

N)


100 200 300 400 500 600 700 800 900 10000.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34LABLEME


Pre

cisi

on


8 16 24 320.2

0.25

0.3

0.35

0.4

0.45CIFAR10

Number of bits

Pre

cisi

on @

300


10 15 20 25 30

0.2

0.25

0.3

0.35

0.4

0.45

0.5CIFAR10

Number of bits

Pre

cisi

on (

Ham

m. d

ist.

<=

2)


8 16 24 320.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46CIFAR10

Number of bits

MA

P (

1000

−N

N)


100 200 300 400 500 600 700 800 900 10000.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46CIFAR10


Pre

cisi

on


Figure 6.7: Results on 2 datasets of our method using different hash functions. Results showthat using kernel hash function (TSH-RBF and TSH-KF) achieves best performances.


8 16 24 320

5

10

15

20

25

30

35

40

45LABLEME

Binary code length (the number of bits)

Com

pres

sion

tim

e −

Dat

abas

e (s

ec.)


8 16 24 320

20

40

60

80

100

120CIFAR10

Binary code length (the number of bits)

Com

pres

sion

tim

e −

Dat

abas

e (s

ec.)


Figure 6.8: Code compression time using different hash functions. Results show that using kerneltransferred feature (TSH-KF) is much faster then SVM with RBF kernel (TSH-RBF). LinearSVM is the fastest one.

0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8FLICKR1M

Recall (32 bits)

Precision

SPLHSTHsITQ−CCAMLHBREsKSHTSH

100 200 300 400 500 600 700 800 900 10000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1FLICKR1M


Pre

cisi

on


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8TINY580K

Recall (32 bits)

Precision


100 200 300 400 500 600 700 800 900 10000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1TINY580K


Pre

cisi

on


Figure 6.9: Comparison of supervised methods on 2 large scale datasets: Flickr1M and Tiny580k.Our method TSH achieves on par result with KSH. TSH and KSH significantly outperform othermethods.


0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8FLICKR1M

Recall (32 bits)

Precision


100 200 300 400 500 600 700 800 900 10000.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8FLICKR1M


Pre

cisi

on


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8TINY580K

Recall (32 bits)

Precision


100 200 300 400 500 600 700 800 900 10000.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75TINY580K


Pre

cisi

on


Figure 6.10: Comparison of unsupervised methods on 2 large scale datasets: Flickr1M andTiny580k. The first row shows the results of supervised methods and the second row for unsu-pervised methods. Our method TSH achieves on par result with KSH. TSH and KSH significantlyoutperform other methods.

6.4 Experiments

6.4.1 Comparing methods

We compare with a few state-of-the-art hashing methods, including 6 (semi-)supervised

methods: Supervised Hashing with Kernels (KSH) [31], Iterative Quantization with su-

pervised embedding (ITQ-CCA) [36], Minimal Loss Hashing (MLH) [29], Supervised

Binary Reconstructive Embeddings (BREs) [28] and its unsupervised version BRE, Su-

pervised Self-Taught Hashing (STHs) [33] and its unsupervised version STH, Semi-

supervised sequential Projection Learning Hashing(SPLH) [17], and 7 unsupervised

methods: Locality-Sensitive Hashing (LSH) [32], Iterative Quantization (ITQ) [36],

Anchor Graph Hashing (AGH) [38], Spectral Hashing (SPH [34]), Spherical Hashing

(SPHER) [39], Multi-dimension Spectral Hashing (MDSH) [35], Kernelized Locality-

Sensitive Hashing KLSH [63].


For comparison methods, we follow the original papers for parameter setting. For SPLH,

the regularization trade-off parameter is picked from 0.01 to 1. We use the hierarchical

variant of AGH. For each dataset, the bandwidth parameters of Gaussian affinity in

MDSH and RBF kernel in KLSH, KSH and our method TSH is set as σ = td. Here d is

the average Euclidean distance of top 100 nearing neighbours and t is picked from 0.01

to 50. For STHs and our method TSH, the trade-off parameter in SVM is picked from

10/n to 105/n, n is the number of data points. For our TSH-EE using EE lost function,

we simply set the trade-off parameter λ to 100. If not specified, our method TSH use

SVM with RBF kernel as hash functions. The cyclic iteration number r in Algorithm 6

is simply set to 1.

6.4.2 Dataset description

We use 2 large scale image datasets and another 6 small datasets for evaluation. 2 large

image datasets are 580, 000 tiny image dataset (Tiny-580K), and Flickr 1 Million image

dataset (Flickr-1M). 6 small datasets include CIFAR10, MNIST, LabelMe, SCNENE15,

and 2 UCI datasets: USPS and ISOLET.

CIFAR10 1 is a subset of Tiny-80m [131] image dataset which contains 60K examples.

We generated 320-dimensional GIST features. MNIST is consisted of 70k examples and

the dimension is 784. The LabelMe image dataset is used in [29], which has 22K images

and 512-dimensional GIST features. SCENE15 [84] is a dataset of scene images which

has 4.5K examples. We extract 200 visual words histogram and 31 sub-windows which

results in 6200-dimensional feature vector. The ISOLET dataset contains round 8K

speaking records of 26 letters and has 617 dimensions. USPS is a hand writing digits

dataset which hash 9K examples and 256 dimensions.

The Tiny-580K dataset, described in [36], is subset of Tiny-80M dataset [131] which

consists of 580, 000 images. we use the provided GIST feature with 384 dimensions.

Flickr-1M dataset consists of 1 million thumbnail images of the MIRFlickr-1M We gen-

erate 320-dimension GIST features.

For the LabelMe dataset, the ground truth pairwise affinity matrix is provided. For other

small datasets, we use the multi-class labels to define the ground truth affinity by label

agreement. For large datasets Tiny-580K and Flickr-1M, there is no semantic ground

truth affinity provided. Following the same setting as other hash methods [17, 31], we

generate pseudo-labels for supervised methods according to the `2 distance. In detail, a

data point is labelled as a relevant neighbour to the query if it lies in the top 2 percentile

points of the whole database.

1http://www.cs.toronto.edu/˜kriz/cifar.html


For all datasets, following a common setting in many supervised hashing methods [28,

29, 31], we randomly select 2000 examples as testing queries, and the rest is served as

database. We train methods using a subset of the database: 5000 examples for large

datasets (Tiny-580K and Flickr-1M) and 2000 examples for the rest.

6.4.3 Evaluation measures

We use 4 types of evaluation measures: precision of top-K retrieved examples (Precision-

at-K), Mean Average Precision (MAP), the area under the Precision-Recall curve, preci-

sion of retrieved examples within the Hamming distance 2. Precision-at-K is the propor-

tion of relevant data points in the returned top-K results. MAP is the averaged precision-

at-K scores over all positions of relevant data points in a ranking. The Precision-Recall

curve is measure the overall performance in all positions.Precision-Recall is computed

by varying the number of nearest neighbours (K) and calculate the precision and recall

value.

6.4.4 Using different loss functions

We evaluate the performance of our method TSH using different loss functions on 3

datasets: LabelMe, MNIST, CIFAR10. 3 types of evaluation measures are used here:

Precision-at-K, Mean Average Precision (MAP) and the area under the Precision-Recall

curve. The loss function is defined in Section 6.3. In particular, our method TSH-KSH

uses the KSH [31] loss function, TSH-BRE uses the BRE [28] function. STHs-RBF is

the STHs method using RBF kernel hash functions. Our method also uses SVM with

RBF kernel as hash functions.

First, we evaluate the effectiveness of the Step-1 in our method. We compare the quality

of the generated binary codes on the training data points. The results are shown in the

upper part of the table in Table. 6.1. The results show that our methods generate

high-quality binary codes and outperform others by a large margin. In CIFAR10 and

MNIST, we are able to generate perfect codes that match the ground truth similarity.

This demonstrates the effectiveness of coordinate descent based hashing codes learning

procedure (Step 1 of our framework).

Compared to STHs-RBF, even though we are using the same formate of hash function,

our overall objective function and the bit-wise binary code inference algorithm may be

more effective. Thus our method achieves better performance than STH.

The second part of the result in Table 6.1 shows the testing performance. Our method

also outperforms others in most cases. Note that MNIST is an ‘easy’ dataset and


not as challenging as CIFAR10 and LabelMe. Thus many methods manage to achieve

good performance. In the challenging dataset CIFAR10 and LabelMe, our method

significantly outperforms others by a large margin.

Overall, for preserving the semantic similarity, supervised methods usually perform much

better than those unsupervised methods, which is expected. Our method performs the

best, and the running-up methods are STHs-RBF, KSH, and ITQ-CCA.

We show further results of using different numbers of bits in Figure 6.4 for supervised

methods and Figure 6.5 for unsupervised methods on the dataset CIFAR10 and La-

belMe. For dataset SCENE15, USPS and ISOLET, the results of several supervised and

unsupervised methods are shown in Figure 6.6.

In the figures, TSH denotes our method using BRE loss function. Our method still

performs the best in most cases. Some retrieval examples are shown in Figure 6.2 and

6.3.

6.4.5 Training time

In Table 6.2, we compare the training time of different methods. It shows that our

method is fast compared to the state-of-the-art. We also present the binary code learning

time in the table. Notice that in the second step, learning hash functions by binary

classification can be easily paralleled which would make our method even more efficient.

6.4.6 Using different hash functions

We evaluate our method using different hash functions. The hash functions are SVM

with RBF kernel (TSH-RBF), linear SVM with kernel transferred feature (TSH-KF),

linear SVM (TSH-LSVM), Adaboost with decision-stump (TSH-Stump, 2000 iterations).

Results on 3 datasets are shown in Figure 6.7. The testing time for different hash

functions are shown in Figure 6.8.

It shows that the kernel hash functions (TSH-RBF and TSH-KF) achieve best perfor-

mance in similarity search. However, the testing of linear hash functions is much faster

than kernel hash functions. We also find that the testing time of TSH-KF is much faster

then TSH-RBF. The TSH-KF is a trade-off between testing time and search perfor-

mance.


6.4.7 Results on large datasets

We carry out experiments on 2 large scale datasets: Flickr 1 million image dataset

(Flickr1M) and 580, 000 Tiny image dataset (Tiny580k). Results are shown in Figure 6.9

and Figure 6.10 for the comparison with supervised methods and unsupervised methods

respectively. Our method TSH achieve on par results with KSH. KSH and our TSH

significantly outperform other supervised or unsupervised methods. Notice that there

is no semantic similarity ground truth provided on these two datasets. We generate the

similarity ground truth using the Euclidean distance. Some unsupervised methods are

also able to perform well in this setting (e.g., MDSH, SPHER and ITQ).

6.5 Conclusion

We have shown that it is possible to place a wide variety of learning-based hashing

methods into a unified framework. The key insights is the fact that the code generation

and hash function learning processes can be seen as separate steps, and that the latter

may accurately be formulated as a classification problem. This insight enables the

development of new hashing approaches with efficient and simple learning. Experimental

testing has validated this approach, and shown that this new approach outperforms the

state-of-the-art.

Chapter 7

Fast Supervised Hashing with

Decision Trees for

High-Dimensional Data

In this chapter, we propose a hashing method [45] for efficient and effective learning

on large-scale and high-dimensional data, which is an extension of our general two-step

hashing method described in Chapter 6.

Supervised hashing aims to map the original features to compact binary codes that

are able to preserve label based similarity in the Hamming space. Non-linear hash

functions have demonstrated their advantage over linear ones due to their powerful

generalization capability. In the literature, kernel functions are typically used to achieve

non-linearity in hashing, which achieve encouraging retrieval performance at the price

of slow evaluation and training time. Here we propose to use boosted decision trees

for achieving non-linearity in hashing, which are fast to train and evaluate, hence more

suitable for hashing with high dimensional data. In our approach, we first propose sub-

modular formulations for the hashing binary code inference problem and an efficient

GraphCut based block search method for solving large-scale inference. Then we learn

hash functions by training boosted decision trees to fit the binary codes. Experiments

demonstrate that our proposed method significantly outperforms most state-of-the-art

methods in retrieval precision and training time. Especially for high-dimensional data,

our method is orders of magnitude faster than many methods in terms of training time.

127

128 Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data

7.1 Introduction

Hashing methods aim to preserve some notion of similarity (or distance) in the Ham-

ming space. These methods can be roughly categorized as supervised and unsupervised.

Unsupervised hashing methods try to preserve the similarity which is calculated in the

original feature space. For example, random projection based Locality-Sensitive Hash-

ing (LSH) [32] generates random linear hash functions to approximate cosine similarity;

Spectral Hashing [35] learns eigenfunctions that preserve Gaussian affinity; Iterative

Quantization (ITQ) [36] approximates the Euclidean distance in the Hamming space;

and Hashing on manifolds [64] takes the intrinsic manifold structure into consideration.

Supervised hashing is designed to preserve the label-based similarity [28, 31, 43, 44].

This might take place, for example, in the case where images from the same category

are defined as being semantically similar to each other. Supervised hashing has received

increasingly attention recently such as Supervised Hashing with Kernels (KSH) [31],

Two-Step Hashing (TSH) [44], Binary Reconstructive embeddings (BRE) [28]. Although

supervised hashing is more flexible and appealing for real-world applications, the learning

is usually much slower than that of unsupervised hashing. Despite the fact that hashing

is only of practical interest in the case where it may be applied to large numbers of

high-dimensional features, most supervised hashing approaches are demonstrated only

on relatively small numbers of low dimensional features. For example, codebook based

features have achieved remarkable success on image classification [132, 133], of which

the number of feature dimension usually comes to tens of thousands. To exploit this

recent advance of feature learning, it is very desirable for supervised hashing to be able

to deal with large-scale data efficiently on sophisticated high-dimensional features. To

bridge this gap, we propose a supervised hashing method which is able to leverage large

training sets and efficiently incorporate with high-dimensional features.

Non-linear hash functions, e.g., the kernel hash function employed in KSH and TSH,

have shown much improved performance over the linear hash function. However, kernel

functions could be extremely expensive for both training and testing on high-dimensional

features. Thus a scalable supervised hashing method with non-linear hash functions is

desirable too.

Compared to kernel method, decision trees only involve simple comparison operations

for evaluation; thus decision trees are much more efficient, especially on high-dimensional

data. Moreover, decision trees are able to work on quantized data without significant

performance loss, hence only consume very small memory for training. Though decision

trees could be a good choice for hash function, it remains unclear that how to learn

Chapter 7 Fast Supervised Hashing with Decision Trees for High-Dimensional Data 129

decision trees for supervised hashing. Here we propose an efficient method for learning

decision tree hashing functions.

Our main contributions are as follows.

1. We propose to use (ensembles of) decision trees as hash functions for supervised

hashing, which can easily deal with a very large number of training data with high

dimensionality (tens of thousands), and has the desirable non-linear mapping. To

our knowledge, our method is the first general hashing method that uses decision

trees as hash functions.

2. In order to efficiently learn decision trees for supervised hashing, we apply a two-

step learning strategy which decomposes the learning into the binary code inference

and the simple binary classification training of decision trees. For binary code

inference, we propose sub-modular formulations and an efficient GraphCut [134]

based block search method for solving large-scale inference.

3. Our method significantly outperforms many state-of-the-art methods in terms of

retrieval precision. For high-dimensional data, our method is usually orders of

magnitude faster in terms of training time.

The two-step learning strategy employed in our method is inspired by our general two-

step hashing learning method: TSH [44] which is described in Chapter 6. Other work

in [118, 121, 135] also learns hash functions by training classifiers. The spectral method

in TSH for binary code inference does not scale well on large training data, and it may

also lead to inferior result due to the loose relaxation of spectral methods. Moreover,

TSH only demonstrates satisfactory performance with kernel hash functions on small-

scale training data with low dimensionality, which is clearly not practical for large-scale

learning on high-dimensional features. In contrast with TSH, we explore efficient decision

trees as hash functions and propose an efficient GraphCut based method for binary code

inference. Experiments show that our method significantly outperforms TSH.

7.2 The proposed method

Let X = x1, ...,xn ⊂ Rd denote a set of training points. Label based similarity infor-

mation is described by an affinity matrix: Y, which is the ground truth for supervised

learning. The element in Y: yij indicates the similarity of two data point xi and xj ;

and yij = yji. Specifically, yij = 1 if two data points are similar, yij = −1 if dissimilar

(irrelevant) and yij = 0 if the pairwise relation is undefined. We aim to learn a set of


Algorithm 7: An example for constructing blocks

Input: training data points: x1, ...xn; Affinity matrix: Y.Output: blocks:B1,B2, ....

1 V← x1, ...,xn; t = 0;2 repeat3 t = t+ 1; Bt ← ∅; xi: randomly selected from V;4 initialize U as joint of V and similar examples of xi ;5 for each xj in U do6 if xj is not dissimilar with any examples in Bt then7 add xj to Bt; remove xj from V ;

8 until V = ∅;

Algorithm 8: Step 1: Block GraphCut for binary code inference

Input: Affinity matrix: Y; bit length: r; max inference iteration; blocks:B1,B2, ...;binary codes: z1, ...,zr−1.

Output: Binary codes of one bit: zr.1 repeat2 Randomly permute all blocks;3 for each Bi do4 Solve the inference in (7.15a) on Bi using GraphCut;

5 until max iteration is reached ;

Algorithm 9: FastHash

Input: Training data points: x1, ...xn; Affinity matrix: Y; bit length: m;blocks:B1,B2, ....

Output: Hash functions: Φ = [h1, ..., hm].1 for r = 1, ...,m do2 Step-1: call Algorithm 8 to obtain binary codes of r-th bit;3 Step-2: train trees in (7.22) to obtain hash function hr;4 update the binary codes of r-th bit by the output of hr;

hash functions to preserve the label based similarity in the Hamming space. m hash

functions are denoted as:

Φ(x) = [h1(x), h1(x), . . . , hm(x)]. (7.1)

The output of hash functions are m-bit binary codes: Φ(x) ∈ −1, 1m.

The Hamming distance between two binary codes is the number of bits taking different

values:

dhm(xi,xj) =

m∑r=1

[1− δ(hr(xi), hr(xj))

], (7.2)


in which δ(·, ·) ∈ 0, 1 is an indication function. If two inputs are equal, δ(·, ·) outputs

1; otherwise, it outputs 0. Generally, the formulation of hashing learning is to encourage

small Hamming distance for similar data pairs and large for dissimilar data pairs. Closely

related to Hamming distance, the Hamming affinity is calculated by the inner product

of two binary codes:

shm(xi,xj) =m∑r=1

hr(xi)hr(xj). (7.3)

As shown in KSH [31], the Hamming affinity is in one-to-one correspondence with the

Hamming distance. Similar to KSH [31], we formulate hashing learning based on Ham-

ming affinity, which is to encourage positive affinity values for similar data pairs and

negative for dissimilar data pairs. The optimization is written as:

minΦ(·)

n∑i=1

n∑j=1

|yij |[myij −

m∑r=1

hr(xi)hr(xj)

]2

. (7.4)

Note that KSH does not include the multiplication of |yij | in the objective. We use |yij |to prevent undefined pairwise relation from harming the hashing task. If the relation is

undefined, |yij | = 0, otherwise, |yij | = 1. Intuitively, this optimization encourages the

Hamming affinity value of a data pair to be close to the ground truth value. In contrast

with KSH which uses kernel functions, here we employ decision trees as hash functions.

We define each hash function as a linear combination of decision trees, that is,

h(x) = sign

[ Q∑q=1

wqTq(x)

]. (7.5)

Here Q is the number of decision trees; T (·) ∈ −1, 1 denotes a tree function with

binary output. The weighting:

w = [w1, w2, ..., wQ] (7.6)

and trees:

T = [T1, T2, ..., TQ] (7.7)

are parameters we need to learn for one hash function. Comparing to kernel method,

decision tree enjoys faster testing on high-dimensional data as well as the non-linear

fitting ability.

Optimizing (7.4) directly for learning decision trees is difficult, and the technique used

in KSH is no longer applicable. Inspired by TSH [44], we introduce auxiliary variables


zr,i ∈ −1, 1 as the output of the r-th hash function on xi:

zr,i = hr(xi). (7.8)

Clearly, zr,i is the binary code of i-th data point in the r-th bit. With these auxiliary

variables, the problem (7.4) can be decomposed into two sub-problems:

minZ∈−1,1m×n

n∑i=1

n∑j=1

|yij |[myij −

m∑r=1

zr,izr,j

]2

; (7.9)

and

minΦ(·)

m∑r=1

n∑i=1

δ(zr,j , hr(xi)). (7.10)

Here Z is the matrix of m-bit binary codes for all training data points; δ(·, ·) is an in-

dicator function which we described in (7.2). Note that (7.9) is a binary code inference

problem, and (7.10) is a simple binary classification problem. In this way, the com-

plicated decision tree learning for supervised hashing (7.4) now becomes two relatively

simpler tasks—solving (7.9) (Step 1) and (7.10) (Step 2).

7.2.1 Step 1: Binary code inference

For (7.9), we sequentially optimize for one bit at a time, conditioning on previous bits.

When solving for the r-th bit, the cost in (7.9) can be written as:

n∑i=1

n∑j=1

|yij |(ryij −

k∑p=1

zp,izp,j

)2

(7.11a)

=

n∑i=1

n∑j=1

|yij |(ryij −

r−1∑p=1

zp,izp,j − zr,izr,j)2

(7.11b)

=

n∑i=1

n∑j=1

|yij |[(ryij −

r−1∑p=1

zp,izp,j

)2

+ (zr,izr,j)2 − 2

(ryij −

r−1∑p=1

zp,izp,j

)zr,izr,j

](7.11c)

=n∑i=1

n∑j=1

−2|yij |(ryij −

r−1∑p=1

zp,izp,j

)zr,izr,j + const. (7.11d)


With the above equations, the optimization for the r-th bit can be formulated as a

binary quadratic problem:

minzr∈−1,1n

n∑i=1

n∑j=1

aijzr,izr,j , (7.12a)

where, aij = −|yij |(ryij −r−1∑p=1

z∗p,iz∗p,j). (7.12b)

Here zr is a vector of binary variables of the r-th bit, which we aim to optimize; z∗

denotes a binary code in previous bits; aij is a constant which can be calculated using

the binary codes of previous bits.

We use a stage-wise scheme for solving for each bit. Specifically, when solving for the

r-th bit, the bit length is set to r instead of m, which is shown in (7.12b) of the above

optimization. This way, the optimization of current bit depends on the loss caused by

previous bits, which usually leads to better inference results.

Alternatively, one can apply spectral relaxation method to solve (7.12a), as in TSH.

However solving eigenvalue problems does not scale up to large training sets, and the

spectral relaxation is rather loose (hence leading to inferior results). Here we propose

sub-modular formulations for the hashing binary code inference problem and an efficient

GraphCut based block search method for solving large-scale inference.

Specifically, we first group data points into a number of blocks, then iteratively optimize

for these blocks until converge. For each iteration, we randomly pick one block, then

optimize for (update) the corresponding variables of this block, conditioning on the rest

of the variables. In another word, when optimizing for one block, only those variables

which correspond to the data points of the target block will be updated; and for the

variables which are not involved in the target block, their values remain unchanged.

Obviously, each block update would strictly decrease the objective.

Formally, let B denote a block of data points. For example, we want to optimize for the

corresponding binary variables of the block: B. We denote by zr a binary code in the

r-bit that is not involved in the target block. First we rewrite the objective in (7.12a) to

separate the variables of the target block from other variables. The objective in (7.12a)


can be rewritten as:

n∑i=1

n∑j=1

aijzr,izr,j =∑i∈B

∑j∈B

aijzr,izr,j +∑i∈B

∑j /∈B

aijzr,izr,j (7.13a)

+∑i/∈B

∑j∈B

aijzr,izr,j +∑i/∈B

∑j /∈B

aij zr,izr,j (7.13b)

=∑i∈B

∑j∈B

aijzr,izr,j + 2∑i∈B

∑j /∈B

aijzr,izr,j +∑i/∈B

∑j /∈B

aij zr,izr,j . (7.13c)

When optimizing for one block, those variables which are not involved in the target block

are treated as constants; hence zr is treated as a constant. By removing the constant

part, the optimization for one block can be written as:

minzr,B∈−1,1|B|

∑i∈B

∑j∈B

aijzr,izr,j + 2∑i∈B

∑j /∈B

aijzr,izr,j . (7.14)

Here zr,B is a vector of variables which are involved in the target block: B and we

aim to optimize. Substituting the constant aij by its definition in (7.12b), the above

optimization is written as:

minzr,B∈−1,1|B|

∑i∈B

uizr,i +∑i∈B

∑j∈B

vijzr,izr,j , (7.15a)

where, vij = −|yij |(ryij −r−1∑p=1

z∗p,iz∗p,j), (7.15b)

ui =− 2∑j /∈B

zr,j |yij |(ryij −r−1∑p=1

z∗p,iz∗p,j). (7.15c)

Here ui, vij are constants.

The key to construct a block is to ensure (7.15a) of such a block is sub-modular, so we

can apply efficient GraphCut. We refer to this as Block GraphCut (Block-GC), shown in

Algorithm 8. Specifically in our hashing problem, by leveraging similarity information,

we can easily construct blocks which meet the sub-modular requirement, as shown in

the following proposition:

Proposition 7.1. ∀i, j ∈ B, if yij ≥ 0, the optimization in (7.15a) is a sub-modular

problem. In other words, for any data point in the block, if it is not dissimilar with any

other data points in the block, then (7.15a) is sub-modular.


Proof. If yij ≥ 0, the following hods:

ryij ≥r−1∑p=1

z∗p,iz∗p,j . (7.16)

Thus we have:

vij = −|yij |(ryij −r−1∑p=1

z∗p,iz∗p,j) ≤ 0. (7.17)

With the following definition:

θij(zr,i, zr,j) = vijzr,izr,j , (7.18)

the following holds:

θij(−1, 1) =θij(1,−1) = −vij ≥ 0; (7.19)

θij(1, 1) =θij(−1,−1) = vij ≤ 0. (7.20)

Hence we have the following relations:

∀i, j ∈ B, θij(1, 1) + θij(−1,−1) ≤ 0 ≤ θij(1,−1) + θij(−1, 1), (7.21)

which prove the sub-modularity of (7.15a) [136].

Blocks can be constructed in many ways as long as they satisfy the condition in Propo-

sition 7.1. A simple greedy method is shown in Algorithm 7. Note that the blocks can

overlap and the union of them needs to cover all n variables. If one variable is one block,

Block-GC becomes ICM [137, 138] which optimizes for one variable at a time.

7.2.2 Step 2: Learning boosted trees as hash functions

For binary classification in (7.10), usually the zero-one loss is replaced by some convex

surrogate loss. Here we use the exponential loss which is common for boosting methods.

The classification problem for learning the r-th hash function is written as:

minw≥0

n∑i=1

exp

[− zr,i

Q∑q=1

wqTq(xi)

]. (7.22)

We apply Adaboost to solve above problem. In each boosting iteration, a decision tree

as well as its weighting coefficient are learned. Every node of a binary decision tree


is a decision stump. Training a stump is to find a feature dimension and threshold

that minimizes the weighted classification error. From this point, we are doing feature

selection and hash function learning at the same time. We can easily make use of efficient

decision tree learning techniques available in the literature, which are able to significantly

speed up the training. Here we summarize some techniques that are included in our

implementation:

1. We have used the highly efficient stump implementation proposed in the recent

work of [139], which is around 10 times faster than conventional stump implemen-

tation.

2. Feature quantization can significantly speed up tree training without performance

loss in practice, and also largely reduce the memory consuming. As in [139], we

linearly quantize feature values into 256 bins.

3. We apply the weight-trimming technique described in [139, 140]. In each boosting

iteration, the smallest 10% weightings are trimmed (set to 0).

4. We apply LazyBoost technique, which significantly speeds up the tree learning

process especially on high-dimensional data. For one node splitting in tree training,

only a random subset of feature dimensions are evaluated for splitting.

Finally, we summarize our hashing method (FastHash) in Algorithm 9. In contrast with

TSH, we alternate Step-1 and Step-2 iteratively. For each bit, the binary code is updated

by applying the learned hash function. Hence, the learned hash function is able to make

a feedback for binary code inference of next bit, which may lead to better performance.

7.3 Experiments

We here describe the results of comprehensive experiments carried out on several large

image datasets in order to evaluate the performance of the proposed method in terms of

training time, binary encoding time and retrieval performance. We compare to a number

of recent supervised and unsupervised hashing methods. For decision tree learning in

our FastHash, if not specified, the tree depth is set to 4, and the number of boosting

iterations is 200.

The retrieval performance is measured in 3 ways: the precision of top-K (K = 100)

retrieved examples (denoted as Precision), mean average precision (MAP) and the area

under the Precision-Recall curve (Prec-Recall).


Figure 7.1: Some retrieval examples of our method FastHash on CIFAR10. The first columnshows query images, and the rest are retrieved images in the database.


Figure 7.2: Some retrieval examples of our method FastHash on ESPGAME. The first columnshows query images, and the rest are retrieved images in the database. False predictions aremarked by red boxes.


Table 7.1: Comparison of KSH and our FastHash. KSH results with different number of supportvectors. Both of our FastHash and FastHash-Full outperform KSH by a large margin in termsof training time, binary encoding time (Test time) and retrieval precision.

Method #Train #Support Vector Train time Test time Precision

CIFAR10 (features:11200)

KSH 5000 300 1082 22 0.480KSH 5000 1000 3481 57 0.553KSH 5000 3000 52747 145 0.590FastH 5000 N/A 331 21 0.634FastH-Full 50000 N/A 1794 21 0.763

IAPRTC12 (features:11200)


ESPGAME (features:11200)


MIRFLICKR (features:11200)


Results are reported on 5 image datasets which cover a wide variety of images. The

dataset CIFAR10 1 contains 60000 images. The datasets IAPRTC12 and ESPGAME

[141] contain around 20000 images, and MIRFLICKR [142] is a collection of 25000

images. SUN397 [3] is a large image dataset which contains more than 100000 scene

images form 397 categories.

For the multi-class datasets: CIFAR10 and SUN397, the ground truth pairwise similarity

is defined as multi-class label agreement. For datasets: IAPRTC12, ESPGAME and

MIRFLICKR, of which the keyword (tags) annotation are provided in [141], two images

are treated as semantically similar if they are annotated with at lease 2 identical keywords

(or tags).

Following a conventional setting in [28, 31], a large portion of the dataset is allocated as

an image database for training and retrieval and the rest is put aside for testing queries.

Specifically, for CIFAR10, IAPRTC12, ESPGAME and MIRFLICKER, the provided

splits are used; for SUN397, 8000 images are randomly selected as test queries, while

the remaining 100417 images form the training set. If not specified, 64-bit binary codes

are generated using comparing methods for evaluation.

1http://www.cs.toronto.edu/~kriz/cifar.html

http://www.cs.toronto.edu/~kriz/cifar.html


Table 7.2: Comparison of TSH and our FastHash for binary code inference in Step 1. Theproposed Block GraphCut (Block-GC) achieves much lower objective value and also takes lessinference time than the spectral method, and thus performs much better.

Step-1 methods #train Block Size Time (s) Objective

SUN397

Spectral (TSH) 100417 N/A 5281 0.7524Block-GC-1 (FastH) 100417 1 298 0.6341Block-GC (FastH) 100417 253 2239 0.5608

CIFAR10


IAPRTC12


ESPGAME


MIRFLICKR


Given the remarkable success with which they have been applied elsewhere, we extract

codebook-based features following the conventional pipeline from [132, 133]: we employ

K-SVD for codebook (dictionary) learning with a codebook size of 800, soft-thresholding

for patch encoding and spatial pooling of 3 levels, which results 11200-dimensional fea-

tures. We also tested increasing the codebook size to 1600 which results in 22400-

dimensional features.

7.3.1 Comparison with KSH

KSH [31] has been shown to outperform many state-of-the-art comparators. The fact

that our method employs the same loss function as KSH thus motivates further com-

parison against this key method. KSH employs a simple kernel technique by predefining

a set of support vectors then learning linear weightings for each hash function. In the

works of [31, 44], KSH is evaluated only on low dimensional GIST features (512 dimen-

sions) using a small number of support vectors (300). Here, in contrast, we evaluate

KSH on high-dimensional codebook features, and vary the number of support vectors

from 300 to 3000. For our method, the tree depth is set to 4, and the number of boosting

iterations is set to 200. KSH is trained on a sampled set of 5000 examples. The results

of these tests are summarized in Table 7.1, which shows that increasing the number of


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CIFAR10

Recall (64 bits)

Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

IAPRTC12

Recall (64 bits)

Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.05

0.1

0.15

0.2

0.25

0.3

0.35ESPGAME

Recall (64 bits)

Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65MIRFLICKR

Recall (64 bits)

Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

500 1000 1500 2000

0.4

0.5

0.6

0.7

0.8CIFAR10


Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

500 1000 1500 20000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

IAPRTC12


Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

500 1000 1500 20000.05

0.1

0.15

0.2

0.25

0.3

ESPGAME


Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

500 1000 1500 2000

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

MIRFLICKR


Pre

cis

ion

KSH−3000

KSH−1000

KSH−300

FastHash

FastHash−Full

Figure 7.3: Comparison of KSH and our FastHash on all datasets. The precision and recallcurves are given in the first two rows. The precision curves of the top 2000 retrieved examplesare given on the last 2 rows. The number after “KSH” is the number of support vectors. Bothof our FastHash and FastHash-Full outperform KSH by a large margin.


support vectors consistently improves the retrieval performance of KSH. However, even

on this small training set, including more support vectors will dramatically increase the

training time and binary encoding time of KSH. We have run our FastHash both on the

same sampled training set and the whole training set (labeled as FastHash-Full) in order

to show that our method can be efficiently trained on the whole dataset. Our FastHash

and FastHash-Full outperform KSH by a large margin both in terms of training speed

and and retrieval precision. The results also show that the decision tree hash func-

tions in FastHash are much more efficient for testing (binary encoding) than the kernel

function in KSH. Our FastHash is orders of magnitude faster than KSH on training,

and thus much better suited to large training sets and high-dimensional data. For the

low-dimensional GIST feature, our FastHash also performs much better than KSH in

retrieval, see Table 7.5 for details. If not specified, the number of support vectors for

KSH is set to 3000.

For the comparison with KSH, we also show the precision-recall curves and the precision

curves of top-K retrieved examples in Figure 7.3. The number after “KSH” is the number

of support vectors. Both of our FastHash and FastHash-Full outperform KSH by a large

margin.

Some retrieval examples of our method are shown in Figure 7.1 for the dataset CIFAR10

and Figure 7.2 for the dataset ESPGAME. The codebook features are used here.

7.3.2 Comparison with TSH

TSH [44] is a general two-step learning method which we propose in Chapter 6. The

proposed FastHash employs a similar two-step approach to that of TSH. We first compare

binary code inference in Step 1: the proposed Block GraphCut (Block-GC) and the

spectral method in TSH. For all experiments in this paper, the number of iterations of

Block-GC is set to 2. The results of testing are summarized in Table 7.2. We construct

blocks using Algorithm 7. The averaged block size is reported in the table. We also

evaluate a special case where the block size is set to 1 for Block-CG (labeled as Block-

CG-1), in which case Block-GC is reduced to the ICM [137, 138] method. It shows that

when the training set gets larger, the spectral method becomes slow. The objective

value shown in the table is divided by the number of defined pairwise relations. The

proposed Block-GC achieves much lower objective values and takes less inference time,

and hence outperforms the spectral method. The inference time for Block-CG increases

only linearly with the training set size. It also shows that the special case: Block-CG-1

is highly efficient and able to achieve comparable low objective value.


16 32 48 640.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8CIFAR10

Number of bits

mA

P

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 640.08

0.1

0.12

0.14

0.16

0.18

0.2ESPGAME

Number of bits

mA

P

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 64

0.16

0.18

0.2

0.22

0.24

0.26

0.28IAPRTC12

Number of bits

mA

P

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 640.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58MIRFLICKR

Number of bits

mA

P

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 640.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

CIFAR10

Number of bits

Pre

cis

ion (

100−

NN

)

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 640.1

0.15

0.2

0.25

0.3

0.35

ESPGAME

Number of bits

Pre

cis

ion (

100−

NN

)

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 64

0.2

0.25

0.3

0.35

0.4IAPRTC12

Number of bits

Pre

cis

ion (

100−

NN

)

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

16 32 48 640.4

0.45

0.5

0.55

0.6

0.65MIRFLICKR

Number of bits

Pre

cis

ion (

100−

NN

)

TSH−Tree

TSH−LSVM

FastHash−LSVM

FastHash

Figure 7.4: Comparison of various combinations of hash functions and binary inference methods.Note that the proposed FastHash uses decision tree as hash functions. The proposed decisiontree hash function performs much better than the linear SVM hash function. Moreover, ourFastHash performs much better than TSH when using the same hash function in Step 2.


Table 7.3: Comparison of combination of hash functions and binary inference methods. Theproposed decision tree hash function performs much better than linear SVM hash function.Moreover, our FastHash performs much better than TSH when using the same hash function inStep-2.

Step-2 Step-1 Precision MAP Prec-Recall

CIFAR10

Linear-SVM TSH 0.676 0.621 0.436Linear-SVM FastH 0.669 0.621 0.435Tree (FastH) TSH 0.745 0.726 0.567Tree (FastH) FastH 0.763 0.775 0.605

IAPRTC12


ESPGAME


MIRFLICKR


We now provide results comparing different combinations of hash functions (Step 2)

and binary code inference methods (Step 1). We evaluate the linear SVM and the pro-

posed decision tree hash functions with different binary code inference methods (Spectral

method in TSH Block-GC in FastHash). The 11200-dimensional codebook features are

used here. The retrieval performance is summarized in Table 7.3; and we also plot

the retrieval performance in Figure 7.4 by varying the number of bits. As expected,

the proposed decision tree hash function performs much better than linear SVM hash

function. It also shows that our FastHash performs much better than TSH when using

the same type of hash function for Step 2 (decision tree or linear SVM hash function),

which indicates that the proposed Block-GC method for binary code inference and the

stage-wise learning strategy is able to generate high quality binary codes.

We can also train RBF-kernel SVM as hashing functions in Step-2. However, as the case

here, when applied on large training set and high-dimensional data, the training of RBF

SVM almost becomes intractable. A stochastic method with a support vector budget

(BSGD) for efficient training of kernel SVM is recently proposed in [6]. Even using

BSGD, the training and testing cost are still very expensive. We run TSH using BSGD

[6] (TSH-BRBF) and linear SVM (TSH-LSVM) as hashing function, and compare to


Table 7.4: Comparison of TSH and our FastHash. Results of TSH with the linear SVM andthe budgeted RBF kernel [6] hash functions (TSH-BRBF) for the Step-2 are presented. OurFastHash outperforms TSH by a large margin both on training speed and retrieval performance.

Methods Train time Test time Precision MAP P-Recall

CIFAR10

TSH-BRBF 98961 8994 0.683 0.629 0.448TSH-LSVM 14567 9 0.676 0.621 0.436FastHash 1794 21 0.763 0.775 0.605

IAPRTC12


ESPGAME


MIRFLICKR


our FastHash with boosted trees. The result is shown in Table 7.4. The number of

support vectors is set to 100 as the budget. Even with this small number of support

vectors, TSH-BRBF is already significantly slow on testing and training. Compared to

kernel SVM, for high-dimensional data, our FastHash with decision trees are much more

efficient both on training and testing (binary encoding). Our FastHash also performs

the best on retrieval performance. It is worthy noting here, for each hash function, a

set of support vectors are learned from the data, which is different from KSH [31] using

predefined support vectors for all hash functions.

7.3.3 Experiments on different features

We compare hashing methods on the the low-dimensional (320 or 512) GIST feature

and the high-dimensional (11200) codebook feature. We extract GIST features of 320

dimensions for CIFAR10 which contains low resolution images, and 520 dimensions for

other datasets. Several state-of-the-art supervised methods are included in this com-

parison: KSH [31], Supervised Self-Taught Hashing (STHs) [33], and Semi-supervised

Hashing (SPLH) [17]. The result is presented in Table 7.5. The codebook features con-

sistently show better result than GIST features. Comparing methods are trained on a

sampled training set (5000 examples). Results show that comparing methods can be

efficiently trained on the GIST features. However, when applied on high dimensional

features, even on a small training set (5000), their training time dramatically increase.


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

CIFAR10

Recall (64 bits)

Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.05

0.1

0.15

0.2

0.25

0.3

0.35ESPGAME

Recall (64 bits)

Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

IAPRTC12

Recall (64 bits)

Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65MIRFLICKR

Recall (64 bits)

Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

500 1000 1500 20000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

CIFAR10


Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

500 1000 1500 20000.05

0.1

0.15

0.2

0.25

0.3

ESPGAME


Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

500 1000 1500 20000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

IAPRTC12


Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

500 1000 1500 20000.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65MIRFLICKR


Pre

cis

ion

STHs

SPLH

BREs

KSH

FastHash

FastHash−Full

Figure 7.5: Results on high-dimensional codebook features. The precision and recall curves aregiven in the first two rows. The precision curves of the top 2000 retrieved examples are given onthe last 2 rows. Both our FastHash and FastHash-Full outperform their comparators by a largemargin.


Table 7.5: Results using two types of features: low-dimensional GIST features and the high-dimensional codebook features. Our FastHash and FastHash-Full outperform the comparatorsby a large margin on both feature types. In terms of training time, our FastHash is also muchfaster than others on the high-dimensional codebook features.

GIST feature (320 / 512 dimensions) Codebook feature (11200 dimensions)

Method #Train Train time Test time Precision MAP Prec-Recall Train time (s) Test time (s) Precision MAP Prec-Recall

CIFAR10

KSH 5000 52173 8 0.453 0.350 0.164 52747 145 0.590 0.464 0.261BREs 5000 481 1 0.262 0.198 0.082 18343 8 0.292 0.216 0.089SPLH 5000 102 1 0.368 0.291 0.138 9858 4 0.496 0.396 0.219STHs 5000 380 1 0.197 0.151 0.051 6878 4 0.246 0.175 0.058FastH 5000 304 21 0.517 0.462 0.243 331 21 0.634 0.575 0.358

FastH-Full 50000 1681 21 0.649 0.653 0.450 1794 21 0.763 0.775 0.605

IAPRTC12


FastH-Full 17665 590 9 0.316 0.240 0.178 620 9 0.371 0.276 0.210

ESPGAME


FastH-Full 18689 448 9 0.228 0.169 0.109 663 9 0.261 0.189 0.126

MIRFLICKR


FastH-Full 12500 451 7 0.525 0.507 0.345 509 7 0.595 0.558 0.420

Large matrix multiplication and solving eigenvalue problem on a large matrix may ac-

count for the expensive computation in these comparing methods. It would be very

difficult to train these methods on the whole training set. The training time of KSH

mainly depends on the number of support vectors (3000 is used here). We run our

FastHash on the same sampled training set (5000 examples) and the whole training set

(labeled as FastHash-Full). Results show that FastHash can be efficiently trained on the

whole dataset. FastHash and FastHash-Full outperform others by a large margin both

in GIST and codebook features. The training of FastHash is also orders of magnitudes

faster than others on the high-dimensional codebook feature. The retrieval performance

on codebook feature is also plotted in Figure 7.5.

7.3.4 Comparison with dimension reduction

A possible way to reduce the training cost on high-dimensional data is to apply dimen-

sion reduction. For comparing methods: KSH, SPLH and STHs, here we reduce the

original 11200-dimensional codebook features to 500 dimensions by applying PCA. We

also compare to CCA+ITQ [36] which combines ITQ with the supervised dimensional

reduction. Our FastHash still use the original high-dimensional features. The result is

summarized in Table 7.6. After dimension reduction, most comparing methods can be

trained on the whole training set within 24 hours (except KSH on CIFAR10). However


Table 7.6: Results of methods with dimension reduction. KSH, SPLH and STHs are trainedwith PCA feature reduction. Our FastHash outperforms others by a large margin on retrievalperformance.

Method # Train Train time Test time Precision MAP

CIFAR10

PCA+KSH 50000 − − − −PCA+SPLH 50000 25984 18 0.482 0.388PCA+STHs 50000 7980 18 0.287 0.200CCA+ITQ 50000 1055 7 0.676 0.642FastH 50000 1794 21 0.763 0.775

IAPRTC12

PCA+KSH 17665 55031 11 0.082 0.103PCA+SPLH 17665 1855 7 0.239 0.169PCA+STHs 17665 2463 7 0.174 0.126CCA+ITQ 17665 804 3 0.332 0.198FastH 17665 620 9 0.371 0.276

ESPGAME


MIRFLICKR


it still much slower than our FastHash. For the retrieval performance, the retrieval result

of SPLH and STHs get improved with more training data. Our FastHash still signifi-

cantly outperforms all others. The proposed decision tree hash functions in FastHash

actually perform feature selection and hash function learning at the same time, which

shows much better performance than other hashing method with dimensional reduction.

The running-up method is CCA+ITQ. Note that supervised feature reduction can be

also applied in our method.

7.3.5 Comparison with unsupervised methods

We compare to some popular unsupervised hashing methods: LSH [32], ITQ [36], Anchor

Graph Hashing (AGH) [38], Spherical Hashing (SPHER) [39], MDSH [35]. The retrieval

performance is shown in Figure 7.6. Unsupervised methods perform poorly for preserving

label based similarity. Our FastHash outperforms others by a large margin.


Table 7.7: Performance of our FastHash on more features (22400 dimensions) and more bits(1024 bits). It shows that FastHash can be efficiently trained on high-dimensional features withlarge bit length. The training and binary coding time (Test time) of FastHash is only linearlyincreased with the bit length.

Bits #Train Features Train time Test time Precision MAP

CIFAR10

64 50000 11200 1794 21 0.763 0.775256 50000 22400 5588 71 0.794 0.8141024 50000 22400 22687 282 0.803 0.826

IAPRTC12

64 17665 11200 320 9 0.371 0.276256 17665 22400 1987 33 0.439 0.3141024 17665 22400 7432 134 0.483 0.338

ESPGAME

64 18689 11200 663 9 0.261 0.189256 18689 22400 1912 34 0.329 0.2331024 18689 22400 7689 139 0.373 0.257

MIRFLICKR

64 12500 11200 509 7 0.595 0.558256 12500 22400 1560 28 0.612 0.5671024 12500 22400 6418 105 0.628 0.576

Table 7.8: Results on the large image dataset SUN397 using 11200-dimensional codebook fea-tures. Our FastHash can be efficiently trained to large bit length (1024 bits) on this large trainingset. FastHash outperforms other methods by a large margin on retrieval performance.

Method #Train Bits Train time Test time Precision MAP

SUN397

KSH 10000 64 57045 463 0.034 0.023BREs 10000 64 105240 23 0.019 0.013SPLH 10000 64 27552 14 0.022 0.015STHs 10000 64 22914 14 0.010 0.008

ITQ 100417 1024 1686 127 0.030 0.021SPHER 100417 1024 35954 121 0.039 0.024LSH − 1024 − 99 0.028 0.019CCA+ITQ 100417 512 7484 66 0.113 0.076CCA+ITQ 100417 1024 15580 127 0.120 0.081

FastH 100417 512 29624 302 0.149 0.142FastH 100417 1024 62076 536 0.165 0.163

7.3.6 More features and more bits

To further evaluate the training efficiency of our method, we increase the codebook

size to 1600 for generating higher dimensional features (22400 dimensions) and run up

to 1024 bits. The result is shown in Table 7.7. It shows that our FastHash can be

efficiently trained on high-dimensional features with large bit length. The training and

binary coding time (Test time) of FastHash is only linearly increased with the bit length.

The retrieval result is improved when the bit length is increased.


500 1000 1500 20000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8CIFAR10


Pre

cis

ion

LSH

AGH

MDSH

SPHER

ITQ

FastHash

500 1000 1500 20000.05

0.1

0.15

0.2

0.25

0.3ESPGAME


Pre

cis

ion

LSH

AGH

MDSH

SPHER

ITQ

FastHash

500 1000 1500 20000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5IAPRTC12


Pre

cis

ion

LSH

AGH

MDSH

SPHER

ITQ

FastHash

500 1000 1500 20000.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65MIRFLICKR


Pre

cis

ion

LSH

AGH

MDSH

SPHER

ITQ

FastHash

Figure 7.6: The retrieval precision results of unsupervised methods. Unsupervised methodsperform poorly for preserving label based similarity. Our FastHash outperform others by a largemargin.

7.3.7 Large dataset: SUN397

The challenging SUN397 dataset is a collection of more than 100000 scene images from

397 categories. 11200-dimensional codebook features are extracted on this dataset. We

compare with a number of supervised and unsupervised methods. The depth for decision

trees is set to 6. The result is presented in Table 7.8 Supervised methods: KSH, BREs,

SPLH and STHs are trained to 64 bits on a subset of 10K examples. However, even

on this sampled training set and only run to 64 bits, the training of these methods

are already impractically slow. It would be almost intractable for the whole training

set and long bit length. Short length of bits are not able to achieve good performance

on this challenging dataset. In contrast, our method can be efficiently trained to large

bit length (1024 bits) on the whole training set (more than 100000 training examples).

Our FastHash outperforms other methods by a large margin on retrieval performance.

The runner-up method is the supervised method CCA+ITQ [36]. For unsupervised

methods: ITQ, SPHER, LSH, they are also efficiently. However, they perform poorly

for preserving label based similarity. The retrieval performance of 1024 bits is also

plotted in Figure 7.7.


500 1000 1500 20000

0.05

0.1

0.15

0.2

0.25SUN397


Pre

cis

ion

LSH

SPHER

ITQ

CCA+ITQ

FastHash

Figure 7.7: The precision curve of top 2000 retrieved examples on large image dataset SUN397using 1024 bits. Here we compare with those methods which can be efficiently trained up to1024 bits on the whole training set. Our FastHash outperforms others by a large margin.

For memory usage, many of the comparing methods require a large amount of memory

for large matrix multiplication. In contrast, the decision tree learning in our method

only involves the simple comparison operation on quantized feature data (256 bins), thus

FastHash only consumes less than 7GB for training, which shows that our method can

be easily applied for large-scale training.

7.4 Conclusion

we have proposed an efficient supervised hashing method for large-scale and high-

dimensional data. Our method follows a two-step hashing learning scheme. We develop

an efficient GraphCut based block search method for solving large-scale inference, and

learn decision tree based hash functions by fitting binary codes. Our method can be effi-

ciently trained on large-scale and high-dimensional datasets, and achieves high retrieval

precision. Our comprehensive experiments show the advantages of our method on re-

trieval precision, training and testing speed, which indicates its practical significance on

many applications like large-scale image retrieval.

Chapter 8

Conclusion

This thesis has made practical and theoretical contributions to structured learning and

binary code learning. We have presented our novel learning methods and explored a num-

ber of applications in computer vision, including image classification, image retrieval,

image segmentation, visual tracking and so on.

In the first part of this thesis, we have explored column generation based techniques

for multi-class classification and the more general structured output prediction tasks.

In Chapter 3, we proposed to learn a separate set of weak learners for different class,

and this class-specified weak learner learning is able to achieve much faster convergence

compared to traditional multi-class boosting methods. For solving the optimization, we

have developed an efficient coordinate decent method with closed-form solutions for each

update iteration. The proposed method has been empirically verified on fast training

and competitive testing accuracies.

Apart from multi-class classification, in Chapter 4, we have developed a boosting based

approach (StructBoost) for general structured output prediction tasks, as an alternative

to conventional structured learning methods like SSVM [7] and CRF [88]. The pro-

posed boosting based method is able to perform efficient nonlinear structured output

prediction by learning and combining a set of weak structured learners. To solve the

resulting optimization problems, we introduced an efficient cutting-plane method and

combined it with the column generation learning framework. In a wide range of appli-

cations on multi-class classification, hierarchical multi-class classification by optimizing

the tree loss, visual tracking that optimizes the Pascal overlap criterion, and learning

CRF parameters for image segmentation, we have shown that StructBoost has compet-

itive performances compared to conventional approaches, and significantly outperforms

linear SSVM, demonstrating the usefulness of our nonlinear structured learning method.

153

154 Chapter 8 Conclusion

In the second part of this thesis, we have introduced three hashing methods which focus

on different aspects of the binary code learning problem. The first proposed method in

Chapter 5 exploits triplet-based relative ranking relations for hash functions learning.

This method is based on large-margin learning framework and incorporate triplet ranking

constraints in the optimization problem. The resulting optimization is solved by column

generation techniques.

The hashing methods presented in Chapter 6 and 7 aim to learn hash functions by

preserving pairwise similarity. Specifically in Chapter 6, we introduced a general two-

step based learning approach, in which the learning task is divided into binary code

inference steps and binary classification steps. We have shown that any hamming dis-

tance (or hamming affinity) based loss functions can be readily applied in this general

learning framework, and thus we palace a wide variety of hashing methods in a uni-

fied framework. As an extension of this general two-step framework, we proposed an

efficient hashing method in Chapter 7, which aims for hashing learning from large-scale

and high-dimensional data. We have developed an efficient GraphCut based block search

method for solving non-submodular problems, which can be easily applied for large-scale

binary inference. Moreover, we proposed to learn decision trees as efficient non-linear

hash functions which are more suitable for high-dimensional data compared to kernel

hash functions. Comprehensive experiments have shown the advantages of our method

on retrieval precision, training and testing speed, indicating its practical significance on

many applications like large-scale image retrieval.

8.1 Future work

Both structured output prediction and binary code learning have been applied in a va-

riety of computer vision applications. As a part of the future work, we aim to explore

more computer vision applications of these proposed learning methods. For structure

learning, our boosting approach can be applied to the learning of interactions between

the elements in complex outputs. Complex outputs are usually involved in many vision

tasks such as pose recognition, action recognition, event detection, context modeling and

scene understanding. The proposed binary code learning approaches can find their po-

tential applications in large-scale nearest neighbor retrieval and learning efficient binary

code representations. Example tasks include nearest-neighbor based image/object/event

retrieval, image classification, object detection, feature matching and so on.

Another direction is to incorporate advanced feature learning techniques into our struc-

tured learning and binary code learning frameworks to develop end-to-end style learning

methods. This end-to-end learning style is able to update both the low-level feature

Chapter 8 Conclusion 155

extracting model and the high-level prediction model according to some task-specific

targets. A popular feature learning method is deep neural networks (DNNs) which

is being actively explored both in the research community and the industry. It has

been shown that DNNs somehow success to learn hierarchical style (from low-level to

high-level) visual concepts. DNNs have shown impressive performance in many vision

applications including large-scale image classification [143], object detection [144, 145],

segmentation [146] and so on. It is interesting to find out how the DNNs can be ap-

plied in our structured learning and binary code learning framework to gain further

performance improvement.

Appendix A

Appendix for MultiBoostcw

A.1 Dual problem of MultiBoostcw

Here we describe how to derive the dual problem of MultiBoostcw. The proposed method

MultiBoostcw with the exponential loss is written as (A.1):

minw≥0

1>w +C

p

∑i

∑y 6=yi

exp (−ρ(i,y)) (A.1a)

s.t. ∀i = 1, . . . , n and ∀y ∈ 1, . . . ,K\yi :

w>yiΦyi(xi)−w>yΦy(xi) = ρ(i,y). (A.1b)

The Lagrangian of (A.1) can be written as:

L = 1>w +C

p

∑i

∑y 6=yi

exp (−ρ(i,y))−α>w +

∑i

∑y 6=yi

µ(i,y)[ρ(i,y) −w>yiΦyi(xi) +w>yΦy(xi)] (A.2)

with α ≥ 0, in which µ,α are Lagrangian multipliers. At optimum, the first derivative

of the Lagrangian w.r.t the primal variables must vanish:

∂L

∂ρ(i,y)= 0 =⇒ − C

pexp (−ρ(i,y)) + µ(i,y) = 0 =⇒ ρ(i,y) = − log (

p

Cµ(i,y))

157

Appendix for MultiBoostcw Appendix A Appendix for MultiBoostcw

∂L

∂wc= 0 =⇒ 1−

∑i(yi=c)

∑y 6=yi

µ(i,y)Φyi(xi) +∑i

∑y 6=yi,y=c

µ(i,y)Φy(xi)−αc = 0

=⇒∑

i(yi=c)

∑y 6=yi

µ(i,y)Φyi(xi)−∑i

∑y 6=yi,y=c

µ(i,y)Φy(xi) ≤ 1

The dual problem of (A.1) can be written as (A.3), in which c is the index of class labels.

µ(i,y) is the dual variable associated with one constraint in (A.1b):

maxµ

∑i

∑y 6=yi

µ(i,y)

[1− log

p

C− log µ(i,y)

](A.3a)

s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)

∑y 6=yi


∑y 6=yi,y=c

µ(i,y)φy(xi) ≤ 1 (A.3b)

∀i = 1, . . . , n : 0 ≤∑y 6=yi

µ(i,y) ≤C

p(A.3c)

A.2 MultiBoostcw with the hinge loss

MultiBoostcw is a flexible framework that it is able to use different loss function. Using

other smooth loss function (e.g. squared hinge loss, logistic loss) in MultiBoostcw is

similar to the case of the exponential loss and can be derived straightforward. Here

we discuss a non-smooth case as a different example (denoted as MultiBoostcw-hinge):

using the hinge loss and slack variable sharing in the constraints (one slack variable ξi is

associated with (K−1) constraints which corresponds to the example xi). MultiBoostcw-

hinge can be formulated as (A.4), in which p is the number of slack variables ξ (p = m):

minw≥0,ξ≥0

1>w +C

p

∑i

ξi (A.4a)

s.t. ∀i = 1, . . . , n and ∀y ∈ 1, 2, . . . ,K\yi :

w>yiΦyi(xi)−w>yΦy(xi) ≥ 1− ξi. (A.4b)

The Lagrangian of (A.4) can be written as:

L = 1>w +C

p

∑i

ξi −α>w − β>ξ +∑i

∑y 6=yi

µ(i,y)[1− ξi −w>yiΦyi(xi) +w>yΦy(xi)] (A.5)

Appendix for MultiBoostcw 159

with µ ≥ 0,α ≥ 0,β ≥ 0, in which µ,α,β are Lagrangian multipliers. At optimum,

the first derivative of the Lagrangian w.r.t the primal variables must vanish:

∂L

∂ξi= 0 =⇒ C

p−∑y 6=yi

µ(i,y) − βi = 0 =⇒ 0 ≤∑y 6=yi

µ(i,y) ≤C

p

∂L

∂wc= 0 =⇒ 1−

∑i(yi=c)

∑y 6=yi

µ(i,y)Φyi(xi) +∑i

∑y 6=yi,y=c

µ(i,y)Φy(xi)−αc = 0

=⇒∑

i(yi=c)

∑y 6=yi

µ(i,y)Φyi(xi)−∑i

∑y 6=yi,y=c

µ(i,y)Φy(xi) ≤ 1

Hence the dual problem of (A.4) can be written as (A.6), in which c indicates the c-th

class. µ(i,y) is the dual variable associated with the constraint (A.4b) for label y 6= yi

and training pair (xi, yi):

maxµ

∑i

∑y 6=yi

µ(i,y) (A.6a)

s.t. ∀φ(·) ∈ C and ∀c = 1, . . . ,K :∑i(yi=c)

∑y 6=yi


∑y 6=yi,y=c

µ(i,y)φy(xi) ≤ 1 (A.6b)

∀i = 1, . . . , n : 0 ≤∑y 6=yi

µ(i,y) ≤C

p(A.6c)

With the primal-dual pair of (A.4) and (A.6), we can develop a column generation

algorithm for the hinge loss which is similar to Algorithm 1 for the exponential loss.

Notice that the dual constraint in (A.6b) is the same with (A.3b), thus finding new

weak learners is the same as the exponential loss. There are only two differences:

1. Using different solver. The optimization of (A.4) is a linear programming problem

(LP), which can be solved by MOSEK [147] or other off-the-shelf LP solver.

2. Obtain the dual solution µ in a different way. The primal solution w and the dual

solution µ can be obtained at the same time using MOSEK solver.

Appendix B

Appendix for StructBoost

B.1 Dual formulation of n-slack

The formulation of StructBoost is written as (n-slack primal):

minw≥0,ξ≥0

1>w +C

n1>ξ (B.1a)

s.t. ∀i = 1, . . . , n and ∀y ∈ Y :

w>δΨi(y) ≥ ∆(yi,y)− ξi. (B.1b)

The Lagrangian of the n-slack primal problem can be written as:

L = 1>w +C

n1>ξ −

∑i,y

µ(i,y) ·[w>δΨi(y)−∆(yi,y) + ξi

]− ν>w − β>ξ, (B.2)

where µ,ν,β are Lagrange multipliers: µ ≥ 0,ν ≥ 0,β ≥ 0. We denote by µ(i,y)

the Lagrange dual multiplier associated with the margin constraints (B.1b) for label y

and training pair (xi,yi). At optimum, the first derivative of the Lagrangian w.r.t. the

primal variables must vanish,

∂L

∂ξi= 0 =⇒ C

n−∑y

µ(i,y) − βi = 0

=⇒ 0 ≤∑y

µ(i,y) ≤C

n;

161

Appendix for StructBoost Appendix B Appendix for StructBoost

and,

∂L

∂w= 0 =⇒ 1−

∑i,y

µ(i,y)δΨi(y)− ν = 0

=⇒∑i,y

µ(i,y)δΨi(y) ≤ 1.

By putting them back into the Lagrangian (B.2) and we can obtain the dual problem of

the n-slack formulation in (B.1):

maxµ≥0

∑i,y

µ(i,y)∆(yi,y) (B.3a)

s.t. ∀ψ ∈ C :∑i,y

µ(i,y)δψi(y) ≤ 1, (B.3b)

∀i = 1, . . . ,m : 0 ≤∑y

µ(i,y) ≤C

n. (B.3c)

B.2 Dual formulation of 1-slack

The 1-slack formulation of StructBoost is written as:

minw≥0,ξ≥0

1>w + Cξ (B.4a)

s.t. ∀c ∈ 0, 1n and ∀y ∈ Y, i = 1, · · · ,m :

1

nw>[ n∑i=1

ci · δΨi(y)

]≥ 1

n

n∑i=1

ci∆(yi,y)− ξ. (B.4b)

The Lagrangian of the 1-slack primal problem can be written as:

L =1>w + Cξ −∑c,y

λ(c,y) ·

1

nw>[ n∑i=1

ci · δΨi(y)

]−

1

n

n∑i=1

ci∆(yi,y) + ξ

− ν>w − βξ, (B.5)

where λ,ν, β are Lagrange multipliers: λ ≥ 0,ν ≥ 0, β ≥ 0. We denote by λ(c,y) the

Lagrange multiplier associated with the inequality constraints for c ∈ 0, 1n and label

y. At optimum, the first derivative of the Lagrangian w.r.t. the primal variables must

Appendix for StructBoost 163

be zeros,

∂L

∂ξ= 0 =⇒ C −

∑c,y

λ(c,y) − β = 0

=⇒ 0 ≤∑c,y

λ(c,y) ≤ C;

and,

∂L

∂w= 0 =⇒ 1− 1

n

∑c,y

λ(c,y) ·[ n∑i=1

ci · δΨi(y)

]= ν.

=⇒ 1

n

∑c,y

λ(c,y) ·[ n∑i=1

ci · δΨi(y)

]≤ 1. (B.6)

The dual problem of (B.4) can be written as:

maxλ≥0

∑c,y

λ(c,y)

n∑i=1

ci∆(yi,y) (B.7a)

s.t. ∀ψ ∈ C :1

n

∑c,y

λ(c,y)

[ n∑i=1

ci · δψi(y)

]≤ 1, (B.7b)

0 ≤∑c,y

λ(c,y) ≤ C. (B.7c)

Appendix C

Appendix for CGHash

C.1 Learning hashing functions with the hinge loss

Our method can be easily extended to using non-smooth loss functions. Here we discuss

the hinge loss as an example of using non-smooth loss function.

C.1.1 Using `1 norm regularization

We here discuss the case of using the hinge loss and `1 norm regularization for hashing

learning. When using the hinge loss, we define the following large-margin optimization

problem:

minw,ξ

1>w + C∑

(i,j,k)∈T

ξ(i,j,k) (C.1a)

s.t. ∀(i, j, k) ∈ T :

dhm(xi,xk;w)− dhm(xi,xj ;w) ≥ 1− ξ(i,j,k), (C.1b)

w ≥ 0, ξ ≥ 0. (C.1c)

Here we have used the `1 norm onw as the regularization term to control the complexity

of the learned model. With the definition of weighted Hamming distance in (5.3) and

the notation in (5.6), the optimization problem in (C.1) can be rewritten as:

minw,ξ

1>w + C∑

(i,j,k)∈T

ξ(i,j,k) (C.2a)

s.t. ∀(i, j, k) ∈ T : w>δΦ(i, j, k) ≥ 1− ξ(i,j,k), (C.2b)

w ≥ 0, ξ ≥ 0. (C.2c)

165

Appendix for CGHash Appendix C Appendix for CGHash

To apply the column generation technique for learning hash functions, we derive the dual

problem of the above optimization. The corresponding dual problem can be written as:

maxµ

∑(i,j,k)∈T

µ(i,j,k) (C.3a)

s.t. ∀h(·) ∈ C :∑

(i,j,k)∈T

µ(i,j,k)δh(i, j, k) ≤ 1, (C.3b)

∀(i, j, k) ∈ T : 0 ≤ µ(i,j,k) ≤ C. (C.3c)

Here µ is one dual variable, which corresponds to one constraint in (C.2b). Similar to

the case of squared hinge loss, we learn a new hash function by finding the most violated

constraint of the dual problem. We solve the following subproblem to learn one hash

function:


∑(i,j,k)∈T

µ(i,j,k)δh(i, j, k)

= argmaxh(·)∈C

∑(i,j,k)∈T

µ(i,j,k)

[|h(xi − h(xk)| − |h(xi)− h(xj)|

]. (C.4a)

Solving the above optimization is the same as that for squared hinge loss which we have

discussed.

In each column generation iteration, we need to obtain the primal and dual solution.

Different from smooth convex loss functions (e.g., the squared hinge loss), the the primal

optimization in (C.2) is a linear programming problem. Here we can use MOSEK [147]

to solve the primal optimization in (C.2) and obtain the primal solution w and the dual

solution µ.

C.1.2 Using l∞ norm regularization

We also can use the hinge loss with the l∞ norm regularization. The primal optimization

can be written as:

minw,ξ

‖w‖∞ + C∑

(i,j,k)∈T

ξ(i,j,k) (C.5a)


w ≥ 0, ξ ≥ 0. (C.5c)

Appendix for CGHash 167

The above optimization can be equivalently written as:

minw,ξ

∑(i,j,k)∈T

ξ(i,j,k) (C.6a)


0 ≤ ξ, 0 ≤ w ≤ C ′1. (C.6c)

Here C ′ is a constant that controls the regularization trade-off. The dual problem can

be derived as:

maxµ,β

− C ′1>β +∑

(i,j,k)∈T

µ(i,j,k) (C.7a)

s.t. ∀h(·) ∈ C :∑

(i,j,k)∈T

µ(i,j,k)δh(i, j, k) ≤ βh, (C.7b)

0 ≤ β, 0 ≤ µ ≤ 1. (C.7c)

Here µ, β are dual variables. From the dual problem, we can find that the rule for

generating one hash function is the same as that for `1 norm, which is to solve the

subproblem in (C.4a). Similar to `1 norm, we can apply MOSEK [147] to solve the

primal optimization in (C.5) and obtain the primal solution w and the dual solution µ.

Bibliography

[1] C. Shen and Z. Hao. A direct formulation for totally-corrective multi-class boost-

ing. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2011.

[2] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran

subroutines for large-scale bound-constrained optimization. ACM T. Math. Softw.,

1997.

[3] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.

SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.

[4] S. Hare, A. Saffari, and P. Torr. Struck: Structured output tracking with kernels.

In Proceedings of the International Conference on Computer Vision, 2011.

[5] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization

with superpixel neighborhoods. In Proceedings of the International Conference on

Computer Vision, 2009.

[6] Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Breaking the curse of

kernelization: Budgeted stochastic gradient descent for large-scale svm training.

The Journal of Machine Learning Research, 2012.

[7] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Al-

tun. Support vector machine learning for interdependent and structured output

spaces. In Proceedings of the International Conference on Machine Learning, 2004.

[8] Martin Szummer, Pushmeet Kohli, and Derek Hoiem. Learning CRFs using graph

cuts. In Proceedings of European Conference on Computer Vision, 2008.

[9] Sebastian Nowozin, Peter V. Gehler, and Christoph H. Lampert. On parameter

learning in CRF-based approaches to object class image segmentation. In Proceed-

ings of European Conference on Computer Vision, 2010.

169

Appendix for CGHash BIBLIOGRAPHY

[10] Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with

structured output regression. In Proceedings of European Conference on Computer

Vision, 2008.

[11] Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and A. Van Den Hengel.

Part-based visual tracking with online latent structural learning. In Proceedings


[12] Lu Zhang and Laurens van der Maaten. Structure preserving object tracking. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

2013.

[13] K. Tang, Li Fei-Fei, and D. Koller. Learning latent temporal structure for complex

event detection. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2012.

[14] Jiang Wang and Ying Wu. Learning maximum margin temporal warping for action

recognition. In Proceedings of the International Conference on Computer Vision,

2013.

[15] Sebastian Nowozin and Christoph H. Lampert. Structured learning and prediction

in computer vision. Foundations & Trends in Computer Graphics & Vision, 2011.

[16] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: A large data

set for nonparametric object and scene recognition. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 2008.

[17] J. Wang, S. Kumar, and S.F. Chang. Semi-supervised hashing for large scale

search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.

[18] Xavier Boix, Gemma Roig, Christian Leistner, and Luc Van Gool. Nested sparse

quantization for efficient feature coding. In Proceedings of European Conference

on Computer Vision, 2012.

[19] Thomas Dean, Mark A. Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijaya-

narasimhan, and Jay Yagnik. Fast, accurate detection of 100,000 object classes

on a single machine. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2013.

[20] Christoph Strecha, Alex Bronstein, Michael Bronstein, and Pascal Fua. Ldahash:

Improved matching with smaller descriptors. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 2012.


[21] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation

with parameter-sensitive hashing. In Proceedings of the International Conference


[22] Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Perez. Aggregating

local descriptors into a compact image representation. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2010.

[23] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for

nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2011.

[24] Tomasz Trzcinski, Mario Christoudias, Pascal Fua, and Vincent Lepetit. Boosting

binary keypoint descriptors. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2013.

[25] Wei Dong, Zhe Wang, William Josephson, Moses Charikar, and Kai Li. Modeling

lsh for performance tuning. In Proceedings of the ACM Conference on Information

and Knowledge Management, 2008.

[26] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe

lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the

International Conference on Very Large Data Bases, 2007.

[27] Ruslan Salakhutdinov and Geoffrey E. Hinton. Learning a nonlinear embedding

by preserving class neighbourhood structure. In Proceedings of the International

Conference on Artificial Intelligence and Statistics, 2007.

[28] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings.

In Proceedings of Advances in Neural Information Processing Systems, 2009.

[29] M. Norouzi and D.J. Fleet. Minimal loss hashing for compact binary codes. In

Proceedings of the International Conference on Machine Learning, 2011.

[30] D. Zhang, J. Wang, D. Cai, and J. Lu. Extensions to self-taught hashing: kerneli-

sation and supervision. In Proc. ACM SIGIR Workshop on Feature Generation

and Selection for Information Retrieval, pages 19–26, 2010.

[31] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang. Supervised hashing with

kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2012.

[32] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high

dimensions via hashing. In Proceedings of the International Conference on Very

Large Data Bases, 1999.


[33] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity

search. In Proceedings of the annual international ACM SIGIR conference on

research and development in information retrieval, 2010.

[34] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of Advances

in Neural Information Processing Systems, 2008.

[35] Yair Weiss, Rob Fergus, and Antonio Torralba. Multidimensional spectral hashing.

In Proceedings of European Conference on Computer Vision, 2012.

[36] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: a

procrustean approach to learning binary codes for large-scale image retrieval. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 2012.

[37] Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An affinity-preserving

quantization method for learning binary compact codes. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2013.

[38] W. Liu, J. Wang, S. Kumar, and S. F. Chang. Hashing with graphs. In Proceedings

of the International Conference on Machine Learning, 2011.

[39] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon.

Spherical hashing. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2012.

[40] Yunchao Gong, Sanjiv Kumar, Henry A. Rowley, and Svetlana Lazebnik. Learning

binary codes for high-dimensional data using bilinear projections. In Proceedings


[41] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and David Suter. Fast

training of effective multi-class boosting using coordinate descent optimization. In

Proceedings of the Asian Conference on Computer Vision, 2013.

[42] Chunhua Shen, Guosheng Lin, and Anton van den Hengel. StructBoost: Boosting

methods for predicting structured output variables. IEEE Transactions on Pattern


[43] Xi Li, Guosheng Lin, Chunhua Shen, A. van den Hengel, and Anthony Dick. Learn-

ing hash functions using column generation. In Proceedings of the International

Conference on Machine Learning, 2013.

[44] Guosheng Lin, Chunhua Shen, David Suter, and Anton van den Hengel. A general

two-step approach to learning-based hashing. In Proceedings of the International

Conference on Computer Vision, 2013.


[45] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David.

Suter. Fast supervised hashing with decision trees for high-dimensional data. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

2014.

[46] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-

cut/max-flow algorithms for energy minimization in vision. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 2004.

[47] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional ran-

dom fields: Probabilistic models for segmenting and labeling sequence data. In


[48] Jun Zhu and Eric P Xing. Maximum entropy discrimination markov networks.

The Journal of Machine Learning Research, 2009.

[49] Benjamin Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov net-

works. In Proceedings of Advances in Neural Information Processing Systems,

2003.

[50] Tong Zhang. Statistical behavior and consistency of classification methods based

on convex risk minimization. Annals of Statistics, 2004.

[51] David McAllester. Generalization bounds and consistency for structured labeling

in predicting structured data. Predicting Structured Data, 2007.

[52] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane train-

ing of structural svms. Mach. Learn., 2009.

[53] Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle

methods for regularized risk minimization. The Journal of Machine Learning

Research, 2010.

[54] Nathan D Ratliff, J Andrew Bagnell, and Martin Zinkevich. (approximate) sub-

gradient methods for structured prediction. In Proceedings of the International

Conference on Artificial Intelligence and Statistics, 2007.

[55] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal esti-

mated sub-gradient solver for svm. In Proceedings of the International Conference

on Machine Learning, 2007.

[56] Steve Branson, Oscar Beijbom, and Serge Belongie. Efficient large-scale structured

learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2013.


[57] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line

learning and an application to boosting. J. Comput. Syst. Sci., 1997.

[58] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting

algorithms as gradient descent. In Proceedings of Advances in Neural Information

Processing Systems, 1999.

[59] Chunhua Shen and Hanxi Li. On the dual formulation of boosting algorithms.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.

[60] A. Demiriz, K. P. Bennett, and J. Shawe-Taylor. Linear programming boosting

via column generation. Mach. Learn., 2002.

[61] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass

kernel-based vector mchines. The Journal of Machine Learning Research, 2001.

[62] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In

Proceedings of annual ACM symposium on Theory of computing, 2002.

[63] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 2012.

[64] Fumin Shen, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and Zhenmin

Tang. Inductive hashing on manifolds. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2013.

[65] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest

neighbor in high dimensions. In Proc. IEEE Symp. Foundations of Computer

Science, 2006.

[66] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-

sensitive hashing scheme based on p-stable distributions. In Proceedings of annual

symposium on Computational geometry, 2004.

[67] Brian Kulis, Prateek Jain, and Kristen Grauman. Fast similarity search for learned

metrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009.

[68] Fan RK Chung. Spectral graph theory. American Mathematical Society, 1997.

[69] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Francois Paiement, Pas-

cal Vincent, and Marie Ouimet. Learning eigenfunctions links spectral embedding

and kernel pca. Neural Computation, 2004.

[70] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral group-

ing using the nyström method. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 2004.


[71] Miguel A. Carreira-Perpinan. The elastic embedding algorithm for dimensionality

reduction. In Proceedings of the International Conference on Machine Learning,

2010.

[72] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction

to Information Retrieval. Cambridge University Press, 2008.

[73] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduc-

tion and data representation. Neural Computation, 2003.

[74] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector ma-

chines. ACM Trans. Intell. Syst. Technol., 2011.

[75] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee S. Lee. boosting the

margin: A new explanation for the effectiveness of voting methods. Annals of

Statistics, 1998.

[76] Paul Viola and Michael J. Jones. Robust real-time face detection. International

Journal of Computer Vision, 2004.

[77] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems

via error-correcting output codes. J. Artif. Int. Res., 1995.

[78] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and error-

correcting codes. In Proc. Annual Conf. Computational Learning Theory, 1999.

[79] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using

confidence-rated predictions. In Mach. Learn., 1999.

[80] Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison

of optimization methods and software for large-scale l1-regularized linear classifi-

cation. The Journal of Machine Learning Research, 2010.

[81] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen

Lin. LIBLINEAR: A library for large linear classification. The Journal of Machine

Learning Research, 2008.

[82] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning

for image classification. In Proceedings of the IEEE Conference on Computer


[83] R. Uetz and S. Behnke. Large-scale object recognition with CUDA-accelerated hi-

erarchical neural networks. In IEEE Int. Conf. Intelligent Computing & Intelligent

Systems, 2009.


[84] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features:

Spatial pyramid matching for recognizing natural scene categories. In Proceedings


[85] J. Weston and C. Watkins. Multi-class support vector machines. In Proc. Euro.

Symp. Artificial Neural Networks, 1999.

[86] I. Steinwart. Sparseness of support vector machines. The Journal of Machine

Learning Research, 2003.

[87] Thorsten Joachims. Training linear SVMs in linear time. In Proc. ACM SIGKDD

Int. Conf. Knowledge discovery & data mining, 2006.

[88] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Proba-

bilistic models for segmenting and labeling sequence data. In Proceedings of the

International Conference on Machine Learning, 2001.

[89] Charles Sutton and Andrew McCallum. An introduction to conditional random

fields. Foundations and Trends in Machine Learning, 2012.

[90] N. Plath, M. Toussaint, and S. Nakajima. Multi-class image segmentation using

conditional random fields and global classification. In Proceedings of the Interna-

tional Conference on Machine Learning, 2009.

[91] Luca Bertelli, Tianli Yu, Diem Vu, and Burak Gokturk. Kernelized structural

SVM learning for supervised object segmentation. In Proceedings of the IEEE


[92] Chaitanya Desai, Deva Ramanan, and Charless C. Fowlkes. Discriminative models

for multi-class object layout. International Journal of Computer Vision, 2011.

[93] Thomas G Dietterich, Adam Ashenfelter, and Yaroslav Bulatov. Training condi-

tional random fields via gradient tree boosting. In Proceedings of the International

Conference on Machine Learning, 2004.

[94] Daniel Munoz, James A Bagnell, Nicolas Vandapel, and Martial Hebert. Contex-

tual classification with functional max-margin markov networks. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[95] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting

algorithms as gradient descent. In Proceedings of Advances in Neural Information

Processing Systems, 1999.

[96] Nathan Ratliff, David Bradley, J. Andrew Bagnell, and Joel Chestnutt. Boosting

structured prediction for imitation learning. In Proceedings of Advances in Neural

Information Processing Systems, 2007.


[97] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Func-

tional gradient techniques for imitation learning. Autonomous Robots, 2009.

[98] C. Shen, H. Li, and A. van den Hengel. Fully corrective boosting with arbitrary

loss and regularization. Neural Networks, 2013.

[99] Charles Parker, Alan Fern, and Prasad Tadepalli. Gradient boosting for sequence

alignment. In Proc. National Conf. Artificial Intelligence, 2006.

[100] C. Parker. Structured gradient boosting, 2007. URL http://hdl.handle.net/

1957/6490. PhD thesis, Oregon State University.

[101] Q. Wang, D. Lin, and D. Schuurmans. Simple training of dependency parsers via

structured boosting. In Proc. Int. Joint Conf. Artificial Intell., 2007.

[102] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector

machines. In Proceedings of the International Conference on Machine Learning,

2008.

[103] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods

for regularized risk minimization. The Journal of Machine Learning Research, 11:

311–365, 2010.

[104] T. Joachims. A support vector method for multivariate performance measures. In


[105] Lijuan Cai and Thomas Hofmann. Hierarchical document categorization with

support vector machines. In Proc. ACM Int. Conf. Information & knowledge

management, 2004.

[106] C. Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books, 1998.

[107] S. Paisitkriangkrai, C. Shen, Q. Shi, and A. van den Hengel. RandomBoost: Sim-

plified multi-class boosting through randomization. IEEE Trans. Neural Networks

and Learning Systems, 2013.

[108] Sebastian Nowozin, Carsten Rother, Shai Bagon, Toby Sharp, Bangpeng Yao,

and Pushmeet Kohli. Decision tree fields. In Proceedings of the International

Conference on Computer Vision, 2011.

[109] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boost-

ing algorithm for combining preferences. The Journal of Machine Learning Re-

search, 2003.

http://hdl.handle.net/1957/6490

http://hdl.handle.net/1957/6490


[110] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan.

Object detection with discriminatively trained part-based models. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 2010.

[111] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual tracking with online

multiple instance learning. In Proceedings of the IEEE Conference on Computer


[112] Amit Adam, Ehud Rivlin, and Ilan Shimshoni. Robust fragments-based tracking

using the integral histogram. In Proceedings of the IEEE Conference on Computer


[113] Helmut Grabner, Michael Grabner, and Horst Bischof. Real-time tracking via

on-line boosting. In Proc. British Mach. Vis. Conf., 2006.

[114] Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition. In Proceedings


[115] Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang. Superpixel tracking.

In Proceedings of the International Conference on Computer Vision, 2011.

[116] M. Marszatek and C. Schmid. Accurate object localization with shape masks. In

Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007.

[117] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic

representation of the spatial envelope. International Journal of Computer Vision,

2001.

[118] Antonio Torralba, Robert Fergus, and Yair Weiss. Small codes and large image

databases for recognition. In Proceedings of the IEEE Conference on Computer


[119] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. Ldahash: Improved

matching with smaller descriptors. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 2011.

[120] S. Korman and S. Avidan. Coherency sensitive hashing. In Proceedings of the

International Conference on Computer Vision, 2011.

[121] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation

with parameter-sensitive hashing. In Proceedings of the International Conference


[122] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons.

In Proc. Adv. Neural Information Processing Systems, 2004.


[123] C. Shen, J. Kim, L. Wang, and A. van den Hengel. Positive semidefinite metric

learning using boosting-like algorithms. J. Machine Learning Research, 2012.

[124] R. Salakhutdinov and G. Hinton. Semantic hashing. Int. J. Approximate Reason-

ing, 2009.

[125] S. Baluja and M. Covell. Learning to hash: forgiving hash functions and applica-

tions. Data Mining & Knowledge Discovery, 2008.

[126] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,

2004.

[127] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The

PASCAL visual object classes challenge 2007 results, 2007.

[128] K.Q. Weinberger, J. Blitzer, and L.K. Saul. Distance metric learning for large

margin nearest neighbor classification. In Proceedings of Advances in Neural In-

formation Processing Systems, 2006.

[129] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman.

Indexing by latent semantic analysis. J. American Society for Information Science,

1990.

[130] D. Zhang, J. Wang, D. Cai, and J. Lu. Laplacian co-hashing of terms and docu-

ments. In Proc. Eur. Conf. Information Retrieval, 2010.

[131] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: A large data

set for nonparametric object and scene recognition. IEEE Transactions on Pattern


[132] Adam Coates and Andrew Ng. The importance of encoding versus training with

sparse coding and vector quantization. In Proceedings of the International Con-

ference on Machine Learning, 2011.

[133] Ryan Kiros and Csaba Szepesvari. Deep representations and codes for image auto-

annotation. In Proceedings of Advances in Neural Information Processing Systems,

2012.

[134] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy mini-

mization via graph cuts. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2001.

[135] Mohammad Rastegari, Ali Farhadi, and David Forsyth. Attribute discovery via

predictable discriminative binary codes. In Proceedings of European Conference



[136] Carsten Rother, Vladimir Kolmogorov, Victor Lempitsky, and Martin Szummer.

Optimizing binary MRFs via extended roof duality. In Proceedings of the IEEE


[137] Julian Besag. On the statistical analysis of dirty pictures. J. of the Royal Stat.

Society., 1986.

[138] Mark Schmidt. http://www.di.ens.fr/~mschmidt/Software/UGM.html, 2012.

[139] Ron Appel, Thomas Fuchs, Piotr Dollar, and Pietro Perona. Quickly boosting

decision trees-pruning underachieving features early. In Proceedings of the Inter-

national Conference on Machine Learning, 2013.

[140] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regres-

sion: a statistical view of boosting (with discussion and a rejoinder by the authors).

The annals of statistics, 2000.

[141] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid.

Tagprop: Discriminative metric learning in nearest neighbor models for image

auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2009.

[142] Mark J Huiskes and Michael S Lew. The MIR-Flickr retrieval evaluation. In Proc.

ACM Int. Conf. Multimedia Info. Retrieval, 2008.

[143] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification

with deep convolutional neural networks. In Proceedings of Advances in Neural

Information Processing Systems, 2012.

[144] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hi-

erarchies for accurate object detection and semantic segmentation. arXiv preprint

arXiv:1311.2524, 2013.

[145] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks

for object detection. In Proceedings of Advances in Neural Information Processing

Systems, 2013.

[146] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning

hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 2013.

[147] The MOSEK optimization software, 2010. URL http://www.mosek.com/.

http://www.di.ens.fr/~mschmidt/Software/UGM.html

http://www.mosek.com/

structured output prediction and binary code learning in

Documents