phd thesis defence slides
TRANSCRIPT
Learning to Hash for Large Scale Image Retrieval
Sean Moran
Pre-Viva Talk
27th November 2015
Learning to Hash for Large Scale Image Retrieval
Overview
Evaluation Protocol
Learning Multiple Quantisation Thresholds
Learning the Hashing Hypersurfaces
Conclusions
Learning to Hash for Large Scale Image Retrieval
Overview
Evaluation Protocol
Learning Multiple Quantisation Thresholds
Learning the Hashing Hypersurfaces
Conclusions
Efficient Retrieval in Large Image Archives
Efficient Retrieval in Large Image Archives
Efficient Retrieval in Large Image Archives
Applications
Recommendation
Near duplicate detection
Content based IR
Recommendation
Near Duplicate Detection
Noun Clustering
Location Recognition
Locality Sensitive Hashing
H
DATABASE
Locality Sensitive Hashing
110101
010111
H
010101
111101
.....
DATABASE
HASH TABLE
Locality Sensitive Hashing
110101
010111
H
H
QUERY
DATABASE
HASH TABLE
010101
111101
.....
Locality Sensitive Hashing
110101
010111
H
H
COMPUTE SIMILARITY
NEARESTNEIGHBOURS
QUERY
010101
111101
.....
H
DATABASE
QUERY
HASH TABLE
Locality Sensitive Hashing
110101
010111
111101
H
H
Content Based IR
Image: Imense Ltd
Image: Doersch et al.
Image: Xu et al.
Location Recognition
Near duplicate detection
010101
111101
.....
H
QUERY
DATABASE
QUERY
NEARESTNEIGHBOURS
HASH TABLE
COMPUTE SIMILARITY
Toy 2D Feature Space
y
x
Feature Space Partitioning
y
w2
w1
h1
h2
x
a
b
Step 1: Projection onto Normal Vectors
x
y
w2
Step 2: Binary Quantisation
y1
y2Projected Dimension #2
Projected Dimension #1
t1
t 2
10
10
a b
a b
2-Bit Hashcodes of Data-Points
y
w2
w1
h1
h2
00
10 11
01
x
b
a
2-Bit Hashcodes of Data-Points
y
w2
w1
h1
h2
00
10 11
01
x
b
a
00
...
01
10
11
...
Hash Table
Thesis Statement
“The retrieval effectiveness of hashing-based ANN search can beimproved by learning the quantisation thresholds and hashinghyperplanes in a manner that is influenced by the distribution ofthe data”
Relaxing Common Assumptions (A1-A3)
I Three limiting assumptions permeate literature:
I A1: Single static threshold placed at zero
I A2: Uniform allocation of thresholds across each dimension
I A3: Linear hypersurfaces (hyperplanes) positioned randomly
Thesis Contributions
I Five novel contributions to address the assumptions:
I Learning Multiple Quantisation Thresholds [1]
I Learning Variable Quantisation Thresholds [2]
I Learning Unimodal Hashing Hypersurfaces [3]
I Learning Cross-Modal Hashing Hypersurfaces [4]
I Learning Hypersurfaces and Thresholds together
[1] Sean Moran, Victor Lavrenko, Miles Osborne. Neighbourhood Preserving Quantisation for LSH. In SIGIR’13.[2] Sean Moran, Victor Lavrenko, Miles Osborne. Variable Bit Quantisation for LSH. In ACL’13.[3] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. In ECIR’15.[4] Sean Moran and Victor Lavrenko. Regularised Cross Modal Hashing. In SIGIR’15.
Thesis Contributions (This Talk)
I Two contributions discussed in this talk:
I Learning Multiple Quantisation Thresholds [1]
I Learning Variable Quantisation Thresholds [2]
I Learning Unimodal Hashing Hypersurfaces [3]
I Learning Cross-Modal Hashing Hypersurfaces [4]
I Learning Hypersurfaces and Thresholds together
[1] Sean Moran, Victor Lavrenko, Miles Osborne. Neighbourhood Preserving Quantisation for LSH. In SIGIR’13.[2] Sean Moran, Victor Lavrenko, Miles Osborne. Variable Bit Quantisation for LSH. In ACL’13.[3] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. In ECIR’15.[4] Sean Moran and Victor Lavrenko. Regularised Cross Modal Hashing. In SIGIR’15.
Evaluation Protocol
Overview
Evaluation Protocol
Learning Multiple Quantisation Thresholds
Learning the Hashing Hypersurfaces
Conclusions
Dataset and Features
I Standard evaluation datasets [5,6]:
I CIFAR-10: 60,000 images, 512-D GIST descriptors1
I NUSWIDE: 270,000 images, 500-D BoW descriptors2
I SIFT-1M: 1,000,000 images, 128-D SIFT descriptors3
[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.
1http://www.cs.toronto.edu/~kriz/cifar.html2http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm3http://corpus-texmex.irisa.fr/
Example NUSWIDE Images [15]
ε-Nearest Neighbour Groundtruth [6]
y
x
ϵ
[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.
Class Based Groundtruth [6]
y
x
Class 1
Class 1
Class 2
Class 2Class 3
Class 3
Class 4
Class 4
[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.
Hamming Ranking Evaluation Paradigm [5,6]
0
1
3
3
4
5
5
01011
01011
01010
11000
11000
11100
10100
10100
Images ranked byHamming distance
Hashcode Hamming distance
Query
+
+
-
-
-
-
-
[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.
Learning to Hash for Large Scale Image Retrieval
Overview
Evaluation Protocol
Learning Multiple Quantisation Thresholds
Learning the Hashing Hypersurfaces
Conclusions
Why Learn the Quantisation Threshold(s)?
I Threshold at zero is in region of highest point density:
0
200
400
600
800
1000
1200
1400
1600
-4 -3 -2 -1 0 1 2 3 4
Cou
nt
Projected Value
Distribution of Projected Values (LSH, CIFAR-10)
Previous work
I Single Bit Quantisation (SBQ): Single static thresholdplaced at zero (default used in most previous work) [13].
I Double Bit Quantisation (DBQ): Multiple thresholdsoptimised by minimising squared error objective [7].
I Manhattan Hashing Quantisation (MHQ): Multiplethresholds optimised by k-means [8].
[7] Weihao Kong, Wu-Jun Li.Double Bit Quantisation. In AAAI’12.[8] Weihao Kong, Wu-Jun Li. Manhattan Hashing for Large-Scale Image Retrieval. In SIGIR’12.
[13] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse ofdimensionality. In STOC’98.
Previous work
Method Data-Dependent # Thresholds Codebook
SBQ [13] 1 0/1DBQ [7] X 2 00/11/10MHQ [8] X 2B − 1 NBC
NPQ [1] X 1,2,2B−1 Any
B : # of bits per projected dimension e.g. 1,2,3,4 bitsNBC: Natural Binary Code
[1] Sean Moran, Victor Lavrenko, Miles Osborne. NeighbourhoodPreserving Quantisation for LSH. In SIGIR’13.
Semi-Supervised Objective Function
I Idea: learn thresholds by fusing two sources of information:
1. Supervised: distances (or labels ) of points in the originalD-dimensional feature space
2. Unsupervised: distances between pairs of points in the 1Dprojected space
I Next few slides: constructing the semi-supervised objectivefunction
Adjacency Matrix S
I Adjacency matrix S ∈ 0, 1Ntrd×Ntrd :
Sij =
1, if xi , xj are true NNs
0, otherwise.(1)
Note: Ntrd N and matrix S highly sparse (1% of elementsnon-zero)
Adjacency Matrix S
t 3t1 t 2
y1
00 01 10 11
a bc de f g h
S a b c d e f g h
a +1b +1c +1d +1e +1f +1g +1h +1
Adjacency Matrix S
t 3t1 t 2
y1
00 01 10 11
a bc de f g h
S a b c d e f g h
a 0 +1 0 0 0 0 0 0b +1 0 0 0 0 0 0 0c 0 0 0 0 0 +1 0 0d 0 0 0 0 0 0 0 +1e 0 0 0 0 0 0 +1 0f 0 0 +1 0 0 0 0 0g 0 0 0 0 +1 0 0 0h 0 0 0 +1 0 0 0 0
Indicator Matrix P
I Indicator matrix P ∈ 0, 1Ntrd×Ntrd :
Pkij =
1, if ∃γ s.t. tkγ ≤ (yk
i , ykj ) ≤ tk(γ+1)
0, otherwise.(2)
Indicator Matrix P
t 3t1 t 2
y1
00 01 10 11
a b
P a b c d e f g h
a +1b +1cdefgh
Indicator Matrix P
t 3t1 t 2
y1
00 01 10 11
a bc
P a b c d e f g h
a +1 +1b +1 +1c +1 +1defgh
Indicator Matrix P
t 3t1 t 2
y1
00 01 10 11
a bc de f g h
P a b c d e f g h
a +1 +1b +1 +1c +1 +1de +1 +1 +1f +1 +1 +1g +1 +1 +1h +1 +1 +1
Indicator Matrix P
t 3t1 t 2
y1
00 01 10 11
a bc de f g h
P a b c d e f g h
a 0 +1 +1 0 0 0 0 0b +1 0 +1 0 0 0 0 0c +1 +1 0 0 0 0 0 0d 0 0 0 0 0 0 0 0e 0 0 0 0 0 +1 +1 +1f 0 0 0 0 +1 0 +1 +1g 0 0 0 0 +1 +1 0 +1h 0 0 0 0 +1 +1 +1 0
Supervised Part: F1-Measure Maximisation
I Integrate supervisory signal by maximising F1-measure:
F1 =2TP
2TP + FP + FN
I True positives (TPs): true NNs in same regions
I False positives (FPs): non-NNs in same regions
I False negatives (FNs): true NNs in different regions
Supervised Part: F1-Measure Maximisation
I True Positives (TPs):
TP =1
2
∑ij
PijSij =1
2‖P S‖1
P S a b c d e f g h
a 0 +1 0 0 0 0 0 0b +1 0 0 0 0 0 0 0c 0 0 0 0 0 0 0 0d 0 0 0 0 0 0 0 0e 0 0 0 0 0 0 +1 0f 0 0 0 0 0 0 0 0g 0 0 0 0 +1 0 0 0h 0 0 0 0 0 0 0 0
Supervised Part: F1-Measure Maximisation
I True Positives (TPs):
TP =1
2‖(P) (S)‖1 = 4/2 = 2
t 3t1 t 2
y1
00 01 10 11
a be g
Supervised Part: F1-Measure Maximisation
I False Negatives (FNs):
FN =1
2
∑ij
Sij − TP =1
2‖S‖1 − TP
S a b c d e f g h
a 0 +1 0 0 0 0 0 0b +1 0 0 0 0 0 0 0c 0 0 0 0 0 +1 0 0d 0 0 0 0 0 0 0 +1e 0 0 0 0 0 0 +1 0f 0 0 +1 0 0 0 0 0g 0 0 0 0 +1 0 0 0h 0 0 0 +1 0 0 0 0
Supervised Part: F1-Measure Maximisation
I False Negatives (FNs):
FN =1
2‖S‖1 − TP = 8/2− 2 = 2
t 3t1 t 2
y1
00 01 10 11
c df h
Supervised Part: F1-Measure Maximisation
I False Positives (FPs):
FP =1
2
∑ij
Pij − TP =1
2‖P‖1 − TP
P a b c d e f g h
a 0 +1 +1 0 0 0 0 0b +1 0 +1 0 0 0 0 0c +1 +1 0 0 0 0 0 0d 0 0 0 0 0 0 0 0e 0 0 0 0 0 +1 +1 +1f 0 0 0 0 +1 0 +1 +1g 0 0 0 0 +1 +1 0 +1h 0 0 0 0 +1 +1 +1 0
Supervised Part: F1-Measure Maximisation
I False Positives (FPs):
FP =1
2
∑ij
Pij − TP = 9− 2 = 7
t 3t1 t 2
y1
00 01 10 11
e gf h a bc
I Note: Not implemented like this as P may be dense.
Semi-Supervised Objective Function
I F1-measure term:
F1(tk) =2‖P S‖1‖S‖1 + ‖P‖1
= 4/13
I Unsupervised term:
Ω(tk) =1
σk
T+1∑j=1
∑i :yk
i ∈rj
yki − µj
2
I Maximise objective Jnpq:
Jnpq(tk) = αF1(tk) + (1− α)(1− Ω(tk))
Experimental Results
I Three research questions: (more in the thesis):
I R1: Do we outperform a single statically placed threshold?
I R2: When is more than one threshold beneficial toeffectiveness?
I R3: Can our method generalise to multiple thresholds andmany different projection functions?
Learning Curve (AUPRC vs. Amount of Supervision)
0 100 200 300 400 500 600 700 800 900 10000.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
CIFAR
NUSWIDE
# of Supervisory Data-Points
Va
lid
ati
on
AU
PR
C
Learning Curve (AUPRC vs. Amount of Supervision)
0 100 200 300 400 500 600 700 800 900 10000.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
CIFAR
NUSWIDE
# of Supervisory Data-Points
Va
lid
ati
on
AU
PR
C
Small amount of supervision
Benefit of Optimising a Single Threshold (R1)
RND SBQ NPQ0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
CIFAR
NUS-WIDE
Te
st
AU
PR
C
Benefit of Optimising a Single Threshold (R1)
RND SBQ NPQ0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
CIFARNUS-WIDE
Tes
t A
UP
RC
+55%
+35%
Benefit of Optimising a Single Threshold (R1)
RND SBQ NPQ0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
CIFAR
NUS-WIDE
Te
st
AU
PR
C
Better to learn thresholds
Benefit of Multiple Thresholds (R2)
LSH (T=1) LSH (T=3) LSH (T=7)0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Te
st
AU
PR
C
Benefit of Multiple Thresholds (R2)
LSH (T=1) LSH (T=3) LSH (T=7)0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Test A
UP
RC
-36%
-38%
Benefit of Multiple Thresholds (R2)
LSH (T=1) LSH (T=3) LSH (T=7)0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Te
st
AU
PR
C
1 thresholdoptimal for LSH
Benefit of Multiple Thresholds (R2)
PCA (T=1) PCA (T=3) PCA (T=7)0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
Te
st
AU
PR
C
Benefit of Multiple Thresholds (R2)
PCA (T=1) PCA (T=3) PCA (T=7)0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
Test A
UP
RC
+14%
+19%
Benefit of Multiple Thresholds (R2)
PCA (T=1) PCA (T=3) PCA (T=7)0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
Test A
UP
RC
2+ thresholdsoptimal for PCA
Comparison to the State-of-the-Art (R3)
DBQ MHQ NPQ0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
LSH SH
PCA SKLSH
Te
st
AU
PR
C
Comparison to the State-of-the-Art (R3)
DBQ MHQ NPQ0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
LSH SH
PCA SKLSH
Test A
UP
RC
+67%
Comparison to the State-of-the-Art (R3)
DBQ MHQ NPQ0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
LSH SH
PCA SKLSH
Te
st
AU
PR
C
Comparison to the State-of-the-Art (R3)
DBQ MHQ NPQ0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
LSH SH
PCA SKLSH
Te
st
AU
PR
C
Comparison to the State-of-the-Art (R3)
DBQ MHQ NPQ0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
LSH SH
PCA SKLSH
Te
st
AU
PR
C
Comparison to the State-of-the-Art (R3)
DBQ MHQ NPQ0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
LSH SH
PCA SKLSH
Te
st
AU
PR
C
Generalises to other projection functions
End of Part 1 of Talk
I Takeaway messages:
I R1: Optimising 1 threshold is much better than placing thethreshold at zero.
I R2: ≥ 2 thresholds beneficial for eigendecomposition-basedprojections.
I R3: The benefits of threshold learning generalise to otherprojection functions.
Learning to Hash for Large Scale Image Retrieval
Overview
Evaluation Protocol
Learning Multiple Quantisation Thresholds
Learning the Hashing Hypersurfaces
Conclusions
Why Learn the Hashing Hypersurfaces?
y
w1
h1 0
1
x
Why Learn the Hashing Hypersurfaces?
y
w1
h1
0 1
x
Previous work
I Data-independent: Locality Sensitive Hashing (LSH) [13]
I Data-dependent (unsupervised): Anchor Graph Hashing(AGH) [11], Spectral Hashing (SH) [9]
I Data-dependent (supervised): Self Taught Hashing (STH)[12], Supervised Hashing with Kernels (KSH) [5], ITQ + CCA[6], Binary Reconstructive Embedding (BRE) [10]
[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.[9] Yair Weiss, Antonio Torralba, Rob Fergus. Spectral Hashing. In NIPS’08.[10] Brian Kulis and Trevor Darrell. Learning to Hash with Binary Reconstructive Embeddings. In NIPS’09.[11] Wei Liu, Jun Wang, Sanjiv Kumar, Shih-Fu Chang. Hashing with Graphs. In ICML’11.[12] Dell Zhang, Jun Wang, Deng Cai, Jinsog Lu. Self-Taught Hashing for Fast Similarity Search. In SIGIR’10.[13] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse ofdimensionality. In STOC’98.
Previous work
Method Learnt Supervised Scalable Effectiveness
LSH [13] X LowSH [9] X LowSTH [12] X X MediumBRE [10] X X MediumITQ+CCA [6] X X MediumKSH [5] X X High
GRH [3] X X X High
[3] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. InECIR’15.
Learning the Hashing Hypersurfaces
I Two step iterative hashing model:
I Step A: Graph Regularisation
Lm ← sgn(α SD−1Lm−1 + (1−α)L0
)I Step B: Data-Space Partitioning
for k = 1. . .K : min ||wk ||2 + C∑N
i=1 ξik
s.t. Lik(wkᵀxi + bk) ≥ 1− ξik for i = 1. . .N
I Repeat for a set number of iterations (M)
Learning the Hashing Hypersurfaces
I Step A: Graph Regularisation [14]
Lm ← sgn(α SD−1Lm−1 + (1−α)L0
)
I S: Affinity (adjacency) matrix
I D: Diagonal degree matrix
I L: Binary bits at specified iteration
I α: Interpolation parameter (0 ≤ α ≤ 1)
[14] Fernando Diaz. Regularizing query-based retrieval scores. InIR’07.
Learning the Hashing Hypersurfaces
I Step A: Graph Regularisation
Lm ← sgn(α SD−1Lm−1 + (1−α)L0
)
I S: Affinity (adjacency) matrix
I D: Diagonal degree matrix
I L: Binary bits at specified iteration
I α: Interpolation parameter (0 ≤ α ≤ 1)
Learning the Hashing Hypersurfaces
I Step A: Graph Regularisation
Lm ← sgn(α SD−1Lm−1 + (1−α)L0
)
I S: Affinity (adjacency) matrix
I D: Diagonal degree matrix
I L: Binary bits at specified iteration
I α: Interpolation parameter (0 ≤ α ≤ 1)
Learning the Hashing Hypersurfaces
I Step A: Graph Regularisation
Lm ← sgn(α SD−1Lm−1 + (1−α)L0
)
I S: Affinity (adjacency) matrix
I D: Diagonal degree matrix
I L: Binary bits at specified iteration
I α: Interpolation parameter (0 ≤ α ≤ 1)
Learning the Hashing Hypersurfaces
I Step A: Graph Regularisation
Lm ← sgn(α SD−1Lm−1 + (1−α)L0
)
I S: Affinity (adjacency) matrix
I D: Diagonal degree matrix
I L: Binary bits at specified iteration
I α: Interpolation parameter (0 ≤ α ≤ 1)
Learning the Hashing Hypersurfaces
-1 1 1
-1 -1 -1ba
c
1 1 1
S a b c
a 1 1 0b 1 1 1c 0 1 1
D−1 a b c
a 0.5 0 0b 0 0.33 0c 0 0 0.5
L0 b1 b2 b3
a −1 −1 −1b −1 1 1c 1 1 1
Learning the Hashing Hypersurfaces
-1 1 1
-1 -1 -1ba
c
1 1 1
L1 = sgn
−1 0 0−0.33 0.33 0.33
0 1 1
Learning the Hashing Hypersurfaces
-1 1 1
-1 1 1ba
c
1 1 1
L1 =
b1 b2 b3
a −1 1 1b −1 1 1c 1 1 1
Learning the Hashing Hypersurfaces
I Step B: Data-Space Partitioning
for k = 1. . .K : min ||wk ||2 + C∑N
i=1 ξik
s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N
I wk : Hyperplane normal k
I tk : bias of hyperplane k
I xi : data-point i
I Lik : bit k of data-point i
ξik : slack variable ik
K : # bits
N: # data-points
Learning the Hashing Hypersurfaces
I Step B: Data-Space Partitioning
for k = 1. . .K : min ||wk ||2 + C∑N
i=1 ξik
s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N
I wk : Hyperplane normal k
I tk : bias of hyperplane k
I xi : data-point i
I Lik : bit k of data-point i
ξik : slack variable ik
K : # bits
N: # data-points
Learning the Hashing Hypersurfaces
I Step B: Data-Space Partitioning
for k = 1. . .K : min ||wk ||2 + C∑N
i=1 ξik
s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N
I wk : Hyperplane normal k
I tk : bias of hyperplane k
I xi : data-point i
I Lik : bit k of data-point i
ξik : slack variable ik
K : # bits
N: # data-points
Learning the Hashing Hypersurfaces
I Step B: Data-Space Partitioning
for k = 1. . .K : min ||wk ||2 + C∑N
i=1 ξik
s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N
I wk : Hyperplane normal k
I tk : bias of hyperplane k
I xi : data-point i
I Lik : bit k of data-point i
ξik : slack variable ik
K : # bits
N: # data-points
Learning the Hashing Hypersurfaces
I Step B: Data-Space Partitioning
for k = 1. . .K : min ||wk ||2 + C∑N
i=1 ξik
s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N
I wk : Hyperplane normal k
I tk : bias of hyperplane k
I xi : data-point i
I Lik : bit k of data-point i
ξik : slack variable ik
K : # bits
N: # data-points
Learning the Hashing Hypersurfaces
ba
c
e
f
g
h
d
Learning the Hashing Hypersurfaces
ba
c
e
f
g
h
d
Learning the Hashing Hypersurfaces
-1 1 1
1 1 1
-1 -1 -1
1 1 -1
ba
c
e
f
g
h
d
-1 1 1
1 -1 -1
1 -1 -1
1 1 1
Learning the Hashing Hypersurfaces
1 1 1
-1 1 1
-1 -1 -1
1 1 -1
ba
c
e
f
g
h
d
-1 1 1
1 1 1
1 -1 -1
1 -1 -1
Learning the Hashing Hypersurfaces
-1 1 1
-1 -1 -1
1 1 -1
ba
c
e
f
g
h
d
-1 1 1
-1 1 1
1 1 1
First bit flipped
1 1 -1 Second bit flipped
1 -1 -1
Learning the Hashing Hypersurfaces
-1 1 1
1 1 1
-1 -1 -1
1 1 -1
ba
c
e
f
g
h
d
-1 1 1w1. x−t 1=0
w1
Negative (-1)half space
Positive (+1)half space
1 1 -1
1 -1 -1
-1 1 1
Learning the Hashing Hypersurfaces
-1 1 1
1 1 1
-1 -1 -1
1 1 -11 -1 -1
ba
c
e
f
g
h
1 1 -1
d
-1 1 1
-1 1 1
w2
Positive (+1)half space
w2 . x−t 2=0 Negative (-1) half space
Experimental Results
I Three research questions: (more in the thesis):
I R1: Does graph regularisation step help us learn more effectivehashcodes?
I R2: Is it necessary to solve an eigenvalue problem in order tolearn effective hashcodes?
I R2: Do non-linear hypersurfaces provide a more effectivebucketing of the space compared to linear hypersurfaces?
Learning Curve (mAP vs. Amount of Supervision)
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.10
0.15
0.20
0.25
0.30
# Supervisory Data Points
Valid
atio
n m
AP
Learning Curve (mAP vs. Amount of Supervision)
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.10
0.15
0.20
0.25
0.30
# Supervisory Data Points
Valid
atio
n m
AP
Small amount of supervision 5% of dataset size
Effect of α and Iterations (CIFAR-10 @ 32 bits)
0 1 2 3 4 50.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.280.70.80.9
Iterations
Va
lid
ati
on
mA
P
Effect of α and Iterations (CIFAR-10 @ 32 bits)
0 1 2 3 4 50.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.280.70.80.9
Iterations
Va
lid
ati
on
mA
P
Fast convergence. Most weight on supervisory signal
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30
Tes
t m
AP
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30
Test
mA
P
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.15
0.20
0.25
0.30
0.35
Tes
t m
AP
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.15
0.20
0.25
0.30
0.35
Test
mA
P
-8%
Linear GRH
Non-Linear KSH
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.15
0.20
0.25
0.30
0.35
Test
mA
P
-8%
Linear GRH
Non-Linear KSH
Non-Linear GRH
+14%
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.15
0.20
0.25
0.30
0.35
Test
mA
P Linear hypersurfaces do really well
GRH vs Literature (CIFAR-10 @ 32 bits)
LSHBRE
STHKSH
GRH (Linear)GRH (RBF)
0.10
0.15
0.20
0.25
0.30
0.35
Test
mA
P Best performance with non-linear hypersurfaces
GRH vs. Initialisation Strategy (CIFAR-10 @ 32 bits)
GRH (Linear) GRH (RBF)0.00
0.05
0.10
0.15
0.20
0.25
0.30 LSH
ITQ+CCA
Tes
t m
AP
v
GRH vs. Initialisation Strategy (CIFAR-10 @ 32 bits)
GRH (Linear) GRH (RBF)0.00
0.05
0.10
0.15
0.20
0.25
0.30 LSH
ITQ+CCA
Tes
t m
AP
GRH vs. Initialisation Strategy (CIFAR-10 @ 32 bits)
GRH (Linear) GRH (RBF)0.00
0.05
0.10
0.15
0.20
0.25
0.30 LSH
ITQ+CCA
Tes
t m
AP
Solving eigenvalue problem not really necessary
GRH Timing (CIFAR-10 @ 32 bits)
Timings (s)*Method Train Test Total % rel cng vs. KSHGRH 8.008 0.033 8.041 -90%
KSH [5] 74.024 0.103 82.27 –
BRE [10] 227.841 0.370 231.4 +181%
* All times obtained an idle Intel 2.7GHz, 16Gb RAM machine andaveraged over 10 runs.
[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[10] Brian Kulis and Trevor Darrell. Learning to Hash with Binary Reconstructive Embeddings. In NIPS’09.
Example of Top 10 Retrieved Images
Queries
GRH
LSH
GRH
LSH
Ranked list
End of Part 2 of Talk
I Takeaway messages:
I R1: Regularising bits over a graph is effective (and efficient)for hashcode learning.
I R2: An intermediate eigendecomposition step is not necessary.
I R2: Non-linear hypersurfaces yield a more effective bucketingof the space than hyperplanes.
Learning to Hash for Large Scale Image Retrieval
Overview
Evaluation Protocol
Learning Multiple Quantisation Thresholds
Learning the Hashing Hypersurfaces
Conclusions
Thesis Statement (Recap)
“The retrieval effectiveness of hashing-based ANN search can beimproved by learning the quantisation thresholds and hashinghyperplanes in a manner that is influenced by the distribution ofthe data”
Conclusions
I Learning the thresholds and hypersurfaces:
I Semi-supervised clustering algorithm for learning quantisationthresholds.
I EM-like algorithm for positioning hashing hypersurfaces basedon a supervisory signal.
I Additional contributions in the thesis:
1. Learning multi-modal hashing hypersurfaces.
2. Learning variable quantisation thresholds.
3. Combining threshold and hypersurface learning.
Future Work
I Extend the threshold and hypersurface learning to an onlinestreaming data scenario e.g. event detection in Twitter.
I Leverage Deep Learning to optimise feature representationand hash functions end-to-end rather than using hand craftedfeatures (e.g. SIFT).
Thank you for your attention!
References
I [1] Sean Moran, Victor Lavrenko, Miles Osborne.Neighbourhood Preserving Quantisation for LSH. InSIGIR’13.
I [2] Sean Moran, Victor Lavrenko, Miles Osborne. VariableBit Quantisation for LSH. In ACL’13.
I [3] Sean Moran and Victor Lavrenko. Graph RegularisedHashing. In ECIR’15.
I [4] Sean Moran and Victor Lavrenko. Regularised CrossModal Hashing. In SIGIR’15.
References
I [5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-FuChang. Supervised Hashing with Kernels. In CVPR’12.
I [6] Yunchao Gong, Svetlana Lazebnik. IterativeQuantization: A Procrustean Approach to LearningBinary Codes for Large-scale Image Retrieval. InCVPR’11.
I [7] Weihao Kong, Wu-Jun Li.Double Bit Quantisation. InAAAI’12.
I [8] Weihao Kong, Wu-Jun Li. Manhattan Hashing forLarge-Scale Image Retrieval. In SIGIR’12.
References
I [9] Yair Weiss, Antonio Torralba, Rob Fergus. SpectralHashing. In NIPS’08.
I [10] Brian Kulis and Trevor Darrell. Learning to Hash withBinary Reconstructive Embeddings. In NIPS’09.
I [11] Wei Liu, Jun Wang, Sanjiv Kumar, Shih-Fu Chang.Hashing with Graphs. In ICML’11.
I [12] Dell Zhang, Jun Wang, Deng Cai, Jinsog Lu.Self-Taught Hashing for Fast Similarity Search. InSIGIR’10.
References
I [13] Piotr Indyk and Rajeev Motwani. Approximate nearestneighbors: Towards removing the curse ofdimensionality. In STOC’98.
I [14] Fernando Diaz. Regularizing query-based retrievalscores. In IR’07.
I [15] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li,Zhiping Luo, and Yan-Tao Zheng. NUS-WIDE: A Real-WorldWeb Image Database from National University of Singapore.In ICIVR’09.