phd thesis defence slides

Post on 21-Jan-2017

646 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Learning to Hash for Large Scale Image Retrieval

Sean Moran

Pre-Viva Talk

27th November 2015

Learning to Hash for Large Scale Image Retrieval

Overview

Evaluation Protocol

Learning Multiple Quantisation Thresholds

Learning the Hashing Hypersurfaces

Conclusions

Learning to Hash for Large Scale Image Retrieval

Overview

Evaluation Protocol

Learning Multiple Quantisation Thresholds

Learning the Hashing Hypersurfaces

Conclusions

Efficient Retrieval in Large Image Archives

Efficient Retrieval in Large Image Archives

Efficient Retrieval in Large Image Archives

Applications

Recommendation

Near duplicate detection

Content based IR

Recommendation

Near Duplicate Detection

Noun Clustering

Location Recognition

Locality Sensitive Hashing

H

DATABASE

Locality Sensitive Hashing

110101

010111

H

010101

111101

.....

DATABASE

HASH TABLE

Locality Sensitive Hashing

110101

010111

H

H

QUERY

DATABASE

HASH TABLE

010101

111101

.....

Locality Sensitive Hashing

110101

010111

H

H

COMPUTE SIMILARITY

NEARESTNEIGHBOURS

QUERY

010101

111101

.....

H

DATABASE

QUERY

HASH TABLE

Locality Sensitive Hashing

110101

010111

111101

H

H

Content Based IR

Image: Imense Ltd

Image: Doersch et al.

Image: Xu et al.

Location Recognition

Near duplicate detection

010101

111101

.....

H

QUERY

DATABASE

QUERY

NEARESTNEIGHBOURS

HASH TABLE

COMPUTE SIMILARITY

Toy 2D Feature Space

y

x

Feature Space Partitioning

y

w2

w1

h1

h2

x

a

b

Step 1: Projection onto Normal Vectors

x

y

w2

Step 2: Binary Quantisation

y1

y2Projected Dimension #2

Projected Dimension #1

t1

t 2

10

10

a b

a b

2-Bit Hashcodes of Data-Points

y

w2

w1

h1

h2

00

10 11

01

x

b

a

2-Bit Hashcodes of Data-Points

y

w2

w1

h1

h2

00

10 11

01

x

b

a

00

...

01

10

11

...

Hash Table

Thesis Statement

“The retrieval effectiveness of hashing-based ANN search can beimproved by learning the quantisation thresholds and hashinghyperplanes in a manner that is influenced by the distribution ofthe data”

Relaxing Common Assumptions (A1-A3)

I Three limiting assumptions permeate literature:

I A1: Single static threshold placed at zero

I A2: Uniform allocation of thresholds across each dimension

I A3: Linear hypersurfaces (hyperplanes) positioned randomly

Thesis Contributions

I Five novel contributions to address the assumptions:

I Learning Multiple Quantisation Thresholds [1]

I Learning Variable Quantisation Thresholds [2]

I Learning Unimodal Hashing Hypersurfaces [3]

I Learning Cross-Modal Hashing Hypersurfaces [4]

I Learning Hypersurfaces and Thresholds together

[1] Sean Moran, Victor Lavrenko, Miles Osborne. Neighbourhood Preserving Quantisation for LSH. In SIGIR’13.[2] Sean Moran, Victor Lavrenko, Miles Osborne. Variable Bit Quantisation for LSH. In ACL’13.[3] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. In ECIR’15.[4] Sean Moran and Victor Lavrenko. Regularised Cross Modal Hashing. In SIGIR’15.

Thesis Contributions (This Talk)

I Two contributions discussed in this talk:

I Learning Multiple Quantisation Thresholds [1]

I Learning Variable Quantisation Thresholds [2]

I Learning Unimodal Hashing Hypersurfaces [3]

I Learning Cross-Modal Hashing Hypersurfaces [4]

I Learning Hypersurfaces and Thresholds together

[1] Sean Moran, Victor Lavrenko, Miles Osborne. Neighbourhood Preserving Quantisation for LSH. In SIGIR’13.[2] Sean Moran, Victor Lavrenko, Miles Osborne. Variable Bit Quantisation for LSH. In ACL’13.[3] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. In ECIR’15.[4] Sean Moran and Victor Lavrenko. Regularised Cross Modal Hashing. In SIGIR’15.

Evaluation Protocol

Overview

Evaluation Protocol

Learning Multiple Quantisation Thresholds

Learning the Hashing Hypersurfaces

Conclusions

Dataset and Features

I Standard evaluation datasets [5,6]:

I CIFAR-10: 60,000 images, 512-D GIST descriptors1

I NUSWIDE: 270,000 images, 500-D BoW descriptors2

I SIFT-1M: 1,000,000 images, 128-D SIFT descriptors3

[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.

1http://www.cs.toronto.edu/~kriz/cifar.html2http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm3http://corpus-texmex.irisa.fr/

Example NUSWIDE Images [15]

ε-Nearest Neighbour Groundtruth [6]

y

x

ϵ

[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.

Class Based Groundtruth [6]

y

x

Class 1

Class 1

Class 2

Class 2Class 3

Class 3

Class 4

Class 4

[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.

Hamming Ranking Evaluation Paradigm [5,6]

0

1

3

3

4

5

5

01011

01011

01010

11000

11000

11100

10100

10100

Images ranked byHamming distance

Hashcode Hamming distance

Query

+

+

-

-

-

-

-

[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.

Learning to Hash for Large Scale Image Retrieval

Overview

Evaluation Protocol

Learning Multiple Quantisation Thresholds

Learning the Hashing Hypersurfaces

Conclusions

Why Learn the Quantisation Threshold(s)?

I Threshold at zero is in region of highest point density:

0

200

400

600

800

1000

1200

1400

1600

-4 -3 -2 -1 0 1 2 3 4

Cou

nt

Projected Value

Distribution of Projected Values (LSH, CIFAR-10)

Previous work

I Single Bit Quantisation (SBQ): Single static thresholdplaced at zero (default used in most previous work) [13].

I Double Bit Quantisation (DBQ): Multiple thresholdsoptimised by minimising squared error objective [7].

I Manhattan Hashing Quantisation (MHQ): Multiplethresholds optimised by k-means [8].

[7] Weihao Kong, Wu-Jun Li.Double Bit Quantisation. In AAAI’12.[8] Weihao Kong, Wu-Jun Li. Manhattan Hashing for Large-Scale Image Retrieval. In SIGIR’12.

[13] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse ofdimensionality. In STOC’98.

Previous work

Method Data-Dependent # Thresholds Codebook

SBQ [13] 1 0/1DBQ [7] X 2 00/11/10MHQ [8] X 2B − 1 NBC

NPQ [1] X 1,2,2B−1 Any

B : # of bits per projected dimension e.g. 1,2,3,4 bitsNBC: Natural Binary Code

[1] Sean Moran, Victor Lavrenko, Miles Osborne. NeighbourhoodPreserving Quantisation for LSH. In SIGIR’13.

Semi-Supervised Objective Function

I Idea: learn thresholds by fusing two sources of information:

1. Supervised: distances (or labels ) of points in the originalD-dimensional feature space

2. Unsupervised: distances between pairs of points in the 1Dprojected space

I Next few slides: constructing the semi-supervised objectivefunction

Adjacency Matrix S

I Adjacency matrix S ∈ 0, 1Ntrd×Ntrd :

Sij =

1, if xi , xj are true NNs

0, otherwise.(1)

Note: Ntrd N and matrix S highly sparse (1% of elementsnon-zero)

Adjacency Matrix S

t 3t1 t 2

y1

00 01 10 11

a bc de f g h

S a b c d e f g h

a +1b +1c +1d +1e +1f +1g +1h +1

Adjacency Matrix S

t 3t1 t 2

y1

00 01 10 11

a bc de f g h

S a b c d e f g h

a 0 +1 0 0 0 0 0 0b +1 0 0 0 0 0 0 0c 0 0 0 0 0 +1 0 0d 0 0 0 0 0 0 0 +1e 0 0 0 0 0 0 +1 0f 0 0 +1 0 0 0 0 0g 0 0 0 0 +1 0 0 0h 0 0 0 +1 0 0 0 0

Indicator Matrix P

I Indicator matrix P ∈ 0, 1Ntrd×Ntrd :

Pkij =

1, if ∃γ s.t. tkγ ≤ (yk

i , ykj ) ≤ tk(γ+1)

0, otherwise.(2)

Indicator Matrix P

t 3t1 t 2

y1

00 01 10 11

a b

P a b c d e f g h

a +1b +1cdefgh

Indicator Matrix P

t 3t1 t 2

y1

00 01 10 11

a bc

P a b c d e f g h

a +1 +1b +1 +1c +1 +1defgh

Indicator Matrix P

t 3t1 t 2

y1

00 01 10 11

a bc de f g h

P a b c d e f g h

a +1 +1b +1 +1c +1 +1de +1 +1 +1f +1 +1 +1g +1 +1 +1h +1 +1 +1

Indicator Matrix P

t 3t1 t 2

y1

00 01 10 11

a bc de f g h

P a b c d e f g h

a 0 +1 +1 0 0 0 0 0b +1 0 +1 0 0 0 0 0c +1 +1 0 0 0 0 0 0d 0 0 0 0 0 0 0 0e 0 0 0 0 0 +1 +1 +1f 0 0 0 0 +1 0 +1 +1g 0 0 0 0 +1 +1 0 +1h 0 0 0 0 +1 +1 +1 0

Supervised Part: F1-Measure Maximisation

I Integrate supervisory signal by maximising F1-measure:

F1 =2TP

2TP + FP + FN

I True positives (TPs): true NNs in same regions

I False positives (FPs): non-NNs in same regions

I False negatives (FNs): true NNs in different regions

Supervised Part: F1-Measure Maximisation

I True Positives (TPs):

TP =1

2

∑ij

PijSij =1

2‖P S‖1

P S a b c d e f g h

a 0 +1 0 0 0 0 0 0b +1 0 0 0 0 0 0 0c 0 0 0 0 0 0 0 0d 0 0 0 0 0 0 0 0e 0 0 0 0 0 0 +1 0f 0 0 0 0 0 0 0 0g 0 0 0 0 +1 0 0 0h 0 0 0 0 0 0 0 0

Supervised Part: F1-Measure Maximisation

I True Positives (TPs):

TP =1

2‖(P) (S)‖1 = 4/2 = 2

t 3t1 t 2

y1

00 01 10 11

a be g

Supervised Part: F1-Measure Maximisation

I False Negatives (FNs):

FN =1

2

∑ij

Sij − TP =1

2‖S‖1 − TP

S a b c d e f g h

a 0 +1 0 0 0 0 0 0b +1 0 0 0 0 0 0 0c 0 0 0 0 0 +1 0 0d 0 0 0 0 0 0 0 +1e 0 0 0 0 0 0 +1 0f 0 0 +1 0 0 0 0 0g 0 0 0 0 +1 0 0 0h 0 0 0 +1 0 0 0 0

Supervised Part: F1-Measure Maximisation

I False Negatives (FNs):

FN =1

2‖S‖1 − TP = 8/2− 2 = 2

t 3t1 t 2

y1

00 01 10 11

c df h

Supervised Part: F1-Measure Maximisation

I False Positives (FPs):

FP =1

2

∑ij

Pij − TP =1

2‖P‖1 − TP

P a b c d e f g h

a 0 +1 +1 0 0 0 0 0b +1 0 +1 0 0 0 0 0c +1 +1 0 0 0 0 0 0d 0 0 0 0 0 0 0 0e 0 0 0 0 0 +1 +1 +1f 0 0 0 0 +1 0 +1 +1g 0 0 0 0 +1 +1 0 +1h 0 0 0 0 +1 +1 +1 0

Supervised Part: F1-Measure Maximisation

I False Positives (FPs):

FP =1

2

∑ij

Pij − TP = 9− 2 = 7

t 3t1 t 2

y1

00 01 10 11

e gf h a bc

I Note: Not implemented like this as P may be dense.

Semi-Supervised Objective Function

I F1-measure term:

F1(tk) =2‖P S‖1‖S‖1 + ‖P‖1

= 4/13

I Unsupervised term:

Ω(tk) =1

σk

T+1∑j=1

∑i :yk

i ∈rj

yki − µj

2

I Maximise objective Jnpq:

Jnpq(tk) = αF1(tk) + (1− α)(1− Ω(tk))

Experimental Results

I Three research questions: (more in the thesis):

I R1: Do we outperform a single statically placed threshold?

I R2: When is more than one threshold beneficial toeffectiveness?

I R3: Can our method generalise to multiple thresholds andmany different projection functions?

Learning Curve (AUPRC vs. Amount of Supervision)

0 100 200 300 400 500 600 700 800 900 10000.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

CIFAR

NUSWIDE

# of Supervisory Data-Points

Va

lid

ati

on

AU

PR

C

Learning Curve (AUPRC vs. Amount of Supervision)

0 100 200 300 400 500 600 700 800 900 10000.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

CIFAR

NUSWIDE

# of Supervisory Data-Points

Va

lid

ati

on

AU

PR

C

Small amount of supervision

Benefit of Optimising a Single Threshold (R1)

RND SBQ NPQ0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

CIFAR

NUS-WIDE

Te

st

AU

PR

C

Benefit of Optimising a Single Threshold (R1)

RND SBQ NPQ0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

CIFARNUS-WIDE

Tes

t A

UP

RC

+55%

+35%

Benefit of Optimising a Single Threshold (R1)

RND SBQ NPQ0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

CIFAR

NUS-WIDE

Te

st

AU

PR

C

Better to learn thresholds

Benefit of Multiple Thresholds (R2)

LSH (T=1) LSH (T=3) LSH (T=7)0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Te

st

AU

PR

C

Benefit of Multiple Thresholds (R2)

LSH (T=1) LSH (T=3) LSH (T=7)0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Test A

UP

RC

-36%

-38%

Benefit of Multiple Thresholds (R2)

LSH (T=1) LSH (T=3) LSH (T=7)0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Te

st

AU

PR

C

1 thresholdoptimal for LSH

Benefit of Multiple Thresholds (R2)

PCA (T=1) PCA (T=3) PCA (T=7)0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

0.21

Te

st

AU

PR

C

Benefit of Multiple Thresholds (R2)

PCA (T=1) PCA (T=3) PCA (T=7)0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

0.21

Test A

UP

RC

+14%

+19%

Benefit of Multiple Thresholds (R2)

PCA (T=1) PCA (T=3) PCA (T=7)0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

0.21

Test A

UP

RC

2+ thresholdsoptimal for PCA

Comparison to the State-of-the-Art (R3)

DBQ MHQ NPQ0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

LSH SH

PCA SKLSH

Te

st

AU

PR

C

Comparison to the State-of-the-Art (R3)

DBQ MHQ NPQ0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

LSH SH

PCA SKLSH

Test A

UP

RC

+67%

Comparison to the State-of-the-Art (R3)

DBQ MHQ NPQ0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

LSH SH

PCA SKLSH

Te

st

AU

PR

C

Comparison to the State-of-the-Art (R3)

DBQ MHQ NPQ0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

LSH SH

PCA SKLSH

Te

st

AU

PR

C

Comparison to the State-of-the-Art (R3)

DBQ MHQ NPQ0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

LSH SH

PCA SKLSH

Te

st

AU

PR

C

Comparison to the State-of-the-Art (R3)

DBQ MHQ NPQ0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

LSH SH

PCA SKLSH

Te

st

AU

PR

C

Generalises to other projection functions

End of Part 1 of Talk

I Takeaway messages:

I R1: Optimising 1 threshold is much better than placing thethreshold at zero.

I R2: ≥ 2 thresholds beneficial for eigendecomposition-basedprojections.

I R3: The benefits of threshold learning generalise to otherprojection functions.

Learning to Hash for Large Scale Image Retrieval

Overview

Evaluation Protocol

Learning Multiple Quantisation Thresholds

Learning the Hashing Hypersurfaces

Conclusions

Why Learn the Hashing Hypersurfaces?

y

w1

h1 0

1

x

Why Learn the Hashing Hypersurfaces?

y

w1

h1

0 1

x

Previous work

I Data-independent: Locality Sensitive Hashing (LSH) [13]

I Data-dependent (unsupervised): Anchor Graph Hashing(AGH) [11], Spectral Hashing (SH) [9]

I Data-dependent (supervised): Self Taught Hashing (STH)[12], Supervised Hashing with Kernels (KSH) [5], ITQ + CCA[6], Binary Reconstructive Embedding (BRE) [10]

[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[6] Yunchao Gong, Svetlana Lazebnik. Iterative Quantization: A Procrustean Approach to Learning BinaryCodes for Large-scale Image Retrieval. In CVPR’11.[9] Yair Weiss, Antonio Torralba, Rob Fergus. Spectral Hashing. In NIPS’08.[10] Brian Kulis and Trevor Darrell. Learning to Hash with Binary Reconstructive Embeddings. In NIPS’09.[11] Wei Liu, Jun Wang, Sanjiv Kumar, Shih-Fu Chang. Hashing with Graphs. In ICML’11.[12] Dell Zhang, Jun Wang, Deng Cai, Jinsog Lu. Self-Taught Hashing for Fast Similarity Search. In SIGIR’10.[13] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse ofdimensionality. In STOC’98.

Previous work

Method Learnt Supervised Scalable Effectiveness

LSH [13] X LowSH [9] X LowSTH [12] X X MediumBRE [10] X X MediumITQ+CCA [6] X X MediumKSH [5] X X High

GRH [3] X X X High

[3] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. InECIR’15.

Learning the Hashing Hypersurfaces

I Two step iterative hashing model:

I Step A: Graph Regularisation

Lm ← sgn(α SD−1Lm−1 + (1−α)L0

)I Step B: Data-Space Partitioning

for k = 1. . .K : min ||wk ||2 + C∑N

i=1 ξik

s.t. Lik(wkᵀxi + bk) ≥ 1− ξik for i = 1. . .N

I Repeat for a set number of iterations (M)

Learning the Hashing Hypersurfaces

I Step A: Graph Regularisation [14]

Lm ← sgn(α SD−1Lm−1 + (1−α)L0

)

I S: Affinity (adjacency) matrix

I D: Diagonal degree matrix

I L: Binary bits at specified iteration

I α: Interpolation parameter (0 ≤ α ≤ 1)

[14] Fernando Diaz. Regularizing query-based retrieval scores. InIR’07.

Learning the Hashing Hypersurfaces

I Step A: Graph Regularisation

Lm ← sgn(α SD−1Lm−1 + (1−α)L0

)

I S: Affinity (adjacency) matrix

I D: Diagonal degree matrix

I L: Binary bits at specified iteration

I α: Interpolation parameter (0 ≤ α ≤ 1)

Learning the Hashing Hypersurfaces

I Step A: Graph Regularisation

Lm ← sgn(α SD−1Lm−1 + (1−α)L0

)

I S: Affinity (adjacency) matrix

I D: Diagonal degree matrix

I L: Binary bits at specified iteration

I α: Interpolation parameter (0 ≤ α ≤ 1)

Learning the Hashing Hypersurfaces

I Step A: Graph Regularisation

Lm ← sgn(α SD−1Lm−1 + (1−α)L0

)

I S: Affinity (adjacency) matrix

I D: Diagonal degree matrix

I L: Binary bits at specified iteration

I α: Interpolation parameter (0 ≤ α ≤ 1)

Learning the Hashing Hypersurfaces

I Step A: Graph Regularisation

Lm ← sgn(α SD−1Lm−1 + (1−α)L0

)

I S: Affinity (adjacency) matrix

I D: Diagonal degree matrix

I L: Binary bits at specified iteration

I α: Interpolation parameter (0 ≤ α ≤ 1)

Learning the Hashing Hypersurfaces

-1 1 1

-1 -1 -1ba

c

1 1 1

S a b c

a 1 1 0b 1 1 1c 0 1 1

D−1 a b c

a 0.5 0 0b 0 0.33 0c 0 0 0.5

L0 b1 b2 b3

a −1 −1 −1b −1 1 1c 1 1 1

Learning the Hashing Hypersurfaces

-1 1 1

-1 -1 -1ba

c

1 1 1

L1 = sgn

−1 0 0−0.33 0.33 0.33

0 1 1

Learning the Hashing Hypersurfaces

-1 1 1

-1 1 1ba

c

1 1 1

L1 =

b1 b2 b3

a −1 1 1b −1 1 1c 1 1 1

Learning the Hashing Hypersurfaces

I Step B: Data-Space Partitioning

for k = 1. . .K : min ||wk ||2 + C∑N

i=1 ξik

s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N

I wk : Hyperplane normal k

I tk : bias of hyperplane k

I xi : data-point i

I Lik : bit k of data-point i

ξik : slack variable ik

K : # bits

N: # data-points

Learning the Hashing Hypersurfaces

I Step B: Data-Space Partitioning

for k = 1. . .K : min ||wk ||2 + C∑N

i=1 ξik

s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N

I wk : Hyperplane normal k

I tk : bias of hyperplane k

I xi : data-point i

I Lik : bit k of data-point i

ξik : slack variable ik

K : # bits

N: # data-points

Learning the Hashing Hypersurfaces

I Step B: Data-Space Partitioning

for k = 1. . .K : min ||wk ||2 + C∑N

i=1 ξik

s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N

I wk : Hyperplane normal k

I tk : bias of hyperplane k

I xi : data-point i

I Lik : bit k of data-point i

ξik : slack variable ik

K : # bits

N: # data-points

Learning the Hashing Hypersurfaces

I Step B: Data-Space Partitioning

for k = 1. . .K : min ||wk ||2 + C∑N

i=1 ξik

s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N

I wk : Hyperplane normal k

I tk : bias of hyperplane k

I xi : data-point i

I Lik : bit k of data-point i

ξik : slack variable ik

K : # bits

N: # data-points

Learning the Hashing Hypersurfaces

I Step B: Data-Space Partitioning

for k = 1. . .K : min ||wk ||2 + C∑N

i=1 ξik

s.t. Lik(wkᵀxi + tk) ≥ 1− ξik for i = 1. . .N

I wk : Hyperplane normal k

I tk : bias of hyperplane k

I xi : data-point i

I Lik : bit k of data-point i

ξik : slack variable ik

K : # bits

N: # data-points

Learning the Hashing Hypersurfaces

ba

c

e

f

g

h

d

Learning the Hashing Hypersurfaces

ba

c

e

f

g

h

d

Learning the Hashing Hypersurfaces

-1 1 1

1 1 1

-1 -1 -1

1 1 -1

ba

c

e

f

g

h

d

-1 1 1

1 -1 -1

1 -1 -1

1 1 1

Learning the Hashing Hypersurfaces

1 1 1

-1 1 1

-1 -1 -1

1 1 -1

ba

c

e

f

g

h

d

-1 1 1

1 1 1

1 -1 -1

1 -1 -1

Learning the Hashing Hypersurfaces

-1 1 1

-1 -1 -1

1 1 -1

ba

c

e

f

g

h

d

-1 1 1

-1 1 1

1 1 1

First bit flipped

1 1 -1 Second bit flipped

1 -1 -1

Learning the Hashing Hypersurfaces

-1 1 1

1 1 1

-1 -1 -1

1 1 -1

ba

c

e

f

g

h

d

-1 1 1w1. x−t 1=0

w1

Negative (-1)half space

Positive (+1)half space

1 1 -1

1 -1 -1

-1 1 1

Learning the Hashing Hypersurfaces

-1 1 1

1 1 1

-1 -1 -1

1 1 -11 -1 -1

ba

c

e

f

g

h

1 1 -1

d

-1 1 1

-1 1 1

w2

Positive (+1)half space

w2 . x−t 2=0 Negative (-1) half space

Experimental Results

I Three research questions: (more in the thesis):

I R1: Does graph regularisation step help us learn more effectivehashcodes?

I R2: Is it necessary to solve an eigenvalue problem in order tolearn effective hashcodes?

I R2: Do non-linear hypersurfaces provide a more effectivebucketing of the space compared to linear hypersurfaces?

Learning Curve (mAP vs. Amount of Supervision)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.10

0.15

0.20

0.25

0.30

# Supervisory Data Points

Valid

atio

n m

AP

Learning Curve (mAP vs. Amount of Supervision)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.10

0.15

0.20

0.25

0.30

# Supervisory Data Points

Valid

atio

n m

AP

Small amount of supervision 5% of dataset size

Effect of α and Iterations (CIFAR-10 @ 32 bits)

0 1 2 3 4 50.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.280.70.80.9

Iterations

Va

lid

ati

on

mA

P

Effect of α and Iterations (CIFAR-10 @ 32 bits)

0 1 2 3 4 50.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.280.70.80.9

Iterations

Va

lid

ati

on

mA

P

Fast convergence. Most weight on supervisory signal

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

Tes

t m

AP

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

Test

mA

P

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.15

0.20

0.25

0.30

0.35

Tes

t m

AP

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.15

0.20

0.25

0.30

0.35

Test

mA

P

-8%

Linear GRH

Non-Linear KSH

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.15

0.20

0.25

0.30

0.35

Test

mA

P

-8%

Linear GRH

Non-Linear KSH

Non-Linear GRH

+14%

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.15

0.20

0.25

0.30

0.35

Test

mA

P Linear hypersurfaces do really well

GRH vs Literature (CIFAR-10 @ 32 bits)

LSHBRE

STHKSH

GRH (Linear)GRH (RBF)

0.10

0.15

0.20

0.25

0.30

0.35

Test

mA

P Best performance with non-linear hypersurfaces

GRH vs. Initialisation Strategy (CIFAR-10 @ 32 bits)

GRH (Linear) GRH (RBF)0.00

0.05

0.10

0.15

0.20

0.25

0.30 LSH

ITQ+CCA

Tes

t m

AP

v

GRH vs. Initialisation Strategy (CIFAR-10 @ 32 bits)

GRH (Linear) GRH (RBF)0.00

0.05

0.10

0.15

0.20

0.25

0.30 LSH

ITQ+CCA

Tes

t m

AP

GRH vs. Initialisation Strategy (CIFAR-10 @ 32 bits)

GRH (Linear) GRH (RBF)0.00

0.05

0.10

0.15

0.20

0.25

0.30 LSH

ITQ+CCA

Tes

t m

AP

Solving eigenvalue problem not really necessary

GRH Timing (CIFAR-10 @ 32 bits)

Timings (s)*Method Train Test Total % rel cng vs. KSHGRH 8.008 0.033 8.041 -90%

KSH [5] 74.024 0.103 82.27 –

BRE [10] 227.841 0.370 231.4 +181%

* All times obtained an idle Intel 2.7GHz, 16Gb RAM machine andaveraged over 10 runs.

[5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang. Supervised Hashing with Kernels. InCVPR’12.[10] Brian Kulis and Trevor Darrell. Learning to Hash with Binary Reconstructive Embeddings. In NIPS’09.

Example of Top 10 Retrieved Images

Queries

GRH

LSH

GRH

LSH

Ranked list

End of Part 2 of Talk

I Takeaway messages:

I R1: Regularising bits over a graph is effective (and efficient)for hashcode learning.

I R2: An intermediate eigendecomposition step is not necessary.

I R2: Non-linear hypersurfaces yield a more effective bucketingof the space than hyperplanes.

Learning to Hash for Large Scale Image Retrieval

Overview

Evaluation Protocol

Learning Multiple Quantisation Thresholds

Learning the Hashing Hypersurfaces

Conclusions

Thesis Statement (Recap)

“The retrieval effectiveness of hashing-based ANN search can beimproved by learning the quantisation thresholds and hashinghyperplanes in a manner that is influenced by the distribution ofthe data”

Conclusions

I Learning the thresholds and hypersurfaces:

I Semi-supervised clustering algorithm for learning quantisationthresholds.

I EM-like algorithm for positioning hashing hypersurfaces basedon a supervisory signal.

I Additional contributions in the thesis:

1. Learning multi-modal hashing hypersurfaces.

2. Learning variable quantisation thresholds.

3. Combining threshold and hypersurface learning.

Future Work

I Extend the threshold and hypersurface learning to an onlinestreaming data scenario e.g. event detection in Twitter.

I Leverage Deep Learning to optimise feature representationand hash functions end-to-end rather than using hand craftedfeatures (e.g. SIFT).

Thank you for your attention!

References

I [1] Sean Moran, Victor Lavrenko, Miles Osborne.Neighbourhood Preserving Quantisation for LSH. InSIGIR’13.

I [2] Sean Moran, Victor Lavrenko, Miles Osborne. VariableBit Quantisation for LSH. In ACL’13.

I [3] Sean Moran and Victor Lavrenko. Graph RegularisedHashing. In ECIR’15.

I [4] Sean Moran and Victor Lavrenko. Regularised CrossModal Hashing. In SIGIR’15.

References

I [5] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-FuChang. Supervised Hashing with Kernels. In CVPR’12.

I [6] Yunchao Gong, Svetlana Lazebnik. IterativeQuantization: A Procrustean Approach to LearningBinary Codes for Large-scale Image Retrieval. InCVPR’11.

I [7] Weihao Kong, Wu-Jun Li.Double Bit Quantisation. InAAAI’12.

I [8] Weihao Kong, Wu-Jun Li. Manhattan Hashing forLarge-Scale Image Retrieval. In SIGIR’12.

References

I [9] Yair Weiss, Antonio Torralba, Rob Fergus. SpectralHashing. In NIPS’08.

I [10] Brian Kulis and Trevor Darrell. Learning to Hash withBinary Reconstructive Embeddings. In NIPS’09.

I [11] Wei Liu, Jun Wang, Sanjiv Kumar, Shih-Fu Chang.Hashing with Graphs. In ICML’11.

I [12] Dell Zhang, Jun Wang, Deng Cai, Jinsog Lu.Self-Taught Hashing for Fast Similarity Search. InSIGIR’10.

References

I [13] Piotr Indyk and Rajeev Motwani. Approximate nearestneighbors: Towards removing the curse ofdimensionality. In STOC’98.

I [14] Fernando Diaz. Regularizing query-based retrievalscores. In IR’07.

I [15] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li,Zhiping Luo, and Yan-Tao Zheng. NUS-WIDE: A Real-WorldWeb Image Database from National University of Singapore.In ICIVR’09.

top related