diverse image annotation - cvf open...

Diverse Image Annotation

Baoyuan Wu†,‡ Fan Jia† Wei Liu‡ Bernard Ghanem†

†King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia‡Tencent AI Lab, Shenzhen, China

[email protected] [email protected] [email protected] [email protected]

Abstract

In this work, we study a new image annotation task

called diverse image annotation (DIA). Its goal is to de-

scribe an image using a limited number of tags, whereby

the retrieved tags need to cover as much useful information

about the image as possible. As compared to the conven-

tional image annotation task, DIA requires the tags to be

not only representative of the image but also diverse from

each other, so as to reduce redundancy. To this end, we treat

DIA as a subset selection problem, based on the conditional

determinantal point process (DPP) model, which encodes

representation and diversity jointly. We further explore se-

mantic hierarchy and synonyms among candidate tags to

define weighted semantic paths. It is encouraged that two

tags with the same semantic path are not retrieved simulta-

neously for the same image. This restriction is embedded

into the algorithm used to sample from the learned condi-

tional DPP model. Interestingly, we find that conventional

metrics for image annotation (e.g., precision, recall, and

F1 score) only consider an overall representative capac-

ity of all the retrieved tags, while ignoring their diversity.

Thus, we propose new semantic metrics based on our pro-

posed weighted semantic paths. An extensive subject study

verifies that the proposed metrics are much more consistent

with human evaluation than conventional annotation met-

rics. Experiments on two benchmark datasets show that the

proposed method produces more representative and diverse

tags, compared with existing methods.

1. Introduction

Image annotation aims to provide keyword tags to de-

scribe an image. It not only presents a simple way to un-

derstand the image’s content, but also provides useful in-

formation for other tasks, such as object detection [12] or

caption generation [6][33][14]. Many existing methods for

image annotation are designed to produce a complete list of

tags that cover all contents of an image, such as ML-MG

[30] and FastTag [5]. We argue that such a complete list

Figure 1. This image with its complete tag list is extracted from

the ESP Game [26] dataset. We also show the annotated tags by

ML-MG [30] and our method using 3 and 5 tags, respectively.

Obviously our tags are more representative and diverse than the

tags of ML-MG. Note that the tag list of our method is obtained

from sampling, so the tag orders in 3 and 5 tags could be different.

is unnecessary in many cases, especially when redundancy

exists among the retrieved tags. We also argue that con-

ventional metrics used to evaluate annotation methods (e.g.,

precision, recall or F1 score) should be modified to discour-

age such redundancy. We will motivate these two points

with an example. In Figure 1, the drummer image has the

following complete (ground truth) tag list: {“band”, “mu-

sic”, “light”, “man”, “people”, “person”, “colors”, “red”,

“wheel”}. Obviously, there are several redundancies in this

list, including “people” and “person” or “colors” and “red”.

Clearly, this image can be described using a more compact

list (e.g., {“band”, “light”, “man”, “red”, “wheel”}), which

describes the same content of the image as the complete

list, but it is more diverse as it avoids redundancy. More-

over, there is usually an upper limit to the number of tags

in the retrieved list for real-world applications. For exam-

ple, in a crowd-sourced image annotation task, we may ask

the human annotator to give at most k (e.g., 3 or 5) tags

for each image. Also, this upper limit naturally arises even

when annotators are not confined to a specific number of

tags, since they choose not to generate a longer list than

necessary. As we will show in our experiments, the aver-

age size of the tag subsets to describe every image in ESP

Game [26] and IAPRTC-12 [11] is around 5 (see Table 1).

Based on our subject studies (see Section 5.4), annotators

do tend to choose more diverse tags, which hints to the fact

2559

that they choose a compact tag list that covers as much of

the image’s content as they see necessary. However, when

comparing this strategy to the top-k (in terms of individual

tag prediction score) retrieved tags of automated annotation

methods, we observe a serious discrepancy. For example,

as shown in Figure 1, the top-3 tags of the recent annotation

method ML-MG [30] are quite redundant. Similar to many

other methods, ML-MG focuses on predicting highly rep-

resentative individual tags, while ignoring diversity within

the retrieved tag list.

Due to the discrepancy between human image annota-

tions and those of existing methods, we propose a new task,

called diverse image annotation (DIA), whose goal is to au-

tomatically generate a list of k (e.g. 3 or 5) tags that jointly

cover as much useful information as possible in the image.

We also propose a method that tackles DIA, by produc-

ing individually informative tags that are also diverse. As

shown in Figure 1, the predicted tags of ML-MG and our

method are quite different, primarily due to their diversities.

To quantitatively evaluate methods for DIA, we have to pro-

pose a new measure that, on one hand, discriminates based

on the aggregate semantic value of the retrieved tags and,

on the other hand, correlates well with human judgment.

We treat the DIA task as a subset selection problem,

where a k sized subset of tags should be retrieved from all

possible tags. The conditional determinantal point process

(DPP) [17] suitably models such a selection problem. DPP

is a probabilistic distribution over subsets of a fixed ground

set, and it enforces diversity among elements within a sub-

set, by utilizing global negative correlations among them.

The parameters of this DPP model are learned from training

samples of images and tags. Once the DPP distribution is

learned, the most probable (i.e., the most jointly represen-

tative and diverse) k tags can be sampled from it for each

testing image. However, for meaningful sampling, we ex-

ploit semantic relationships between candidate tags, namely

their semantic hierarchies and whether or not they are syn-

onyms. These relationships are used to define weighted se-

mantic paths for different tags. Two tags are discouraged to

be sampled together, if they belong to the same path, thus

reducing redundancy in annotation results. These semantic

paths are also used to define a similarity measure between a

retrieved tag list and the ground truth, which is shown to be

more consistent with human annotation than conventional

measures (e.g., precision, recall and F1).

Contributions. Our main contributions are three-fold. (i)

We propose a new task called diverse image annotation.

(ii) We propose a new annotation method that treats DIA

as a subset selection problem and uses the conditional DPP

model, as well as, tag-specific semantic paths to address it.

(iii) We define and validate (through subject studies) new

semantic metrics to evaluate annotations based on the qual-

ity of representation and diversity.

2. Related Work

In this section, we review the main directions in image

annotation, including: learning features, exploring tag cor-

relations, designing loss functions, and handling incomplete

tags in training data. Then, we show the connections and

differences between DIA and these directions.

The first direction focuses on learning better image fea-

tures for annotations, especially based on convolutional

neural networks (CNNs) [18]. Such networks learn very

promising features for many tasks, such as image classifi-

cation [15] and object detection [23]. Global CNN-based

image features have been used for image annotation too

[13]; however, some recent work [10] [27] learns local fea-

tures for detected bounding boxes, so as to extract more dis-

criminative object-centric features rather than from back-

ground. The second direction focuses on exploring and

exploiting tag correlations. As such, image annotation is

treated as a multi-label learning problem, where tag cor-

relations play a key role. Most common tag correlations

involve tag-level smoothness [30, 32] (i.e., the prediction

scores of two semantically similar tags should be similar in

the same image), image-level smoothness [13, 30, 32, 20]

(i.e., visually similar images have similar tags), low rank

assumption [2] (i.e., the whole tag space is spanned by a

lower-dimensional space), and semantic hierarchy [30, 25]

(i.e. parent tags in a hierarchy are as probable as their chil-

dren). Note that most of these methods only focus on posi-

tive tag correlations, while negative correlations have rarely

been explored, such as mutual exclusion [7, 3] and diver-

sity. The third direction focuses on designing loss functions

that encourage certain types of annotation solutions, such as

the (weighted) hamming loss [30, 34] or the pairwise rank-

ing loss [1]. The fourth direction handles incomplete tags

in training, which has been studied in many recent works

[30, 34, 5, 29, 31, 19]. The basic idea is to utilize correla-

tions between provided tags and missing ones to propagate

information.

Our DIA task does not exactly belong to any of the above

directions. However, there are connections and differences

between DIA and these directions, which can help us un-

derstand DIA more clearly. The feature learning and loss

function design directions can be seen to be independent

to DIA. Any progress in these two directions can be seam-

lessly incorporated into our proposed annotation task. In

this work, we adopt global CNN-based features to represent

images and the softmax loss function. The second direction

is the most related, as tag correlations also play a key role in

DIA. However, the intrinsic difference is that existing work

focuses on positive correlations, while DIA considers neg-

ative ones. Although mutual exclusion falls into this cate-

gory too, it only involves a pair of tags. A related work pre-

sented in [22] utilizes the pairwise redundancy between tags

in a sequential tag selection process, for the image retagging

2560

task in the social media scenario. In contrast, DIA takes into

account overall negative correlations across all tags. Inter-

estingly, handling incomplete/missing tags seems to have

an opposite goal as DIA, since the former seeks the com-

plete tag list from a subset, while DIA targets for a subset

from the complete list. However, they are not contradictory

to each other because they target for different challenges.

The motivation of handling incomplete tags is that the num-

ber of fully labeled images is insufficient, while most web

images are partially labeled. Thus, learning from massive

partially labeled images becomes valuable. But again, the

tag diversity is not considered. In contrast, DIA provides a

compact tag list that is not only individually representative

but also diverse, thus, trying to bridge the gap between au-

tomatic image annotation and human annotation. Actually

these two tasks can be combined together, where a complete

tag list is firstly predicted, then DIA extracts a representa-

tive and diverse subset from it. Moreover, the DPP model

has been applied to many computer vision tasks, where the

diversity is required, such as image retrieval [16][17] and

video summarization [9][35]. However, to the best of our

knowledge, this work is the first attempt to applying DPP to

image annotation.

3. Task and Model

3.1. Diverse Image Annotation (DIA)

The training image set is denoted as X = {x1, . . . ,xn},

where xj ∈ Rd is the d-dimensional feature representing

the jth image. For each image xj , a ground-truth tag subset

Yj ⊂ T = {1, 2, . . . ,m} is also provided, with T being

the whole tag set including m candidate tags. Our task is to

learn a model based from all pairs {(xj ,Yj)}nj=1 to predict

a representative and diverse tag subset with at most k (a user

defined integer) tags for each testing image.

3.2. Conditional DPP Model

The parametric conditional DPP with respect to an image

and tag subset pair (x,Y) is formulated as follows [17]:

PW(Y|x) =det(LY(x;W))

det(L(x;W) + I), (1)

where the kernel matrix for all m tags L(x;W) ∈ Rm×m

is positive semi-definite with W being its parameters.

L(x;W) can also be denoted as LT (x;W). I is iden-

tity matrix. LY(x;W) ∈ R|Y|×|Y| is a sub-matrix gen-

erated by extracting the rows and columns corresponding

to the tags in Y ⊂ T from L(x;W). det(LY) indi-

cates the determinant of LY , and it is good for encoding

negative correlations. Let us see an simple example that

LY = [a11, a12; a21, a22], and a11 and a22 indicate the

individual scores of two tags respectively, while a12 and

a21 denote tag correlations. Its determinant is det(LY) =

a11a22 − a12a21. If det(LY) is small, indicating this two

tags are highly correlated, then the probability PW(Y|x) is

small; if det(LY) = 0, indicating two tags are fully corre-

lated, then PW(Y|x) is 0. Regarding the general LY , if it

is rank deficient because the included tags are highly corre-

lated, then its probability is also 0. Obviously the model (1)

discourages the tag subset with redundant tags.

Using the quality/diversity (here “quality” refers to “rep-

resentation”) decomposition [17], we have

Lij(x;W)) = qi(x)φi(x)⊤φj(x)qj(x), (2)

where W = [w1 w2 . . .wm] denotes the set of quality

parameters, one for each tag. qi(x;wi) = exp(0.5w⊤i x)

is the quality term, indicating the individual score of x wrt

tag i. φi(x) ∈ Rm is a normalized diverse feature vector,

with ‖ φi(x) ‖= 1. S(x) = φi(x)φi(x)⊤ ∈ R

m×m is

the similarity matrix among tags. In this work, we adopt a

similarity matrix independent of x, thus we denote it as S

for clarity. Specifically, we adopt the cosine similarity,

S(i, j) =1

2+

〈ti, tj〉

2‖ti‖2‖tj‖2∈ [0, 1] ∀ i, j ∈ T , (3)

where the tag representation ti ∈ R50 is derived from the

GloVe algorithm [21]. Then, Eq (1) can be reformulated as

PW(Y|x) =

∏

i∈Y [exp(w⊤i x)]det(SY)

∑

Y′⊂T

∏

i∈Y′ [exp(w⊤i x)]det(SY′)

, (4)

where SY ∈ R|Y|×|Y| and SY′ ∈ R

|Y′|×|Y′| are sub-

matrices of S corresponding to tag subsets Y,Y ′ ⊂ T .

3.3. Learning

Assuming that the diversity kernel S is given and fol-

lowing [17], we learn the parameter W by minimizing the

negative log likelihood with an ℓ2 regularization,

L(W) =−1

n

n∑

j

logPW(Yj |xj) +η

2

m∑

i

‖ wi ‖22 (5)

=η

2

m∑

i

‖ wi ‖22 −

1

n

n∑

j

[

∑

i∈Yj

w⊤i xj − log det(SYj

)]

+1

n

n∑

j

log[

∑

Y′j∈T

∏

i∈Y′j

[exp(w⊤i xj)]det(SY′

j)]

.

It is easy to prove that L(W) is a convex function wrt each

wi. Thus, we adopt a simple gradient-based method to min-

2561

imize L(W). The gradient wrt wi is computed as follows:

∂L

wi

=ηwi −1

n

n∑

j

xjIi∈Yj+ (6)

1

n

n∑

j

∑

Y′j∈T

exp(w⊤i xj)det(SY′

j)xjIi∈Y′

j∑

Y′j∈T

∏

i′∈Y′j[exp(w⊤

i′ xj)]det(SY′j)

=ηwi +1

n

n∑

j

xj

[

∑

Y′j∈T

PW(Y ′j |xj)Ii∈Y′

j− Ii∈Yj

]

,

where indicator function Ii∈Yjis 1 if i ∈ Yj , other-

wise 0.∑

Y′j∈T PW(Y ′

j |xj)Ii∈Y′j

can be seen as the

marginal probability of xj wrt tag i. It is equiva-

lent to the diagonal entry Kii of the marginal kernel

K(xj ;W) = L(xj ;W)/(L(xj ;W) + I), with Kii(xj) =∑m

i′=1

λi′

λi′+1υi′(i)

2, where λi′ and υi′ are the i′-th eigen-

value and eigenvector of the kernel L(xj ;W) respec-

tively, derived from the SVD decomposition: L(xj ;W) =∑m

i′=1λi′υi′υ

⊤i′ . Then we obtain the gradient as follows:

∂L

wi

= η ·wi +1

n

n∑

j

xj

[

Kii(xj)− Ii∈Yj

]

(7)

Given ∂Lwi

, the back-propagation and stochastic gradient de-

scent algorithm [24] are adopted to optimize W.

Note that if we replace S by the the identity matrix I,

then the above learning can be seen as a standard multi-

label learning, where the tag subset is transformed to a la-

bel powerset. We refer the reader to [36] for details about

label powerset based multi-label learning. In this case, the

tag correlations are not utilized at all. In fact, S in the

DPP model serves to encourage negative correlations be-

tween tags, since it penalizes the probability of the subset

including highly-correlated tags. Thus, a subset with repre-

sentative (from q) and diverse (from S) tags is encouraged.

4. Sampling

Given the learned conditional DPP model, we can pro-

duce a representative and diverse tag subset for each testing

image by sampling from the learned distribution. Before de-

tailing the sampling algorithm in Section 4.3, we first intro-

duce a few pertinent concepts, namely semantic hierarchy,

synonyms (Section 4.1) and weighted semantic path (4.2)

because they play important roles in sampling.

4.1. Semantic Hierarchy and Synonyms

Semantic hierarchy (SH) was explored in ML-MG [30]

for image annotation. It describes the semantic dependen-

cies among tags. For example, “woman” is a “person”. Fig-

ure 2-left shows a part of the semantic hierarchies of ESP

Game [26]. Please refer to [30] for the detailed definition

of semantic hierarchy. In ML-MG, it is assumed that if the

descendant tag (e.g., “woman”) exists, then all its ancestor

tags (e.g., “person” and “people”) must exist too. In con-

trast, the usage of SH in our sampling algorithm is different.

We assume that two tags with semantic dependency cannot

be selected together, thus, reducing redundancy. Also, we

will use SH to define semantic metrics for DIA evaluation.

Synonyms indicates the state when two tags have the

same or similar meaning, such as “people” and “person”,

or “rock” and “stone”. We find that in many benchmark

image datasets, such as ESP Game [26] and IAPRTC-12

[11], there are many pairs of synonymous tags, according to

Wordnet [8]. In [28], synonyms are utilized to modify the

evaluation metric. In this work, synonyms are not only used

to define semantic metrics, but also utilized to discourage

synonymous tags from being selected simultaneously.

4.2. Weighted Semantic Path

Here, we propose a new concept, called weighted se-

mantic path (SP), to merge the idea of SH and syn-

onyms together. We present a simple example in

Figure 2 to illustrate some definitions of SP. Firstly,

as shown in Figure 2-left, We can find some di-

rected paths among the 5 candidate tags, such as

[“person”→“woman”→“lady”]. However, some paths may

represent the same or similar meaning, as their constituent

tags are synonyms, such as [“person”→“woman”→“lady”]

and [“people”→“woman”→“lady”]. We propose that if

two directed paths are only different at synonymous tags,

then they should be merged into one path, such as [“per-

son”, “people”)→“woman”→“lady”], as shown in Figure

2-middle. All semantic paths corresponding to the whole

tag set T is denoted as SPT = {sp1, . . . , spr}.

For each semantic path, we focus on two of its im-

portance properties, namely the hierarchy layers and tag

weights. See the first path shown in Figure 2-middle, the

tag layers are [(2, 2), 1, 0] respectively. As for tag weights,

we seek a model that scores a tag based on the content it

represents for an image. To motivate such a model, we

make two observations. (i) A descendant tag can tell more

specific information than its ancestor tags (e.g. “women” is

more informative than “person”). Therefore, the weight of

the descendant tag should be higher than the weights of its

ancestor tags. (ii) The number of descendants of each tag

is also considered. For example, in the SH of IAPRTC-

12, “person” has 15 descendants, while “sport” has 3. As

such, one can assume that “person” is less discriminative

than “sport”. Thus, we model the tag weight to be in-

versely proportional to its number of descendants. Com-

bining both observations, we define the weight of tag yiin path spj as ωij = τ lij/|di|, where |di| indicates the

number of descendants of yi, lij represents the layer of yiin spj , and τ ∈ (0, 1) denotes the decay factor between

2562

Figure 2. An example of constructing the semantic paths from the semantic hierarchy and synonyms. Left: The semantic hierarchy and

synonyms of the whole tag set T . Middle: The semantic paths of T , and tag weights in each path. Right: The semantic paths of one tag

subset Y , and tag weights in each path.

layers. In this work, we set τ = 0.7. Consequently, the

weight of tag yi in the whole set of semantic paths is de-

fined as the sum of its weights in all semantic paths, i.e.,

ωi =∑|SP |

j ωij . As shown in Figure 2-middle, the weight

of “people” is 0.3966 = 0.1633+0.2333, as it exists in two

paths. The weights of all tags can be concatenated into one

vector: ω = (ω1, . . . , ωm).In the above paragraph we have introduced the construc-

tion of the semantic paths SPT of the whole of tags T . In

the following, we also define the semantic paths SPY of

the tag subset Y of one image, as shown in Figure 2-right,

where we set Y = {“people”, “person”, “woman”}. Firstly,

from SPT , we crop the partial paths where tags in Y oc-

cur, i.e., [(“person”, “people”)→“woman”]. Then, to ensure

the leaf tag weight in each path of SPY to be 1 (such that

each independent path tells the same amount of content), we

should adjust the tag weight. Thus, the weight of “woman”

is changed from 0.7 to 1, and the change factor is 1/0.7.

Using the same factor, we adjust the weights of “people”

and “person” from 0.1633 to 0.2333 = 0.1633 ∗ (1/0.7).

4.3. DPP Sampling with Weighted Semantic Paths

Here, we present the sampling algorithm based on the

learned conditional DPP model (see Section 3.3), and the

weighted semantic paths SP . The pseudo-code is shown

in Algorithm 1. The inputs are the testing image feature x,

the learned parameters W, the similarity matrix S, two in-

tegers k1, k2 with m > k1 > k2 > 0, semantic paths SPT

with tag weights ω. The output is the tag subset Yk2with

at most k2 tags for this testing image. Although the number

k2 should be provided as a priori, it is not a strict require-

ment. As k2 only serves as a upper limit of the sampled tags,

rather than requiring exactly k2 tags. In practice, k2 can be

determined according to user’s requirement or experience.

Algorithm 1 is a modified version of the standard k-DPP

sampling algorithm [17], by embedding the weighted se-

mantic paths in Line 7-9. It consists of two stages. The first

stage ranges from Line 1 to Line 3, to compute the eigenval-

ues {λj} and the elementary symmetric polynomials {eNk },

and it is the normalization term of the k-DPP model, about

Algorithm 1: DPP Sampling with Weighted Semantic

Paths

Input: x,W,S, k1, k2, SPT , ω.

Output: A tag subset Yk2.

1 compute the quality score qi(x;wi) = exp( 12w⊤

i x), and the

kernel matrix L = diag(q) · S · diag(q) with

q = (. . . ; qi(x,wi); . . .) ∈ Rm;

2 determine the tag set Yk1corresponding to the largest k1 entries in

q, and the sub-kernel LYk1

= L(Yk1,Yk1

);

3 compute eigenvalues {λj} of LYk1

, and eNk

=∑

Yk⊂[N ],|Yk|=k

∏j∈Yk

λj for N = 0, 1, . . . , k1 and

k = 0, 1, . . . , k2;

4 for t = 1, . . . , 10 do

5 Yt = ∅, l = k26 for i = k1, . . . , 1 do

7 if Yk1(i) is in the same semantic path in SPT with any

tag in Yt then

8 skip to the next iteration

9 end

10 if u ∼ U [0, 1] < λi

ei−1

l−1

eil

then

11 Yt ← Yt ∪ Yk1(i), l← l − 1

12 end

13 if l = 0 then Break end

14 end

15 compute tag weights ωYt= ω(Yt), and the weight

summation ωYt=

∑|Yt|j ωYt

(j)

16 end

17 return Yk2= argmaxYt′

ωYt′.

which we refer the readers to [17] for more details. {λj}and {eNk } will be used to compute the sampling probability

of each tag. Note that we pick k1 candidate tags with the k1largest entries in q, and the output k2 tags are sampled from

these k1 tags, rather than from T . As a result, the sampling

cost is significantly reduced, and most negative labels will

not be sampled, with the cost that some positive tags may

also be removed. The second stage is sampling (Line 4-

17). Since the sampling may produce different subsets, we

run 10 samplings to produce 10 subsets. Line 7-9 ensures

that two tags in the same semantic path will not be selected

together. λiei−1

l−1/eil in Line 10 indicates the probability of

adding tag Yk1(i) into the subset, given the current subset

Yt. At the end of each sampling process, the tag weight

2563

Data C1 C2 C3 C4 C5 C6 C7 C8

ESP Game [26] 18689 2081 268 597 129 9 106 4.56

IAPRTC-12 [11] 17495 1957 291 536 178 11 139 5.85

Table 1. Details of the semantic hierarchies, synonyms and se-

mantic paths of two benchmark datasets. The notations C1 to C8

indicate the numbers of: training images, testing images, candi-

date tags, feature dimension, parent-child pairs in SH, synonyms

pair, semantic paths corresponding to the whole set of tags, aver-

age semantic paths corresponding to the tag subset of each image.

summation in the sampled subset is computed (see Line 15).

Finally, we pick the subset with the largest weight summa-

tion among 10 sampled subsets (see Line 17), as we believe

that the larger weight summation indicates more contents.

5. Experiments

5.1. Experimental Settings

Datasets. We run experiments on two benchmark datasets

in image annotation, namely ESP Game (20770 images,

268 tags) [26] and IAPRTC-12 (19452 images, 291 tags)

[11]. Regarding features, we extract a 4096-dimensional

feature vector for each image, using the pre-trained VGG-

F model1 [4]. Then, we perform dimensionality reduction

using PCA to maintain 80% of the feature variance. As

described in Section 4, we construct the semantic hierar-

chies2, synonyms and the weighted semantic paths. The ba-

sic statistics of these terms in two datasets are shown in Ta-

ble 1. The complete set of semantic hierarchies, synonyms,

weighted semantic paths of both datasets are provided in

supplementary material. Note that we find many repeat-

ing images in IAPRTC-12, so we remove these redundant

images (170 training and 5 testing images) in experiments.

Parameters. The parameters of stochastic gradient descent

for learning W are set as follows: the initial learning rate is

20, and the decay is 0.02. The learning rate is updated every

50 iterations with momentum 0.9 and batch size 1024. The

maximum number of epochs is 5 and the parameter of the

ℓ2 regularization is η = 0.0001 (see Eq (5)).

Compared Methods. We first compare with existing image

annotation methods, namely ML-MG [30] and LEML[34].

Also, we compare three variants of the proposed method,

including DPP-I-topk, DPP-S-topk, and DPP-S-sampling.

DPP-I-topk denotes the case when the S matrix is replaced

by identity during the learning phase, and then the tags

with the top-k largest quality scores are retrieved. DPP-S-

topk denotes the case where we learn the conditional DPP

model with S, then retrieve tags with the top-k largest qual-

ity scores. DPP-S-sampling means that we learn the con-

ditional DPP model with S, and then use Algorithm 1 to

retrieve at most k tags for the testing image.

1It is downloaded from http://www.vlfeat.org/matconvnet/ pretrained/2The semantic hierarchies of ESP Game and IAPRTC-12 are provided

by the author of ML-MG [30].

Algorithm 2: Semantic Metrics

Input: The ground-truth tag subset Y , the predicted tag subset Y ′.

Output: Psp, Rsp and F1−sp.

1 construct the semantic paths SPY and SPY′ ;

2 for spj ∈ SPY do

3 for yi ∈ spj do

4 if yi ∈ Y ′ then

5 syi,j = ωyi,j

6 else

7 syi,j = 08 end

9 end

10 sj = maxyi∈spj syi,j11 end

12 Psp =∑|SPY |

j sj/|SPY′ |;

13 Rsp =∑|SPY |

j sj/|SPY |;

14 F1−sp = 2(Psp · Rsp)/(Psp + Rsp);

5.2. Semantic Metrics

Many evaluation metrics have been used in image anno-

tation and multi-label learning, such as the example-based

precision, recall and F1 score [36]. However, these metrics

are not very suitable for the DIA task, as they treat every

tag equally and independently. In other words, they focus

on evaluating representation, but ignoring diversity. Thus,

we propose semantic metrics to evaluate representation and

diversity jointly. Semantic metrics are defined based on the

semantic paths (see Section 4.2), rather than individual tag

metrics. Algorithm 2 shows how to compute the scores of

semantic metrics for one testing image. In experiments we

will report the average score over all testing images.

5.3. Results

The results evaluated by semantic metrics on ESP Game

and IAPRTC-12 are shown in Table 2. Since the compared

methods belong to different categories, in the following we

present the comparisons with different groups. The first

category is ML-MG, which utilizes the linear inequality

constraint to encourage the tag order satisfying the semantic

hierarchy. Thus, the ancestor tags are always ranked before

their descendant tags. Besides, ML-MG also utilizes the

tag co-occurrence to encourage similar tags to have similar

scores. Then the tags in one semantic path will be ranked

close to each other. As a result, if we pick the top-3 or top-

5 tags from the tag ranking list of ML-MG, it is expected

that more ancestor tags (corresponding to lower weights in

the semantic path), and tags from fewer paths will be ob-

tained. Such tags will cover less-representative and less in-

formation. This is why ML-MG shows the worst perfor-

mance evaluated by semantic metrics. In the second cate-

gory, LEML utilizes the empirical risk minimization (ERM)

framework with a decomposed loss over each tag; DPP-

I-topk can be seen as a label-powerset-based multi-label

method. They don’t consider the ranking relationships (as

2564

did in ML-MG), neither the tag diversity (as did in DPP-

S-sampling). Thus their performances range between ML-

MG and DPP-S-topk, DPP-S-sampling. The last category

includes DPP-S-topk and DPP-S-sampling, which takes the

diversity into account. The difference is: given the learned

DPP model with S, DPP-S-sampling utilizes Algorithm 1 to

obtain the tag subset, while DPP-S-topk chooses the top-k

tags according to the quality scores. They shows the best

performance in most cases. In details, in the case of 3 tags,

DPP-S-sampling gives significant improvements of F1−sp

scores over DPP-S-topk: 13.94% on ESP Game, 19% on

IAPRTC-12. This verifies the efficacy of the proposed sam-

pling algorithm. It is notable that DPP-S-sampling always

shows the best recall Rsp (see Line 13 in Algorithm 2) than

other methods. The reason is DPP-S-sampling encourages

to cover more diverse tags from different semantic paths,

thus its nominator value sj of Rsp is very likely to be higher

than the values of other methods. Besides, the denominator

value |SPY | of Rsp, i.e., the number of ground-truth seman-

tic paths, is same for all methods. Hence, DPP-S-sampling

gives higher recall than others. However, we also observe

there is a significant decreasing on Psp of DPP-S-sampling,

from 3 tags to 5 tags. The first reason is that Algorithm 1

ensures the number of semantic paths of the sampled sub-

set (i.e., |SPY′ |) to equal to the number of included tags,

as two tags in the same path cannot be selected simulta-

neously. In contrast, the number of paths of the subset pro-

duced by DPP-S-topk and other compared methods is likely

to be smaller than the number of included tags, as tags in the

same path could be selected together. Thus, when comput-

ing Psp (see Line 12 in Algorithm 2), the denominator value

|SPY′ | of DPP-S-sampling will not be smaller than the val-

ues of other methods (actually it is larger at most times).

Meanwhile, since the 3 tags and 5 tags are sampled from

the top-6 and top-8 candidate tags respectively (see Algo-

rithm 1) for DPP-S-sampling, if the additional 2 candidate

tags don’t include positive tags in different semantic paths,

or just one, then the nominator value of Psp will not increase

much. Hence, Psp of DPP-S-sampling could be lower than

the one of DPP-S-topk, in the case of 5 tags.

Moreover, the comparison between DPP-S-topk and

DPP-I-topk could highlight the influence of S. S will influ-

ence the tag ranking, i.e., the quality scores of two similar

(or highly related) tags should be not to close. As shown in

Table 2, DPP-S-topk shows improvements over DPP-I-topk

in most cases. It tells that S indeed contributes to produce

more representative and diverse tags. But meanwhile, the

limited improvements reminds us that this S derived from

the cosine similarity between GloVe vectors are not perfect

enough. Exploring a better S will be a future direction of

our research. Due to the space limit, in the supplemen-

tary material we provide some additional results, includ-

ing: a) the evaluation results by conventional metrics, b) the

Datametric→ 3 tags 5 tags

method↓ Psp Rsp F1−sp Psp Rsp F1−sp

ESP Game

ML-MG [30] 30.51 16.55 19.73 36.61 29.63 30.59

LEML [34] 45.16 23.61 28.31 41.82 33.87 34.58

DPP-I-topk 47.39 23.77 29.02 44.79 35.37 36.77

DPP-S-topk 48.07 23.93 29.34 45.33 35.6 37.04

DPP-S-sampling 42.37 30.48 33.43 36.15 40.1 35.96

IAPRTC-12

ML-MG [30] 35.74 17.99 21.89 41.95 29.56 31.98

LEML [34] 43.03 19.54 24.86 47.27 29.76 33.67

DPP-I-topk 42.88 20.24 25.3 46.64 31.06 34.35

DPP-S-topk 42.95 20.2 25.32 47.14 31.13 34.56

DPP-S-sampling 44.01 25.16 30.13 38.91 34.21 34.23

Table 2. Results (%) evaluated by semantic metrics on ESP Game

and IAPRTC-12. The higher value indicates the better perfor-

mance, and the best result in each column is highlighted in bold.

results of combining our sampling algorithm with ML-MG

and LEML, to verify the diversity contribution of the sam-

pling algorithm, and c) quality results of some images with

predicted tag subsets, as well as the evaluation scores.

5.4. Subject Study

To evaluate the efficacy of the proposed semantic met-

rics, a subject study via Amazon Mechanical Turk is con-

ducted for two algorithmic comparisons: DPP-S-sampling

vs ML-MG and DPP-S-sampling vs DPP-S-topk. For each

image, we present two tag subsets produced by two meth-

ods, and ask the human to judge “which tag subset can tell

more useful contents about the image”. To avoid the ran-

dom choice by the annotator, we pick a subset of testing

images for the study as follows. According to the computed

F1−sp values of the two tag subsets, if both values are larger

than 0.2 (i.e., they both are representative of the testing im-

age content), and the absolute difference between two val-

ues is larger than 0.15 (i.e., there is enough difference be-

tween two results such that the annotator does not need to

choose randomly), then this image is picked. We collect 7

judgements from 7 different persons, for each testing im-

age. Then we determine the better subset through majority

vote, and set the better one as 1, while the other as 0. Con-

sequently, we obtain a binary vector for the first tag subset

produced by method-1 over all testing images. Meanwhile,

we compute the evaluation scores of this two subsets, us-

ing the semantic metric F1−sp and the conventional metric

F1. Using this two scores, we also obtain two binary vectors

of method-1 respectively. Then we compute the consisten-

cies (i.e., 1−hamming loss) between the binary vector from

subject study and the two binary vectors from F1−sp and F1.

The higher consistency (from 0 to 1)indicates the metric is

more close to human evaluation.

The subject study results of DPP-S-sampling vs ML-MG

on ESP Game are shown in the top sub-table of Table 3. In

the case of 3 tags, 437 images are studied. DPP-S-sampling

wins at 250 images, while ML-MG wins at 187 images, ac-

cording to the subject study. We show present the judg-

ment by the standard F1 score and F1−sp. F1 is consistent

2565

Data ↓ # tags → 3 tags 5 tags

ESP Game

metric↓ DPP-S-sampling wins ML-MG wins equivalent consistency DPP-S-sampling wins ML-MG wins equivalent consistency

subject study 250 187 0 100% 494 53 0 100%conventional F1 16 / 19 123 / 210 208 31.81% 46 / 49 47 / 394 104 17%F1−sp 231 / 351 67 / 86 0 68.19% 341 / 357 37 / 190 0 69.1%metric↓ DPP-S-sampling wins DPP-S-topk wins equivalent consistency DPP-S-sampling wins DPP-S-topk wins equivalent consistency

subject study 445 47 0 100% 447 82 0 100%conventional F1 40 / 41 37 / 239 212 15.65% 45 / 47 74 / 376 106 22.5%F1−sp 324 / 341 30 / 151 0 71.95% 254 / 280 56 / 249 0 58.6%

IAPRTC-12

metric↓ DPP-S-sampling wins ML-MG wins equivalent consistency DPP-S-sampling wins ML-MG wins equivalent consistency

subject study 251 91 0 100% 388 116 0 100%conventional F1 15 / 20 52 / 162 160 19.59% 19 / 28 83 / 374 102 20.24%F1−sp 193 / 256 28 / 86 0 64.62% 237 / 291 62 / 213 0 59.33%metric↓ DPP-S-sampling wins DPP-S-topk wins equivalent consistency DPP-S-sampling wins DPP-S-topk wins equivalent consistency

subject study 269 108 0 100% 333 121 0 100%conventional F1 19 / 21 66 / 171 185 22.55% 22 / 28 98 / 339 87 26.43%F1−sp 213 / 270 51 / 107 0 70.03% 192 / 234 79 / 220 0 59.69%

Table 3. Subject study on ESP Game and IAPRTC-12, of DPP-S-sampling vs. ML-MG, and DPP-S-sampling vs. DPP-S-topk. The

numbers “231 / 351” corresponding to the metric F1−sp and “DPP-S-sampling wins” in the top sub-table mean: according to the score of

F1−sp, DPP-S-sampling wins at 351 images, among which DPP-S-sampling also wins at 231 images according to subject study, i.e., the

number of consistent judgments between F1−sp and subject study. The consistency 68.19% is computed by (231 + 67)/(351 + 86).

with subject study at 139 images, i.e., 31.81% consistency.

F1−sp is consistent with subject study at 298 images, i.e.,

68.19% consistency. Note that the conventional F1 judges

the DPP-S-sampling tags and the ML-MG tags are equiv-

alent at 208 images, since every tag is treated equally and

independently. As long as the numbers of correct tags in

two tag subsets are the same, then their F1 scores will be

same. In contrast, as each tag in each semantic path is of

different weight when calculating F1−sp, it is less likely to

give the same score to two different tag subsets. This sub-

ject study tells the semantic metric F1−sp is much closer to

human annotation than the standard F1 score. The results of

DPP-S-sampling vs DPP-S-topk on ESP Game are shown

in the second sub-table of Table 3. This comparison is more

challenging, since the tags between DPP-S-sampling and

ML-MG are quite different, while the tags between DPP-

S-sampling and DPP-S-topk are more similar, and at many

images they are different at the tags within the same seman-

tic path. Even in this case, F1−sp gives much higher consis-

tencies than F1 in subject study, i.e., 71.95% vs 15.65% at 3

tags and 58.6% vs 22.5% at 5 tags. The subject study results

on IAPRTC-12 are shown in Table 3. F1−sp always gives

much higher consistency with human performance than F1

in all cases. Above comparisons tell that the proposed se-

mantic metrics are much more consistent with human anno-

tation than the standard metrics, and that they are suitable

for quantitative DIA evaluation. Moreover, in all cases the

human annotator judges that DPP-S-sampling wins at many

more images than the conventional methods. This validates

the good performance of DPP-S-sampling for DIA.

6. Conclusions

This work studied a new task called diverse image anno-

tation (DIA), where an image is annotated using a limited

number of tags that attempt to cover as much semantic im-

age information as possible. This task inherently requires

that the few retrieved tags are not only representative of the

image but also diverse. To this end, we treated the new task

as a subset selection problem and model it using a condi-

tional DPP model, which naturally incorporates the repre-

sentation and diversity jointly. Further, we proposed a mod-

ified DPP sampling algorithm, which incorporates semantic

paths. We also proposed new metrics based on these se-

mantic paths to evaluate the quality of the diverse tag list.

The experiments on two benchmarks demonstrate that our

proposed method is superior to those state-of-the-art image

annotation approaches. An extensive subject study validates

the claim that our proposed semantic metrics are much more

consistent with human annotation than traditional metrics.

However, many interesting issues about the new diverse

image annotation (DIA) task deserve to be studied in the

future. Firstly, the similarity matrix S in the DPP model

is assumed to be pre-computed in this work. That is why

the contribution of S is not very significant, compared with

the contribution of semantic paths in sampling. In future

work, we plan to learn S and W jointly. Secondly, there is

till a sizeable gap between the semantic metrics and human

evaluation. To bridge this gap, we will focus on updating the

way that the semantic paths are constructed and weighted,

based on more detailed analysis of the path structure and tag

weights. We will make the new semantic metrics available

to the community as an online toolkit3. Consequently, the

evaluation of DIA can be standardized for fair comparison

amongst future annotation methods.

Acknowledgements. This work is supported by the King

Abdullah University of Science and Technology (KAUST)

Office of Sponsored Research. Baoyuan Wu is partially

supported by Tencent AI Lab. We thank Fabian Caba for

his help in conducting the online subject studies.

3https://sites.google.com/site/baoyuanwu2015/

2566

https://sites.google.com/site/baoyuanwu2015/

References

[1] S. S. Bucak, R. Jin, and A. K. Jain. Multi-label learning with

incomplete class assignments. In CVPR, pages 2801–2808.

IEEE, 2011. 2

[2] R. S. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino.

Matrix completion for multi-label image classification. In

NIPS, pages 190–198, 2011. 2

[3] X. Cao, H. Zhang, X. Guo, S. Liu, and D. Meng. Sled: Se-

mantic label embedding dictionary representation for multi-

label image annotation. IEEE Transactions on Image Pro-

cessing, 24(9):2746–2759, 2015. 2

[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.

Return of the devil in the details: Delving deep into convo-

lutional nets. BMVC, 2014. 6

[5] M. Chen, A. Zheng, and K. Weinberger. Fast image tagging.

In ICML, pages 1274–1282, 2013. 1, 2

[6] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent

visual representation for image caption generation. In CVPR,

pages 2422–2431, 2015. 1

[7] X. Chen, X.-T. Yuan, Q. Chen, S. Yan, and T.-S. Chua.

Multi-label visual classification with label exclusive context.

In ICCV, pages 834–841, 2011. 2

[8] C. Fellbaum. WordNet. Wiley Online Library, 1998. 4

[9] B. Gong, W.-L. Chao, K. Grauman, and F. Sha. Diverse

sequential subset selection for supervised video summariza-

tion. In NIPS, pages 2069–2077, 2014. 3

[10] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep

convolutional ranking for multilabel image annotation. arXiv

preprint arXiv:1312.4894, 2013. 2

[11] M. Grubinger, P. Clough, H. Muller, and T. Deselaers. The

iapr tc-12 benchmark: A new evaluation resource for visual

information systems. In International Workshop OntoImage,

pages 13–23, 2006. 1, 4, 6

[12] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simul-

taneous detection and segmentation. In ECCV, pages 297–

312. Springer, 2014. 1

[13] J. Johnson, L. Ballan, and L. Fei-Fei. Love thy neighbors:

Image annotation by exploiting image metadata. In ICCV,

pages 4624–4632, 2015. 2

[14] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully

convolutional localization networks for dense captioning. In

CVPR, 2016. 1

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet

classification with deep convolutional neural networks. In

NIPS, pages 1097–1105, 2012. 2

[16] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal

point processes. In ICML, pages 1193–1200, 2011. 3

[17] A. Kulesza and B. Taskar. Determinantal point processes for

machine learning. arXiv preprint arXiv:1207.6083, 2012. 2,

3, 5

[18] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.

Howard, W. Hubbard, and L. D. Jackel. Backpropagation

applied to handwritten zip code recognition. Neural compu-

tation, 1(4):541–551, 1989. 2

[19] Y. Li, B. Wu, B. Ghanem, Y. Zhao, H. Yao, and Q. Ji. Fa-

cial action unit recognition under incomplete data based on

multi-label learning with missing labels. Pattern Recogni-

tion, 60:890–900, 2016. 2

[20] S. Liu, S. Yan, T. Zhang, C. Xu, J. Liu, and H. Lu. Weakly

supervised graph propagation towards collective image pars-

ing. IEEE Transactions on Multimedia, 14(2):361–373,

2012. 2

[21] J. Pennington, R. Socher, and C. D. Manning. Glove: Global

vectors for word representation. In EMNLP, volume 14,

pages 1532–43, 2014. 3

[22] X. Qian, X.-S. Hua, Y. Y. Tang, and T. Mei. Social image

tagging with diverse semantics. IEEE transactions on cyber-

netics, 44(12):2493–2508, 2014. 2

[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards

real-time object detection with region proposal networks. In

NIPS, pages 91–99, 2015. 2

[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning

representations by back-propagating errors. Cognitive mod-

eling, 5(3):1, 1988. 4

[25] A.-M. Tousch, S. Herbin, and J.-Y. Audibert. Semantic hier-

archies for image annotation: A survey. Pattern Recognition,

45(1):333–345, 2012. 2

[26] L. Von Ahn and L. Dabbish. Labeling images with a com-

puter game. In Proceedings of the SIGCHI conference on

Human factors in computing systems, pages 319–326. ACM,

2004. 1, 4, 6

[27] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and

S. Yan. Cnn: Single-label to multi-label. arXiv preprint

arXiv:1406.5726, 2014. 2

[28] J. Weston, S. Bengio, and N. Usunier. Large scale image

annotation: learning to rank with joint word-image embed-

dings. Machine learning, 81(1):21–35, 2010. 4

[29] B. Wu, Z. Liu, S. Wang, B.-G. Hu, and Q. Ji. Multi-label

learning with missing labels. In ICPR, 2014. 2

[30] B. Wu, S. Lyu, and B. Ghanem. Ml-mg: Multi-label learning

with missing labels using a mixed graph. In ICCV, pages

4157–4165, 2015. 1, 2, 4, 6, 7

[31] B. Wu, S. Lyu, and B. Ghanem. Constrained submodu-

lar minimization for missing labels and class imbalance in

multi-label learning. In AAAI, pages 2229–2236, 2016. 2

[32] B. Wu, S. Lyu, B.-G. Hu, and Q. Ji. Multi-label learning with

missing labels for image annotation and facial action unit

recognition. Pattern Recognition, 48(7):2279–2289, 2015. 2

[33] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image cap-

tioning with semantic attention. In CVPR, 2016. 1

[34] H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-

label learning with missing labels. In ICML, pages 593–601,

2014. 2, 6, 7

[35] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Summary

transfer: Exemplar-based subset selection for video summa-

rizatio. CVPR, 2016. 3

[36] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learn-

ing algorithms. IEEE transactions on knowledge and data

engineering, 26(8):1819–1837, 2014. 4, 6

2567

diverse image annotation - cvf open...

Documents