general flow of annotation

978-1-4244-6516-3/10/$26.00 ©2010 IEEE 1642

2010 3rd International Congress on Image and Signal Processing (CISP2010)

Ensemble Of Multiple Descriptors For Automatic Image Annotation

Dongjian He

College of Mechanical and Electronic Engineering, Northwest A&F University

Yangling, Shaanxi 712100, China

Yu Zheng, Shirui Pan, Jinglei Tang College of Information Engineering,

Northwest A&F University Yangling, Shaanxi 712100, China

Abstract—Automatic image annotation (AIA) plays an important role and attracts much research attention in image understanding and retrieval. Annotation can be posed as classification problems where each annotation keyword is defined as a group of database images labeled with a semantic word. It is shown that, by establishing one-to-one corresponding between image region and semantic keyword is a feasible approach for automatic image annotation. In this paper, we proposed a novel algorithm, EMDAIA for automatic image annotation based on ensemble of descriptors. EMDAIA regards the annotation process as a multi-class image classification. The producers of EMDAIA are presented as follows. First, each image is segmented into a collection of image regions. For each region, a variety of low-level visual descriptors are extracted. All regions are then clustered into k categories with each cluster associated with an annotation keyword. Moreover, for an unlabeled instance, distance between this instance and each cluster center is measured and the nearest category’s keyword is chosen to annotate it. Experiment results on LabelMe, a benchmark dataset, shows EMDAIA outperforms some recent state-of-the-art automatic image annotation algorithms.

Keywords-Auotantic Image Annotation; Ensemble Descriptors; Classification; Feature extraction

I. INTRODUCTION Nowadays, the number of digital images is growing rapidly

because of the increasing number of digital cameras, which makes the image management very challenging for researchers. Automatic image annotation is to develop methods that can predict for a new image the relevant keywords from an annotation vocabulary. The final goal of automatic image annotation is mostly to assist image retrieval by supplying users with semantic keywords for search. This capability makes large image databases easier organized.

The image annotation has been extensively researched for more than a decade [1, 3, 4, 5, 6, 7, 10, 13, 14]. There are mainly two methods for automatic image annotation: statistics models and classification approaches. Statistics models [3, 4, 7, 10, 16] annotate images by computing the joint probability between words and image features. AIA also can be regarded as a type of multi-class image classification [6, 13, 14, 17, 18] with a very large number of classes – as larger as the vocabulary size. Therefore, automatic image annotation can be considered as a multi-class object recognition problem which is an extremely challenging task and still remains an open problem in computer vision.

In spite of many algorithms proposed with different motivations, the underlying questions are still not well solved: 1) Most of the automatic image annotation systems utilize a single set of features to train a single learning classifier. The problem is: a feature set, which represents an image category well, may fail to represent other categories. For example, the semantic word “flower” and “tree” may different in color, so color features may work best, but for “tree” and “grass”, which has similarity color may be distinguished by texture features. This problem degrades the performance of the automatic image annotation system as the number of categories increase. 2) For each image, we often have keywords assigned with the whole image but unknown which regions of the image correspond to these keywords. For example, the image annotation system SADE proposed in [13] utilizes two-layer classifier for the global image classification. After classification, SADE chooses top K categories for an unlabeled image and forms a frequency list of words for all of the words pertaining to these K categories. Then the probability of each word appearing is computed and the words whose probabilities under a predefined threshold are selected for the annotation. While this method successfully infers high level semantic concepts based on global features, identification of more specific categories and objects remains a challenge.

In this study, we propose a novel automatic image annotation system which can tackle the problems mentioned above: 1) Our algorithm combines different kinds of feature descriptors to boost the annotation; 2) segments each image into several regions, and establishes one-to-one corresponding between image region and annotation keyword. For this purpose, an ensemble of descriptors for automatic image annotation (EMDAIA) is proposed. For EMDAIA, first a training set with pre-annotated keywords is constructed. Then, each image in the training set is segmented into a collection of regions, where each image region associates with a keyword. For each region, a variety of low-level visual descriptors are extracted. Next, all regions are clustered into k categories with each cluster associated with an annotation keyword. Moreover, the cluster center is calculated for the annotation process. To annotate a new image region, we measure the distance between the instance and every cluster center, choose the nearest category’s keyword to annotate it. Each descriptor will predict a category for the instance, so we get m annotation words by using m descriptors. At last, we combine the results by calculating the probability of each keyword, and fix the most high probability keywords for this region.

1643

The rest of this paper is organized as follows: In the next section, details of the proposed system are given. The experimental results are presented in Section 3. In Section 4 we present our conclusions and directions for future research.

II. FRAMEWORK OF EMDAIA In this section, we introduce the EMDAIA framework for

automatic image annotation. The entire framework includes two components: training and testing. Fig. 1 shows the framework of EMDAIA. The details of the EMDAIA algorithm are given as follows:

A. Training

Given a set of images { }NIIII ,,, 21= in the training set, where all the images in I are segmented into a collection of image regions { }

BNRRRR ,,, 21= , BN is the total number

of all the image regions. { }TwwwW ,,, 21= is the set of words associated with all the images in the training set. M feature extraction algorithms ( ) ( ) ( )},,,{ 21 ⋅⋅⋅= mDDDD … are used for EMDAIA to extract M distinct feature vectors, called descriptors. The image region can be then represented as a set of feature vectors:

{ }),,2,1();,,2,1(, mjnidR iji …… ∈∈= (1)

where ijd denotes the descriptor extracted from the ith image region using the jth feature extractor.

The dataset of EMDAIA has been pre-defined by a set of words, and after segmentation, each region also has a labeled keyword. Formally speaking, we are given a training set ( ) ( ) ( ){ }

BB NN lRlRlRS ,,,,,, 2211 …= , where iR denotes

the ith image region, Wli ∈ is the keyword label associated

with the image region iR . One example for this type of images is the LabelMe dataset which will be introduced in the experiment section.

All the image regions are then grouped into k category where the image regions with the same label will be clustered into the same category. For all image regions in the kth category, the cluster center is then computed, and denoted as kmC , which represents the kth category center calculating by the mth descriptors. We use m descriptors in this paper, so each category corresponds to m different centers.

B. Tesing

While for an unlabeled image jI , EMDAIA first segments the image into n image regions. Features are extracted from each image region which can be represented as a set of feature vectors mri

d , where mrid denotes the feature vector using the

mth descriptor extracted for the ith image region. Then the distances between every image region and the cluster centers are be calculated.

1) Measure Similarity The distance measure proposed in [15] is used to compute

the distance between image region and the cluster centers. Given two instances 1x and 2x , the Mahalanobis distance is defined as:

( ) ( ) ( )21212

2121 ,,, xxAxxxxxxdist TA

=−= (2)

where ddSA ×+∈ is the distance metric to be learned ( ddS ×

+ is the space of all dd × positive-semi definite matrices). We define the distance between the image region iR and all the cluster center as the minimum distance, i.e.,

2

1,1min

Almmrklnimr Cddistii

−=≤≤≤≤

(3)

)( 12 jRD … )( 1 ji RD …)( 11 jRD …

md 111… imid …1md 221…

kmkC …1mC 221…mC 111…

mrrr ddd111

,,, 21 …mrrr ddd

2212,,, 21 …

mrrr nnnddd ,,, 21 …

jI

mrrr distdistdist111

,,, 21

mrrr distdistdist222

,,, 21

mrrr nnndistdistdist ,,, 21

mrrr vvv111

,,, 21

mrrr vvv222

,,, 21

mrrr nnnvvv ,,, 21

nrrr zzz ,,,21

Fig. 1 The Framework of EMDAIA. There are two components: Training and Testing. In the training section, images with annotation keywords are given for input, all the images are segmented into a collection of image regions. Then a separate set of low level feature descriptors are extracted for clustering. Image regions are grouped into k category, while each class has a cluster center. In the testing section, an unlabed image are given for input. The image is first segmented into regions. For each region, m descriptors are then exreacted. By computing the distace between each region and every cluster center, the class word which has the nearest distance are then choosed for the prediction.

1644

where mridist denotes the minimum distance for the ith region

computed by using the mth descriptor.

2) Ensemble the predict of visual descriptors From formula (3) we can see that each image region gets m

nearest distances by using m descriptors for the classification. Each nearest distance corresponds to a class which associates with a keyword. So the image region iR gets m predict words, denoted as:

{ }mrrrr iiiivvvv ,,, 21 …= (4)

Here, Wv kri∈ . Among those m keywords mri

v , we choose one keyword for the ith region, some measure was adapted to fuse the predict results. Assume that the probability of a word

kriv is denoted as )( kri

vP , a word kriv appearing at jn times

in the predict words can be approximated by the following equation:

kiri vjkr pnvP ⋅⋅= α)( (5)

where nmpkirv /= gives the ratio of categories annotated

with word kriv to all categories, α is set to 0.8 by experience in

this paper.

At last, the word which has maximum probability is chosen for the prediction of iR , given by:

)(maxarg],1[ krmkr ii

vPz∈

= (6)

For the unlabeled image jI which has segmented into n image regions, finally get the predict words as follows:

},,,{)(2 ni rrrj zzzIpredict = (7)

The resulting automatic image annotation approach of EMDAIA therefore is illustrated in Algorithm 1.

III. EXPERIMENTS AND RESUALTS EMDAIA is tested on the annotated image dataset, which

was created by Antonio Torralba and Bryan Russell [9]. We use the LabelMe matlab toolbox to obtain the spatial envelope_256×256_static_8outdoorcategories images from the following 8 classes: “coast”, “forest”, “highway”, “mountain”, “open country”, “street”, “inside city”, “tall building”. The dataset contains 2688 images and 29424 image regions. All the images were kept in 256×256 pixels, and have been segmented into 3 to 25 image regions with each region associated with a labeled keyword. We randomly selected 200images from each

class for training, and the left images are for testing. The dictionary contains 795 words that appear in the dataset. In order to reduce the class, we use the LabelMe matlab function LMaddtags to reduce the variability on object labels used to describe the same object class, for example, “person walking”, “person standing”, “person occluded”, “people” et. can be instead by the same word “person”. After this step, the total number of words is reduced to 97. Fig. 2 gives the top 20 categories keywords and the number of times these objects appears in the LabelMe data used in this paper. We removed annotation terms that occurred less than 3 times. Fig. 3 shows some images and their labels from the LabelMe dataset. From Fig. 3 we can see that, in this dataset, each image has been segmented into several regions by different color and each region associated with an annotation keyword, which makes the annotation process much easier.

A. Feature descriptors Both MPEG-7 descriptors and color descriptors are used for

experiments. The feature vectors corresponding to the MPEG-7 visual descriptors were extracted with the MPEG-7 Experimentation Model (XM) software version 5.5 [12]. A total of six different MPEG-7 descriptors could be extracted

Algorithm 1 The EMDAIA algorithm Training: 1: Segmenting images { }NIIII ,,, 21= in the training set into regions.

2: for each image region iR (i from 1 to

BN ) , do

3: Training ( )⋅iD to extract descriptors where image region is denoted

as a set of feature vectors: { }),,2,1();,,2,1(, mjnidR iji …… ∈∈=

4: Clustering all the segments into k categories 5: Calculating each cluster center

kmC .

6: end for Testing: 1: for an unlabeled image

jI , do

2: Segmenting image jI into n image regions, where image regions

are represented as a set of feature vectorsmri

d .

3: for each image region ir , do

4: Calculating the distance mri

dist between ir and every cluster

center, where 2

1,1min

Almmrklnimr cddistii

−=≤≤≤≤

.

5: the keyword associated with the class is chose for this region’s prediction: { }mrrrr iiii

vvvv ,,, 21 …= where Wv kri∈ .

6: Computing the probability of a word kriv ,

kiri vjkr pnvP ⋅⋅= α)(

where nmpkirv /= gives the ratio of categories annotated with

word kri

v to all categories, α is set to 0.8. jn is the time word

kriv appearing in the predict words.

7: The word )(maxarg],1[ krmkr ii

vPz∈

= which has maximum probability

was chosen for the prediction of image regionkri

v .

8: end for 9: return the prediction },,,{)(

2 ni rrrj zzzIpredict = for this image.

10: end for

1645

and we chosen five of them (SCD, CLD, CSD, EHD, HTD) for classification because this five has the common dimensional for the image in different size. SCD and CSD are 128-dimensional, CLD is 120-dimensional, EHD is 80-demensional and HTD is 62-dimensional.

Scalable Color Descriptor (SCD) is one of the most basic descriptions of color features provided by describing color distribution in images. Color Layout Descriptor (CLD) specifies a spatial distribution of colors. The image is divided into 8*8 blocks and the dominant colors are solved for each block in the YCbCr color system. Color Structure Descriptor (CSD) captures both color content and the structure of this content. It does this by means of a structuring element that is slid over the image. Edge Histogram Descriptor (EHD) is designed to capture the spatial distribution of edges by dividing the image into 16 non-overlapping blocks and then calculating 5 edge directions in each block. Homogeneous Texture Descriptor (HTD) filters the image with a bank of orientation and scale tuned filters that are modeled using Gabor functions. The first and second moments of the energy in the frequency domain in the corresponding sub-bands are then used as the components of the texture descriptor.

In this paper, we also extracted color descriptors utilized ColorDescriptor software v2.0 created by Koen van de Sande [2]. We chosen six descriptors, i.e., RGB histogram, Opponent histogram, Opponent histogram, Hue histogram, Color moments, SIFT, HueSIFT. The dense sampling detector is used which samples at every 6th pixel in the image. For this six color descriptors, the total points for each image region are different for the image regions are not in the same image size. So we use

the method in [8] to run the k-means algorithm to obtain the codeword and codebook.

B. Ensemble of Descriptors vs. Single Descriptor In this sub-section, we compare our EMDAIA algorithm

which uses multiple descriptors with that of single descriptors, such as CSD, SCD, and HTD, etc. The experimental results are given in table 1.

It is shown obviously in table 1 that, our algorithm EMDAIA gives the most reliable annotation results than the single descriptor algorithm.

Automatic image annotation performance is usually measured by precision and recall [3]. Precision of iw is defined as the number of correctly annotated images divided by the total number of images annotated with iw . Recall of a word

iw is defined as the number of images correctly annotated with

iw divided by the number of images that have iw in the truth annotation.

Fig. 4 shows the performance of the five different descriptors and our algorithm EMDAIA. From Fig. 4 we can see that the performance of automatic image annotation can be significantly improved by ensemble of the descriptors.

C. EMDAIA Compares with the STATE-OF-THE-ART

TABLE Ⅰ

COMPARIOSN DIFFERENT DESCRIPTOS’ ANNOTATION

LabelMe

Annotation sand, sea, sky path, brushes, brushes, tree

sky, mountain, tree, tree, field

EMDAIA Annotation sand, sea, sky path, brushes,

leaf, tree sky, mountain, tree, brushes,

fieldCSD sand, sea, sky building, lamp,

leaf, tree sky, mountain, tree, grass, field

SCD chair, sea, sun path, lamp, leaf, leaf

sky, waterfall,tree, brushes,

grassCLD Bridge, sea, sky path, plant, lamp,

tree sky, mountain,tree, plant, field

HTD sand, sky, sun brushes, brushes, brushes, brushes

sky, rock, bushes, tree, field

EHD sea, sea, mountain grass, brushes, brushes, tree

sky, field,brushes, tree, path

RGB histogram sand, sky, sky path, tree, leaf,

tree sky, rock, grass, brushes, grass

Opponent histogram road, sea, sea building, plant,

bushes, brushes sky, mountain,

brushes, brushes, grass

Hue histogram sand, sky, sea path, brushes,

leaf, tree sky, rock, tree, brushes, grass

Color moments sand, sea, sky sand, leaf, leaf,

leaf sky, mountain, tree, tree, field

SIFT sun, sea, sky wall, tree, tree, tree

sky, field, brushes, plant,

grassHueSIFT sand, sky, sky Path, plant, tree,

tree sky, bridge, tree,

burshes, grass

Fig. 3 Examples images from LabelMe dataset

Fig. 2 The top 20 categories keywords in the LabelMe dataset

1646

In order to show the effectiveness of EMDAIA, we

compare it with the state-of-the-art algorithm sLDA system proposed in [3]. Table 3 shows the comparison between sLDA and EMDAIA.

From table 2, it is shown that the first image which contains multiple objects, our algorithm EMDAIA gives much more details of information by annotating more keywords. And in the second and the third images, our algorithm is comparable to the state-of-the-art algorithm sLDA.

IV. CONCLUSIONS In this paper, we propose a novel approach, EMDAIA, for

automatic image annotation. EMDAIA firstly segments all the images into regions, and then uses a set of extracting algorithms to extract multiple descriptors for the regions of images. We cluster all the regions into k clusters with each cluster associated with one keyword and m cluster centers. In the test phase, we measure the distance between each test region and the cluster centers, at last choose the nearest category’s keyword to annotate it.

Experimental results on benchmark dataset show that our proposed algorithm significantly outperforms those algorithms which use only single descriptors. Moreover, our algorithm is comparable with the state-of-the-art algorithm.

There are several open problems for further studies. Firstly, it would be a promising topic to integrate the state-of-art image segmentation algorithm, by which we can predict a new image uncontained in the LabelMe dataset. Secondly, we just annotate some low-level semantic concepts, e.g. “rock”, “tree”, “sky”, “sea”, “water”, from the relationship of the low-level semantic

keywords, it is interesting to get the high-level concept like “beach”, “outdoor”. In the future, we will look into these areas.

ACKNOWLEDGMENT This work is funded by the National Natural Science

Foundation of China Grant #60975007.

REFERENCES

[1] P. Duygulu, and K. Barnad. “Object recognition as machine translation: learning a lexicon for fixed image vocabulary”. ECCV, 2002.

[2] Koen E. A. van de Sande, Theo Gevers and Cees G. M. Snoek, “Evaluating Color Descriptors for Object and Scene Recognition”. IEEE Transactions on Pattern Analysis and Machine Intelligence (in press), 2010.

[3] Chong Wang, David Blei, and Li fei-Fei. “Simultaneous Image Classification and Annotation”. CVPR, 2009.

[4] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. “TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation”. ICCV, 2009.

[5] Ameesh Makadia , Vladimir Pavlovic , Sanjiv Kumar. “A New Baseline for Image Annotation”. ECCV, 2008.

[6] E. Chang, G. Kingshy, G. Sychay, G. Wu. “CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines”. IEEE Transactions on Circuits and System for Video Technology, 2003.

[7] Gustavo Carneiro , Antoni B. Chan , Pedro J. Moreno , Nuno Vasconcelos. “Supervised Learning of Semantic Classes for Image Annotation and Retrieval”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007.

[8] Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek . “Visual Word Ambiguity”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 99, No. 1. (5555) (in press), 2010.

[9] Bryan C. Russell , Antonio Torralba , Kevin P. Murphy , William T. Freeman. “LabelMe: A Database and Web-Based Tool for Image Annotation”. International Journal of Computer Vision, Vol. 77, No. 1, pp. 157-173. 1 May 2008.

[10] Li-Jia Li, Richard Socher, Li Fei-Fei. “Towards Total Scene Understanding Classification, Annotation and Segmentation in an Automatic Framework”. CVPR, 2009.

[11] Frank Moosmann, Bill Trgiggs, Frederic Jurie. “Fast Discriminative Visual Codebooks using Randomized Clustering Forests”. Neural Information Processing Systems(NIPS), 2006.

[12] Thomas Sikora. “The MPEG-7 Visual Standard for Content Description—An Overview”. IEEE Transactions on Circuits and System for Video Technology, VOL. 11, NO. 6, 2001.

[13] Emre Akbas, Fatos T. Yarman Vural. “Automatic Image Annotation by Ensemble of Visual Descriptors”. CVPR, 2007.

[14] Heidy-Marisol Marin-Castro, L. Enrique Sucar, Eduardo F. Morales. “Automatic image annotation using a semi-supervised ensemble of classifiers”. Lecture Notes in Computer Science, 2008.

[15] Rong Jin, Shijun Wang, Zhi-Hua Zhou, “Learning a Distance Metric from Multi-instance Multi-label Data”. IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[16] J. Jeon, V. Lavrenko, R. Manmatha. “Automatic Image Annotation and Retrieval using Cross-Media Relevance Models”. SIGIR, 2003.

[17] P. Quelhas, F. Monay, J-M. Odobez, D. Gatica-Perez, T. Tuytelarrs, and L. J. V. Gool. “Modeling scenes with local descriptors and latent aspects”. ICCV, 2005.

[18] B. Russell, A. Efros, J. Sivic, W. Freeman, and A. Zisserman. “Using multiple segmentations to discover objects and their extent in image collections”. CVPR, 2006.

TABLE Ⅱ COMPARIOSN OF SLDA AND EMDAIA ANNOTATION

sLDA car, sigh, road snow mountain, sea water, field

trees, buildings occluded, window

EMDAIA sky, mountain, road, car, fence

Sky, mountain, snow

sky, building, tree, window

0.1 0.15 0.2 0.25 0.3 0.35 0.40.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

Recall

Pre

ciso

n

EMDAIACSD

CLD

HTD

Hue histogramSIFT

Fig. 4 Performance on the LabelMe datasetof our algorithm EMDAIA

general flow of annotation

Documents