general flow of annotation
TRANSCRIPT
978-1-4244-6516-3/10/$26.00 ©2010 IEEE 1642
2010 3rd International Congress on Image and Signal Processing (CISP2010)
Ensemble Of Multiple Descriptors For Automatic Image Annotation
Dongjian He
College of Mechanical and Electronic Engineering, Northwest A&F University
Yangling, Shaanxi 712100, China
Yu Zheng, Shirui Pan, Jinglei Tang College of Information Engineering,
Northwest A&F University Yangling, Shaanxi 712100, China
Abstract—Automatic image annotation (AIA) plays an important role and attracts much research attention in image understanding and retrieval. Annotation can be posed as classification problems where each annotation keyword is defined as a group of database images labeled with a semantic word. It is shown that, by establishing one-to-one corresponding between image region and semantic keyword is a feasible approach for automatic image annotation. In this paper, we proposed a novel algorithm, EMDAIA for automatic image annotation based on ensemble of descriptors. EMDAIA regards the annotation process as a multi-class image classification. The producers of EMDAIA are presented as follows. First, each image is segmented into a collection of image regions. For each region, a variety of low-level visual descriptors are extracted. All regions are then clustered into k categories with each cluster associated with an annotation keyword. Moreover, for an unlabeled instance, distance between this instance and each cluster center is measured and the nearest category’s keyword is chosen to annotate it. Experiment results on LabelMe, a benchmark dataset, shows EMDAIA outperforms some recent state-of-the-art automatic image annotation algorithms.
Keywords-Auotantic Image Annotation; Ensemble Descriptors; Classification; Feature extraction
I. INTRODUCTION Nowadays, the number of digital images is growing rapidly
because of the increasing number of digital cameras, which makes the image management very challenging for researchers. Automatic image annotation is to develop methods that can predict for a new image the relevant keywords from an annotation vocabulary. The final goal of automatic image annotation is mostly to assist image retrieval by supplying users with semantic keywords for search. This capability makes large image databases easier organized.
The image annotation has been extensively researched for more than a decade [1, 3, 4, 5, 6, 7, 10, 13, 14]. There are mainly two methods for automatic image annotation: statistics models and classification approaches. Statistics models [3, 4, 7, 10, 16] annotate images by computing the joint probability between words and image features. AIA also can be regarded as a type of multi-class image classification [6, 13, 14, 17, 18] with a very large number of classes – as larger as the vocabulary size. Therefore, automatic image annotation can be considered as a multi-class object recognition problem which is an extremely challenging task and still remains an open problem in computer vision.
In spite of many algorithms proposed with different motivations, the underlying questions are still not well solved: 1) Most of the automatic image annotation systems utilize a single set of features to train a single learning classifier. The problem is: a feature set, which represents an image category well, may fail to represent other categories. For example, the semantic word “flower” and “tree” may different in color, so color features may work best, but for “tree” and “grass”, which has similarity color may be distinguished by texture features. This problem degrades the performance of the automatic image annotation system as the number of categories increase. 2) For each image, we often have keywords assigned with the whole image but unknown which regions of the image correspond to these keywords. For example, the image annotation system SADE proposed in [13] utilizes two-layer classifier for the global image classification. After classification, SADE chooses top K categories for an unlabeled image and forms a frequency list of words for all of the words pertaining to these K categories. Then the probability of each word appearing is computed and the words whose probabilities under a predefined threshold are selected for the annotation. While this method successfully infers high level semantic concepts based on global features, identification of more specific categories and objects remains a challenge.
In this study, we propose a novel automatic image annotation system which can tackle the problems mentioned above: 1) Our algorithm combines different kinds of feature descriptors to boost the annotation; 2) segments each image into several regions, and establishes one-to-one corresponding between image region and annotation keyword. For this purpose, an ensemble of descriptors for automatic image annotation (EMDAIA) is proposed. For EMDAIA, first a training set with pre-annotated keywords is constructed. Then, each image in the training set is segmented into a collection of regions, where each image region associates with a keyword. For each region, a variety of low-level visual descriptors are extracted. Next, all regions are clustered into k categories with each cluster associated with an annotation keyword. Moreover, the cluster center is calculated for the annotation process. To annotate a new image region, we measure the distance between the instance and every cluster center, choose the nearest category’s keyword to annotate it. Each descriptor will predict a category for the instance, so we get m annotation words by using m descriptors. At last, we combine the results by calculating the probability of each keyword, and fix the most high probability keywords for this region.
1643
The rest of this paper is organized as follows: In the next section, details of the proposed system are given. The experimental results are presented in Section 3. In Section 4 we present our conclusions and directions for future research.
II. FRAMEWORK OF EMDAIA In this section, we introduce the EMDAIA framework for
automatic image annotation. The entire framework includes two components: training and testing. Fig. 1 shows the framework of EMDAIA. The details of the EMDAIA algorithm are given as follows:
A. Training
Given a set of images { }NIIII ,,, 21= in the training set, where all the images in I are segmented into a collection of image regions { }
BNRRRR ,,, 21= , BN is the total number
of all the image regions. { }TwwwW ,,, 21= is the set of words associated with all the images in the training set. M feature extraction algorithms ( ) ( ) ( )},,,{ 21 ⋅⋅⋅= mDDDD … are used for EMDAIA to extract M distinct feature vectors, called descriptors. The image region can be then represented as a set of feature vectors:
{ }),,2,1();,,2,1(, mjnidR iji …… ∈∈= (1)
where ijd denotes the descriptor extracted from the ith image region using the jth feature extractor.
The dataset of EMDAIA has been pre-defined by a set of words, and after segmentation, each region also has a labeled keyword. Formally speaking, we are given a training set ( ) ( ) ( ){ }
BB NN lRlRlRS ,,,,,, 2211 …= , where iR denotes
the ith image region, Wli ∈ is the keyword label associated
with the image region iR . One example for this type of images is the LabelMe dataset which will be introduced in the experiment section.
All the image regions are then grouped into k category where the image regions with the same label will be clustered into the same category. For all image regions in the kth category, the cluster center is then computed, and denoted as kmC , which represents the kth category center calculating by the mth descriptors. We use m descriptors in this paper, so each category corresponds to m different centers.
B. Tesing
While for an unlabeled image jI , EMDAIA first segments the image into n image regions. Features are extracted from each image region which can be represented as a set of feature vectors mri
d , where mrid denotes the feature vector using the
mth descriptor extracted for the ith image region. Then the distances between every image region and the cluster centers are be calculated.
1) Measure Similarity The distance measure proposed in [15] is used to compute
the distance between image region and the cluster centers. Given two instances 1x and 2x , the Mahalanobis distance is defined as:
( ) ( ) ( )21212
2121 ,,, xxAxxxxxxdist TA
=−= (2)
where ddSA ×+∈ is the distance metric to be learned ( ddS ×
+ is the space of all dd × positive-semi definite matrices). We define the distance between the image region iR and all the cluster center as the minimum distance, i.e.,
2
1,1min
Almmrklnimr Cddistii
−=≤≤≤≤
(3)
)( 12 jRD … )( 1 ji RD …)( 11 jRD …
md 111… imid …1md 221…
kmkC …1mC 221…mC 111…
mrrr ddd111
,,, 21 …mrrr ddd
2212,,, 21 …
mrrr nnnddd ,,, 21 …
jI
mrrr distdistdist111
,,, 21
mrrr distdistdist222
,,, 21
mrrr nnndistdistdist ,,, 21
mrrr vvv111
,,, 21
mrrr vvv222
,,, 21
mrrr nnnvvv ,,, 21
nrrr zzz ,,,21
Fig. 1 The Framework of EMDAIA. There are two components: Training and Testing. In the training section, images with annotation keywords are given for input, all the images are segmented into a collection of image regions. Then a separate set of low level feature descriptors are extracted for clustering. Image regions are grouped into k category, while each class has a cluster center. In the testing section, an unlabed image are given for input. The image is first segmented into regions. For each region, m descriptors are then exreacted. By computing the distace between each region and every cluster center, the class word which has the nearest distance are then choosed for the prediction.
1644
where mridist denotes the minimum distance for the ith region
computed by using the mth descriptor.
2) Ensemble the predict of visual descriptors From formula (3) we can see that each image region gets m
nearest distances by using m descriptors for the classification. Each nearest distance corresponds to a class which associates with a keyword. So the image region iR gets m predict words, denoted as:
{ }mrrrr iiiivvvv ,,, 21 …= (4)
Here, Wv kri∈ . Among those m keywords mri
v , we choose one keyword for the ith region, some measure was adapted to fuse the predict results. Assume that the probability of a word
kriv is denoted as )( kri
vP , a word kriv appearing at jn times
in the predict words can be approximated by the following equation:
kiri vjkr pnvP ⋅⋅= α)( (5)
where nmpkirv /= gives the ratio of categories annotated
with word kriv to all categories, α is set to 0.8 by experience in
this paper.
At last, the word which has maximum probability is chosen for the prediction of iR , given by:
)(maxarg],1[ krmkr ii
vPz∈
= (6)
For the unlabeled image jI which has segmented into n image regions, finally get the predict words as follows:
},,,{)(2 ni rrrj zzzIpredict = (7)
The resulting automatic image annotation approach of EMDAIA therefore is illustrated in Algorithm 1.
III. EXPERIMENTS AND RESUALTS EMDAIA is tested on the annotated image dataset, which
was created by Antonio Torralba and Bryan Russell [9]. We use the LabelMe matlab toolbox to obtain the spatial envelope_256×256_static_8outdoorcategories images from the following 8 classes: “coast”, “forest”, “highway”, “mountain”, “open country”, “street”, “inside city”, “tall building”. The dataset contains 2688 images and 29424 image regions. All the images were kept in 256×256 pixels, and have been segmented into 3 to 25 image regions with each region associated with a labeled keyword. We randomly selected 200images from each
class for training, and the left images are for testing. The dictionary contains 795 words that appear in the dataset. In order to reduce the class, we use the LabelMe matlab function LMaddtags to reduce the variability on object labels used to describe the same object class, for example, “person walking”, “person standing”, “person occluded”, “people” et. can be instead by the same word “person”. After this step, the total number of words is reduced to 97. Fig. 2 gives the top 20 categories keywords and the number of times these objects appears in the LabelMe data used in this paper. We removed annotation terms that occurred less than 3 times. Fig. 3 shows some images and their labels from the LabelMe dataset. From Fig. 3 we can see that, in this dataset, each image has been segmented into several regions by different color and each region associated with an annotation keyword, which makes the annotation process much easier.
A. Feature descriptors Both MPEG-7 descriptors and color descriptors are used for
experiments. The feature vectors corresponding to the MPEG-7 visual descriptors were extracted with the MPEG-7 Experimentation Model (XM) software version 5.5 [12]. A total of six different MPEG-7 descriptors could be extracted
Algorithm 1 The EMDAIA algorithm Training: 1: Segmenting images { }NIIII ,,, 21= in the training set into regions.
2: for each image region iR (i from 1 to
BN ) , do
3: Training ( )⋅iD to extract descriptors where image region is denoted
as a set of feature vectors: { }),,2,1();,,2,1(, mjnidR iji …… ∈∈=
4: Clustering all the segments into k categories 5: Calculating each cluster center
kmC .
6: end for Testing: 1: for an unlabeled image
jI , do
2: Segmenting image jI into n image regions, where image regions
are represented as a set of feature vectorsmri
d .
3: for each image region ir , do
4: Calculating the distance mri
dist between ir and every cluster
center, where 2
1,1min
Almmrklnimr cddistii
−=≤≤≤≤
.
5: the keyword associated with the class is chose for this region’s prediction: { }mrrrr iiii
vvvv ,,, 21 …= where Wv kri∈ .
6: Computing the probability of a word kriv ,
kiri vjkr pnvP ⋅⋅= α)(
where nmpkirv /= gives the ratio of categories annotated with
word kri
v to all categories, α is set to 0.8. jn is the time word
kriv appearing in the predict words.
7: The word )(maxarg],1[ krmkr ii
vPz∈
= which has maximum probability
was chosen for the prediction of image regionkri
v .
8: end for 9: return the prediction },,,{)(
2 ni rrrj zzzIpredict = for this image.
10: end for
1645
and we chosen five of them (SCD, CLD, CSD, EHD, HTD) for classification because this five has the common dimensional for the image in different size. SCD and CSD are 128-dimensional, CLD is 120-dimensional, EHD is 80-demensional and HTD is 62-dimensional.
Scalable Color Descriptor (SCD) is one of the most basic descriptions of color features provided by describing color distribution in images. Color Layout Descriptor (CLD) specifies a spatial distribution of colors. The image is divided into 8*8 blocks and the dominant colors are solved for each block in the YCbCr color system. Color Structure Descriptor (CSD) captures both color content and the structure of this content. It does this by means of a structuring element that is slid over the image. Edge Histogram Descriptor (EHD) is designed to capture the spatial distribution of edges by dividing the image into 16 non-overlapping blocks and then calculating 5 edge directions in each block. Homogeneous Texture Descriptor (HTD) filters the image with a bank of orientation and scale tuned filters that are modeled using Gabor functions. The first and second moments of the energy in the frequency domain in the corresponding sub-bands are then used as the components of the texture descriptor.
In this paper, we also extracted color descriptors utilized ColorDescriptor software v2.0 created by Koen van de Sande [2]. We chosen six descriptors, i.e., RGB histogram, Opponent histogram, Opponent histogram, Hue histogram, Color moments, SIFT, HueSIFT. The dense sampling detector is used which samples at every 6th pixel in the image. For this six color descriptors, the total points for each image region are different for the image regions are not in the same image size. So we use
the method in [8] to run the k-means algorithm to obtain the codeword and codebook.
B. Ensemble of Descriptors vs. Single Descriptor In this sub-section, we compare our EMDAIA algorithm
which uses multiple descriptors with that of single descriptors, such as CSD, SCD, and HTD, etc. The experimental results are given in table 1.
It is shown obviously in table 1 that, our algorithm EMDAIA gives the most reliable annotation results than the single descriptor algorithm.
Automatic image annotation performance is usually measured by precision and recall [3]. Precision of iw is defined as the number of correctly annotated images divided by the total number of images annotated with iw . Recall of a word
iw is defined as the number of images correctly annotated with
iw divided by the number of images that have iw in the truth annotation.
Fig. 4 shows the performance of the five different descriptors and our algorithm EMDAIA. From Fig. 4 we can see that the performance of automatic image annotation can be significantly improved by ensemble of the descriptors.
C. EMDAIA Compares with the STATE-OF-THE-ART
TABLE Ⅰ
COMPARIOSN DIFFERENT DESCRIPTOS’ ANNOTATION
LabelMe
Annotation sand, sea, sky path, brushes, brushes, tree
sky, mountain, tree, tree, field
EMDAIA Annotation sand, sea, sky path, brushes,
leaf, tree sky, mountain, tree, brushes,
fieldCSD sand, sea, sky building, lamp,
leaf, tree sky, mountain, tree, grass, field
SCD chair, sea, sun path, lamp, leaf, leaf
sky, waterfall,tree, brushes,
grassCLD Bridge, sea, sky path, plant, lamp,
tree sky, mountain,tree, plant, field
HTD sand, sky, sun brushes, brushes, brushes, brushes
sky, rock, bushes, tree, field
EHD sea, sea, mountain grass, brushes, brushes, tree
sky, field,brushes, tree, path
RGB histogram sand, sky, sky path, tree, leaf,
tree sky, rock, grass, brushes, grass
Opponent histogram road, sea, sea building, plant,
bushes, brushes sky, mountain,
brushes, brushes, grass
Hue histogram sand, sky, sea path, brushes,
leaf, tree sky, rock, tree, brushes, grass
Color moments sand, sea, sky sand, leaf, leaf,
leaf sky, mountain, tree, tree, field
SIFT sun, sea, sky wall, tree, tree, tree
sky, field, brushes, plant,
grassHueSIFT sand, sky, sky Path, plant, tree,
tree sky, bridge, tree,
burshes, grass
Fig. 3 Examples images from LabelMe dataset
Fig. 2 The top 20 categories keywords in the LabelMe dataset
1646
In order to show the effectiveness of EMDAIA, we
compare it with the state-of-the-art algorithm sLDA system proposed in [3]. Table 3 shows the comparison between sLDA and EMDAIA.
From table 2, it is shown that the first image which contains multiple objects, our algorithm EMDAIA gives much more details of information by annotating more keywords. And in the second and the third images, our algorithm is comparable to the state-of-the-art algorithm sLDA.
IV. CONCLUSIONS In this paper, we propose a novel approach, EMDAIA, for
automatic image annotation. EMDAIA firstly segments all the images into regions, and then uses a set of extracting algorithms to extract multiple descriptors for the regions of images. We cluster all the regions into k clusters with each cluster associated with one keyword and m cluster centers. In the test phase, we measure the distance between each test region and the cluster centers, at last choose the nearest category’s keyword to annotate it.
Experimental results on benchmark dataset show that our proposed algorithm significantly outperforms those algorithms which use only single descriptors. Moreover, our algorithm is comparable with the state-of-the-art algorithm.
There are several open problems for further studies. Firstly, it would be a promising topic to integrate the state-of-art image segmentation algorithm, by which we can predict a new image uncontained in the LabelMe dataset. Secondly, we just annotate some low-level semantic concepts, e.g. “rock”, “tree”, “sky”, “sea”, “water”, from the relationship of the low-level semantic
keywords, it is interesting to get the high-level concept like “beach”, “outdoor”. In the future, we will look into these areas.
ACKNOWLEDGMENT This work is funded by the National Natural Science
Foundation of China Grant #60975007.
REFERENCES
[1] P. Duygulu, and K. Barnad. “Object recognition as machine translation: learning a lexicon for fixed image vocabulary”. ECCV, 2002.
[2] Koen E. A. van de Sande, Theo Gevers and Cees G. M. Snoek, “Evaluating Color Descriptors for Object and Scene Recognition”. IEEE Transactions on Pattern Analysis and Machine Intelligence (in press), 2010.
[3] Chong Wang, David Blei, and Li fei-Fei. “Simultaneous Image Classification and Annotation”. CVPR, 2009.
[4] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. “TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation”. ICCV, 2009.
[5] Ameesh Makadia , Vladimir Pavlovic , Sanjiv Kumar. “A New Baseline for Image Annotation”. ECCV, 2008.
[6] E. Chang, G. Kingshy, G. Sychay, G. Wu. “CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines”. IEEE Transactions on Circuits and System for Video Technology, 2003.
[7] Gustavo Carneiro , Antoni B. Chan , Pedro J. Moreno , Nuno Vasconcelos. “Supervised Learning of Semantic Classes for Image Annotation and Retrieval”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007.
[8] Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek . “Visual Word Ambiguity”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 99, No. 1. (5555) (in press), 2010.
[9] Bryan C. Russell , Antonio Torralba , Kevin P. Murphy , William T. Freeman. “LabelMe: A Database and Web-Based Tool for Image Annotation”. International Journal of Computer Vision, Vol. 77, No. 1, pp. 157-173. 1 May 2008.
[10] Li-Jia Li, Richard Socher, Li Fei-Fei. “Towards Total Scene Understanding Classification, Annotation and Segmentation in an Automatic Framework”. CVPR, 2009.
[11] Frank Moosmann, Bill Trgiggs, Frederic Jurie. “Fast Discriminative Visual Codebooks using Randomized Clustering Forests”. Neural Information Processing Systems(NIPS), 2006.
[12] Thomas Sikora. “The MPEG-7 Visual Standard for Content Description—An Overview”. IEEE Transactions on Circuits and System for Video Technology, VOL. 11, NO. 6, 2001.
[13] Emre Akbas, Fatos T. Yarman Vural. “Automatic Image Annotation by Ensemble of Visual Descriptors”. CVPR, 2007.
[14] Heidy-Marisol Marin-Castro, L. Enrique Sucar, Eduardo F. Morales. “Automatic image annotation using a semi-supervised ensemble of classifiers”. Lecture Notes in Computer Science, 2008.
[15] Rong Jin, Shijun Wang, Zhi-Hua Zhou, “Learning a Distance Metric from Multi-instance Multi-label Data”. IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[16] J. Jeon, V. Lavrenko, R. Manmatha. “Automatic Image Annotation and Retrieval using Cross-Media Relevance Models”. SIGIR, 2003.
[17] P. Quelhas, F. Monay, J-M. Odobez, D. Gatica-Perez, T. Tuytelarrs, and L. J. V. Gool. “Modeling scenes with local descriptors and latent aspects”. ICCV, 2005.
[18] B. Russell, A. Efros, J. Sivic, W. Freeman, and A. Zisserman. “Using multiple segmentations to discover objects and their extent in image collections”. CVPR, 2006.
TABLE Ⅱ COMPARIOSN OF SLDA AND EMDAIA ANNOTATION
sLDA car, sigh, road snow mountain, sea water, field
trees, buildings occluded, window
EMDAIA sky, mountain, road, car, fence
Sky, mountain, snow
sky, building, tree, window
0.1 0.15 0.2 0.25 0.3 0.35 0.40.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
Recall
Pre
ciso
n
EMDAIACSD
CLD
HTD
Hue histogramSIFT
Fig. 4 Performance on the LabelMe datasetof our algorithm EMDAIA