robust content-based image indexing - cse home
TRANSCRIPT
Department of Computer Science and Engineering
University of Texas at Arlington
Arlington, TX 76019
Robust Content-based Image Indexing
Y. Alp Aslandogan, Ravishankar Mysore
Clement T. Yu, Bo Liu
Department of Computer Science, The University of Illinois at Chicago
Technical Report CSE-2003-29
Robust Content-based Image Indexing
Using Contextual Clues and Automatic Pseudo-Feedback
Y. Alp Aslandogan, Ravishankar Mysore Clement T. Yu, Bo Liu
Dept. of Computer Science and Engineering Dept. of Computer Science
The University of Texas at Arlington The University of Illinois at Chicago
[email protected] [email protected]
Abstract
In this paper we present a robust information integration approach to identifying images of per-
sons in large collections, such as the web. The underlying system relies on combining content anal-
ysis, which involves face detection and recognition, with context analysis which involves extraction
of text or HTML features. Two aspects are explored to test the robustness of this approach: Sen-
sitivity of the retrieval performance to the context analysis parameters and automatic construction
of a facial image database via automatic pseudo-feedback. For the sensitivity testing, we reevaluate
system performance while varying context analysis parameters. This is compared with a learning
approach where association rules among textual feature values and image relevance are learned via
the CN2 algorithm. A face database is constructed by clustering after an initial retrieval relying on
face detection and context analysis alone. Experimental results indicate that the approach is robust
for identifying and indexing person images.
1
1 Introduction
Automatic image and text analysis techniques have certain limitations when used alone in indexing
multimedia. One difficulty is bridging the so called semantic gap [11, 9, 23, 24, 5]. Semantic concepts
that are close to the user’s intent but away from the low-level, automatically extracted features are
hard to identify by relying on visual features alone. While textual context or descriptions can be
used to generate semantic metadata for the images, it is not possible to rely solely on context analysis
either. Since the textual context is essentially separate from the non-textual media content, there is
always the possibility of the descriptive information being inaccurate or insufficient. Furthermore,
the textual context is a single snapshot of a particular agent’s comments related to the media content,
hence it is very subjective. The combination of content and context analysis, when possible, provides
a richer environment for content indexing.
From the image analysis perspective, a major difficulty is having to deal with uncontrolled image
acquisition conditions. In the case of person images accessible over the web, there is no control over
the conditions of how the images are taken or processed. Many facial images have gross variations
in resolution, viewpoint, illumination and size, making most face recognizers unusable. Recently,
researchers have incorporated pre-processing and normalization steps to the recognition process in
order to account for the variations in images collected in uncontrolled environments [17]. However,
even these approaches assume an existing face database. When a known image database doesn’t
exist prior to retrieval, recognition can not be used initially.
In this work, we investigate the robustness of an information integration approach for dealing
with the aforementioned problems. The approach is to use the textual context in guiding the visual
analysis of the image contents. Currently available textual contexts for images and video include
the following: (1) The world wide web, (2) TV closed captions, (3) displayed text on TV and video,
(4) verbal annotations such as transcriptions of physicians’ impressions of medical images, and (5)
2
news stories accompanying photographs.
The integration technique was implemented in the image search agent Diogenes1 [2] where the
use of text analysis, face detection, face recognition with an existing face database and heuristically
set parameters produced an overall precision of 95% for twenty celebrity queries [1]. Two questions
arose in that context: Would the system be able to obtain similar results if the parameter values
were changed, and in the absence of a face database would the system be able to use recognition?
In the following we address these two issues. Specifically, we compare the performance of the search
agent under varying contextual analysis parameters and with parameter learning. This allows us to
examine the sensitivity of the context analysis to these parameters. The experimental results show
that the approach is fairly insensitive to fine tuning of these parameters. Secondly, we implement
a pseudo-feedback method for automatic face database construction via clustering. This allows us
to determine system performance in the absence of a known image database. A key contribution of
this paper is the demonstration of the automatic pseudo-feedback method in the context of content
based image retrieval.
The paper is organized as follows: In Section 2 we give an overview of the architecture of image
search agent Diogenes upon which the techniques described in this paper are built. Section 3
provides the background for the information integration approach we have implemented. In Section
4 we examine the sensitivity of the retrieval performance to context analysis parameters. We first
experiment with different sets of weights and then we describe a parameter learning experiment
where association rules between the contextual feature values and the image relevance are learned
via the CN2 algorithm. In Section 5 we describe an automatic pseudo-feedback method that is
facilitated by evidence combination and its evaluation. Section 6 reviews related work and finally,
section 7 gives a summary of the key points of the paper and discusses future research possibilities.
1After philosopher Diogenes of Sinope, d.c. 320 B.C. who is said to have gone about Athens with a lantern in daytime looking for an honest man.
3
2 Background
The information integration approach that is evaluated in the following sections are implemented
on top of the image search agent Diogenes. Diogenes is a web-based image search agent designed
for identifying and indexing person images. The user types in the first name and the last name of a
person as a query. The system returns images of that person ranked by their estimated relevance.
The query page on the web interface for Diogenes is shown in Figure 1. The results of the ”Abraham
Lincoln” query are depicted in Figure 2.
Figure 1: Image search query page of Diogenes.
2.1 Visual and Contextual Features
In analyzing web pages, Diogenes analyzes both the images themselves and the text/HTML context
around them. The content-based visual features are: (1) Whether a human face is present in the
image, and (2) whether the image matches one of the known images in the facial image database.
4
Figure 2: The results of the ”Abraham Lincoln” query.
Diogenes also extracts several (con)textual features to establish a degree of association among
image and person name pairs. These features include (1) frequency of a person name on the page,
(2) match between a person name and an image alternate text, (3) match between a person name
and an image caption text, (4) match between a person name and an image path or URL, and
the number of shared HTML tags between a person name an image. The visual (content-based)
features and contextual features are analyzed and integrated in a formal evidence combination
framework.
2.2 System Architecture
The system consists of a number of modules. Figure 3 illustrates Diogenes’ system architecture. We
will describe some of the important modules briefly.
5
Figure 3: Diogenes system architecture.
1. Search Engine Driver: This module interfaces with web text search engines and obtains a list of
URLs for a given name. The user types a query name in the form of first and last name. This
module then submits the query string according to the conventions of several different text
search engines and concatenates the results. The engines that were used for the experimental
results reported here included AltaVista, Lycos, Google, WebCrawler, and others.
2. Web Crawler: This module actually issues HTTP requests to visit each URL obtained by the
search engine driver. It retrieves the text of the URLs and then retrieves each of the images
referenced on those pages. It then saves all this information under a directory generated from
a unique time-stamp.
6
3. Face Detector: This neural-network based module [20] detects whether a human face exists in
an image. If one or more faces are present, it reports the locations of those faces in terms of
rectangular coordinates.
4. Face Recognizer: This module identifies new facial images by using a database of known facial
images. It computes distance values between the query image and a subset of the known
images. For the experiments reported in this paper, the face recognition module was only used
in the second stage to improve upon the detection/text analysis-only results.
5. HTML Analyzer: This module analyzes the HTML structure of the downloaded web pages. It
examines features such as common HTML tags, caption fields, and alternate fields.
6. Feedback Processor: This module processes the results of the initial retrieval, forms image
clusters and generates a face image database for use by face recognition. It then drives the
evidence combination module with the output of the context (text/HTML) analysis and the
face recognition modules. This pseudo-feedback process takes place without user interaction.
7. Evidence Combination Module: This module combines the evidences produced by context
analysis and face recognition modules using different combination mechanisms.
8. HTML Composer: This module prepares the results page based on the image links found in
the search.
3 Integrating Content-based and Contextual Information
The content-based visual evidence and the contextual evidence obtained from surrounding text are
integrated in the formal framework of Dempster-Shafer Theory of Evidence, also known as the Math-
ematical Theory Of Evidence [25]. The Dempster-Shafer theory is intended to be a generalization
of Bayesian theory of subjective probability. Since the details of this theory are beyond the scope
of this article we refer the interested reader to the relevant literature [10] and focus on Dempster’s
7
formula for evidence combination which we use for integrating context and content information.
3.1 Dempster’s Rule for Evidence Combination
Suppose we are interested in finding the combined evidence for a hypothesis C. We may think of
C as a class assignment in pattern recognition. C is a member of 2Θ, where Θ is our frame of
discernment, the set of hypotheses under consideration. Given two independent sources of evidence
m1 and m2, Dempster’s rule for their combination is as follows:
m1,2(C) =∑
A,B⊆Θ,A∩B=Cm1(A)m2(B)∑
A,B⊆Θ,A∩B 6=∅ m1(A)m2(B)
Here m1,2(C) is the combined Dempster-Shafer probability for C. m1 and m2 are the basic prob-
abilities assigned to sets A and B respectively by two independent sources of evidence. A and B
are supersets of C. A and B are not necessarily proper supersets and they may as well be equal to
C or to the frame of discernment Θ. The numerator accumulates the evidence which supports a
particular hypothesis and the denominator conditions it on the total evidence for those hypotheses
supported by both sources.
3.2 Using Dempster-Shafer Theory in Image Retrieval
The content-based visual feature, namely the relevance score obtained from the face detection/recognition
(FR) module, and the contextual feature, namely the relevance score obtained from the text/HTML
analysis module (TA) represent the two sources of evidence we have for image classification. We
assume that if more than one person appears in an image, identifying one of them is sufficient. We
designate the two pieces of evidence as mFR and mTA respectively. By default, these two modules
operate independently: The results of face recognition module does not affect the text/HTML score
and vice versa. Hence the independence assumption of the theory holds. The text/HTML analysis
8
module determines a degree of association between a personal name and facial image on the web
page. Similarly, the face detection/recognition module determines a degree of relevance for an image,
given a known image database. Let us assume that P represents the hypothesis that in the context
of a query ( a person name P) an image I is relevant. Using Dempster’s Rule for combination of
evidence we get the following:
mFR,TA(P ) =∑
A,B⊆Θ,A∩B=PmF R(A)mT A(B)∑
A,B⊆Θ,A∩B 6=∅ mF R(A)mT A(B)
Again, P designates a hypothesis which is an element of 2Θ. In the case of classification of personal
images, it is possible to simplify this formulation. Our face recognition and text/HTML analysis
modules give us information about the relevance of a particular image and the uncertainty of the
recognition/analysis. This means we have only beliefs for singleton classes (persons) and (m(Θ)),
the uncertainty in the body of the evidence. With this observation we can simplify the combined
evidence.
mFR,TA(P ) = mF R(P )mT A(P )+mF R(Θ)mT A(P )+mF R(P )mT A(Θ)∑A,B⊆Θ,A∩B 6=∅ mF R(A)mT A(B)
Since we are interested in the ranking of the hypotheses and the denominator is independent of any
particular hypothesis (i.e. same for all) we can ignore the denominator and compare the support
for hypotheses on the basis of the numerator only:
rank(I, P ) ∝ mFR(P )mTA(P ) + mFR(Θ) + mTA(P ) + mFR(P )mTA(Θ)
Here ∝ represents ‘is proportional to” relationship, mFR(Θ) and mTA(Θ) represent the uncertainty
in the bodies of evidence mFR and mTA respectively. Both face recognition and text analysis
uncertainties are obtained locally, i.e. for each retrieval and automatically without user interaction
in contrast to applications where the users provide the uncertainties [12].
9
3.3 Evidence from Content Analysis
Analysis of the image content yields two kinds of information: Whether a human face is present in
an image and the likely identity of this face. The first piece of information is obtained by a face
detector and is used for screening out pages that do not contain an facial images. The second piece is
obtained by a face recognition module and used in determining the rank of this image in the context
of a query. Suppose mFR(P ) is the evidence from face recognition module for the hypothesis that
the target image belongs to a person P named in the user’s query. It is computed by the following
formula:
mFR(P ) = CFR ∗ dFR(I, P )
and
dFR(I, P ) = 1− (minDistance(I, P )
distanceMAX)
where dFR(I, P ) is the degree of association, according to the face recognition module, between
person P named in the query and the image I. minDistance(I, P ) is the minimum distance among
the distances between the images of person P in the training database and image I. distanceMAX
is a global constant maximum distance which is set to 10000 for the eigen-face recognition module.
Any distance(I) that is greater than distanceMAX is set to be equal to distanceMAX .
The multiplier constant CFR is obtained as follows:
CFR =1−mFR(Θ)∑P∈Φ dFR(I, P )
mFR(Θ) represents the uncertainty in the body of evidence mFR and is obtained as follows: The
eigen-face based face recognition module used in our initial experiments provides a “distance from
face space” (DFFS) value for each recognition. This value is the distance of the target image to the
space of eigen-faces formed from the training images [31]. Diogenes uses the DFFS value to estimate
10
the uncertainty associated with face recognition. If the DFFS value is small, the recognition is good
(uncertainty is low) and vice versa. The following is Diogenes’ formula for the uncertainty in face
recognition:
mFR(Θ) = 1− (1
ln(e + DFFS))
3.4 Evidence from Context Analysis
The context analysis module of Diogenes assigns weights to the text/HTML features found on the
web pages. Some of these features are local while others are global. The local features are specific
features associating a personal name with a particular image. Global features are features of personal
names that are not related to any specific image. Weights associated with four of the important
features are the following:
• (wfreq) Name Frequency Weight: This is the weight associated with the global feature of
name frequency. When a person’s name occurs with a high frequency on a page, that name is
assumed to be related to the images that appear on that page.
• (wtag) Shared HTML Tags Weight: As the number of HTML tags that are shared by the
image and a name increases, the degree of association between the image and the name is also
assumed to increase.
• (wpath) Image Path Match Weight: If part of the image path matches part or all of the person
name, then it is assumed that there is an association between the image and the name.
• (walt) Alternate Text Match Weight: If a person’s name appears fully or in part in the alternate
text for an image, an association is assumed.
The text/HTML analysis process proceeds as follows: When a page is retrieved, a part-of-speech
tagger (Brill’s tagger[3]) tags all the words that are part of a proper name on the page. The
occurrence frequency of these words are recorded. For each such word, and for each image on the
11
page, a degree of association is established. The frequency of the word serves as the starting point
for this score. Then the HTML analysis module analyzes the HTML structure of the page. If an
image and a word share some common tags, their degree of association is increased. If the word is a
substring of the image name or if the word is part of the alternate text for the image, the association
is increased further. The formula for calculating the degree of association between a word w and
image I is
d(w, I) = ωfreq ∗ sfreq + ωtag ∗ stag + ωpath ∗ spath + ωalt ∗ salt
where d(w, I) is the degree of association between the word and the image; ωfreq, ωtag, ωpath,
and ωalt are the relative weights of word frequency, shared HTML tags, image name substring
property and image alternate text substring property respectively. The sfreq, stag, spath, salt are
the corresponding scores for word frequency, number of shared HTML tags, whether the word is
part of the image name, and whether the word is part of the alternate text, respectively. Since
the text/HTML analysis module assigns degrees of association to individual words, at the time of
evidence combination, a weighed combination of the scores of the two words (the first name and the
last name) that make up a personal name P is calculated to get a single text/HTML score.
d(I, P ) = α ∗ d(firstName(P ), I) + β ∗ d(lastName(P ), I)
For the experimental results reported here these coefficients were .25 and .75 respectively.
The contextual evidence mTA(P ) for the relevance of a particular image I in the context of a
person name P is obtained by normalizing d(I, P ).
mTA(P ) = CTA ∗ d(I, P )Dmax
12
where Dmax is a global normalization constant. The multiplier constant CTA is obtained as follows:
CTA =1−mTA(Θ)∑P∈Φ dTA(I, P )
For text analysis, uncertainty is assumed to be inversely proportional to the maximum value among
the set of degree of association values assigned to name-image combinations.
mTA(Θ) =1
ln(e + dmax)
Where dmax is the local maximum numeric “degree of association” value assigned to a personal
name with respect to a facial image among other names.
4 Sensitivity to Context Analysis Parameters
In testing the sensitivity of the retrieval accuracy to the particular weight values, we follow two
approaches. In the first approach, the weights associated with the four parameters are varied and
the overall retrieval precision is evaluated. In the second approach, instead of using fixed weights
associated with feature values, we employ the CN2 algorithm to learn association rules among feature
values and the image relevance. Tables 1, 2, and 4 show the results of the first set of experiments.
In these tables the average precision numbers are rounded after two decimal units.
In Table 1 the results of retrievals with twenty different weight sets are reported. The average
precision ranges from .89 to .95. The first observation is that there are multiple weight sets that
achieve the best or near-best results. Namely, weight set 10 produces the best average precision
of .95, weight sets 5 and 7 produce an average precision of .94, weight sets 3, 4, 8, 18, 19 and
20 produce an average precision of .93. These results suggest that the performance of the context
analysis module does not depend on fine tuning of these parameters. The next table shows the
13
aggregated average precision for individuals over 20 weight sets. Even averaged over 20 different
weight sets, some of which do not produce competitive results, the overall average precision is still a
competitive value of .92. In the next experiment we look at individual queries to confirm previous
Table 1: Different Weight Sets and Resulting Average Precision.
Weight Set Freq Tag Path Alt Avg. Precision1 0.6 0.1 0.2 0.1 .922 0.3 0.3 0.1 0.3 .913 0.1 0.1 0.35 0.45 .934 0.2 0.05 0.35 0.4 .935 0.1 0.15 0.4 0.35 .946 0.4 0.05 0.2 0.35 .927 0.5 0.05 0.4 0.05 .948 0.2 0.25 0.3 0.25 .939 0.1 0.35 0.25 0.3 .9110 0.05 0.45 0.45 0.05 .9511 0.25 0.5 0.05 0.2 .8912 0.2 0.6 0.1 0.1 .9113 0.1 0.7 0.1 0.1 .9114 0.05 0.8 0.05 0.1 .9015 0.02 0.9 0.02 0.06 .9016 0.1 0.1 0.1 0.7 .9217 0.25 0.25 0.15 0.35 .9218 0.15 0.1 0.25 0.5 .9319 0.35 0.3 0.15 0.2 .9320 0.15 0.05 0.4 0.4 .93
observations and to better understand why this is the case. The query names were obtained from
the Time magazine’s Top 100 most influential people list2. The weight sets are shown in Table 3.
In Table 3 the columns labelled with feature names show the weight for that feature. The ”Avg.
Precision” column shows the average precision over 20 queries with this weight set. The column
labelled ”Num. Best” shows how many times this particular weight set provided the best average
precision. Similarly, the column labelled ”Num Worst” shows the number of queries where this
particular weight set produced the worst result. Table 4 shows the results for individuals. The best
2http://www.time.com/time/time100/
14
Table 2: Average Precision Over 20 Weight Sets for Individual Queries.
Query Avg. PrecisionJay Leno .57Michael Jordan .85Dalai Lama .92Grant Hill .75Dick Cheney .93Pete Sampras .97Martina Hingis 1.0Martina Navratilova .91Albert Einstein .99Andre Agassi .98Diego Maradona .89Tiger Woods .88Tom Cruise 1.0Demi Moore 1.0Deng Xiaoping .82Julia Roberts .96Pamela Anderson Lee 1.0Brooke Shields 1.0Sharon Stone 1.0Sylvester Stallone .99Overall .92
precision(s) in each row is (are) shown in bold. A review of these results indicate that while the
system is not sensitive to small changes in individual weights, some textual features are indeed more
significant than others. For instance, the weight sets where the frequency and path match features
are assigned relatively higher weights than the other two features produce better results than others.
Similarly, the two weight sets where the weights for these features are set to 0.0 produce relatively
worse results. We have also looked at the textual feature values for the individual query results.
This analysis shows that for the top 20-40 results most images tend to have the majority of the
textual features. Therefore a variation of the textual weights does not produce a significant impact.
Below the top 40, however, certain features are less frequent, and weight sets emphasizing those
features are likely to produce more accurate results.
15
Table 3: Weight Sets for Individual Evaluation and Average Precision.
Weight Set Freq Tag Path Alt Avg. Precision Num. Best Num. Worst1 0.30 0.00 0.35 0.35 .91 9 32 0.00 0.35 0.35 0.30 .89 1 83 0.35 0.35 0.00 0.30 .90 2 44 0.35 0.35 0.30 0.00 .90 4 45 0.25 0.25 0.25 0.25 .89 2 76 0.30 0.20 0.30 0.20 .92 17 07 0.50 0.0 0.25 0.25 .89 2 9
Table 4: Impact of Different Weight Sets on Individual Queries.
Query Set1 Set2 Set3 Set4 Set5 Set6 Set7Lucille Ball 1.0 0.94 0.96 0.96 0.96 1.0 0.94Nelson Mandela 0.94 0.9 0.9 0.9 0.92 0.96 0.92Enrico Fermi 0.94 0.92 0.94 0.94 0.92 0.96 0.92Marilyn Monroe 0.92 0.88 0.92 0.92 0.92 0.94 0.94Oprah Winfrey 0.92 0.9 0.9 0.92 0.9 0.94 0.9Mahatma Gandhi 0.92 0.92 0.9 0.92 0.9 0.94 0.88Sigmund Freud 0.92 0.92 0.92 0.9 0.9 0.94 0.9Albert Einstein 0.89 0.92 0.9 0.88 0.9 0.92 0.86Mother Teresa 0.92 0.88 0.9 0.92 0.92 0.92 0.85Charlie Chaplin 0.88 0.88 0.9 0.9 0.92 0.92 0.88Frank Sinatra 0.9 0.84 0.88 0.9 0.88 0.9 0.86Mao Zedong 0.8 0.83 0.87 0.86 0.86 0.9 0.86Adolf Hitler 0.9 0.88 0.88 0.86 0.88 0.9 0.88Mikhail Gorbachev 0.93 0.89 0.9 0.92 0.9 0.9 0.9Henry Ford 0.92 0.9 0.9 0.88 0.88 0.9 0.88Marlon Brando 0.9 0.88 0.9 0.9 0.86 0.9 0.88Mikhail Gorbachev 0.9 0.86 0.86 0.88 0.86 0.9 0.86Louis Armstrong 0.88 0.88 0.9 0.88 0.88 0.9 0.86Bill Gates 0.92 0.89 0.88 0.88 0.87 0.88 0.85Theodore Roosevelt 0.86 0.87 0.86 0.88 0.86 0.88 0.88Average Precision 0.90 0.89 0.90 0.9 0.89 0.92 0.89
16
4.1 Automatic Parameter Learning
In our earlier experiments [1] we have reported results obtained with a heuristically selected set
of weights. In the previous section affects of varying the weight sets on the retrieval results were
examined. In this section, we report the results of an experiment where association rules are used
instead of weighting.
A set of examples are used to train the system. Each example includes a set of contextual
attribute values for an image and its relevance with respect to a query (a personal name). The
program then induces a set of rules based on this data. In our application, the attributes are the
values of the context features described above. The training sample was produced as follows: For
each person, 20 different sets of weights were used. For each set of weights, the relevance information
for the top 20 returned images were recorded. Thus for each person a set of 400 (20 times 20) sample
lines were produced. Each line has 4 double values and one discrete value. The double values are
the values of the context features and the discrete value is one of “Relevant” or “Irrelevant”. A
sample line from the example file is given below:
mfreq mtag mpath malt Relevance
.97 1.0 0.75 0.25 Relevant
where mfreq represents the frequency of the person name on a page divided by a global maximum
frequency, mtag represents the status of tag match (0 for no match, 0.25 for first name match, 0.75
for last name match, 1.0 for full name match); mpath represents person name match with image
path (one of 0, 0.25, 0.75 or 1.0 as in name match) , and finally malt represents person name match
with image alternate text. For ten-fold validation, the 20 persons were partitioned into 10 groups.
Each group contained 2 persons. Each group in turn was removed from the set of 20 person names
to form the test set. The remaining 18 persons’ sample files were merged to form the training set.
Rules learned from each of the training sets were saved in a file. These steps were executed ten
times to produce 10 sample (training) files and 10 sets of rules. An example rule could be of the
17
form:
if mfreq ≥ 0.50 ∧ mtag ≥ 0.50 ∧ mpath ≥ 0.50 ∧ malt ≥ 0.50 then relevant
During the screening of the rules that were produced, rules whose output were ”Irrelevant” were
removed as well as rules that were induced on fewer than 50 sample lines. Finally 10 sets of screened
rules were obtained. Each of the rules is an induction on the relationship of a particular set of
attribute values and the resulting images’ being relevant. These 10 sets of learned association rules
were then used to perform test runs for the test groups. Top 20 images returned on the test sets
were analyzed and the average precision over this top 20 were recorded. The total average precision
for each query is shown in Table 5. It can be observed that while the average precision remains
high, one particular person, namely Jay Leno, has low precision. This behavior is further explained
in Section 5.2. The table shows that the main retrieval mechanism of the system, namely context
Table 5: Learning Results.
Query Before Learning After LearningQ:Jay Leno .50 .55Q:Michael Jordan .90 .90Q:Dalai Lama .85 .80Q:Grant Hill .70 .75Q:Dick Cheney .95 .95Q:Pete Sampras 1.0 1.0Q:Martina Hingis 1.0 1.0Q:Martina Navratilova 1.0 0.95Q:Albert Einstein 1.0 1.0Q:Andre Agassi .95 .95Q:Diego Maradona .95 .95Q:Tiger Woods 0.80 0.85Q:Tom Cruise 1.0 1.0Q: Demi Moore 1.0 1.0Q: Deng Xiaoping 1.0 .75Q: Julia Roberts .95 1.0Q: Pamela Anderson Lee 1.0 1.0Q: Brooke Shields 1.0 1.0Q: Sharon Stone 1.0 1.0Q: Sylvester Stallone 1.0 1.0Average .9275 .92
18
analysis combined with face detection showed little sensitivity to particular parameters of context
analysis.
5 Automatic Visual Pseudo-Feedback
A pesudo-feedback mechanism is implemented in Diogenes to construct a facial image database
automatically without user input. Different forms of feedback have been used successfully in the
past to improve search results in both text and multimedia retrieval systems. In interactive feedback,
the user of an information retrieval system can provide feedback to the system by designating the
results returned in response to a query as relevant or irrelevant. This kind of feedback mechanism
has been demonstrated to improve retrieval performance in both text and image retrieval systems
[6, 30, 26, 30, 14, 21, 22, 15, 4, 33]. One drawback of interactive feedback is its need for involving
the user in the process after the initial query. Usage statistics of web-based search engines indicate
that users prefer very short queries and minimal interaction when searching for a document or a
multimedia element3.
The two modes of evidence used by Diogenes facilitate another type of feedback known as auto-
matic pseudo-feedback. This type of a feedback mechanism is considered pseudo feedback because
the user does not really go over the initial retrieval results and mark them as relevant or irrelevant.
The relevance of the images in the initial retrieval results are not known, instead certain assumptions
are made about the initial accuracy.
In this method a user can initiate a retrieval with a textual query without having any facial
images of a person. The system uses the face detector to filter out images without any faces and
then ranks the images containing faces based solely on their context score. The majority of the top
ranking images retrieved with this method are relevant. The system then does a similarity analysis
3From a presentation by a system engineer of Excite Inc. at the SIGIR ’97 conference. Also, see http://www.alexa.com.
19
by using a face recognition module and a clustering algorithm. The purpose of this step is to group
similar images into clusters. Images taken from relatively large clusters are then used as an initial
face database for the person. Since relevant images belong to the same person and irrelevant images
tend to belong to different persons, relevant images are expected to form larger clusters around them
while irrelevant images remain alone or form only small clusters.
Figure 4 illustrates this process. In Figure 4 the search process starts when the user issues a
User Query
Initial Ranking Preliminary
Results
Clustering
Face
Training Set
Combine
Evidences &
Reevaluate
Context
Analysis
Face Detection
Face
Recognition
Final
Results
Retrieved
Pages
1
3
2
4
5 6
Figure 4: Diogenes feedback process.
query (Step 1). The system retrieves an initial set of pages and analyzes them via face detection and
context analysis (Step 2). The initial ranking is based on context analysis, provided a facial image
is found (Step 3). The goal of the next step in the pseudo-feedback process is to cluster images from
this top set based on their visual similarity to each other. The idea is that the irrelevant images in
this set will not form large clusters since they typically do not belong to a single person. Instead,
they will remain solitary or form small clusters. In our experiments we have regarded a cluster size
20
of three or more as a relatively large cluster. To be able to cluster images based on their similarity
to each other, the distances are computed between pairs of images. Then these images are clustered
iteratively until there are N images in clusters of size three or more, where N is the size of the face
database we want to form. The hierarchical clustering algorithm used here is called the UPGMA
or Unweighted Pair-Group Method using Arithmetic Averages. We will present this method briefly
below. Details and references can be found in [8]. When the clustering is done, N images that belong
to clusters of size three or more are designated as the training set for face recognition (Step 4). If
there are more than N images in clusters of size three or more, first N of these are selected. The
size of the training set, N, is a number selected according to performance and precision trade off
parameters. For instance, initially as N gets bigger, the accuracy of the recognition increases, but
the recognition time gets longer as well. After a certain threshold point, increasing N doesn’t lead
to any further improvement in accuracy, instead the accuracy begins to degrade. One reason for this
behavior is that as more and more images from the top are included, they may contain an increasing
number of irrelevant images. Since the whole process is executed without any user interaction, it
is not possible to identify the irrelevant images a-priori. In our experiments we have found that a
training set of size 10 provided the best accuracy-performance trade-off with the wavelet-based face
recognition program we used.
After preparing the training set for face recognition, the system reevaluates all of the existing
images and gives them face recognition scores (Step 5). These visual scores are then combined with
the text scores via the combination mechanisms described earlier. This re-evaluation results in a
new ranking of the images (Step 6).
5.1 Clustering Using U.P.G.M.A.
The method starts by forming a cluster from each image (containing only one image) and proceeds
as follows:
21
1. Compute the distance between each pair of clusters. The distance between clusters consisting
of single images is the distance between those two images. The distance between two images
is obtained by wavelet-based face recognition module [18] and is explained further below. The
distance between two clusters which contain more than one image is the average of the pair-wise
image distances.
2. Find the pair of clusters Ci and Cj with the minimum distance among all possible pairs, and
merge them. If there is a tie, merge the first pair found.
3. Repeat steps 1 and 2 until an exit criteria is met.
In our application of the UPGMA algorithm, the exit criteria was to have a total of at least N
images in clusters of size three or more. This condition can be met in a number of ways, for instance
with one cluster of size N or with two clusters of size N/2 each.
The computational cost of the automatic pseudo-feedback process consist of three components:
The first component is the initial clustering, the second component is the face recognition and the
third component is the relevance score re-computation. The first process has a computational cost
O(M2) where M is the number of initial images to be clustered. This step can be performed in
parallel or distributed mode. The second step is O(N) where N is the number of images to be
reevaluated. This number is domain dependent. In our experience, for the general celebrity domain
a value between 60 and 120 is typically sufficient to obtain an average precision of above 90%. The
third component is also O(N) and is negligible. While the total computational cost of pseudo-
feedback is not excessive, it may be infeasible for online queries. In a distributed environment where
the indices are pre-generated, this cost can be mitigated.
5.2 Experimental Evaluation
The process outlined in Figure 4 was implemented to examine its feasibility. The results of these
experiments are summarized in Table 6 and Figure 5. In Table 6 the number in each cell shows the
22
Query No FB DioFB Ditto GoogleJay Leno .65 .80 .60 .80David Letterman .55 .80 .75 .65Michael Jordan .85 .90 .85 .70Dalai Lama .95 .95 .75 .80Grant Hill .85 .85 .50 .85Steve Forbes 1.0 1.0 1.0(4) .85Dick Cheney .90 .95 .25(4) .90Bill Gates 1.0 1.0 .80 .80Bill Clinton .95 .95 .95 .75Hillary Clinton .95 1.0 .80 .70Pete Sampras 1.0 1.0 1.0 .90Martina Hingis 1.0 1.0 .90 1.0Martina Navratilova .95 1.0 N/A 1.0Albert Einstein .95 .95 1.0 .90Andre Agassi .90 .90 .90 1.0Diego Maradona 1.0 1.0 .83(6) .95Tiger Woods 0.95 0.95 1.0 .90Tom Cruise 1.0 1.0 .85 .85Al Gore 1.0 1.0 .90 .95Princess Diana 1.0 1.0 .90 .70Average .92 .95 .82 .85
Table 6: Automatic Feedback Experimental Results.
average precision of the retrieval computed over the top 20 images. A value of .90 indicates that
18 of the top 20 images were relevant. As can be seen in Table 6, the feedback-enabled Diogenes
system achieved the best precision in 17 of the 20 queries. Its average precision is also higher than
both Ditto and Google, as well as its previous results not involving feedback. In Table 1, some
average precision numbers for Ditto are shown next to a number in parentheses. For these queries
Ditto returned fewer than 20 images. Hence, the average precision was computed over this smaller
total.
In Figure 9 the average precision of retrieval results before and after feedback are plotted as a
function of the number of images from the top. The values .92 and .95, which are also reported as the
average precision in Table 6, correspond to the average precision computed over the top 20 retrieved
23
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
10 15 20 25 30 35 40
Num
ber
of r
elev
ant i
mag
es
Number of images from top
Before feedbackAfter feedback
Figure 5: Average precision plots of before and after feedback retrieval results for Diogenes as a functionof number of images from the top.
images4. If we assume ten images per page, then these results correspond to a user’s viewing two
result pages. A higher number of images from the top corresponds to a user’s viewing more pages of
retrieval results. The plot reflects the proportion of the images viewed by the user that are relevant.
Note that when the user views only the first page of results (top ten images), the average precision
of Diogenes, while using the automatic pseudo-feedback method, reaches .97.
The experimental results show that the automatic pseudo-feedback mechanism did provide an
improvement in average precision without any need for user interaction. The average precision was
improved in five queries and remained the same in others. Especially worthy of notice are the first
two queries: Jay Leno and David Letterman. These two queries posed the greatest challenge to our
context analysis module. Some of the heuristics used by the context module were invalidated by the
pages devoted to guests of these talk show hosts5. Image names, captions and alternate text fields
4In our earlier experiments, we were able to obtain a score of .95 for queries not involving talk show hosts. In thoseexperiments the system that came closest to Diogenes in terms of retrieval precision was Ditto with an average score of.79. Details of these experiments are reported in [1]
5Analysis of retrieval results by Ditto and Google showed that these search engines also suffered from similar problems.
24
provided misleading clues about the owners of images in those pages. An example of such an image
is in Figure 10. The image belongs to singer Shania Twain and was taken when she was a guest
Figure 6: Shania Twain as a guest on David Letterman’s show.
on David Letterman’s show. On the page containing the image, the words David and Letterman
have the highest frequency. The image is named “letterman18s.jpg” and the alternate text for the
image reads “Letterman”. According to the heuristics employed by the context module, this image
is highly likely to belong to David Letterman. Although to the human eye it is obvious that the
image doesn’t belong to David Letterman, it is not possible to confirm this without an existing face
database. As demonstrated by the above results, the combination strategy employed by Diogenes
was able to take advantage of the visual clues by forming a face database for David Letterman
automatically and then using this source of evidence to lower the score of the above image. Overall,
Diogenes was able to improve the context-only results without user interaction. Below we show four
additional example images from image search results of Google (top 20) where textual clues are
misleading. The first and the last images belong to people who are named Jenny Jones and Larry
King respectively. While in a context-free query these can be regarded as legitimate results, if the
user’s intent is known (e.g. talk show host) these can be eliminated with the method proposed in
this work. The second image has all the textual clues for Nelson Mandela (alternate text match,
path match, shared tags, frequency) but the image does not feature Mandela himself. The third
Ditto, for instance, had four guest images for Jay Leno in top twenty. Google had two guest images for David Lettermanin the top twenty.
25
image belongs to a public speaker who presented Henry Ford in an event. These two images could
also be easily eliminated with the proposed method.
Figure 7: Example images where (con)textual clues are misleading. Left to right: Jenny Jones, NelsonMandela, Henry Ford and Larry King.
6 Related Work
Multimedia content has an important role in the World Wide Web’s becoming a popular medium
of information exchange. The enormity of this content has created a challenge for effective and effi-
cient information access. A number of research projects and some commercial products have been
developed to help users locate relevant multimedia elements. Some of these systems incorporate
interactive feedback features. WebSEEk6 [29] was among the pioneers of content based indexing of
multimedia on the web. It provides access to images and videos on the web via keywords and sim-
ilarity queries. WebSEEk and Amore7 [19] categorize their images into conceptual categories such
as arts, sports, celebrities, movies etc. With WebSEEk, the user can start a querying session with a
random image and search by similarity; or he can type keywords to be used in conjunction with the
visual features such as color and shape. The user can also indicate the relative significance of those
features. ImageScape8, and ImageRover [30, 14] both use textual and visual information to classify
6http://www.ctr.columbia.edu/webseek7http://www.ccrl.com/amore/8http://www.liacs.nl/home/lim/image.scape.html
26
images. While ImageScape aims at indexing arbitrary images found on the web, ImageRover is fo-
cused on nature images. The relevance feedback mechanism in ImageRover is based on recomputing
the distance metrics that are used in determining image similarity based on user feedback. In [21]
the content based image retrieval system MARS is presented for retrieving images using automat-
ically extracted visual features. This system incorporates the term weighting approach developed
in the Information Retrieval field for reevaluating and hence improving image retrieval results. A
Bayesian approach is used in the feedback mechanism proposed in PicHunter [4]. In the partial
labelling approach described in [23] textual and visual information are integrated to label a portion
of an image database and serve as seeds to enable further visual searches. The pseudo-feedback
method employed by Diogenes is intended for applications where interactive feedback is infeasible.
Commercial search engines such as AltaVista9 and Lycos10 also provide image search capability
but no feedback mechanism. Another commercial image search engine Ditto11 emphasizes timely
celebrity image searches. A recent addition to the highly successful Google search engine is the
image search interface12. While none of these search engines provide any feedback mechanism, they
do provide the advanced search features offered for text searches. In the SIMPLICITY search engine
[32], region-based wavelet signatures are used for content based image retrieval and the detection of
various types of objects in images.
7 Conclusion and Future Work
Content-based indexing of multimedia in large collections is a challengin task. In this paper we have
presented a robust information integration (evidence combination) approach for indexing person
images. In the proposed approach contextual evidence is combined with content-based evidence to
9http://www.altavista.com/sites/search/simage10http://multimedia.lycos.com,11http://www.ditto.lycos.com,12http://images.google.com
27
achieve accurate identification of person images. In any retrieval and indexing system that relies
on a set of parameters and an existing database of known objects, two questions arise: Are the
system parameters fine-tuned for a particular set of queries; if so would the performance degrade
for another set of queries or for another set of weights? Secondly, in the absence of a known object
database, would recognition be possible? In the experiments reported here we have shown that face
detection combined with context analysis is a robust way to identify person images on the web.
Namely, we have shown that the approach is not sensitive to fine-tuning of textual feature weights,
and furthermore, in the absence of a known person image database, the system can construct a
database automatically and possibly improve previous retrieval results that did not make use of
recognition.
The proposed method was demonstrated in the domain of person image indexing. However,
a number of other applications are obvious. Replacing the face detector used in the system with
another object detector and replacing the contextual analysis module to reflect heuristics about that
object type would render a new system capable of handling a new object type. Neural networks
have been shown to be very effective in detecting a multitude of objects [7, 13, 16]. Region-based
wavelet signatures have also been shown to be effective in object detection in images [32].
An important problem in content based multimedia indexing and retrieval is that of the semantic
gap between the user’s intent in querying and what an automated indexing system can identify in
multimedia. The MPEG7 standard provides a common mechanism to describe the contents of video
data in terms closer to the users’ semantic concepts. However, generation of these high level concepts
is an open research problem [11]. Evidence combination and feedback are likely to be very useful
in enabling automated semantic content extraction for video as well as for other forms of media
such as radio archives, XML document archives, images and computer generated structured media
elements. The Transferable Belief Model (TBM) [27, 28] appears to be another good candidate for
evidence combination in addition to the Dempster-Shafer model used in the present system.
28
References
[1] Y. Alp Aslandogan and Clement Yu. Experiments in Using Visual and Textual Clues for ImageHunting on the Web. In Proceedings of VISUAL 2000, Lyon , France, pages 108–119, November2000.
[2] Y. Alp Aslandogan and Clement Yu. Multiple Evidence Combination in Image retrieval: Dio-genes Searches for People on the Web. In Proceedings of ACM SIGIR 2000, Athens, Greece,pages 88–95, July 2000.
[3] Eric Brill. Some advances in transformation-based part of speech tagging. In Proceedings ofthe Twelfth National Conference on Artificial Intelligence, pages 722–727, 1994.
[4] I. Cox, M. Miller, S. Omohundro, and P. Yianilos. Pichunter: Bayesian relevance feedback forimage retrieval. volume 3, pages 361–369, 1996.
[5] Chitra Dorai and Svetha Venkatesh. Bridging the semantic gap in content management systems:Computational media aesthetics. In Proceedings of COSIGN 2001: Computational Semioticsfor Games and New Media, pages 33–52, 2001.
[6] Faloutsos C., Barber R., Flickner M., Hafner J., Niblack W., Petkovic D., and Equitz W.Efficient and Effective Querying by Image Content. Journal of Intelligent Information Systems,3(1):231–262, 1994.
[7] B. A. Golomb, D. T. Lawrence, and T. J. Sejnowski. Sexnet: A Neural Network Identifies Sexfrom Human Faces. In Advances in Neural Information Processing Systems 3, 1991.
[8] Earl Gose, Richard Johnsonbaugh, and Steve Jost. Pattern Recognition and Image Analysis.Artech HousePrentice Hall, 1996.
[9] William I. Grosky and Rong Zhao. Negotiating the semantic gap: From feature maps tosemantic landscapes. In Conference on Current Trends in Theory and Practice of Informatics,pages 33–52, 2001.
[10] David L. Hall. Mathematical Techniques in Multisensor Data Fusion. Artech House, 1992.
[11] R. Jain and A. Hampapur. Metadata in video databases. SIGMOD Record (ACM SpecialInterest Group on Management of Data), 23(4):27–33, 1994.
[12] Joemon M. Jose, Jonathan Furner, and David J. Harper. Spatial Querying for Image Retrieval:A User Oriented Evaluation. In ACM SIGIR, pages 232–240, 1998.
[13] A. Katz and P. Thrift. Hybrid neural network classifiers for automatic target detection, 1993.
[14] Marco LaCascia, Saratendu Sethi, and Stan Sclaroff. Combining Textual and Visual Cues forContent-based Image Retrieval on the World Wide Web. In Proceedings of IEEE Workshop onContent-Based Access of Image and Video Libraries, June 1998.
[15] Christophe Meilhac and Chahab Nastar. Relevance feedback and category search in imagedatabases. In ICMCS, Vol. 1, pages 512–517, 1999.
[16] Tom M. Mitchell. Machine learning. McGraw Hill, New York, US, 1996.
[17] Baback Moghaddam and Alex Pentland. Face Recognition using View-Based and ModularEigenspaces. Automatic Systems for the Identification and Inspection of Humans, SPIE, 2277,July 1994.
[18] Xiaoyan Mu, Mehmet Artiklar, Metin Artiklar, Mohamad Hassoun, and Paul Watta. TrainingAlgorithms for Robust Face Recognition using a Template-matching Approach. In Proceedingsof the IJCNN ’01, July 2001.
[19] Sougata Mukherjea, Kyoji Hirata, and Yoshinori Hara. AMORE: A World Wide Web ImageRetrieval Engine. World Wide Web, 2(3):115–132, 1999.
29
[20] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural Network-Based Face Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, Jan 1998.
[21] Y. Rui, T. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedback inmars. In Proceedings of IEEE Int. Conf. on Image Proc., 1997.
[22] Y. Rui, T. S. Huang, and S. Mehrotra. Relevance feedback techniques in interactive content-based image retrieval. In Storage and Retrieval for Image and Video Databases (SPIE), pages25–36, 1998.
[23] Simone Santini. The integration of textual and visual search in image databases. In FirstInternational Workshop on Intelligent Multimedia Computing and Networking, 2000.
[24] Simone Santini, Amarnath Gupta, and Ramesh Jain. Emergent semantics through interactionin image databases. Knowledge and Data Engineering, 13(3):337–351, 2001.
[25] Glenn Shafer. A Mathematical Theory of Evidence. Princeton University Press, 1976.
[26] Alan F. Smeaton and Ian Qigley. Experiments on Using Semantic Distances Between Wordsin Image Caption Retrieval. In Proceedings of ACM SIGIR Conference, 1996.
[27] Ph. Smets and R. Kennes. The transferable belief model. Artificial Intelligence, 66:191–234,1994.
[28] Ph. Smets and R. Kennes. The transferable belief model for quantified belief representation.Handbook of Defeasible Reasoning and Uncertainty Management Systems, 1:267–301, 1998.
[29] J. R. Smith and S. F. Chang. Visually Searching the Web for Content. IEEE Multimedia,4(3):12–20, July-September 1997.
[30] Leonid Taycher, Marco LaCascia, and Stan Sclaroff. Image Digestion and Relevance Feedbackin the ImageRover WWW Search Engine. In Proceedings of SPIE Visual 97, 1997.
[31] M. Turk and A. Pentland. Eigenfaces for Recognition. Cognitive Neuroscience, 3(1):71–86,1991.
[32] James Ze Wang, Jia Li, and Gio Wiederhold. Simplicity: Semantics-sensitive integrated match-ing for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence,23(9):947–963, 2001.
[33] M. E. J. Wood, N. W. Campbell, and B. T. Thomas. Iterative refinement by relevance feedbackin content-based digi tal image retrieval. In ACM Multimedia 98, pages 13–20, Bristol, UK,1998. ACM.
30